vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [PP] Place remainder PP layers on the last stage(s) when num_layers % pp_stages != 0
- [Bug]: W4A16 or W8A16 Qwen3.5 9B meet AssertionError
- [Core] Release NCCL communicator memory in sleep mode
- [Bugfix] Preserve FP8 indexer WK pairs across incremental load_weights
- [Bugfix] Avoid mutating tool parameters in _get_tool_schema_defs
- [CI] Add unit tests for IdentityReasoningParser
- [Bugfix] DeepseekV4 (nvidia): thread is_sequence_parallel into shared_experts to fix SP-MoE construct/load mismatch
- [Feat][1/N] CuTeDSL warmup infrastructure, FA4 MLA
- [ROCm] Enable RDNA3 W4A16 GEMM kernels on gfx1151 (Strix Halo)
- [Bugfix][ROCm] Fix GDN KKT TP>=2 hang on RDNA4 (gfx1201)
- Docs
- Python not yet supported