vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Core] Release NCCL communicator memory in sleep mode
- [Bugfix] Preserve FP8 indexer WK pairs across incremental load_weights
- [Bugfix] Avoid mutating tool parameters in _get_tool_schema_defs
- [CI] Add unit tests for IdentityReasoningParser
- [Bugfix] DeepseekV4 (nvidia): thread is_sequence_parallel into shared_experts to fix SP-MoE construct/load mismatch
- [Feat][1/N] CuTeDSL warmup infrastructure, FA4 MLA
- [ROCm] Enable RDNA3 W4A16 GEMM kernels on gfx1151 (Strix Halo)
- [Bugfix][ROCm] Fix GDN KKT TP>=2 hang on RDNA4 (gfx1201)
- [ROCm][Perf] Avoid fp32 round-trip dequant in fp8 KV paged decode
- [ROCm] Enable AITER unified attention on RDNA4 (gfx1201)
- Docs
- Python not yet supported