vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Perf] Fuse Qwen3.5 GDN in_proj_ba into 6-way in_proj MergedColumnParallelLinear
- fix(llama): use weightless RMSNorm for FlashNorm-folded checkpoints (has_weight=False)
- [LoRA][MoE] Fix PEFT 0.18+ target_parameters LoRA loading for 3D MoE experts (Qwen3.5)
- [Mamba] NVIDIA GB10 and B200 tuned selective_state_update configs and benchmark tooling
- [PhiMoE] Add MixtureOfExperts protocol support to enable EPLB
- [Bugfix] Fix DeepSeek-V4 MTP deadlock with batch size limitation
- [Core] feat: add optional cap for --max-model-len auto
- [Kernel][ROCm] Native W4A16 kernel for AMD RDNA3 (gfx1100) — fp16 + bf16
- [DSv4] Improved fused Indexer Q quant kernel
- Log dummy DP step in iteration details
- Docs
- Python not yet supported