vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Two-node deployment of kimi2-5, runtime crash
- [Usage]: MoE flatten_tp_size should not unconditionally include dp_size — DP loses its original semantics for MoE layers
- [RFC]: Support quarot for eagle3
- [Bugfix] Remove incorrect assertion blocking mixed decode+spec-decode batches in GDN attention
- [Bugfix] Fix MoE flatten_tp_size unconditionally including dp_size
- [Bug]: Gemma3 mmproj-*.gguf is not downloaded in 'download_gguf'
- [Misc] Use VLLMValidationError consistently in chat completion and completion protocol validators
- [Feature]: [CPU] Performance improvement: Auto-preload Intel OpenMP on x86 for multi-core CPU inference
- [Kernel][MoE] Add fp8_w8a8 MoE tuning config for NVIDIA GB10 (DGX Spark)
- [Bug]: Qwen3.5 4b incompatibility
- Docs
- Python not yet supported