vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Expand packed module names in GPTQ modules_in_block_to_quantize
- [Core] Pluggable sleep-mode backend abstraction (RFC #34303)
- [Core] Avoid scheduler `RoutedExpertsManager` import unless needed
- fix: bind structured-output grammar on Responses API when reasoning parser is active
- [Bugfix] Quark: set moe_quant_config in QuarkW8A8Int8MoEMethod
- fix(spec_decode): ensure draft model uses correct parallel/model config in all proposers
- [Bugfix] routed_experts: fall back to Triton MoE backend (FlashInfer kernels bypass capture)
- [BugFix] Omit empty tool_calls from OpenAI chat responses
- [Doc] Clarify ColBERT token embedding behavior
- [Kernel] Add weightless RMSNorm CUDA kernels for has_weight=False (#41430)
- Docs
- Python not yet supported