vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Frontend][ToolParser] emit tool name blocks in advance in MinimaxM2
- routed_experts all-zero with FlashInfer (TRTLLM/CUTLASS) MoE — wire FlashInfer routing_replay_out into the capturer
- Add env var for FlashInfer autotune cache
- [Bugfix] Expand packed module names in GPTQ modules_in_block_to_quantize
- [Core] Pluggable sleep-mode backend abstraction (RFC #34303)
- [Core] Avoid scheduler `RoutedExpertsManager` import unless needed
- fix: bind structured-output grammar on Responses API when reasoning parser is active
- [Bugfix] Quark: set moe_quant_config in QuarkW8A8Int8MoEMethod
- fix(spec_decode): ensure draft model uses correct parallel/model config in all proposers
- [Bugfix] routed_experts: fall back to Triton MoE backend (FlashInfer kernels bypass capture)
- Docs
- Python not yet supported