vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Reject negative values for max_logprobs and long_prefill_token_threshold
- [Feature] Add tanh softcapping support to FlexAttention backend
- [Bug] V1 InputBatch condense can leak stale allowed_token_ids mask to recycled row
- feat(benchmark): support multimodal request_id tracking and label output
- [Bug] MXFP8 MoE always falls back to MARLIN on SM_121 (DGX Spark / GB10): TrtLlmFp8ExpertsBase gates on family(100), excluding SM_12x consumer Blackwell
- Pad FP8 Marlin weights to valid thread tiles
- b12x nvfp4 w4a16 use a16 fix
- [Bug]: DeepSeek reasoning parser can emit reasoning_content and content in the same streaming chunk
- [rust] feat: add Granite 4 tool parser
- [Manual Fusion][ROCm] Port RMS + Group Quant Fused Op for Qwen3 on ROCm Platform
- Docs
- Python not yet supported