vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix][Tool Parsers] Validate JSON arguments in extract_tool_calls for Kimi K2, DeepSeek V3/V3.1
- [Kernel][Helion][1/N] Add Helion kernel for fused_qk_norm_rope
- Add speculative decoding metrics
- [CPU][Spec Decode] Enable DFlash SD for CPU
- [Bugfix] Reject negative values for max_logprobs and long_prefill_token_threshold
- [Feature] Add tanh softcapping support to FlexAttention backend
- [Bug] V1 InputBatch condense can leak stale allowed_token_ids mask to recycled row
- feat(benchmark): support multimodal request_id tracking and label output
- [Bug] MXFP8 MoE always falls back to MARLIN on SM_121 (DGX Spark / GB10): TrtLlmFp8ExpertsBase gates on family(100), excluding SM_12x consumer Blackwell
- b12x nvfp4 w4a16 use a16 fix
- Docs
- Python not yet supported