vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Frontend] Use unified Parser for chat completions non-streaming
- [XPU] Enable topk_per_row and indexer_quant_cache kernels for DeepSeekV3.2 and GLM5
- [MLAAttention] Clear Cudagraph padded region of FI decode Attention kernel
- Add sanitized MiniMax M2 parser variant for path-like output
- [Feature] RIY: Runtime expert pruning for MoE models
- [Feature] Add prefill node failure detection and health query endpoint for MooncakeConnector Proxy
- [Bug]: Intel ARC 140v not supported as XE2 cutlass kernel
- [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape
- [TEST ONLY]Restore AsyncTP fusion for FlashInfer FP8 BMM (need advices)#27893
- [Bugfix] Fix Qwen3CoderToolParser anyOf/oneOf type resolution for nullable params
- Docs
- Python not yet supported