vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported15 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Performance]: DeepSeek-v3.2 throughput&TTFT&TPOT is much slower than DeepSeek-v3.1 on 8*H200
- [V1][Spec Decode] Add Dynamic SD
- [Frontend] Skip `stop` in reasoning content
- [Draft] AFD implementation for step3
- [Kernels] Add support for hybrid DeepEP
- [feat] Implement Elastic Speculation: Adaptive Draft Length + Confidence-Based Early Exit
- [Bugfix] Record request stats when request is aborted by client
- [Frontend][CLI] Add --enable-dashboard for vLLM Web UI
- [Misc][ViT Cuda Graphs] Enable Piecewise CUDA Graphs for Qwen3-VL and Qwen2.5-VL ViT to Improve Performance
- [Bug]: Issue of Unstable Output for Identical Queries
- Docs
- Python not yet supported