vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [CI] Skip FLASH_ATTN+DeepSeek eagle test on Blackwell
- [Bug]: Garbled Output in DeepSeek-V4 with CUDA Graph Enabled Under Concurrent Identical Input Requests
- Bump Transformers version to 5.8.0
- [Bug]: Qwen3-30B-A3B on B200 (TP=8) — K must be divisible by blockK in flashinfer convert_to_block_layout (unquantized MoE oracle path)
- [Feature]: Add cap to --max-model-len auto (auto-fit with upper bound)
- [Bug]: vllm-0.20.0 metrics not accurate
- [Bug]: Gemma4 Fast Prefill Optimization degrades p95 inter-token latency significantly
- [Installation]: GH200 nodes
- [Feature]: Server overloaded response
- [style] Remove redundant None default in dict.get() (ruff SIM910)
- Docs
- Python not yet supported