vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [ROCm][CI] Stabilizing teardown and timeout of flaky tests to prevent rare OOMs
- Add weights padding for fp8 per-block online quantization
- [XPU][Minor] format moe kernel name and add in kernel list
- [Feature]: [Metrics] Add per-request preemption count to Prometheus histogram
- [Feature]: Explicit cache breakpoints and cache usage accounting in OpenAI-compatible API
- fix: strip Gemma4 string delimiters from dict keys
- [Bug] ValueError: Following weights were not initialized from checkpoint (Gemma 4 models with KV sharing)
- [RFC]: Support Bailing MTP (Multi-Token Prediction) for Ling-2.6-flash
- [Bug]: ValueError in safe_apply_chat_template deadlocks the HTTP server — every /v1/* request hangs after one malformed chat request
- [Bugfix] Gemma4 streaming parser for multi-boundary tool deltas
- Docs
- Python not yet supported