vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [XPU] profile xpu graph memory
- [Bug]: Hybrid Mamba + KV connector: per-group prefix-hit divergence and vllm engine crashed
- [Bugfix][Model] Fix DiffusionGemma GGUF tied embedding loading
- [CI] intel CI: add quantization and awq case for xpu
- [Feature]: LoRA support for MiniCPMV4_6ForConditionalGeneration
- [Bug][TurboQuant] Fast-path incorrectly triggers with continuation requests under prefix caching, losing cached prefix K/V
- [Bug]: DFlash+Qwen3.6-35B-A3B, invalid gpu kv cache usage, low decode speed
- [Bug]: InternLM2 models crash with IndexError in embedding layer on CPU backend (v0.23.0 regression from v0.22.1)
- [Bugfix][Parser] Gracefully handle Harmony parser errors
- [XPU] Support lora serialize and deserialize
- Docs
- Python not yet supported