vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported14 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix]: nccl connnector memory leak
- Fix: align vllm bench serve ignore_eos behavior with legacy benchmark…
- [Bug]: DeepSeek V3.1 Tool Parser: Leading whitespace accumulation in multi-turn tool calling conversations
- v1: clamp max_model_len when KV cache exceeds GPU budget
- Take SM Count During Persistent SiLU Mul Quant
- [WIP][CI][GHA] Use linux.2xlarge GHA runner for pre-commit checks
- [ROCm][fusion] Enable qk_norm mRoPE fusion for Qwen VL models
- Fix LoRA compatibility with quantized MoE models
- [Feature]: Qwen3 Omni Transcriptions
- [Don't merge] Try enabling fastsafetensors as default weight loader
- Docs
- Python not yet supported