vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Performance]: Very slow GGUF quantized model
- [Doc]: Inconsistent hash notation in Prefix Caching "Time 5" diagram
- [FA/Chore] Bump FA version for FP8 two-level accumulation every n steps
- [Bugfix] Fix Harmony streaming cross-channel delta accumulation
- [Bug]: Realtime audio transcription (Voxtral) silently hangs after ~10 minutes due to unhandled TimeoutError in background task
- [Bugfix] Update TritonExperts to reflect support for none-gated ReLU^2
- [Feature]: Support DeepGEMM MTP3 NV Kernel
- feat: adjust FA3 num_splits on Hopper for low latency in cudagraph mode
- [Bugfix] Clamp out-of-bounds token IDs in MTP models
- [RFC]: Add PPL and KLD to VLLM
- Docs
- Python not yet supported