vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported19 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Realtime audio transcription (Voxtral) silently hangs after ~10 minutes due to unhandled TimeoutError in background task
- [Bugfix] Update TritonExperts to reflect support for none-gated ReLU^2
- [Feature]: Support DeepGEMM MTP3 NV Kernel
- feat: adjust FA3 num_splits on Hopper for low latency in cudagraph mode
- [Bugfix] Clamp out-of-bounds token IDs in MTP models
- [RFC]: Add PPL and KLD to VLLM
- [Bug]: Graph Capturing reports negative memory consumption
- Add support to pass custom presence_penalty and frequency_penalty parameters in the generation config
- [New Model]: MetaCLIP-2 variants
- [Core] Add sharding metadata to model parameters
- Docs
- Python not yet supported