vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Clean up ModelOpt LM head state before tying weights
- [BugFix] Reset num_output_placeholders on KV load failure recomputation
- [Rust Frontend][RFC]: Elastic Expert Parallel support
- [Bug]: cupy-cuda13x installed instead of cupy-cuda12x in CUDA 12.9 Docker image
- [Bugfix] Validate model_loader_extra_config values in DefaultModelLoader
- fix: [Feature]: Default eplb num_redundant_experts to the lowest valid val...
- [Feature]: Restore support for 2-bit and 3-bit GPTQ
- [ROCm] Fix AMD build from shuffle mask dtype error while compiling `silu_and_mul_per_block_quant_kernel`
- fix: [Bug]: Wrong timestamps if audio > 30s
- fix(multimodal): support gemma4 QAT configs by adding safe num_soft_t…
- Docs
- Python not yet supported