vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Doc] Add google/gemma-2-2b-it to batch invariance tested models
- [Feature] Detect all2all peer fault with fault tolerance backend and prevent corrupted output
- [Bugfix][DSV4] Plumb quant_config + route compressor GEMM through quant-aware dispatch
- [Bugfix] Avoid concurrent port allocation collisions
- [RFC]: Add Gumiho speculative decoding to vLLM
- [Draft][Kernel] Port ActivationQuantFusionPass to manual fusion (#43501)
- [Usage]: How to run Qwen3.5 models on V100 given the conflicting requirements of transformers version and vLLM architecture support
- [Usage]: Intel Xeon Prefill Decode Disaggregation
- [Bug] FP8 block-quant loader rejects artifacts using 'weight_scale' rather than 'weight_scale_inv' naming
- [Bug]: TurboQuant crashes on T4/Turing (SM75) — FlashAttention capability not checked
- Docs
- Python not yet supported