vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported14 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Fix/compile cache reuse for vision blocks #27590
- [Performance]: DeepSeek-R1 performance degradation after enabling MTP
- [Feature]: Support norm+quant & silu+quant fusion for block (group) quantization
- Adding render group to docker container
- [Bug]: ibm-granite/granite-4.0-h-tiny model fails for CPU on vLLM
- [Bug]: Qwen3-4B Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
- Flashinfer: TopK+TopP sampling from probs
- [BugFix]Fix the issue where there is no parallelism in PP mode
- [Bug]: Prefix caching leads to different outputs for Hermes-3-Llama-3.1-8B
- [Bug]: ValueError: There is no module or parameter named 'mlp_AR' in TransformersForCausalLM
- Docs
- Python not yet supported