vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Fix Gemma4 TritonAttention buffer mismatch for heterogeneous head_dim (#41656)
- fix: handle detached HEAD in precompiled wheel commit resolution
- [Bug]: CPUWorker shutdown reports "RuntimeError: Cannot access accelerator device when none is available."
- [Bug]: XPU TP=2 on dual Intel Arc Pro B70 (Battlemage): GP fault + xe BCS engine reset reproduces in intel/vllm:0.17.0-xpu on Ubuntu 24.04 HWE 6.17
- fix-grpc-spawn
- [ROCm] Add QuickReduce min-size override and codec threshold
- Fix CPU-only shutdown cleanup without accelerator
- [Bug]: The last few reasoning output tokens are missing when using Gemma4 and setting "--streaming-interval" to be larger than 1
- [CI/Build] Bump flashinfer to v0.6.10
- [Bug]: GPT-OSS-20B repeats itself for some prompts
- Docs
- Python not yet supported