vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- Fix Qwen3-Coder required tool parsing
- docs: replace PR review request email with Google Form
- # Helion `scaled_mm` vs cutlass — benchmark command & result
- [XPU] Fix test_logprobs_e2e import error: pin lm-eval[api]>=0.4.12
- [Bug]: vllm process crashed because of dp coordinator receives unexpected message which send by safety scan software
- [Bug]: Online FP8 (`--quantization fp8`) over-allocates non-gated MoE `w13` (2×intermediate), causing OOM — NemotronH on a single GPU
- [ROCm][Perf] Enable torch.compile / Inductor fusion passes on GLM-4 MTP draft
- Deprecate old FP8 online quantization classes
- [Model] Use native packed audio attention for Qwen2.5-Omni to remove standalone flash-attn dependency
- [Bugfix] Clamp num_computed_tokens after streaming session rebuild
- Docs
- Python not yet supported