vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Fix collect_env.py crash on non-Linux platforms
- [Bug]: TRITON_MLA grouped-decode kernel fails for Mistral Small 4 (kv_lora_rank=256): 'Cannot make_shape_compatible: 256 and 512'
- [Example] Add Top-n-sigma logit truncation custom logits processor
- llmd+vllm+mori-ep(inter node wide-ep)+mori-io(write) for 2p2d with dp=ep=16 tp=1
- docs: Add GKE Autopilot deployment & Prometheus monitoring guide
- [Bug]: Vllm init fails when model-loader-extra-config option provided
- [Multimodal] Add PixelPrune visual token pruning for Qwen3-VL
- [Draft] DeepSeek V4 PP/PD disaggregated serving with Mooncake
- [Core][Perf] Cap Torch intra-op threads in UniProcExecutor to avoid CPU oversubscription
- fix: use .pop() with default None to avoid KeyError in chat.py (closes #44827)
- Docs
- Python not yet supported