vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [RFC]: Support for Domino speculative decoding in vLLM
- [Bug]: When the startup parameter "--max-model-len auto --enable-chunked-prefill -- enable-prefix-caching --max-num-batched-tokens 65536", garbled characters are displayed in the output of the GLM-5.1-NVFP4 model deployed using the VLLM Docker image.
- [Feat] Add runtime monitor for post-warmup TileLang compilation
- [Bug]: sparse-MLA indexer error with GLM 5.2 NVFP4 on RTX 6000 Pro SM120
- [Kernel][MoE] Integrate TokenSpeed Mxfp4 MOE Kernel
- [Bugfix] Enable DSML structural tag for DeepSeek-V4 with auto + non-strict tools
- [Bug]: Qwen3.5 397B NVFP4 Crashes on B300
- [Kernel] Support batch invariance for WNA16 Marlin MoE
- [Kernel][MoE] Tune block-FP8 fused MoE for low-batch decode
- [Model Runner v2] Enable all moe models for MRv2
- Docs
- Python not yet supported