vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Correct prompt lengths for timed_traces benchmark
- [Bug]: Silent output corruption: MTP + DCP + FULL_AND_PIECEWISE cudagraphs when the attention backend lacks varlen-decode support under DCP
- [Model] Support top_k and top_p sampling for DiffusionGemma
- fix: reset KV load recompute placeholders
- [Bug] Qwen3CoderToolParser emits no tool calls when the complete call arrives in a single delta
- [PD] [NIXL] NIXL Connector: support hetero block size
- [Bug] [CPU] Serving Model RuntimeError: Worker failed with error 'Current vLLM config is not set.
- Admit MTP/EAGLE spec-decode steps and sliding-window layers into the Triton 3D flash-decoding path (B300, NVFP4)
- Opt-in two-stream + launch-elision latency optimizations for Kimi-K2.6-NVFP4 decode on B300 (TP=4)
- [Bug]: Out of bounds in cp_gather_cache
- Docs
- Python not yet supported