vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Model] Support top_k and top_p sampling for DiffusionGemma
- fix: reset KV load recompute placeholders
- [Bug] Qwen3CoderToolParser emits no tool calls when the complete call arrives in a single delta
- [PD] [NIXL] NIXL Connector: support hetero block size
- [Bug] [CPU] Serving Model RuntimeError: Worker failed with error 'Current vLLM config is not set.
- Admit MTP/EAGLE spec-decode steps and sliding-window layers into the Triton 3D flash-decoding path (B300, NVFP4)
- Opt-in two-stream + launch-elision latency optimizations for Kimi-K2.6-NVFP4 decode on B300 (TP=4)
- [Bug]: Out of bounds in cp_gather_cache
- [Feature]: make the memory profiling assert configurable to enable parallel testing reusing gpus
- [Bug]: `cudaErrorLaunchFailure` in worker under concurrent native KV offloading with DeepSeek-V4-Flash (frequency depends on kv-cache dtype)
- Docs
- Python not yet supported