vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode
- [kv_offload+HMA][5/N]: Track group block hashes and block IDs
- [Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved()
- [Feature]: Upstream DGX spark improvements from Avarok-Cybersecurity/dgx-vllm
- [Bug]: responses API, combining of message and tool call
- [Usage]:wen3.5-35B-A3B (FP8) with vLLM 0.17.1 , the first request takes significantly longer than subsequent requests
- Add SM 120 (RTX Blackwell) MLA attention support
- [Feat][v1] Simple yet General CPU KV Cache Offloading
- [Feature] Support loading LoRA from memory
- [MoE][Offload] Run MoE models exceeding VRAM via expert CPU offloading with GPU cache (--moe-expert-cache-size)
- Docs
- Python not yet supported