vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Decode prompt text from token IDs upstream in renderer
- [KV Offload] Unified memory layout for offloading workers
- Add local-runtime CLI, launcher install flow, and easy model management
- [Bug]:推理时报错,模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型
- [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode
- [kv_offload+HMA][5/N]: Track group block hashes and block IDs
- [Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved()
- [Feature]: Upstream DGX spark improvements from Avarok-Cybersecurity/dgx-vllm
- [Bug]: responses API, combining of message and tool call
- [Usage]:wen3.5-35B-A3B (FP8) with vLLM 0.17.1 , the first request takes significantly longer than subsequent requests
- Docs
- Python not yet supported