vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Core] Add register_model() to KVConnectorBase_V1 for CacheBlend
- Automatically add links to API docs for matching strings in docs
- [Bugfix] [Frontend] Responses API, fix merging of message and tool call
- [Models][GDN] Prevent D2H sync in `ChunkGatedDeltaRule`
- [KV Offload] Unified memory layout for offloading workers
- Add local-runtime CLI, launcher install flow, and easy model management
- [Bug]:推理时报错,模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型
- [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode
- [kv_offload+HMA][5/N]: Track group block hashes and block IDs
- [Feat][v1] Simple yet General CPU KV Cache Offloading
- Docs
- Python not yet supported