vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix][MLA] Add logits size budget to sparse indexer prefill chunking
- [BugFix] Ensure contiguous input tensor in LoRA shrink kernel
- [Bug]: Symbol not found: __ZN3c1013MessageLoggerC1EPKcii\n Referenced from: <6A12389C-7A10-3CA4-BEDF-893991822933> /opt/anaconda3/envs/vllm-inference/lib/python3.11/site-packages/vllm/_C.abi3.so\n
- [Feature]: Support CUDAGraphMode.FULL with ChunkedLocalAttention for Llama4 models
- [Bug]: vllm serve quantized GLM-5 failed
- [Bug]: Two-node deployment of kimi2-5, runtime crash
- [Usage]: MoE flatten_tp_size should not unconditionally include dp_size — DP loses its original semantics for MoE layers
- [RFC]: Support quarot for eagle3
- [Bugfix] Remove incorrect assertion blocking mixed decode+spec-decode batches in GDN attention
- [Bugfix] Fix MoE flatten_tp_size unconditionally including dp_size
- Docs
- Python not yet supported