vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- fix: handle escaped <\\think> tags in reasoning parser (closes #36207)
- feat(openai): add per-request timing metrics and completion_tokens_deβ¦
- [Bug]: Inconsistent PP layer indexing in EAGLE model code
- [BugFix] Ensure contiguous input tensor in LoRA shrink kernel
- [Bug]: Symbol not found: __ZN3c1013MessageLoggerC1EPKcii\n Referenced from: <6A12389C-7A10-3CA4-BEDF-893991822933> /opt/anaconda3/envs/vllm-inference/lib/python3.11/site-packages/vllm/_C.abi3.so\n
- [Feature]: Support CUDAGraphMode.FULL with ChunkedLocalAttention for Llama4 models
- [Bug]: vllm serve quantized GLM-5 failed
- [Bug]: Two-node deployment of kimi2-5, runtime crash
- [Usage]: MoE flatten_tp_size should not unconditionally include dp_size β DP loses its original semantics for MoE layers
- [RFC]: Support quarot for eagle3
- Docs
- Python not yet supported