vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: responses API, combining of message and tool call
- [Usage]:wen3.5-35B-A3B (FP8) with vLLM 0.17.1 , the first request takes significantly longer than subsequent requests
- [SM120][GLM-5.1] NVFP4 DCP/MTP stack tracker
- [Bug]: [ROCm][gfx1151] Engine Core segfaults in libhsa-runtime64.so when loading Qwen3-VL-32B-AWQ on AMD Ryzen AI MAX+ 395
- fix(v1/prefix-cache): prevent KV-cache pollution from same-step block registration
- fix: handle missing parameters gracefully in weight offloading
- [Hardware] replace torch.cuda.Stream with torch.Stream
- [ROCm][Quantization] Add Quark W4A16 MXFP4 A16 for LinearLayer
- [RFC]: Active Coordination and Two-Zone Scheduling Mechanism for KV Cache in Long-Running Agents
- [Perf] consolidating, vectorizing and cleaning up CUDA/HIP implementations of custom ops.
- Docs
- Python not yet supported