vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Usage]: How to launch the Qwen3.5 service using vLLM on a V100 GPU
- [Bug]: qwen 3.5 model launch get stuck for quite a long time
- ROCm sometimes compiles problematically on torch.log on MI325
- fused_moe_kernel opt
- [Feature]: Parity with CUDA: vLLM router should have ROCm CI
- [OPT] Optimize the fused moe triton kernel routing expert accumulation
- [Quantization] Convert NVFP4 weights to FP8 on Hopper for faster inference
- [Core] Preempt requests with fewer num_computed_tokens to reduce wasted computation
- [compile] fuse rope and cache insertion for mla
- [ROCm] Enable dual-stream MoE shared experts, AITER sparse MLA workaround, and GLM-5-FP8 weight loading fix
- Docs
- Python not yet supported