vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Usage]: How to launch the Qwen3.5 service using vLLM on a V100 GPU
- [Bug]: qwen 3.5 model launch get stuck for quite a long time
- [Bug]: CUDA assert in triton attention for MolmoWeb models (Molmo2 architecture with different max_position_embeddings)
- [Bug]: Jamba tool parser crashes on Mistral-style [TOOL_CALLS] models with standard HF tokenizer (e.g., Apriel-Nemotron-15b)
- [Bug]: parity with CUDA: ROCm nightly & release docker images aren't built with Pollara AINIC or Broadcom Thor-2 NICs
- ROCm sometimes compiles problematically on torch.log on MI325
- Fix Kimi-K2.5 accuracy when Aiter MLA FP8 PS + CUDA graphs are used
- [XPU] Enable group_size=-1/channel-wise for w4a16 and w4a8
- fused_moe_kernel opt
- [Feature]: Parity with CUDA: vLLM router should have ROCm CI
- Docs
- Python not yet supported