vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: 0.17.1 - vllm serve deepseek-ai/DeepSeek-OCR-2 on H100 crashes during Capturing CUDA graphs (decode, FULL)
- [Bugfix] Lower spec decode match threshold from 66% to 60% to increase chances of test pass on CI
- [WIP] SP+AsyncTP piecewise compilation fix + per-matmul heuristic gating
- [ROCm] Add VLLM_ROCM_W8A8_TRITON_MAX_M env var for CK/Triton GEMM rou…
- fix: xgrammar structured output crash
- [MODEL] Cherry-pick: Adding Support for Qwen3.5 Models
- [Feature] Rework chunk-based processing with torch.scan
- mm_fp4 trtllm backend leaks padding scales into real rows (use_8x4_sf_layout=True)
- [renderer][ez] combine render_chat_async, render_chat
- [Core] Preallocate sampler logits workspace during memory profiling
- Docs
- Python not yet supported