vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Defer re-admission of preempted request with in-flight offloading stores
- [Bug] MiniMax-M2.7 multi-node TP=4: NCCL collective deadlock (all ranks spin at SM~96%/mem=0%/~15W)
- [ROCm][Perf] TP-shard Lightning indexer prefill
- [ROCm] Detect ROCm via KFD topology when amdsmi cannot enumerate GPUs
- [ROCm][Perf] MXFP8 dense-linear + grouped-MoE GEMM optimizations for MiniMax-M3
- [Bug]: 这是 MTP 推测解码 + 结构化输出(grammar)冲突的已知 bug。MTP 生成的 speculative tokens 里有 grammar FSM 无法处理的 token(如特殊 token 248069),导致 FSM 拒绝整个请求
- [ROCm][Perf] FlyDSL BF16 MoE for MiniMax-M3 MXFP8 emulation (gfx942) via --moe-backend aiter
- [Bugfix] Make Kimi's tool parser accept numeric only tool call IDs
- [Misc] Add unit test for ep_gather kernel
- [Bug]: Segfault in DNNL matmul during mixed-batch GQA after #43032 (v0.23.0 regression)
- Docs
- Python not yet supported