vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: MoE + --enable-sleep-mode OOM during weight load — bisected to #41268, root cause in cumem MemPool reclaim (pytorch#159674)
- [ROCm][Perf] Enable AR+RMS fusion for GemmaRMSNorm models
- [Bug]: NVCC compilation error when launching DeepSeek-V4-Flash on H100
- [Perf] Reduce MTP decode bubbles for Qwen3.5 hybrid models
- [Bugfix] Fix Eagle3 spec-decode with mismatched target/draft hidden size
- [XPU] Fix FP8 block-scaled scheme selection on non-CUDA platforms
- [Bug]: Qwen3-1.7B silent correctness regression in vLLM 0.21.0: TP=2/4 and Triton attention produce wrong answer
- [Bug][Deepseek v4][DBO]: AssertionError: positions is required for C128A metadata build File
- [Bug]: KeyError: 'layers.0.mlp.gate_up_proj.g_idx' of GLM-OCR GPTQ Int8 in v0.21.1rc1
- [Bug]: gpt-oss-120b MXFP4 MoE init OOM-killed on unified-memory ARM (DGX Spark / Jetson Thor)
- Docs
- Python not yet supported