vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Hybrid Mamba + KV connector: reconcile diverged per-group prefix hits instead of `max()`/trim
- [Bugfix][TurboQuant] Add continuation guard to fast-path to prevent prefix K/V loss under prefix caching
- [Bug]: openai_gptoss reasoning parser raises HarmonyError during streamed chat completions
- [ROCm][Test] xfail fused TRITON MXFP4 MoE accuracy on gfx950
- [ROCm] Add regression test for vision encoder MATH SDPA backend
- [ROCm][Perf] Fused shared expert for Minimax M3
- [Misc] Add unit tests for five untested utility modules
- [Scheduler] Fix KeyError in PP2 when last stage finishes request via tool-call parser
- [ROCm] honour skip_kv_gather in AITER MLA sparse prefill chunk loop (#40018)
- [Kernel][SM120] NVFP4 grouped MoE: pingpong schedule at large per-exp…
- Docs
- Python not yet supported