vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported13 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Add devcontainer configuration file
- [ROCm] [CI] fix test_unrecognized_env (#34350)
- [Feature] Enable uniform KV cache allocation for multi-group HMA models
- [BUG] Reset running requests when clearing cache for pause/resume
- [DO NOT MERGE ] Evidence for FlashInfer allreduce_fusion one-shot (kARResidualRMSNorm) causes deterministic NaN corruption and GSM8K collapse
- [KV Connector] Add temporary, off-by-default `VLLM_DISABLE_REQUEST_ID_RANDOMIZATION` workaround
- [Bugfix] Delete unused redundant code in Kimi-K2.5
- [CI] Add GPT-OSS Eval job for H100
- [torch.compile] Remove duplicated split_with_sizes after RoPE
- [Bug Fix] Enable non-gated MoE support in Triton backends (#34356)
- Docs
- Python not yet supported