vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Canonical KV Cache Allocation for HMA Models
- [MoE] Move nixl_ep and mori prepare/finalize into fused_moe/prepare_finalize/
- [Bugfix] Fixe MiniMax-M2 parser failed to validate the validity of function names
- Fix missing logprobs for <tool_call> in streaming chat completions (#37737)
- [CI/Build] Fix LMCache KV connector install path for CUDA 13 images
- [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model
- [Bug]: FLASHINFER_CUTLASS and FLASHINFER_TRTLLM do not work for Qwen3.5 Bf16 DP/EP
- [Bug]: [OOM] DeepSeek-R1 Out of Memory
- [Bug]: CUDA 13 LMCache KV connector install path still resolves CUDA 12 artifacts
- [MoE] Move GPT OSS Triton kernel experts into fused_moe/experts/
- Docs
- Python not yet supported