vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Feat]Make EPLB max expert redundancy configurable
- [Platform][XPU] Opt-in integrated-GPU override for unified memory
- [Bug]: Engine crashes with AssertionError when prompt exceeds auto-fitted max_model_len (admission check missing)
- [Bugfix] Fix hybrid KV manager for quantized per-token-head KV cache
- Revert "Fix MoE backend selection for LoRA (unquantized MoE)" (#40273)
- [Feat][KVConnector] Prepend offloaded blocks on offloading complete for lazy mode in simple cpu offloader
- [ROCm] Allow Triton MXFP4 MoE support checks on gfx11xx
- [Bug]: MoRI Connector hangs at >=128 concurrency
- [Docs] [Misc] add sig list table in community governance process
- [Bug]: MTP draft head TP allgather deadlock under sustained long-context load (GLM-5.1-FP8)
- Docs
- Python not yet supported