vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- Revert "[bugfix]Indexer init skip and MTP TopK share for iteration" (#45895)
- Revert "[Bugfix] Fix corrupt outputs in MoE FP8 LoRA responses and MoE base model responses when LoRAs are loaded" (#42120)
- [Bugfix] Forward upstream error message in Anthropic streaming converter (#46028)
- docs: clarify uv-specific install options
- [Misc] Add unit test for _fwd_kernel_ep_scatter_1 and _fwd_kernel_ep_…
- [Config] Reject negative values for max_logprobs and long_prefill_token_threshold
- [Misc] Add unit test for per_token_quant_int8 kernel
- [Bug]: GLM-5.2 (DSA sparse MLA) + fp8_ds_mla — sparse indexer off-by-one crashes concurrent decode at max_model_len >= ~325K
- [Bugfix] Defer re-admission of preempted request with in-flight offloading stores
- [Bug] MiniMax-M2.7 multi-node TP=4: NCCL collective deadlock (all ranks spin at SM~96%/mem=0%/~15W)
- Docs
- Python not yet supported