vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Fix GLM tool-call finish chunk suffix alignment in streaming
- [Bug]: 0.17.0rc1在A2部署GLM-4.7,开启MTP后工具调用异常
- [Bug]: Shared Expert output is incorrect under Sequence Parallel MoE (EP + TP > 1 + DP > 1) for Qwen3.5 MoE models
- [Bug] UVA CPU offload completely broken on WSL with NVFP4 MoE (Qwen3.5-35B-A3B): three distinct crashes across all parameter combinations
- [Bug]: Engine V1 crash with WorkerProc leaked shared_memory on multi-GPU setup
- [Usage]: Unable to run Qwen3-14B with vLLM (multiple issues)
- [Usage]: Using RIXL Connector on AMD GPU
- [Bug]: Minimax-M2.5 on version 0.17.0 results in an keyerror when the pipeline parallelism (PP) is greater than or equal to 2
- [RFC]: Prefill Context Parallel for Qwen3.5 Hybrid Attention
- [SpecDecode] Add shortcut in rejection sampler for greedy sampling
- Docs
- Python not yet supported