vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix] Add dimension alignment check to Marlin MoE kernel selection
- [Bugfix] Handle reasoning_effort="none" for Harmony models instead of crashing
- [Bugfix] [Frontend] responses api, refactored simple event streaming
- Revert "Various Transformers v5 fixes" (#38127)
- [Bugfix][Frontend] Return 400 for corrupt/truncated image inputs instead of 500
- [Optimization] Fuse mamba_get_block_table_tensor in align mode
- [Metrics] Add labeled token waiting metrics for precise load balancing
- [GGUF Kernel] Remove artificial 255 expert limit to support models with more experts
- [Frontend] Add /v1/responses/render endpoint and refactor responses preprocessing
- [Bug]: Qwen3.5 LoRA module is not in model's supported LoRA target modules
- Docs
- Python not yet supported