vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Kernel] Support UE8M0 scales in fused SiLU block quant
- [Bug]: OpenAIServingChat silently requires new openai_serving_render kwarg since v0.18
- Fix OLMo3 sliding attention RoPE parameters
- [RFC]: Porting compiler fusions to manual fusion
- [CompressedTensors] FP4 Qutlass Integration
- [Bug] PR #36138 grammar-mask spec-decode fix doesn't handle multi-token reasoning boundaries (gpt-oss/openai_gptoss still bleeds; Qwen3 fixed)
- fix: add WSL UVA fallback support for buffer utils
- [Bug]: TurboQuant workspace locked at 3.06 MB — continuation_prefill requires 12 MB on any prompt >4096 tokens (Qwen3.6-27B NVFP4 hybrid, Blackwell SM120)
- [KV Connector][3/N][NIXL] Per-layer-name HMA routing for hybrid (Mamba/SSM) models under PP
- [Bug]: Qwen3.5 397B model occurs assertion error during allocating new blocks
- Docs
- Python not yet supported