vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Build] Use no-guess-dev version scheme to fix version mismatch on release branches
- [Feature]: Batch-invariant support for GDN_ATTN (Qwen3-Next / Qwen3.6 hybrid Mamba+GDN MoE models)
- Resolve silu mul quant padded NaN corruption correctness
- [Test][1/N] Add Platform-Aware Test Skip Mechanism
- [SM120] _dummy_sampler_run hangs indefinitely on RTX 5090 due to top_k=vocab_size-1 triggering an SM120-broken top-k masking kernel (one-line fix)
- [Bug]: Poor Qwen3.5 NVFP4 disagg GSM8K accuracy with 2p1d (2xTEP8 prefill, 1xDEP8 decode)
- [Bug]: vLLM wheel version mismatch
- Make code can be built by clang
- [Bug]: Prefix-cache 0% hit on re-sent request — DeepSeek-V4-Flash hybrid groups lose all first-block cache keys on every request reassignment (DSv4 variant of #32802)
- [Bugfix] Reject non-object JSON bodies with HTTP 400
- Docs
- Python not yet supported