vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Windows] RTX 5070 Ti (Blackwell sm_120) - setup and deployment notes
- [Bug]: Qwen3.6 hybrid Mamba models fail KV cache allocation on RTX PRO 6000 Blackwell + WSL2 — 16 GiB invisible CUDA overhead
- [Bug] Skip KVConnector lookup when request opts out of prefix cache
- [RFC]: Route TorchAO and LLM-Compressor Quantized Inference through zentorch on AMD Zen CPUs
- [Bug]: Batch invariant mode does not work for Kimi K2.6
- [Bugfix] Fix GLM zero-arg streaming tool names
- Fix Gemma4 TritonAttention buffer mismatch for heterogeneous head_dim (#41656)
- fix: handle detached HEAD in precompiled wheel commit resolution
- [Bug]: CPUWorker shutdown reports "RuntimeError: Cannot access accelerator device when none is available."
- [Bug]: XPU TP=2 on dual Intel Arc Pro B70 (Battlemage): GP fault + xe BCS engine reset reproduces in intel/vllm:0.17.0-xpu on Ubuntu 24.04 HWE 6.17
- Docs
- Python not yet supported