vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Fix GPU memory not being released on MRV2 shutdown
- Update quack-kernels requirement from >=0.3.3 to >=0.4.1
- Bump protobuf from 6.33.6 to 7.34.1
- [Bug]: AssertionError on last PP rank with async scheduling + KV offloading
- [Windows] RTX 5070 Ti (Blackwell sm_120) - setup and deployment notes
- [Bug]: Qwen3.6 hybrid Mamba models fail KV cache allocation on RTX PRO 6000 Blackwell + WSL2 — 16 GiB invisible CUDA overhead
- [Bug] Skip KVConnector lookup when request opts out of prefix cache
- [RFC]: Route TorchAO and LLM-Compressor Quantized Inference through zentorch on AMD Zen CPUs
- [Bug]: Batch invariant mode does not work for Kimi K2.6
- [Bugfix] Fix GLM zero-arg streaming tool names
- Docs
- Python not yet supported