vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported14 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [docs] Add lightweight AI assisted contribution policy
- Implement optimal group size calculation for KV cache layers, preferr…
- [Bug]: NVIDIA NGC container nvcr.io/nvidia/vllm:25.12-py3 ships outdated vLLM 0.11.1 instead of 0.13.x
- Add Cogagent Model to vllm
- FlashInferUnification
- [4/n] Migrate pos_encoding sampler and fused_qknorm_rope to libtorch stable ABI
- [RFC]: Native Weight Syncing APIs
- [Bug]: Prefix cache hit rate remains 0 in multi-round conversation with history of identical prompts.
- [Bug]: vLLM crashed when load testing GLM 4.7 FP8 on H100
- fix(examples): replace unsafe eval() with safe math evaluator in xLAM tool examples
- Docs
- Python not yet supported