vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug][flashinfer 0.6.8]: worker hang on Qwen3.5-397B-A17B-NVFP4 EP=8 (B200 SM100) - bisected to flashinfer-python 0.6.7 → 0.6.8.post1
- [Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down
- [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_lora_adapter`?
- [Bug]: Streaming chat completion drops partial content when stop string interrupts auto tool parsing
- [Security] Blocked by CVE-2025-30165 & CVE-2024-11041 (Legacy V0 Engine)
- [Bug]: benchmark_serving_multi_turn.py deadlocks after clients exit when --max-num-requests is used
- [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling
- [CI Failure]: mi300_4: Distributed Torchrun + Examples (4 GPUs)
- [CI Failure]: mi300_2: Distributed Compile Unit Tests (2xH100-2xMI300)
- [Bug]: compiling from source crashes the PC
- Docs
- Python not yet supported