vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- DeepGEMM SiLU/mul FP8 quant Triton kernel overflows int32 addresses for large DPEP warmup shapes
- [Bug]: FP8 KV cache corrupts output in Qwen3.5-397B-NVFP4 Disagg serving
- [RFC]: Cache-affinity-aware request ordering for the V1 scheduler
- [Bug][flashinfer 0.6.8]: worker hang on Qwen3.5-397B-A17B-NVFP4 EP=8 (B200 SM100) - bisected to flashinfer-python 0.6.7 → 0.6.8.post1
- [Bugfix][Elastic EP] Unify warm+capture across scale-up and scale-down
- [Usage]: How to proactively clear CPU-resident memory left behind by unloaded LoRA adapters after calling `/v1/unload_lora_adapter`?
- [Bug]: Streaming chat completion drops partial content when stop string interrupts auto tool parsing
- [Security] Blocked by CVE-2025-30165 & CVE-2024-11041 (Legacy V0 Engine)
- [Bug]: benchmark_serving_multi_turn.py deadlocks after clients exit when --max-num-requests is used
- [Bug]: Qwen3.5-397B-NVFP4 Disagg accuracy gsm8k collapses with async scheduling
- Docs
- Python not yet supported