vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Index out of bounds in TurboQuant KV Cache kernel with Qwen3 during high-concurrency 32k context benchmark
- [Performance]: vllm 19.0 online server测试波动偏大
- [Bug]: Inconsistent KV Cache reporting and system hang on long context requests (Gemma-4 26B AWQ Int4)
- [Tracking Issue]: NIXL P/D Disaggregation for Hybrid Models
- Optimize LoRA index lookup from O(n) to O(1) in convert_mapping
- Replace O(n log n) sort with O(n) direct indexing in ubatch result ordering
- Optimize detokenizer string concatenation from O(n^2) to O(n)
- [Bugfix] Increase bench_serve aiohttp read buffer for long-context throughput
- fix: allow max_num_batched_tokens > max_model_len with chunked prefill
- [Bug]: Turbo Quant keep failing TRITON_ATTN 'kv_cache_dtype not supported'
- Docs
- Python not yet supported