vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Tracking Issue]: NIXL P/D Disaggregation for Hybrid Models
- Optimize LoRA index lookup from O(n) to O(1) in convert_mapping
- Replace O(n log n) sort with O(n) direct indexing in ubatch result ordering
- Optimize detokenizer string concatenation from O(n^2) to O(n)
- [Bugfix] Increase bench_serve aiohttp read buffer for long-context throughput
- fix: allow max_num_batched_tokens > max_model_len with chunked prefill
- [Bug]: Turbo Quant keep failing TRITON_ATTN 'kv_cache_dtype not supported'
- [Bug]: Gemma4 multimodal: missing vision-aware bidirectional attention mask for use_bidirectional_attention="vision" models
- [CI Failure]: Plugin Tests (2 GPUs)
- [CI Failure]: Multi-Modal Processor (CPU)
- Docs
- Python not yet supported