vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- replace cuda_device_count_stateless() to current_platform.device_count()
- [XPU] Enable topk_per_row and indexer_quant_cache kernels for DeepSeekV3.2 and GLM5
- [MLAAttention] Clear Cudagraph padded region of FI decode Attention kernel
- [Feature] Add prefill node failure detection and health query endpoint for MooncakeConnector Proxy
- [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape
- [TEST ONLY]Restore AsyncTP fusion for FlashInfer FP8 BMM (need advices)#27893
- [Reasoning][Frontend] Add structural tag support for reasoning parser via sampling params
- Canonical KV Cache Allocation for HMA Models
- [UX] Logging - Improve Startup Error Logs
- [Perf] Use torch compile to fuse pack topk in trtllm moe
- Docs
- Python not yet supported