vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Frontend] Remove pooling multi task support. (Hold off until v0.20.0)
- replace cuda_device_count_stateless() to current_platform.device_count()
- [Frontend] Use unified Parser for chat completions non-streaming
- [XPU] Enable topk_per_row and indexer_quant_cache kernels for DeepSeekV3.2 and GLM5
- [MLAAttention] Clear Cudagraph padded region of FI decode Attention kernel
- Add sanitized MiniMax M2 parser variant for path-like output
- [Feature] RIY: Runtime expert pruning for MoE models
- [Feature] Add prefill node failure detection and health query endpoint for MooncakeConnector Proxy
- [Bug]: Intel ARC 140v not supported as XE2 cutlass kernel
- [Bugfix][Perf]: avoid Range allocation and dict hashing in _find_range_for_shape
- Docs
- Python not yet supported