vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported13 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Kernel] [Helion] [7/N] Use HOP to represent Helion Kernel call to enable fx tracing and pattern matching
- [torch.compile] Disable ar-rms fusion for ds3-fp4
- [CI Failure]: Distributed Tests (8 GPUs)(H100)
- [CPU Offloading] Add offloading connector scheduler load delay metric
- [Bug Fix] Fix MambaManager.cache_blocks() crash on null blocks in align mode
- [CI Failure]: mi325_8: LoRA Test %N
- Add group quantization support to fused FP8 RMSNorm quant kernels
- fix: return HTTP 413 when request exceeds max context length
- [Custom Ops] Add functional + out variant for scaled_fp4_quant
- [Bug]: AR+rms+fp4 fusion results in total accuracy collapse for DSV3-fp4
- Docs
- Python not yet supported