vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: High concurrency or enabling eagle3 and streaming + function call can trigger accuracy issues on kimi k2.5
- [Bug]: HunyuanOCR crashes with "query and key must have the same dtype" during inference (vLLM 0.19.0, RTX 3050)
- nixl refactor [3/N]: extract model-specific logic into ModelBlockTransferPolicy
- [Bugfix] GLM tool parser: fix streaming corruption for Optional[str]/array args
- [torch.compile] refactor config hashing through compile_factors and normalization
- [Core] Optimize sliding-window cache hit search in SlidingWindowManager
- [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing
- [Core] Add --deterministic-prefix-caching for reproducible prefill on ROCm
- [Kernel] Support fused_moe tuning with gemma-4-26B-A4B-it
- [Distributed] Add MSCCL++ allreduce support for multi-node communication
- Docs
- Python not yet supported