vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [RFC]: Unified ModelOpt Quantization in vLLM
- [torch.compile] Remove layer name from unified_kv_cache_update / unified_mla_kv_cache_update to fix cold-start (#33267)
- [RFC]: Add API to restore free_block_queue allocation order for long-running RLHF / rollout sessions
- [Bug]: High concurrency or enabling eagle3 and streaming + function call can trigger accuracy issues on kimi k2.5
- [Bug]: HunyuanOCR crashes with "query and key must have the same dtype" during inference (vLLM 0.19.0, RTX 3050)
- nixl refactor [3/N]: extract model-specific logic into ModelBlockTransferPolicy
- [Bugfix] GLM tool parser: fix streaming corruption for Optional[str]/array args
- [torch.compile] refactor config hashing through compile_factors and normalization
- [Core] Optimize sliding-window cache hit search in SlidingWindowManager
- [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing
- Docs
- Python not yet supported