vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported24 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [MyPy] Fix mypy for `vllm/benchmarks`
- Update gpu.xpu.inc.md to use triton-xpu 3.7.0
- [Kernel] Optimize moeTopK memory access with shared memory and fix indexing bugs
- [WIP][Config] Introduce RuntimeDefault fields for runtime-initialized config values with correct static types
- [Core] Immediately evict thinking token blocks from prefix cache
- [Bugfix] Skip gpu_memory_utilization validation when kv_cache_memory_bytes is set
- [XPU] update dp rank w/o env-var isolation
- [Core] Add Ring Attention Primitives for Context Parallelism
- fix: resolve ROCm VRAM release issue in sleep mode
- [Kernel] Add fused routing and scatter-reduce decode optimizations for BS<=64 for GPT-OSS
- Docs
- Python not yet supported