vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [XPU] keep generator state of sycl kernel align with pytorch
- [Bugfix] Flush final KV block when SimpleCPUOffload request finishes in same step as its last full block
- [XPU] Cap topk/topp Triton BLOCK_SIZE to 4096 for deterministic sampling
- [Feature]: Add request-level OTel span attribute for cached prefix-cache input tokens
- [Performance]: RMSNorm op in v0.20 IR layer prevent further pytorch/triton op fusion
- NixlConnector hardcodes backends=["UCX"] default; no env-var override path; LIBFABRIC/EFA operators must discover kv_connector_extra_config.backends from source
- [Bug]: CUDA illegal instruction in Mamba2 mixed prefill/decode path
- [Bug]: OverflowError in mamba_utils.collect_mamba_copy_meta on XPU when device pointer ≥ 2^63 (hybrid models with align-mode prefix caching)
- [RFC]: Adaptive throughput/latency profile for RL rollout long-tail
- [Feature]: Support Dynamic Pruning for Speculative Decoding Draft Trees in EAGLE-3
- Docs
- Python not yet supported