vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Severe Head-of-Line Blocking (147x TTFT) under Prefix Caching with Asymmetric Batches
- Pooling API: expose extra_kwargs and allow nested response data for custom poolers
- Port custom ops to native Inductor multi-stream support
- [Feature]: Allow passing `images` to CompletionRequest
- [Bug]: _find_range_for_shape in hotpath
- fix: include prompt text in RequestLogItem for gpt-oss-20b
- fix(lmcache): handle KeyError in layerwise mode
- Optimize Fusedmoe int8_w8a8 kernel performance
- [Misc] Improve DCP error messages with actionable guidance
- [Doc] Fix inconsistent hash notation in Prefix Caching diagram
- Docs
- Python not yet supported