vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- serving: MM warmup background overlap needs real serialization via _mm_executor (design review needed)
- [v2 of #45208] CuMem slept-L1 fragmentation accounting
- (security) Enforce allowed_tools at execution time for Responses API
- Revert "[Render] Add `/derender` endpoints for disaggregated postprocessing" (#43606)
- fix: cache bad_words tokenization to avoid 'Already borrowed' errors under concurrency
- [Core] Expose engine pause/resume state as prometheus metrics
- [v1] Initialize InputBatch in initialize_kv_cache instead of __init__
- [Bugfix] Bounds-check moe_permute reverse-map write (#45492)
- [Bug]: minimax M3MXFP8 with mtp can not start success
- [Security][Rust Frontend] Add input validation to gRPC and HTTP stop_token_ids
- Docs
- Python not yet supported