vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Model Runner V2][Spec Decode] Use log1p to compute residual during rejection sampling
- [Bugfix] Exclude disagg P/D transfer tokens from cached_tokens (NixlConnector)
- [Frontend][Responses] Accept chat-completions image format on /v1/responses (#46631)
- Support `srt` response format for audio transcription
- Bump flashinfer version to 0.6.13
- [Rust Frontend] add repetition_detection support to sampling params across Rust frontend
- [Bugfix] Fix UVA offload fallback copies
- [ROCm]: Bump aiter to 0.1.16.post2
- Fix FlashAttnMLA FP8 KV cache support
- Releases/v0.22.1
- Docs
- Python not yet supported