vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Core] Improve startup failure diagnostics for early subprocess exits
- [WIP][Model Runner V2][Spec Decode] CUDA graph rejection sampling
- fix: update WNA16 Marlin MoE fake signature
- fix: make FusedMoE expert-map logging meta safe
- [Bugfix] V1: clear stale allowed_token_ids mask in InputBatch.condense
- Support Hybrid&Mamba in kv transfer
- [DSV4]add sequence parallel support for DSV4
- [Bugfix][DeepSeek-V4] Append generation prompt when last message is system
- [ROCm]: Docker/CMake build support for gfx1103 (Radeon 780M / RDNA3 APU)
- [Performance] Reduce DeepEP LL batched Marlin activation overhead
- Docs
- Python not yet supported