vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bugfix][Model] Fix DiffusionGemma self-conditioning with tensor parallelism
- [Bug]: Streaming parser engine leaks special/structural tokens (BOS/EOS/drop_tokens) in non-streaming responses
- [Bugfix][Parser] Strip special tokens in non-streaming engine parsing
- fix(anthropic): resolve model chat template before deciding inline-system merge
- Update platform checker to reuse cache
- [Misc] Add unit test for apply_expert_map kernel
- [ROCm][Perf] MiniMax-M3 MXFP8 gemm/group gemm dispatch AITER
- [Bugfix][Core] Fix num_output_placeholders underflow with async scheduling + spec decode
- [Bug]: Pipeline-Parallel (PP=2) Engine Crash / Hang on Dual Battlemage GPUs (Arc Pro B70 + Arc B580)
- [BugFix] Fix latent-MoE crash when MoE runner holds the gate (is_internal_router)
- Docs
- Python not yet supported