vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Performance]: Triton fusion for Qwen2/3-MoE shared-expert gate (Qwen2MoeMLP/Qwen3MoeMLP)
- fix(chat): allow multimodal content in tool messages for vision models
- [bug] set vllm config when performing weight sync
- [Perf] Optimize `per_token_group_quant` using regsiter directly, 4.5% E2E Throughput improvement
- [Bugfix] Fix DeepSeek V4 ImportError when cutlass/quack versions are incompatible
- fix(platforms): raise AttributeError on missing Platform attribute
- [Usage]: DeepSeek-V4 startup takes ~8 min — guidance on reducing initialization time
- [Bugfix][Reasoning] Properly detect reasoning end when using thinking_token_budget
- [Perf][DSv4] Add generic cuteDSL LL Blockwise FP8 GEMM with PDL
- [Core][WIP] Check for GPU<->CPU sync during CI
- Docs
- Python not yet supported