vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Model] Migrate QWen to use AutoWeightsLoader
- [Feature]: add object store e2e tests for kv caching
- [Model][MiniMax-M2] Fix EAGLE-3 aux hidden-state layer off-by-one
- [Bug]: OOM killed caused by possible CPU memory leak in vLLM Worker RPC Broadcast Deserialization Path
- [Bugfix][Core] MTP + enable prefix caching + mamba accuracy fix
- [torch.compile] Add E2E correctness tests for fusion passes
- [Frontend] Make DP ZMQ liveness timeout configurable
- Fix InternVL2-8B compiled generation with InternLM2 backbone
- [ROCm][Perf] DSv3.2: fuse MLA Q concat+fp8-quant in forward_mqa
- Fix uniform_random routing simulation to sample without replacement
- Docs
- Python not yet supported