vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported14 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- optimize topk_topp_sampling.
- Feature/slots optimization
- [Bug]: Value error, Tensor parallel size (2) cannot be larger than the number of available GPUs (1).
- [WIP][Feat][Sched] Add Buffered_Response
- [Bugfix] Temporarily disable group quant rms norm fusion
- Update to transformers v5
- Algo
- [Bugfix] Fix DeepSeekV32 tool parser incorrect type conversion for array/object parameters
- [docs] Add lightweight AI assisted contribution policy
- Implement optimal group size calculation for KV cache layers, preferr…
- Docs
- Python not yet supported