vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported15 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Usage]: Logprobs Scaling with O(n) Complexity – Unexpected Performance Degradation
- [Feature]: `reasoning_tokens` in Chat Completion Response `usage`
- [Model] add colqwen2_vl code & inference
- [Feature]: will whisper add language detection?
- Deepseek MTP for V1
- [Distributed] Add reduce_scatter to DeviceCommunicatorBase
- [Feature]: Implement Concurrent Partial Prefills In V1 Engine
- [Bug]: wake up OOM (72B model in 8*A800(40G))
- Support w8a8 block_fp8_matmul from generated kernel
- [HPU] Enable AutoGPTQ/AutoAWQ quantized model inference
- Docs
- Python not yet supported