vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported13 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- it run on rtx 5060 ti 16 gb
- [Feature]: Faster apply_top_k_top_p without scatter
- [BugFix] Propagate prefix to backend
- [Feature]: Reading format constraints from tool call parsers for guided decoding.
- [Usage]: The same configuration v0.11.0 will report insufficient video memory compared to v0.8.5
- [New Model]: Add support for Nanonets-OCR2-3B
- [Bug]: Potential Out-of-Bounds Access in gptq_marlin.cu and marlin_24_cuda_kernel.cu
- [Bug]: Tokenize endpoint for Granite models returns malformed strings in `token_strs` for non-Latin characters
- [Test] Adjust abort sleep time to reduce AsyncLLM test flake
- [Performance][torch.compile]: Inductor partition performance issues
- Docs
- Python not yet supported