vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported13 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Compile Integration should reuse for identical code
- [Bug]: SamplingParams.truncate_prompt_tokens has no effect in LLM.chat
- [Bug]: Qwen3-VL-235B-A22B-Instruct stuck with assert placeholder < len(self._out_of_band_tensors)
- [Bug]: Potential out-of-bounds access in paged_attention_v1.cu and paged_attention_v2.cu
- it run on rtx 5060 ti 16 gb
- [Feature]: Faster apply_top_k_top_p without scatter
- [BugFix] Propagate prefix to backend
- [Feature]: Reading format constraints from tool call parsers for guided decoding.
- [Usage]: The same configuration v0.11.0 will report insufficient video memory compared to v0.8.5
- [New Model]: Add support for Nanonets-OCR2-3B
- Docs
- Python not yet supported