vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported19 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: calculate_kv_scales=True isn't doing anything
- [Performance]: non-optimal performance of `linear` for small batches
- Support PP on tpu_inference
- [Bug]: vLLM (TP=8) on 235B model triggers "CUDA error: unspecified launch failure" and persistent "ERR!" state in nvidia-smi
- [Bug]: SamplingParams.truncate_prompt_tokens has no effect in LLM.chat
- [Feature]: Faster apply_top_k_top_p without scatter
- [BugFix] Propagate prefix to backend
- [Feature]: Reading format constraints from tool call parsers for guided decoding.
- [Performance][torch.compile]: Inductor partition performance issues
- Initial commit
- Docs
- Python not yet supported