vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported14 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Granite4 Quantization Bug Fix
- [CLI] fix unicode encode error for `vllm chat/complete` command input
- [Bugfix] fix CUDA illegal memory access when sleep mode is triggered during request processing
- Reset V1 max_model_len after KV sizing
- v1: account for CPU offload capacity in KV cache check
- build: align CUDA 12.1 xformers wheel pin
- [Models] Lfm2-VL Architecture
- [CUDA] cutlass_moe_mm: proper sm version check
- [Bug]: Multinode inference request with Ray and vLLM crashes - regression from vLLM v0.7.3
- [Frontend] add toolparser for deepseek v3.2 reusing qwen xml parser
- Docs
- Python not yet supported