vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported14 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: CUDA OOM during sampler warmup with DeepSeek-V3.2 (DeepGEMM) on vLLM Nightly (V1 Engine)
- [Bug]: v0.11.2 can not support Qwen2.5-Omni-
- [RFC]: FlashMask Attention Backend for PrefixLM Models
- enable vllm ut for Intel GPU
- [Bugfix][LoRA] Fix LoRA weight mapping for DeepSeek MLA attention and…
- fix multiconnector for multi connector use push kv connector
- [torch.compile] Improve encoder compilation detection in PiecewiseBackend
- [Usage]:请问Qwen3-VL本地加载模式支持单独加载LoRA么?
- [ROCm][Perf] Tune fused_moe and add int4 w4a16 config
- [wip] custom allreduce and custom unquantized_gemm
- Docs
- Python not yet supported