vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [CI] De-flake eagle correctness spec decode tests
- SyntaxWarning: invalid escape sequence and spurious trust_remote_code warnings on startup
- [Usage]: vLLM 启动后 69GB 是总占用,其中 62.18GiB 是 KV cache 预留池。 单个请求只用了约 50MiB KV cache。 我是否可以理解为:62.18GiB 大部分是预分配但当前空闲的 blocks,而当前请求活跃显存约等于 non_kv_cache_memory + request KV cache?
- [XPU] add awq format for INCXPULinear
- [WIP][CI][ROCm] Isolate test_sleep_mode in its own subprocess
- [Bug]: When will the RTX Pro 6000 support the deepseek-v4-flash model? There are too many versions of the vLLM framework right now, and I’m wondering when a vLLM image specifically optimized for the Pro 6000 will be released.
- [Bug]: DeepSeek V4 Flash Model Output is Garbled
- [Rust Frontend] Improve startup failure reporting UX
- [Bug]: Endless '!' output when a long context request is sent on qwen3.5/3.6, B60 gpus
- [New Model]: OpenMOSS-Team/MOSS-Audio (audio understanding)
- Docs
- Python not yet supported