vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Core][V1] support min_characters for SamplingParams
- [Bugfix] Fix weight transfer tests using stale envs cache - CI test failures
- [ROCm] Add AITER fused kernel support for DeepSeek MLA attention
- [Hardware][GPU] Profiler config additional to increase it scope and annotation details
- Fix NVFP4-quantized MoE checkpoint support for Step-3.5 Flash
- [Community] RTX 5090 (Blackwell sm_120) + WSL2 2.7.0: CUDA graphs work — benchmarks + full config
- [Bug]: prompt is logged as None in RequestLogItem for gpt-oss-20b (Chat Completion API)
- [Performance]: vllm and transformer call the same Qwen3-VL-AI4TEST-V1 model, with roughly the same configuration, but the visual label accuracy is 20% lower in testing.
- [RFC]: Hotness-aware multi-level KV cache management to accelerate dynamic sparse attention
- fix(lmcache): handle KeyError in layerwise storage mode
- Docs
- Python not yet supported