vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- Validate num_gpu_blocks_override in CacheConfig
- [Bugfix][SM121] Extend TrtLlmFp8ExpertsBase device gate to SM_12x (consumer Blackwell / DGX Spark)
- Remove redundant Triton KV cache dtype asserts
- [Model Runner V2] Feature: Support `ElasticEPScalingExecutor` for MRv2
- [Parity with CUDA]: Add top 5-10 popular OSS models into mi355, mi325, mi300 into vLLM performance & accuracy regression testing & dashboard
- [RFC]: Move trainer-side weight transfer logic out of `vllm`
- [Bugfix][MLA] Fix LSE log-base mismatch in DCP + FlashInfer MLA decode
- [Bug]: V1 structured outputs: a malformed grammar request after a valid one crashes EngineCore
- [Core] Improve startup failure diagnostics for early subprocess exits
- [WIP][Model Runner V2][Spec Decode] CUDA graph rejection sampling
- Docs
- Python not yet supported