vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: n_completions + logprobs Causes Significant TTFT Spike for Co-Scheduled Requests on Cold Cache
- [Bug]: 'placeholder_block_size' is not defined
- Guidance backend structured output doesn't work with openai_gptoss reasoning parser (offline LLM.generate)
- Guidance structured output blocked during thinking with nemotron_v3 reasoning parser (offline LLM.generate)
- fix(compilation): fix piecewise CUDA graph bugs with splitting_ops
- [Bug]: gcc: internal compiler error: Segmentation fault signal terminated program cc1
- [WIP][Model Runner V2] Add Encoder Dummy Run
- [WIP][CT][XPU] Add W8A16 FP8 MoE Support
- [Feature]: Tree speculative decode.
- [Bug]: `redundancy_buffer_memory` is Never really used
- Docs
- Python not yet supported