vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [RFC]: Non-blocking core model loop
- [Bug] Inconsistent parameter names (`thinking` vs `enable_thinking`) between reasoning parsers and chat templates causes content:null
- [WIP] Support DCP with FlashInfer MLA
- [Bugfix][TurboQuant] Fix CUDA graph capture crash with spec-decode + chunked-prefill (#40807)
- [Bugfix][Reasoning] Fix thinking_token_budget not enforced on re-entry after forced end
- [Bug]: Why does the video file uploaded to the LLM node parse out as files:[] during the data processing step?
- [Bugfix][Spec Decode] Disable EAGLE prefill FULL CUDA graph for live multimodal batches
- [Core][Perf] Add MessagePack prefix block hash algorithms
- [Bug]: `thinking_token_budget` enforcement fails on multi-turn conversations when `max_completion_tokens` >> `thinking_token_budget` with ignore_eos:true
- [Bugfix] qwen3_xml: emit one OpenAI tool_call per <function=...>, fix duplicate close
- Docs
- Python not yet supported