vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Frontend] Fix default_chat_template_kwargs handling in Responses API
- [Responses API] Fix tool_choice=required: WebSearch crash, parallel tool merge, JSON fallback
- [Responses API] Fix ParsableContext: defer parsing to end of generation
- [Responses API] Add ToolChoiceRequiredLogitsProcessor for thinking models
- Fix: preserve streaming logprobs
- [Bugfix] Respect VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY in prefetch offloader
- [Refactor] Fix bitsandbytes loader import for pipeline-parallel params
- [Bug]: /v1/chat/completions/render` crashes for Qwen/Qwen3-ASR-0.6B multimodal audio, and chat audio returns empty/junk
- [Bug]: cudaErrorIllegalAddress crash when running zai-org/GLM-4.7-FP8 with `--max-num-batched-tokens` < default (e.g. 4K) under
- reshape instead of view in FP8ScaledMMLinearKernel
- Docs
- Python not yet supported