vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported21 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Feature]: Qwen3 Omni Transcriptions
- [Don't merge] Try enabling fastsafetensors as default weight loader
- [Doc] Add 20251202 vLLM Malaysia Meetup Info
- [Bugfix] Fix infinite loop in V1 scheduler with max-length prompts (#…
- [Bug]: DeepSeek-V3.2 As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one
- [Bug]: Got different `max model len` using MTP with Qwen3 next
- [Fix] Support TP for ModelOpt NVFP4 by adding dynamic padding
- [docker] Only install flashinfer-jit-cache on CUDA 12.8+
- [Bugfix] Improve DCP error message with backend hint
- Fix scheduler yield on arm
- Docs
- Python not yet supported