vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported14 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Doc] Add 20251202 vLLM Malaysia Meetup Info
- [Bugfix] Fix infinite loop in V1 scheduler with max-length prompts (#…
- Added regression test for openai/harmony/issues/78
- [Bug]: DeepSeek-V3.2 As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one
- [Feature]: VLLM_DISABLE_COMPILE_CACHE should be a config flag
- [Bug]: Got different `max model len` using MTP with Qwen3 next
- [Frontend] OpenAI Responses API supports Tool/Function calling with streaming
- [Fix] Support TP for ModelOpt NVFP4 by adding dynamic padding
- [docker] Only install flashinfer-jit-cache on CUDA 12.8+
- [Bugfix] Improve DCP error message with backend hint
- Docs
- Python not yet supported