vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [KV Connector] Canonical KV Cache Allocation for HMA Models
- [Bugfix] Fixe MiniMax-M2 parser failed to validate the validity of function names
- Fix missing logprobs for <tool_call> in streaming chat completions (#37737)
- [Bug] FlashInfer + MTP speculative decoding crashes on SM121 (DGX Spark) with GQA=16 model
- [Bug]: [OOM] DeepSeek-R1 Out of Memory
- Consolidate AWQ quantization into single awq_marlin.py file
- [Bugfix] Add size guard to `make_copy_and_call` and improve tests
- Revert "[Frontend] Remove librosa from audio dependency" (#37058)
- [Model] Add GGUF support for Qwen3.5 hybrid models
- [Feature]: Unify MoE "Oracles" with Class Structure
- Docs
- Python not yet supported