vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- Benchmark request: mixed long-prefill / long-decode / repeated-prefix serving boundary
- [Performance][ModelOpt] B300 auto backend is suboptimal for Qwen-Image mixed NVFP4
- [Bug][FP8] ScaledMMLinearKernel rejects valid non-contiguous batched activations
- [Bug]: gemma4_tool_parser._parse_gemma4_args does not strip <|"|> from dict-key positions
- fix(tool_parsers): strip STRING_DELIM from dict keys in gemma4 parser
- [Feature]: Add support for Bailing MTP speculative decoding
- [Bugfix][Frontend] Fix Anthropic count_tokens decorator order driving server load negative
- [Bugfix][Core] Close underlying iterator in merge_async_iterators single-iterator fast path
- [Bugfix][Rust Frontend] Set a structured-output backend so requests do not 500
- 💥 RTX 5090 + WSL2: V1 Engine hangs at startup — EngineCore spawns but never connects via ZMQ, ALL models fail (v0.21-0.22), raw spawn+Pytorch works fine
- Docs
- Python not yet supported