vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- fix-grpc-spawn
- [ROCm] Add QuickReduce min-size override and codec threshold
- Fix CPU-only shutdown cleanup without accelerator
- [Bug]: The last few reasoning output tokens are missing when using Gemma4 and setting "--streaming-interval" to be larger than 1
- [CI/Build] Bump flashinfer to v0.6.10
- [Bug]: GPT-OSS-20B repeats itself for some prompts
- [Bug]: TokenizersBackend fallback returns tokenizer without `max_chars_per_token`
- [Bugfix] Limit default Qwen MM budget by scheduler tokens
- Recurring CUDA kernel hang on 2x DGX Spark (GB10, sm_12.1) with MiniMax-M2.7-NVFP4, TP=2 across 2 nodes
- [CI Failure]: mi250_1: LoRA %N
- Docs
- Python not yet supported