vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Misc] Add VLLM_GPU_NIC_PCIE_MAPPING for per-worker RDMA NIC selection
- [CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt
- [Frontend] add support for thinking_token_budget in completions
- [Bugfix] Fix corrupt outputs in MoE FP8 LoRA responses and MoE base model responses when LoRAs are loaded
- [Bugfix][Kernel] Fix mxfp8 scale swizzling after EP all-to-all
- [Kernel] Fuse dual RMSNorm + residual + scalar in Gemma4 MoE layers
- [Perf] Use 2D-grid to eliminate divmod in W8W8 group quant
- [Bugfix] Fix Gemma4ToolParser streaming float corruption
- [Performance] Decode fast path for scheduler and model runner
- [Core] Added Feather as another waiting request queue to LLM inference scheduler
- Docs
- Python not yet supported