vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Performance]Reduce cpu launch overheads for MultiGroupBlockTable and attn_meta_data handling on Blackwell
- [Attention][SM90] Add CUTLASS FA3 sparse MLA attention backend for Hopper GPUs
- [Feature] Fused SiLU + Mul + per-token dynamic FP8 quantization (Triton)
- Add prompt-percentage base selective offload in OffloadConnector
- [Feature]: Qwen3.5-Moe LoRA Support (experts)
- [ROCm][Perf] Replace WNA16 MoE Triton kernel with FlyDSL MoE
- INT16/FP8/INT8 + SR SSM Cache Support
- [Bugfix] Fix gemma4 reasoning leak in multi-turn tool_choice=auto streaming (#39885)
- [Core] Replace routing replay with device cache and async D2H pipeline
- FlashInfer + DFlash + FP8 on RTX 4090
- Docs
- Python not yet supported