vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- Fix DeepSeek V4 FlashMLA auto KV cache dtype
- [Bug]: ModelOpt NVFP4 checkpoints with quantized lm_head fail to load: "There is no module or parameter named lm_head.input_scale"
- feat(loader): zero-copy checkpoint prefetching
- fix: call parent init in EagleMistralLarge3Model to set use_mha
- fix: compensate timestamp drift in Whisper chunked transcription
- [Kernel] NVIDIA-tuned tile configs + PID swizzling for triton_scaled_mm (up to 1.6x on H800, 2.0x on L20)
- [KvConnector] Fix cupy-cuda13x installed on CUDA 12 images
- [Kernel][Perf] 1.5x speed up tune fused_moe for Qwen3-Next-80B on H20-3e
- [ROCM][DSV32][Perf][MTP] Enable UNIFORM_BATCH CG mode in rocm_aiter_mla_sparse
- fix(distributed): propagate distributed_timeout_seconds to NCCL device groups
- Docs
- Python not yet supported