vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [ROCm][Bugfix][MLA] Fix FP8 KV cache for sparse MLA decode on gfx950 (DSv3.2)
- [Bugfix][DeepSeek V4] Resolve expert_dtype for FP8 checkpoints missing the field
- [Kernel] Replace union type-punning UB with std::bit_cast in sampler
- [Attention] Support skip-softmax attention (BLASST) through TRTLLM flashinfer backends
- [Core][Perf] Optimize sliding window cache hit search
- [ROCm][DSv4] Functional fixes for DeepSeek V4 on MI300X (gfx942)
- [MoE][Perf] Replace torch.compile pack with fused Triton kernels for FlashInfer routed MoE
- Reapply #42686: [torch.compile] Add patch for fullgraph compilation
- Implement MLA Decode + fp8 per token quant epilogue kernel
- Remove Pydantic v2.11 workaround: simplify Mistral tokenizer tool call handling
- Docs
- Python not yet supported