vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [SimpleCPUOffloadConnector]: Add KV Events
- [Quantization] Align online fp8_ptpc/block_fp8/mxfp8 weight quantization with llm-compressor (compressed-tensors) export
- [ROCm]: Verify and enable cross-layer KV cache layout for AITER MLA decode
- Sanitize server file paths from validation error responses
- [compile][RFC] VllmBackend parity with vanilla torch.compile + external FULL cudagraph
- [Bugfix][Mamba2] Fix assert crash when prefill-reclassified-as-decode occurs with no concurrent spec tokens
- [Model] Fix quantization config resolution for Gemma 4 MTP draft model
- [Model][Attention] DiffusionGemma: NVFP4 KV cache via FlashInfer VO-split + per-request causal grouping (sm120)
- [Bugfix][V1] Warm up slot-mapping kernel through BlockTable
- fix(rocm): fall back to cuda_ipc backend for NIXL on ROCm/AMD
- Docs
- Python not yet supported