vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- fix(model_loader): deterministic suffix mapping for Gemma4 MoE quantization
- [kv_offload]: Add request_finished method to OffloadingManager and decouple store policy
- [ROCm] Enable gluon paged MQA logits on gfx950 (MI355X)
- docs: Add Apple Silicon documentation for vLLM-Metal GPU support
- [Model] Support MOSS-VL
- [Scheduler] Optimize priority queue removals with lazy deletion and batch rebuild
- [Performance][MLA][ROCm] AITER fused QK-RoPE + KV cache + q-absorb + q-cat + q-quant for decode
- [RFC]: Add DeepStream as a video loader backend for GPU-accelerated Video decode
- Fix: Nemotron 3 rescue whitespace-only final_content, not just None
- [Bug]: ValueError when CUDA_VISIBLE_DEVICES is a MIG device UUID
- Docs
- Python not yet supported