vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported30 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- Fix NVFP4 pack16 CUDA version guard
- [ROCm][Bugfix] Plumb rotary_dim through fused QK-norm+RoPE+KV-cache kernel (enables GLM-4.7-FP8 on top of #42749)
- [Optimization] Return image_grid_thw in render response for disaggregated mRoPE
- [Doc] Add google/gemma-2-2b-it to batch invariance tested models
- [Feature] Detect all2all peer fault with fault tolerance backend and prevent corrupted output
- [Bugfix][DSV4] Plumb quant_config + route compressor GEMM through quant-aware dispatch
- [Bugfix] Avoid concurrent port allocation collisions
- [RFC]: Add Gumiho speculative decoding to vLLM
- [Draft][Kernel] Port ActivationQuantFusionPass to manual fusion (#43501)
- [Usage]: How to run Qwen3.5 models on V100 given the conflicting requirements of transformers version and vLLM architecture support
- Docs
- Python not yet supported