vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: google/gemma-4-E2B-it text-only mode seems uses more VRAM than multimodal-mode
- [Bug]: MooncakeConnector may auto-select the wrong host IP when VLLM_HOST_IP is unset on multi-homed hosts
- [Kernel][MoE] Add GELU_TANH to CPU, CUTLASS, and WNA16 MoE backends
- [Fix]Fix one-sided MoE padding sentinel for local expert maps
- [CI Failure][Bug] AsyncScheduler drops first post-resume token after pause_generation(mode="keep") + clear_cache
- [Bugfix] Auto-detect SupportsHMA connectors instead of unconditional disable
- vLLM 0.20.1 hard-pins torch 2.11.0, which OOMs during CUDA initialization on RTX 4090 / cu130
- [Feature]: Output prompt text when enable `--enable-log-requests`
- [Performance]: Quantized KV Cache Throughput/TTFT/TPOT slower?
- [Bug]: Engine hangs indefinitely during model weight loading for nvidia/Qwen3.5-397B-A17B-NVFP4 on Blackwell GPUs (RTX PRO 6000) with TP=4
- Docs
- Python not yet supported