vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Gemma 4 torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors
- [Bug]: Potential misalignment between qwen3.5 chat template and recommended tool parser
- [Bug]: Gemma 4 E4B weight loading fails `Gemma4ClippableLinear` parameter `input_max` not recognized
- [Bug]: Gemma 4 E4B extremely slow on v0.19.0 forced TRITON_ATTN fallback yields ~9 tok/s on RTX 4090 (vs ~100+ tok/s for comparable Llama 3B)
- [Feature]: Mamba `DS` conv state layout | Support speculative decoding with `mamba_cache_mode=align`
- [Bug]: Cross-request context contamination with async scheduling + pipeline parallelism on multi-node
- [Bug]: tool_choice='required'+PD disaggregation; internal server error for GLM-5
- Gemma 4 MoE NVFP4: expert_params_mapping doesn't handle scale key suffixes
- [Bug]: Qwen3.5 Inference TimeoutError with flashinfer gdn backend
- [Feature]: ROCm Kimi K2.5 EAGLE3 MTP heads
- Docs
- Python not yet supported