vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Perf] Reduce MTP decode bubbles for Qwen3.5 hybrid models
- [XPU] Fix Eagle3 initialization on XPU
- [Bug]: Qwen3-1.7B silent correctness regression in vLLM 0.21.0: TP=2/4 and Triton attention produce wrong answer
- [Bug][Deepseek v4][DBO]: AssertionError: positions is required for C128A metadata build File
- [Bug]: KeyError: 'layers.0.mlp.gate_up_proj.g_idx' of GLM-OCR GPTQ Int8 in v0.21.1rc1
- [Bug]: gpt-oss-120b MXFP4 MoE init OOM-killed on unified-memory ARM (DGX Spark / Jetson Thor)
- [ROCm][Bugfix] Fix GPT-OSS Quark MXFP4 MoE loading - emulation buffer not block-aligned
- [Bugfix][Quantization] Fix OCP MX emulation MoE crash with fp8 activations (w_*_a_fp8)
- [Bug]: --max-logprobs and --long-prefill-token-threshold silently accept negative values (config-validation gap)
- [Kernel][Helion][1/N] Add Helion kernel for silu_and_mul_per_block_quant
- Docs
- Python not yet supported