vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Quantization] Add TurboQuant dynamic kv cache compression
- Hybrid KV offload: planner, MultiConnector, and mamba alignment for hybrid models
- [Bug]: Marlin MoE kernel fails with MXFP4-quantized GPT-OSS 20B - Invalid thread config for non-aligned dimensions (K=2880, N=2880)
- [Models][GDN] Remove GPU/CPU syncs in `GDNAttentionMetadata.build` during speculative decoding
- [Bug]: Possible warm start compile time issue for Deepseek V3.2 and Kimi K2.5
- [Bug]: ImportError: flash_attn.ops.triton.rotary not found on older versions (< v2.1.2)
- [Bug] scalar_types.int4 weight type not supported in Marlin kernel, making W4A8-INT models undeployable
- [Bug] W4A8-INT compressed_tensors silently runs W4A16 — activations never quantized to int8
- [Perf] FP8 FlashInfer Attn for ViT
- [RFC] Redesign enable_return_routed_experts to avoid blocking EngineCore event loop
- Docs
- Python not yet supported