vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: Marlin MoE kernel fails with MXFP4-quantized GPT-OSS 20B - Invalid thread config for non-aligned dimensions (K=2880, N=2880)
- [Models][GDN] Remove GPU/CPU syncs in `GDNAttentionMetadata.build` during speculative decoding
- [Bug]: Possible warm start compile time issue for Deepseek V3.2 and Kimi K2.5
- [Bug]: ImportError: flash_attn.ops.triton.rotary not found on older versions (< v2.1.2)
- [Bug] scalar_types.int4 weight type not supported in Marlin kernel, making W4A8-INT models undeployable
- [Bug] W4A8-INT compressed_tensors silently runs W4A16 — activations never quantized to int8
- [Perf] FP8 FlashInfer Attn for ViT
- [RFC] Redesign enable_return_routed_experts to avoid blocking EngineCore event loop
- [Bugfix] Fix ImportError for flash_attn < v2.1.2 missing triton rotary module
- [Installation]: Ray not present in Container Image
- Docs
- Python not yet supported