vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported18 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [QeRL] Compose online quantization with quantized reloading
- [Perf] fuse kernels in gdn
- [RFC] Tail-Optimized LRU (T-LRU): Reducing Tail Latency via Conversation-Aware KV Cache Eviction
- [Bug] Potential incorrect tokenizer source path in RunAI object storage pull
- [Installation]: Documented v0.18.0 cu128 release wheel URL returns 404
- [Performance]: Request vLLM input: FlashInfer JIT ops not registered as proper torch.ops custom ops, breaking torch.compile(fullgraph=True) — upstream fix in progress at flashinfer#2734
- [Bug]: Phi qk_layernorm appears to be unsupported in vLLM
- [Bug]: NGC vLLM 26.02 rejects Nemotron-3-Super-120B-A12B-NVFP4 — quant_algo MIXED_PRECISION not in whitelist
- Fix incorrect tokenizer source path in RunAI object storage pull (#37836)
- fix(moe): fix RoutedExpertsCapturer assertion failure with DP>1 and MK path
- Docs
- Python not yet supported