vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported19 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Kernel] feat: TurboQuant KV cache quantization (PolarQuant + QJL)
- [5/n] Migrate CUTLASS MLA, hadamard, awq, allspark and DSV3 fused a gemm to torch stable ABI
- [XPU] [Quant] rename mxfp8_e4m3_quantize and add xpu backend implementation
- Add opt-in `--record-power` option to `vllm bench serve`
- Fix Nano Nemotron VL regressions
- [torch.compile] Add compile-only mode
- [Renderer] Enforce token-only inputs for LLMEngine and AsyncLLM
- [Feature]: Sharded model loader doesn't support GCS
- [Compile] Fix nvfp4 compile warning
- Revert "[Bugfix][MLA] Change default SM100 MLA prefill backend back to TRT-LLM" (#38562)
- Docs
- Python not yet supported