vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported13 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Online Quantization] Support memory-efficient online quantization via layerwise loading
- [Kernel] Optimize grouped topk kernel
- [Kernels] Make GGUF linear method allow 3d inputs
- Add unit tests for fp8 output fusion of triton_attn
- [CI Failure]: mi325_1: Language Models Test (MTEB)
- [KVConnector] Clean up redundant code in KV connectors
- [ROCm][Kernel] Add GFX11 (RDNA3) support for wvSplitK skinny GEMM kernels
- [Misc] Introduce ec_both role EC (encoder cache) connector
- [DNM][Bugfix] Fix mamba cache dtype for Qwen3.5
- [XPU][7/N] enable xpu fp8 moe
- Docs
- Python not yet supported