vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported15 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Metrics] Add Prometheus counters for Model FLOPs Utilization (MFU)
- [Attention][4/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties
- [Model] Add huggingface skt/A.X-K1 model
- [ONLY FOR TEST ][MLA] Add nvfp4 packed KV cache decode path via dequant cache op #32220
- Implemented the T5GEMMA2 from Google
- [Attention] FA4 integration
- [W8A8 Block Linear Refactor][2/N] Make FP8 Block Linear Ops use kernel abstraction.
- Add Torchax as an alternative Pytorch->TPU lowering backend
- [RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference
- [Bugfix][Frontend] support webm with audioread fallback
- Docs
- Python not yet supported