vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported23 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Attention][4/n] Remove usage of deprecated seq_lens_cpu and num_computed_tokens_cpu CommonAttentionMetadata properties
- [ONLY FOR TEST ][MLA] Add nvfp4 packed KV cache decode path via dequant cache op #32220
- Implemented the T5GEMMA2 from Google
- [W8A8 Block Linear Refactor][2/N] Make FP8 Block Linear Ops use kernel abstraction.
- Add Torchax as an alternative Pytorch->TPU lowering backend
- [RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference
- [Model] vllm v1 support mlp_speculator
- Layered Dockerfile for smaller size and faster image pulling
- [Bug]: FP4 not leverage on RTX 6000 Pro (Blackwell)
- [Usage]: how to make one quantized model(w4a FP8). I used llm-compressor make one. But it not work in vllm 0.10.2.
- Docs
- Python not yet supported