vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported15 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Add option to disable weakref conversion for last piecewise cudagraph in a module
- Add a flag to use FusedMoE kernel in compressed quantization
- [openai api] log http exception in handler
- [LoRA] Add `--lora-target-modules` to selectively apply LoRA layers
- [RFC]: Separated CPU KV Cache Offloading/Transfer Process
- Optimize dockerfile
- [offloader] v2: Hide weight onloading latency via prefetching
- Implement LMDB-based multi-modal cache
- [Hardware][Power] Add IBM MASS + NUMA optimizations for POWER8
- [Feature]: Extract KV-Cache update from all attention backends
- Docs
- Python not yet supported