vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported15 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [openai api] log http exception in handler
- [LoRA] Add `--lora-target-modules` to selectively apply LoRA layers
- [RFC]: Separated CPU KV Cache Offloading/Transfer Process
- Optimize dockerfile
- [offloader] v2: Hide weight onloading latency via prefetching
- Implement LMDB-based multi-modal cache
- [Hardware][Power] Add IBM MASS + NUMA optimizations for POWER8
- [Feature]: Extract KV-Cache update from all attention backends
- [Bug]: vLLM hangs forever on waiting engine process to start
- Introduce RayCudaCommunicator as Ray Compiled Graph communicator
- Docs
- Python not yet supported