vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Frontend] Add OpenAIBaseModel.get_extra_fields() public accessor
- [Bugfix] Fix crash with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set and custom_allreduce enabled
- [Rust Frontend] Support continuous_usage_stats stream option
- [Bug]:
- Add Chimera edit-program example showing 1.8x coding-task speedup
- [Bugfix] Split mixed reasoning/content streaming deltas
- [AMD][Bugfix][Quantization] Honor fused-name match in is_layer_skipped
- [TurboQuant] Add MTP spec-decode routing
- [ROCm][Perf] DSv3.2: fuse indexer Q-RoPE+quant + K-norm/RoPE/quant/cache
- [ROCm][DSV4] Use aiter mHC pre/post as the default ROCm path
- Docs
- Python not yet supported