vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]:[SM90][FP8 blockwise] swap_ab path for small/non-multiple-of-4 M fails in can_implement() with kInvalid
- skip tokens do not span a full block in remove_skipped_blocks
- Use b''.join() for block hash concatenation in KV cache utils
- [Bug]: Qwen3.5 DFlash gives strange responses on SM90
- [WIP][Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell
- [Bugfix] Fix level-2 sleep/wake/reload with enable_lora=True
- [Spec Decode] Support hybrid attention models in extract_hidden_states
- [Test] Add nightly MoE eval tests
- fix: check for missing weight files and enable weight integrity for q…
- [BugFix]Fix the bug where query_start_loc values are polluted/corrupted in _dummy_run
- Docs
- Python not yet supported