vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug] Fatal AssertionError: Encoder KV cache fails to evict tokens, exceeding max_model_len in long-lived WebSocket sessions
- [Bug]: Step 3.5 Flash MTP failed to start in v0.19.0
- [Bugfix] [ROCm] Add gfx12x FP8 support for fused_batched_moe kernels
- [Bug]:[SM90][FP8 blockwise] swap_ab path for small/non-multiple-of-4 M fails in can_implement() with kInvalid
- skip tokens do not span a full block in remove_skipped_blocks
- Use b''.join() for block hash concatenation in KV cache utils
- [Bug]: Qwen3.5 DFlash gives strange responses on SM90
- [WIP][Perf] Add FlashInfer CuTeDSL backend for NVFP4 GEMM on Blackwell
- [Bugfix] Fix level-2 sleep/wake/reload with enable_lora=True
- [Spec Decode] Support hybrid attention models in extract_hidden_states
- Docs
- Python not yet supported