vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported13 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Misc] Add VLLM_DISTRIBUTED_INIT_METHOD_OVERRIDE env var
- [Benchmark] Add support for list of tokenized ids for custom dataset
- v1/kv_cache_utils: Respect num_gpu_blocks_override in memory check
- [ROCM] Add sinks argument to AiterFlashAttentionImpl
- [Bug]: (EngineCore_DP0 pid=77352) INFO 10-22 09:43:49 [shm_broadcast.py:466] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation).
- [CI/Build] improve editable mode setup
- [Bug]: Inference Qwen3-VL-30B-A3B with tp=2 pp=4 on 8x4090 get weired result
- [Bug]: vLLM0.11.0 is about 10x slower than 0.8.1 on classification task
- [perf] Optimize Qwen2-VL Startup Performance with LRU Cache
- [Core] Prefix cache: frequency- and cost-aware eviction (opt-in)
- Docs
- Python not yet supported