vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported25 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Doc] Add Scheduler section to V1 architecture overview
- [Misc][LoRA] Add automerge weight merge for single-adapter LoRA serving
- [Bug]: --kv-cache-dtype fp8 produces garbage output on MLA models (GLM-4.7-Flash) at multi-turn
- [Bug]: MLA attention casts activations to int32 when using Marlin FP8 on GPUs without native FP8 support (sm < 89)
- [Bug]: data_parallel_rpc_port is not robust to invalid traffic and can crash multi-node startup
- Fix llm_request trace context propagation
- [CI][ROCm] Remove unsupported cases in test_fusion.py
- CPU] Fix OMP lscpu JSON when NUMA node key is absent
- [RFC]: O(1) KV Cache for vLLM: 4.8x Speedup & 22x More Accurate than TurboQuant on Qwen2.5-7B
- [Bug]: qwen3.5 when enable response_format json_schema outputs garbled spaces
- Docs
- Python not yet supported