vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported19 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: --kv-cache-dtype fp8 produces garbage output on MLA models (GLM-4.7-Flash) at multi-turn
- [Bug]: MLA attention casts activations to int32 when using Marlin FP8 on GPUs without native FP8 support (sm < 89)
- [Bug]: Regression can no longer load Qwen 3.5 397B nvfp4 model - CUBLAS_STATUS_NOT_INITIALIZED
- [Bugfix] Preserve original ImportError in gRPC server entrypoint
- [Bug]: data_parallel_rpc_port is not robust to invalid traffic and can crash multi-node startup
- Fix llm_request trace context propagation
- [CI][ROCm] Remove unsupported cases in test_fusion.py
- [CPU] Fix lscpu NUMA node regex to handle quoted - and null in containers
- [Quantization] Rename mxfp4 quant layer and oracle to gpt_oss_mxfp4
- [ROCm][CI] Remove soft_fail from AMD Docker Image Build
- Docs
- Python not yet supported