vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported24 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [SimpleCPUOffloadConnector] PCP + DCP support
- [KV Connector] Remove compat support for pre-v0.12.0 constructor signatures without `KVCacheConfig`
- [Bug]: vllm main process cannot be canceled by interrupting it via the keyboard
- [Bug]: V1 Engine: Child process (EngineCore) dies silently while parent (APIServer) remains alive and responds to health checks, causing service hang
- [Bug]: When function_call is used in the request and tool_choice is set to required,the system performs schema formatting in advance before the request is sent to the model for inference.
- [RFC]: Replace Hardcoded Device Strings with current_platform and Implement Linting
- [Bug]: Significant Cross-Instance Inference Variance in vLLM v0.18.0 on H20 (~10-point gap) Qwen3.5-35B-A3B
- [ROCm] Fix AttributeError in GELU activations when C extension ops are absent on cuda_alike platforms
- [v1] Expose num_prompt_tokens in CommonAttentionMetadata
- [RFC]: Async Failure Notification for Fault Tolerant EP Kernels
- Docs
- Python not yet supported