vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Feature]: Performance Tiers: Apple-style hardware requirements for stable inference
- [Model] Nemotron-H 3.5: quantized LM head and MTP compressed-tensors fix
- [Model] Qwen3.5: quantized LM head, VLM prefix fallbacks, MTP fix
- [CI Failure]: no logical_output_size in CutlassFP8ScaledMMLinearKernel
- [CI Failure]: Spec Decode Draft Mode and Spec Decode Draft Model Nightly B200 failing with low match ratio
- [CI Failure]: Multiple tests failing with assert output_size is not None
- [CI Failure]: Multiple Gemma4 tests fail due to insufficient permissions
- [CI Failure]: Quantized Models Test fails with OOM
- prompt_tokens_details.cached_tokens always reports prompt_tokens - 1 in disaggregated prefill/decode mode
- [Bug]: RuntimeError: UVA is not available
- Docs
- Python not yet supported