vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported26 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Feature]: Add nemotron_json as built-in tool parser (NVIDIA Nemotron-Nano-9B-v2 plugin breaks against v0.20.x module reorg)
- [Bugfix] Fix MTP draft model using local cache path instead of S3 URL with runai_streamer
- [Bug] Fix kimi dtype issue with `mm_projector_forward`
- [RFC]: Standardize KV-cache Layouts
- [Model] llama: add FP8 scale name remapping to LlamaForCausalLM and LlamaModel
- [CI Failure]: mi300_1: Quantized Models Test
- fix: replace assert with early return in get_pkg_version for non-Linux platforms
- Fix DeepSeek V3 tokenizer class override
- [Doc]: Gemma 4 assistant speculative decoding docs do not match actual behavior on vLLM 0.20.1
- [Bug]: FP8 MoE models produce corrupted output when serving LoRA adapters
- Docs
- Python not yet supported