vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported23 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- Prototyping single batch overlapping for Deepseek EP
- [RFC]: Support Dynamic Context Parallelism
- [Bug]: triton 3.5 for gpt-oss fails on sm11.0a cu130
- Add OpenVLA model support
- Fix missing as_list() conversion in streaming chat completion
- [Bug]: fails to inference prompt ends with '.' ':' for video inputs
- [Bug]: DeepSeek-V3.1-Terminus-BF16 run error
- fix(gguf): Skip lm_head mapping for models with tied word embeddings
- [Bugfix] Fix ScalarType NanRepr enum comparisons
- [Installation]: How to install vLLM 0.11.0 with CUDA < 12.9 (Driver 535)? No matching wheels found
- Docs
- Python not yet supported