vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- FlashInfer NVFP4 NaN propagation plausible fix
- support firered_aed_l model
- [Bugfix] Handle truncate_prompt_tokens in Harmony (GPT-OSS) path
- [Core] Add register_model() to KVConnectorBase_V1 for CacheBlend
- [Bugfix] AsyncLLM: Add the ability to specify the pooling_task
- [Bugfix] Decode prompt text from token IDs upstream in renderer
- Add local-runtime CLI, launcher install flow, and easy model management
- [Bug]:推理时报错,模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型
- [Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved()
- fix: re-record prepare_inputs_event after sample_tokens for spec decode
- Docs
- Python not yet supported