vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Bug]: FP8 MoE on SM120 (RTX PRO 6000 Blackwell) crashes in Triton fused_moe: AssertionError "Unsupported lhs dtype fp8e4nv"; VLLM_MOE_FORCE_MARLIN=1 not honored
- Mergify message not on cancelled
- [vLLM Model] Migrate deepseekv2 to vllm/models
- [Bug]: Reducing block sizes or `num_stages` may help.
- [RFC]: Triton Kernel Dispatcher for Multi-Platform Support
- [Bugfix][V1] Use device-aware timeout for execute_model RPC on CPU
- [Bug]: [load failure] request could not finish when load failure happens
- Fix (Benchmark): Prevent Credential Exposure and Sensitive Data Leakage
- [Bugfix] Clean up ModelOpt LM head state before tying weights
- [BugFix] Reset num_output_placeholders on KV load failure recomputation
- Docs
- Python not yet supported