vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported23 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Misc] Force `_init_reorder_batch_threshold` to be called and make `get_reorder_batch_threshold` an instance property
- Take SM Count During Persistent SiLU Mul Quant
- [Feature]: Qwen3 Omni Transcriptions
- [Don't merge] Try enabling fastsafetensors as default weight loader
- [Doc] Add 20251202 vLLM Malaysia Meetup Info
- [Bugfix] Fix infinite loop in V1 scheduler with max-length prompts (#…
- [Bug]: DeepSeek-V3.2 As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one
- [Bug]: Got different `max model len` using MTP with Qwen3 next
- [Fix] Support TP for ModelOpt NVFP4 by adding dynamic padding
- [docker] Only install flashinfer-jit-cache on CUDA 12.8+
- Docs
- Python not yet supported