vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported19 Subscribers
Add a CodeTriage badge to vllm
Help out
- Issues
- [Hotfix] Minor polish to reduce the `key in map` calling.
- [Bug]: qwen 3.5 model launch get stuck for quite a long time
- [Bug]: CUDA assert in triton attention for MolmoWeb models (Molmo2 architecture with different max_position_embeddings)
- [Core][Feat][ safely abort requests where FSM failed to advance
- [Bug]: Jamba tool parser crashes on Mistral-style [TOOL_CALLS] models with standard HF tokenizer (e.g., Apriel-Nemotron-15b)
- fix(lora): use float32 intermediate buffer in fused MoE LoRA to prevent bf16 precision loss
- [Bug]: parity with CUDA: ROCm nightly & release docker images aren't built with Pollara AINIC or Broadcom Thor-2 NICs
- ROCm sometimes compiles problematically on torch.log on MI325
- [MRV2][KVConnector] Fix missing build_connector_worker_meta
- Fix invalid logprobs with MTP enabled and sync scheduling
- Docs
- Python not yet supported