vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- v1/engine: emit prefix-cache KV-events at hash_block_size granularity for hybrid Mamba+Attention models
- [Perf][Frontend] Cache "<|0.00|>" anchor token id for ASR verbose segments
- [Usage]: embeddings API when task is generate
- Fix `previous_response_id` dropping tool calls from stored context
- [Bug]: Triton Attention AssertionError on supported kv_cache_dtype
- [Bug]: Granite 3.3 / 4.0 H-Small Python-style tool calls not converted to OpenAI tool_calls format
- [Spec Decode] Add Qwen3 architecture support for EAGLE3
- [Bug]: UBatch CUDA graph capture stores graph under first-two-microbatch token count when ubatch_size > 2
- [NIXL][2/N] Cache TP slicing and mapping redesign
- [Bugfix] Fix UBatchWrapper CUDA graph key to sum all ubatches, not just first two
- Docs
- Python not yet supported