vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported31 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- [Frontend] Add TLS support with certificate/key files
- [BugFix] Report correct cached_tokens for disaggregated prefill
- fix: guard init_fp8_kv_scales against unallocated tensors during wake up after sleep
- [Multimodal] fix: resolve memory allocation and buffer overflow in audio resampling
- [Bug]: glm-5-fp8 zcode str object has no attribute items
- [Attention][Quantization] NVFP4 KV cache on consumer/SoC Blackwell (sm120/sm121) for Gemma 3/4 via FlashInfer FA2
- [Bugfix][KV Connector] Mooncake: honor logical->physical block ratio in register_kv_caches
- [Bugfix][Core] Skip stale KV xfer finish notifications for already-freed requests
- [Docs] Add Qwen2.5-Coder-7B-Instruct to batch invariance tested models
- v1 P/D: _update_from_kv_xfer_finished AssertionError (kills EngineCore) when an aborted request's finished_recving + finished_sending land in the same step
- Docs
- Python not yet supported