vllm
https://github.com/vllm-project/vllm
Python
A high-throughput and memory-efficient inference and serving engine for LLMs
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported29 Subscribers
View all SubscribersAdd a CodeTriage badge to vllm
Help out
- Issues
- fix(processor): route MiMo-V2-Omni media fetch through MediaConnector
- Refactor CT NVFP4 linear schemes around reusable weight/activation bu…
- [Bug]: ZeroDivisionError in deepgemm_post_process_fp8_weight_block when loading FP8 model with TP=16 on dual-node H20
- 【Feature】Modify the fps parameter when loading the multimodal model Video.
- [Feature] Add External LB Mode Support for Elastic EP Scaling
- [Misc] Add --max-duration-sec to benchmark_serving_multi_turn.py
- [Perf][Spec Decode] add force_max_spec_tokens for FULL cudagraph
- [Frontend] Add readiness endpoint and engine health checks
- Use cumem allocator for the KV cache by default
- [Bugfix] Allow float16/bfloat16 as kv_cache_dtype in Triton attention ops
- Docs
- Python not yet supported