deepspeed
https://github.com/microsoft/deepspeed
Python
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
Triage Issues!
When you volunteer to triage issues, you'll receive an email each day with a link to an open issue that needs help in this project. You'll also receive instructions on how to triage issues.
Triage Docs!
Receive a documented method or class from your favorite GitHub repos in your inbox every day. If you're really pro, receive undocumented methods or classes and supercharge your commit history.
Python not yet supported15 Subscribers
Add a CodeTriage badge to deepspeed
Help out
- Issues
- [BUG] Why in the checkpoints of later stage of training, the global_step number is inconsistent with checkpoint step number?
- How can I merge the original model weights with LoRA weights?
- [BUG] TypeError: 'staticmethod' object is not callable, in deepcompile (patch_compiled_func.py)
- Deepspeed Zero3 resume training with Huggingface Trainer Failed
- [REQUEST] Support Muon Optimizer
- [BUG] ZeRO-3 partition does not work in Ulysses SP tutorial
- [BUG] Why are the checkpoints saved during deepspeed-zero0 training larger than the safetensors of the original base model?
- [BUG] FlopsProfiler will hit error when sequence parallel enabled
- [BUG] How to drop some batches entirely to avoid calculating backpropagation while still updating the model for the rest
- [BUG]Deepspeed (v0.15.4 ~v0.16.9) Zero3 training performance is slow,compare than v0.13.1
- Docs
- Python not yet supported