Amazon SageMaker data parallelism library examples
This page provides Jupyter notebooks that present examples of implementing the SageMaker distributed data parallelism (SMDDP) library to run distributed training jobs on SageMaker.
Blogs and Case Studies
The following blogs discuss case studies about using the SMDDP library.
SMDDP v2 blogs
-
Enable faster training with Amazon SageMaker data parallel library
, AWS Machine Learning Blog (December 05, 2023)
SMDDP v1 blogs
-
How I trained 10TB for Stable Diffusion on SageMaker
in Medium (November 29, 2022) -
Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon Search
, AWS Machine Learning Blog (August 18, 2022) -
Training YOLOv5 on AWS with PyTorch and the SageMaker distributed data parallel library
, Medium (May 6, 2022) -
Speed up EfficientNet model training on SageMaker with PyTorch and the SageMaker distributed data parallel library
, Medium (March 21, 2022) -
Speed up EfficientNet training on AWS with the SageMaker distributed data parallel library
, Towards Data Science (January 12, 2022) -
Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker
, AWS Machine Learning Blog (June 25, 2021) -
Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker
, the Hugging Face website (April 8, 2021)
Example notebooks
Example notebooks are provided in the SageMaker examples GitHub repositorytraining/distributed_training/pytorch/data_parallel
.
Note
Clone and run the example notebooks in the following SageMaker ML IDEs.
-
SageMaker JupyterLab (available in Studio created after December 2023)
-
SageMaker Code Editor (available in Studio created after December 2023)
-
Studio Classic (available as an application in Studio created after December 2023)
git clone https://github.com/aws/amazon-sagemaker-examples.git cd amazon-sagemaker-examples/training/distributed_training/pytorch/data_parallel
SMDDP v2 examples
SMDDP v1 examples