Amazon SageMaker AI data parallelism library examples
This page provides Jupyter notebooks that present examples of implementing the SageMaker AI distributed data parallelism (SMDDP) library to run distributed training jobs on SageMaker AI.
Blogs and Case Studies
The following blogs discuss case studies about using the SMDDP library.
SMDDP v2 blogs
-
Enable faster training with Amazon SageMaker AI data parallel library
, AWS Machine Learning Blog (December 05, 2023)
SMDDP v1 blogs
-
How I trained 10TB for Stable Diffusion on SageMaker AI
in Medium (November 29, 2022) -
Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon Search
, AWS Machine Learning Blog (August 18, 2022) -
Training YOLOv5 on AWS with PyTorch and the SageMaker AI distributed data parallel library
, Medium (May 6, 2022) -
Speed up EfficientNet model training on SageMaker AI with PyTorch and the SageMaker AI distributed data parallel library
, Medium (March 21, 2022) -
Speed up EfficientNet training on AWS with the SageMaker AI distributed data parallel library
, Towards Data Science (January 12, 2022) -
Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker AI
, AWS Machine Learning Blog (June 25, 2021) -
Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker AI
, the Hugging Face website (April 8, 2021)
Example notebooks
Example notebooks are provided in the SageMaker AI examples GitHub repositorytraining/distributed_training/pytorch/data_parallel
.
Note
Clone and run the example notebooks in the following SageMaker AI ML IDEs.
-
SageMaker AI JupyterLab (available in Studio created after December 2023)
-
SageMaker AI Code Editor (available in Studio created after December 2023)
-
Studio Classic (available as an application in Studio created after December 2023)
git clone https://github.com/aws/amazon-sagemaker-examples.git cd amazon-sagemaker-examples/training/distributed_training/pytorch/data_parallel
SMDDP v2 examples
SMDDP v1 examples
-
CNN with PyTorch and the SageMaker AI data parallelism library
-
BERT with PyTorch and the SageMaker AI data parallelism library
-
CNN with TensorFlow 2.3.1 and the SageMaker AI data parallelism library
-
BERT with TensorFlow 2.3.1 and the SageMaker AI data parallelism library
-
HuggingFace Distributed Data Parallel Training in TensorFlow on SageMaker AI