Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Run distributed training with the SageMaker AI distributed data parallelism library

Focus mode
Run distributed training with the SageMaker AI distributed data parallelism library - Amazon SageMaker AI

The SageMaker AI distributed data parallelism (SMDDP) library extends SageMaker training capabilities on deep learning models with near-linear scaling efficiency by providing implementations of collective communication operations optimized for AWS infrastructure.

When training large machine learning (ML) models, such as large language models (LLM) and diffusion models, on a huge training dataset, ML practitioners use clusters of accelerators and distributed training techniques to reduce the time to train or resolve memory constraints for models that cannot fit in each GPU memory. ML practitioners often start with multiple accelerators on a single instance and then scale to clusters of instances as their workload requirements increase. As the cluster size increases, so does the communication overhead between multiple nodes, which leads to drop in overall computational performance.

To address such overhead and memory problems, the SMDDP library offers the following.

  • The SMDDP library optimizes training jobs for AWS network infrastructure and Amazon SageMaker AI ML instance topology.

  • The SMDDP library improves communication between nodes with implementations of AllReduce and AllGather collective communication operations that are optimized for AWS infrastructure.

To learn more about the details of the SMDDP library offerings, proceed to Introduction to the SageMaker AI distributed data parallelism library.

For more information about training with the model-parallel strategy offered by SageMaker AI, see also (Archived) SageMaker model parallelism library v1.x.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.