# Run distributed training with the SageMaker AI distributed data parallelism library
<a name="data-parallel"></a>

The SageMaker AI distributed data parallelism (SMDDP) library extends SageMaker training capabilities on deep learning models with near-linear scaling efficiency by providing implementations of collective communication operations optimized for AWS infrastructure.

When training large machine learning (ML) models, such as large language models (LLM) and diffusion models, on a huge training dataset, ML practitioners use clusters of accelerators and distributed training techniques to reduce the time to train or resolve memory constraints for models that cannot fit in each GPU memory. ML practitioners often start with multiple accelerators on a single instance and then scale to clusters of instances as their workload requirements increase. As the cluster size increases, so does the communication overhead between multiple nodes, which leads to drop in overall computational performance.

To address such overhead and memory problems, the SMDDP library offers the following.
+ The SMDDP library optimizes training jobs for AWS network infrastructure and Amazon SageMaker AI ML instance topology.
+ The SMDDP library improves communication between nodes with implementations of `AllReduce` and `AllGather` collective communication operations that are optimized for AWS infrastructure. 

To learn more about the details of the SMDDP library offerings, proceed to [Introduction to the SageMaker AI distributed data parallelism library](data-parallel-intro.md).

For more information about training with the model-parallel strategy offered by SageMaker AI, see also [(Archived) SageMaker model parallelism library v1.x](model-parallel.md).

**Topics**
+ [Introduction to the SageMaker AI distributed data parallelism library](data-parallel-intro.md)
+ [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md)
+ [Distributed training with the SageMaker AI distributed data parallelism library](data-parallel-modify-sdp.md)
+ [Amazon SageMaker AI data parallelism library examples](distributed-data-parallel-v2-examples.md)
+ [Configuration tips for the SageMaker AI distributed data parallelism library](data-parallel-config.md)
+ [Amazon SageMaker AI distributed data parallelism library FAQ](data-parallel-faq.md)
+ [Troubleshooting for distributed training in Amazon SageMaker AI](distributed-troubleshooting-data-parallel.md)
+ [SageMaker AI data parallelism library release notes](data-parallel-release-notes.md)

# Introduction to the SageMaker AI distributed data parallelism library
<a name="data-parallel-intro"></a>

The SageMaker AI distributed data parallelism (SMDDP) library is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library addresses communications overhead of the key collective communication operations by offering the following.

1. The library offers `AllReduce` optimized for AWS. `AllReduce` is a key operation used for synchronizing gradients across GPUs at the end of each training iteration during distributed data training.

1. The library offers `AllGather` optimized for AWS. `AllGather` is another key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries such as the SageMaker AI model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP).

1. The library performs optimized node-to-node communication by fully utilizing AWS network infrastructure and the Amazon EC2 instance topology. 

The SMDDP library can increase training speed by offering performance improvement as you scale your training cluster, with near-linear scaling efficiency.

**Note**  
The SageMaker AI distributed training libraries are available through the AWS deep learning containers for PyTorch and Hugging Face within the SageMaker Training platform. To use the libraries, you must use the SageMaker Python SDK or the SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout the documentation, instructions and examples focus on how to use the distributed training libraries with the SageMaker Python SDK.

## SMDDP collective communication operations optimized for AWS compute resources and network infrastructure
<a name="data-parallel-collective-operations"></a>

The SMDDP library provides implementations of the `AllReduce` and `AllGather` collective operations that are optimized for AWS compute resources and network infrastructure.

### SMDDP `AllReduce` collective operation
<a name="data-parallel-allreduce"></a>

The SMDDP library achieves optimal overlapping of the `AllReduce` operation with the backward pass, significantly improving GPU utilization. It achieves near-linear scaling efficiency and faster training speed by optimizing kernel operations between CPUs and GPUs. The library performs `AllReduce` in parallel while GPU is computing gradients without taking away additional GPU cycles, which makes the library to achieve faster training.
+  *Leverages CPUs*: The library uses CPUs to `AllReduce` gradients, offloading this task from the GPUs. 
+ * Improved GPU usage*: The cluster’s GPUs focus on computing gradients, improving their utilization throughout training.

The following is the high-level workflow of the SMDDP `AllReduce` operation.

1. The library assigns ranks to GPUs (workers).

1. At each iteration, the library divides each global batch by the total number of workers (world size) and assigns small batches (batch shards) to the workers.
   + The size of the global batch is `(number of nodes in a cluster) * (number of GPUs per node) * (per batch shard)`. 
   + A batch shard (small batch) is a subset of dataset assigned to each GPU (worker) per iteration. 

1. The library launches a training script on each worker.

1. The library manages copies of model weights and gradients from the workers at the end of every iteration.

1. The library synchronizes model weights and gradients across the workers to aggregate a single trained model.

The following architecture diagram shows an example of how the library sets up data parallelism for a cluster of 3 nodes. 

 
![\[SMDDP AllReduce and data parallelism architecture diagram\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/data-parallel/sdp-architecture.png)


### SMDDP `AllGather` collective operation
<a name="data-parallel-allgather"></a>

`AllGather` is a collective operation where each worker starts with an input buffer, and then concatenates or *gathers* the input buffers from all other workers into an output buffer.

**Note**  
The SMDDP `AllGather` collective operation is available in `smdistributed-dataparallel>=2.0.1` and AWS Deep Learning Containers (DLC) for PyTorch v2.0.1 and later.

`AllGather` is heavily used in distributed training techniques such as sharded data parallelism where each individual worker holds a fraction of a model, or a sharded layer. The workers call `AllGather` before forward and backward passes to reconstruct the sharded layers. The forward and backward passes continue onward after the parameters are *all gathered*. During the backward pass, each worker also calls `ReduceScatter` to collect (reduce) gradients and break (scatter) them into gradient shards to update the corresponding sharded layer. For more details on the role of these collective operations in sharded data parallelism, see the [SMP library's implementati on of sharded data parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html), [ZeRO](https://deepspeed.readthedocs.io/en/latest/zero3.html#) in the DeepSpeed documentation, and the blog about [PyTorch Fully Sharded Data Parallelism](https://engineering.fb.com/2021/07/15/open-source/fsdp/).

Because collective operations like AllGather are called in every iteration, they are the main contributors to GPU communication overhead. Faster computation of these collective operations directly translates to a shorter training time with no side effects on convergence. To achieve this, the SMDDP library offers `AllGather` optimized for [P4d instances](https://aws.amazon.com/ec2/instance-types/p4/).

SMDDP `AllGather` uses the following techniques to improve computational performance on P4d instances.

1. It transfers data between instances (inter-node) through the [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) network with a mesh topology. EFA is the AWS low-latency and high-throughput network solution. A mesh topology for inter-node network communication is more tailored to the characteristics of EFA and AWS network infrastructure. Compared to the NCCL ring or tree topology that involves multiple packet hops, SMDDP avoids accumulating latency from multiple hops as it only needs one hop. SMDDP implements a network rate control algorithm that balances the workload to each communication peer in a mesh topology and achieves a higher global network throughput.

1. It adopts [low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology (GDRCopy)](https://github.com/NVIDIA/gdrcopy) to coordinate local NVLink and EFA network traffic. GDRCopy, a low-latency GPU memory copy library offered by NVIDIA, provides low-latency communication between CPU processes and GPU CUDA kernels. With this technology, the SMDDP library is able to pipeline the intra-node and inter-node data movement.

1. It reduces the usage of GPU streaming multiprocessors to increase compute power for running model kernels. P4d and P4de instances are equipped with NVIDIA A100 GPUs, which each have 108 streaming multiprocessors. While NCCL takes up to 24 streaming multiprocessors to run collective operations, SMDDP uses fewer than 9 streaming multiprocessors. Model compute kernels pick up the saved streaming multiprocessors for faster computation.

# Supported frameworks, AWS Regions, and instances types
<a name="distributed-data-parallel-support"></a>

Before using the SageMaker AI distributed data parallelism (SMDDP) library, check what are the supported ML frameworks and instance types and if there are enough quotas in your AWS account and AWS Region.

## Supported frameworks
<a name="distributed-data-parallel-supported-frameworks"></a>

The following tables show the deep learning frameworks and their versions that SageMaker AI and SMDDP support. The SMDDP library is available in [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only), integrated in [Docker containers distributed by the SageMaker model parallelism (SMP) library v2](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2), or downloadable as a binary file.

**Note**  
To check the latest updates and release notes of the SMDDP library, see the [SageMaker AI data parallelism library release notes](data-parallel-release-notes.md).

**Topics**
+ [PyTorch](#distributed-data-parallel-supported-frameworks-pytorch)
+ [PyTorch Lightning](#distributed-data-parallel-supported-frameworks-lightning)
+ [Hugging Face Transformers](#distributed-data-parallel-supported-frameworks-transformers)
+ [TensorFlow (deprecated)](#distributed-data-parallel-supported-frameworks-tensorflow)

### PyTorch
<a name="distributed-data-parallel-supported-frameworks-pytorch"></a>


| PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | 
| v2.3.1 | smdistributed-dataparallel==v2.5.0 | Not available | 658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.4.1/cu121/2024-10-09/smdistributed\$1dataparallel-2.5.0-cp311-cp311-linux\$1x86\$164.whl | 
| v2.3.0 | smdistributed-dataparallel==v2.3.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker | Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\$1dataparallel-2.3.0-cp311-cp311-linux\$1x86\$164.whl | 
| v2.2.0 | smdistributed-dataparallel==v2.2.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\$1dataparallel-2.2.0-cp310-cp310-linux\$1x86\$164.whl | 
| v2.1.0 | smdistributed-dataparallel==v2.1.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\$1dataparallel-2.1.0-cp310-cp310-linux\$1x86\$164.whl | 
| v2.0.1 | smdistributed-dataparallel==v2.0.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\$1dataparallel-2.0.2-cp310-cp310-linux\$1x86\$164.whl | 
| v2.0.0 | smdistributed-dataparallel==v1.8.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed\$1dataparallel-1.8.0-cp310-cp310-linux\$1x86\$164.whl | 
| v1.13.1 | smdistributed-dataparallel==v1.7.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed\$1dataparallel-1.7.0-cp39-cp39-linux\$1x86\$164.whl | 
| v1.12.1 | smdistributed-dataparallel==v1.6.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.1/cu113/2022-12-05/smdistributed\$1dataparallel-1.6.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.12.0 | smdistributed-dataparallel==v1.5.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\$1dataparallel-1.5.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.11.0 | smdistributed-dataparallel==v1.4.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.11.0/cu113/2022-04-14/smdistributed\$1dataparallel-1.4.1-cp38-cp38-linux\$1x86\$164.whl | 

\$1\$1 The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

**Note**  
The SMDDP library is available in AWS Regions where the [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) and the [SMP Docker images](distributed-model-parallel-support-v2.md) are in service.

**Note**  
The SMDDP library v1.4.0 and later works as a backend of PyTorch distributed (torch.distributed) data parallelism (torch.parallel.DistributedDataParallel). In accordance with the change, the following [smdistributed APIs](https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest/smd_data_parallel_pytorch.html#pytorch-api) for the PyTorch distributed package have been deprecated.  
`smdistributed.dataparallel.torch.distributed` is deprecated. Use the [torch.distributed](https://pytorch.org/docs/stable/distributed.html) package instead.
`smdistributed.dataparallel.torch.parallel.DistributedDataParallel` is deprecated. Use the [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) API instead.
If you need to use the previous versions of the library (v1.3.0 or before), see the [archived SageMaker AI distributed data parallelism documentation](https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest.html#documentation-archive) in the *SageMaker AI Python SDK documentation*.

### PyTorch Lightning
<a name="distributed-data-parallel-supported-frameworks-lightning"></a>

The SMDDP library is available for PyTorch Lightning in the following SageMaker AI Framework Containers for PyTorch and the SMP Docker containers.

**PyTorch Lightning v2**


| PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | --- | 
| 2.2.5 | 2.3.0 | smdistributed-dataparallel==v2.3.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker | Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\$1dataparallel-2.3.0-cp311-cp311-linux\$1x86\$164.whl | 
| 2.2.0 | 2.2.0 | smdistributed-dataparallel==v2.2.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\$1dataparallel-2.2.0-cp310-cp310-linux\$1x86\$164.whl | 
| 2.1.2 | 2.1.0 | smdistributed-dataparallel==v2.1.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\$1dataparallel-2.1.0-cp310-cp310-linux\$1x86\$164.whl | 
| 2.1.0 | 2.0.1 | smdistributed-dataparallel==v2.0.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\$1dataparallel-2.0.2-cp310-cp310-linux\$1x86\$164.whl | 

**PyTorch Lightning v1**


| PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | 
|  1.7.2 1.7.0 1.6.4 1.6.3 1.5.10  | 1.12.0 | smdistributed-dataparallel==v1.5.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\$1dataparallel-1.5.0-cp38-cp38-linux\$1x86\$164.whl | 

\$1\$1 The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

**Note**  
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the PyTorch DLCs. When you construct a SageMaker AI PyTorch estimator and submit a training job request in [Step 2](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-framework-estimator), you need to provide `requirements.txt` to install `pytorch-lightning` and `lightning-bolts` in the SageMaker AI PyTorch training container.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

### Hugging Face Transformers
<a name="distributed-data-parallel-supported-frameworks-transformers"></a>

The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and paired PyTorch and TensorFlow versions, see the latest [Hugging Face Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers) and the [Prior Hugging Face Container Versions](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#prior-hugging-face-container-versions).

### TensorFlow (deprecated)
<a name="distributed-data-parallel-supported-frameworks-tensorflow"></a>

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. The following table lists previous DLCs for TensorFlow with the SMDDP library installed.


| TensorFlow version | SMDDP library version | 
| --- | --- | 
| 2.9.1, 2.10.1, 2.11.0 |  smdistributed-dataparallel==v1.4.1  | 
| 2.8.3 |  smdistributed-dataparallel==v1.3.0  | 

## AWS Regions
<a name="distributed-data-parallel-availablity-zone"></a>

The SMDDP library is available in all of the AWS Regions where the [AWS Deep Learning Containers for SageMaker AI](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) and the [SMP Docker images](distributed-model-parallel-support-v2.md) are in service.

## Supported instance types
<a name="distributed-data-parallel-supported-instance-types"></a>

The SMDDP library requires one of the following instance types.


| Instance type | 
| --- | 
| ml.p3dn.24xlarge\$1 | 
| ml.p4d.24xlarge | 
| ml.p4de.24xlarge | 

**Tip**  
To properly run distributed training on the EFA-enabled instance types, you should enable traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

**Important**  
\$1 The SMDDP library has discontinued support for optimizing its collective communication operations on P3 instances. While you can still utilize the SMDDP optimized `AllReduce` collective on `ml.p3dn.24xlarge` instances, there will be no further development support to enhance performance on this instance type. Note that the SMDDP optimized `AllGather` collective is only available for P4 instances.

For specs of the instance types, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Request a service quota increase for SageMaker AI resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure).

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.
```

# Distributed training with the SageMaker AI distributed data parallelism library
<a name="data-parallel-modify-sdp"></a>

The SageMaker AI distributed data parallelism (SMDDP) library is designed for ease of use and to provide seamless integration with PyTorch.

When training a deep learning model with the SMDDP library on SageMaker AI, you can focus on writing your training script and model training. 

To get started, import the SMDDP library to use its collective operations optimized for AWS. The following topics provide instructions on what to add to your training script depending on which collective operation you want to optimize.

**Topics**
+ [Adapting your training script to use the SMDDP collective operations](data-parallel-modify-sdp-select-framework.md)
+ [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md)

# Adapting your training script to use the SMDDP collective operations
<a name="data-parallel-modify-sdp-select-framework"></a>

The training script examples provided in this section are simplified and highlight only the required changes to enable the SageMaker AI distributed data parallelism (SMDDP) library in your training script. For end-to-end Jupyter notebook examples that demonstrate how to run a distributed training job with the SMDDP library, see [Amazon SageMaker AI data parallelism library examples](distributed-data-parallel-v2-examples.md).

**Topics**
+ [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md)
+ [Use the SMDDP library in your PyTorch Lightning training script](data-parallel-modify-sdp-pt-lightning.md)
+ [Use the SMDDP library in your TensorFlow training script (deprecated)](data-parallel-modify-sdp-tf2.md)

# Use the SMDDP library in your PyTorch training script
<a name="data-parallel-modify-sdp-pt"></a>

Starting from the SageMaker AI distributed data parallelism (SMDDP) library v1.4.0, you can use the library as a backend option for the [PyTorch distributed package](https://pytorch.org/tutorials/beginner/dist_overview.html). To use the SMDDP `AllReduce` and `AllGather` collective operations, you only need to import the SMDDP library at the beginning of your training script and set SMDDP as the the backend of PyTorch distributed modules during process group initialization. With the single line of backend specification, you can keep all the native PyTorch distributed modules and the entire training script unchanged. The following code snippets show how to use the SMDDP library as the backend of PyTorch-based distributed training packages: [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html), [DeepSpeed](https://github.com/microsoft/DeepSpeed), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed).

## For PyTorch DDP or FSDP
<a name="data-parallel-enable-for-ptddp-ptfsdp"></a>

Initialize the process group as follows.

```
import torch.distributed as dist
import smdistributed.dataparallel.torch.torch_smddp

dist.init_process_group(backend="smddp")
```

**Note**  
(For PyTorch DDP jobs only) The `smddp` backend currently does not support creating subprocess groups with the `torch.distributed.new_group()` API. You also cannot use the `smddp` backend concurrently with other process group backends such as `NCCL` and `Gloo`.

## For DeepSpeed or Megatron-DeepSpeed
<a name="data-parallel-enable-for-deepspeed"></a>

Initialize the process group as follows.

```
import deepspeed
import smdistributed.dataparallel.torch.torch_smddp

deepspeed.init_distributed(dist_backend="smddp")
```

**Note**  
To use SMDDP `AllGather` with the `mpirun`-based launchers (`smdistributed` and `pytorchddp`) in [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md), you also need to set the following environment variable in your training script.  

```
export SMDATAPARALLEL_OPTIMIZE_SDP=true
```

For general guidance on writing a PyTorch FSDP training script, see [Advanced Model Training with Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) in the PyTorch documentation.

For general guidance on writing a PyTorch DDP training script, see [Getting started with distributed data parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) in the PyTorch documentation.

After you have completed adapting your training script, proceed to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md).

# Use the SMDDP library in your PyTorch Lightning training script
<a name="data-parallel-modify-sdp-pt-lightning"></a>

If you want to bring your [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction.html) training script and run a distributed data parallel training job in SageMaker AI, you can run the training job with minimal changes in your training script. The necessary changes include the following: import the `smdistributed.dataparallel` library’s PyTorch modules, set up the environment variables for PyTorch Lightning to accept the SageMaker AI environment variables that are preset by the SageMaker training toolkit, and activate the SMDDP library by setting the process group backend to `"smddp"`. To learn more, walk through the following instructions that break down the steps with code examples.

**Note**  
The PyTorch Lightning support is available in the SageMaker AI data parallel library v1.5.0 and later.

## PyTorch Lightning == v2.1.0 and PyTorch == 2.0.1
<a name="smddp-pt-201-lightning-210"></a>

1. Import the `pytorch_lightning` library and the `smdistributed.dataparallel.torch` modules.

   ```
   import lightning as pl
   import smdistributed.dataparallel.torch.torch_smddp
   ```

1. Instantiate the [LightningEnvironment](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.plugins.environments.LightningEnvironment.html).

   ```
   from lightning.fabric.plugins.environments.lightning import LightningEnvironment
   
   env = LightningEnvironment()
   env.world_size = lambda: int(os.environ["WORLD_SIZE"])
   env.global_rank = lambda: int(os.environ["RANK"])
   ```

1. **For PyTorch DDP** – Create an object of the [DDPStrategy](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.DDPStrategy.html) class with `"smddp"` for `process_group_backend` and `"gpu"` for `accelerator`, and pass that to the [Trainer](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) class.

   ```
   import lightning as pl
   from lightning.pytorch.strategies import DDPStrategy
   
   ddp = DDPStrategy(
       cluster_environment=env, 
       process_group_backend="smddp", 
       accelerator="gpu"
   )
   
   trainer = pl.Trainer(
       max_epochs=200, 
       strategy=ddp, 
       devices=num_gpus, 
       num_nodes=num_nodes
   )
   ```

   **For PyTorch FSDP** – Create an object of the [FSDPStrategy](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.FSDPStrategy.html) class (with [wrapping policy](https://pytorch.org/docs/stable/fsdp.html) of choice) with `"smddp"` for `process_group_backend` and `"gpu"` for `accelerator`, and pass that to the [Trainer](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) class.

   ```
   import lightning as pl
   from lightning.pytorch.strategies import FSDPStrategy
   
   from functools import partial
   from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
   
   policy = partial(
       size_based_auto_wrap_policy, 
       min_num_params=10000
   )
   
   fsdp = FSDPStrategy(
       auto_wrap_policy=policy,
       process_group_backend="smddp", 
       cluster_environment=env
   )
   
   trainer = pl.Trainer(
       max_epochs=200, 
       strategy=fsdp, 
       devices=num_gpus, 
       num_nodes=num_nodes
   )
   ```

After you have completed adapting your training script, proceed to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md). 

**Note**  
When you construct a SageMaker AI PyTorch estimator and submit a training job request in [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md), you need to provide `requirements.txt` to install `pytorch-lightning` and `lightning-bolts` in the SageMaker AI PyTorch training container.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

# Use the SMDDP library in your TensorFlow training script (deprecated)
<a name="data-parallel-modify-sdp-tf2"></a>

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

The following steps show you how to modify a TensorFlow training script to utilize SageMaker AI's distributed data parallel library.  

The library APIs are designed to be similar to Horovod APIs. For additional details on each API that the library offers for TensorFlow, see the [SageMaker AI distributed data parallel TensorFlow API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html#api-documentation).

**Note**  
SageMaker AI distributed data parallel is adaptable to TensorFlow training scripts composed of `tf` core modules except `tf.keras` modules. SageMaker AI distributed data parallel does not support TensorFlow with Keras implementation.

**Note**  
The SageMaker AI distributed data parallelism library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to enable AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism library runs its `AllReduce` operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:  
[Frameworks - TensorFlow](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensorflow) in the *NVIDIA Deep Learning Performance documentation*
[Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision) in the *NVIDIA Developer Docs*
[TensorFlow mixed precision APIs](https://www.tensorflow.org/guide/mixed_precision) in the *TensorFlow documentation*

1. Import the library's TensorFlow client and initialize it.

   ```
   import smdistributed.dataparallel.tensorflow as sdp 
   sdp.init()
   ```

1. Pin each GPU to a single `smdistributed.dataparallel` process with `local_rank`—this refers to the relative rank of the process within a given node. The `sdp.tensorflow.local_rank()` API provides you with the local rank of the device. The leader node is rank 0, and the worker nodes are rank 1, 2, 3, and so on. This is invoked in the following code block as `sdp.local_rank()`. `set_memory_growth` is not directly related to SageMaker AI distributed, but must be set for distributed training with TensorFlow. 

   ```
   gpus = tf.config.experimental.list_physical_devices('GPU')
   for gpu in gpus:
       tf.config.experimental.set_memory_growth(gpu, True)
   if gpus:
       tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')
   ```

1. Scale the learning rate by the number of workers. The `sdp.tensorflow.size()` API provides you the number of workers in the cluster. This is invoked in the following code block as `sdp.size()`. 

   ```
   learning_rate = learning_rate * sdp.size()
   ```

1. Use the library’s `DistributedGradientTape` to optimize `AllReduce` operations during training. This wraps `tf.GradientTape`.  

   ```
   with tf.GradientTape() as tape:
         output = model(input)
         loss_value = loss(label, output)
       
   # SageMaker AI data parallel: Wrap tf.GradientTape with the library's DistributedGradientTape
   tape = sdp.DistributedGradientTape(tape)
   ```

1. Broadcast the initial model variables from the leader node (rank 0) to all the worker nodes (ranks 1 through n). This is needed to ensure a consistent initialization across all the worker ranks. Use the `sdp.tensorflow.broadcast_variables` API after the model and optimizer variables are initialized. This is invoked in the following code block as `sdp.broadcast_variables()`. 

   ```
   sdp.broadcast_variables(model.variables, root_rank=0)
   sdp.broadcast_variables(opt.variables(), root_rank=0)
   ```

1. Finally, modify your script to save checkpoints only on the leader node. The leader node has a synchronized model. This also avoids worker nodes overwriting the checkpoints and possibly corrupting the checkpoints. 

   ```
   if sdp.rank() == 0:
       checkpoint.save(checkpoint_dir)
   ```

The following is an example TensorFlow training script for distributed training with the library.

```
import tensorflow as tf

# SageMaker AI data parallel: Import the library TF API
import smdistributed.dataparallel.tensorflow as sdp

# SageMaker AI data parallel: Initialize the library
sdp.init()

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    # SageMaker AI data parallel: Pin GPUs to a single library process
    tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')

# Prepare Dataset
dataset = tf.data.Dataset.from_tensor_slices(...)

# Define Model
mnist_model = tf.keras.Sequential(...)
loss = tf.losses.SparseCategoricalCrossentropy()

# SageMaker AI data parallel: Scale Learning Rate
# LR for 8 node run : 0.000125
# LR for single node run : 0.001
opt = tf.optimizers.Adam(0.000125 * sdp.size())

@tf.function
def training_step(images, labels, first_batch):
    with tf.GradientTape() as tape:
        probs = mnist_model(images, training=True)
        loss_value = loss(labels, probs)

    # SageMaker AI data parallel: Wrap tf.GradientTape with the library's DistributedGradientTape
    tape = sdp.DistributedGradientTape(tape)

    grads = tape.gradient(loss_value, mnist_model.trainable_variables)
    opt.apply_gradients(zip(grads, mnist_model.trainable_variables))

    if first_batch:
       # SageMaker AI data parallel: Broadcast model and optimizer variables
       sdp.broadcast_variables(mnist_model.variables, root_rank=0)
       sdp.broadcast_variables(opt.variables(), root_rank=0)

    return loss_value

...

# SageMaker AI data parallel: Save checkpoints only from master node.
if sdp.rank() == 0:
    checkpoint.save(checkpoint_dir)
```

After you have completed adapting your training script, move on to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md). 

# Launching distributed training jobs with SMDDP using the SageMaker Python SDK
<a name="data-parallel-use-api"></a>

To run a distributed training job with your adapted script from [Adapting your training script to use the SMDDP collective operations](data-parallel-modify-sdp-select-framework.md), use the SageMaker Python SDK's framework or generic estimators by specifying the prepared training script as an entry point script and the distributed training configuration.

This page walks you through how to use the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/index.html) in two ways.
+ If you want to achieve a quick adoption of your distributed training job in SageMaker AI, configure a SageMaker AI [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch) or [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) framework estimator class. The framework estimator picks up your training script and automatically matches the right image URI of the [pre-built PyTorch or TensorFlow Deep Learning Containers (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only), given the value specified to the `framework_version` parameter.
+ If you want to extend one of the pre-built containers or build a custom container to create your own ML environment with SageMaker AI, use the SageMaker AI generic `Estimator` class and specify the image URI of the custom Docker container hosted in your Amazon Elastic Container Registry (Amazon ECR).

Your training datasets should be stored in Amazon S3 or [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) in the AWS Region in which you are launching your training job. If you use Jupyter notebooks, you should have a SageMaker notebook instance or a SageMaker Studio Classic app running in the same AWS Region. For more information about storing your training data, see the [SageMaker Python SDK data inputs](https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-input) documentation. 

**Tip**  
We recommend that you use Amazon FSx for Lustre instead of Amazon S3 to improve training performance. Amazon FSx has higher throughput and lower latency than Amazon S3.

**Tip**  
To properly run distributed training on the EFA-enabled instance types, you should enables traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

Choose one of the following topics for instructions on how to run a distributed training job of your training script. After you launch a training job, you can monitor system utilization and model performance using [Amazon SageMaker Debugger](train-debugger.md) or Amazon CloudWatch.

While you follow instructions in the following topics to learn more about technical details, we also recommend that you try the [Amazon SageMaker AI data parallelism library examples](distributed-data-parallel-v2-examples.md) to get started.

**Topics**
+ [Use the PyTorch framework estimators in the SageMaker Python SDK](data-parallel-framework-estimator.md)
+ [Use the SageMaker AI generic estimator to extend pre-built DLC containers](data-parallel-use-python-skd-api.md)
+ [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md)

# Use the PyTorch framework estimators in the SageMaker Python SDK
<a name="data-parallel-framework-estimator"></a>

You can launch distributed training by adding the `distribution` argument to the SageMaker AI framework estimators, [https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch) or [https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator). For more details, choose one of the frameworks supported by the SageMaker AI distributed data parallelism (SMDDP) library from the following selections.

------
#### [ PyTorch ]

The following launcher options are available for launching PyTorch distributed training.
+ `pytorchddp` – This option runs `mpirun` and sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to the `distribution` parameter.

  ```
  { "pytorchddp": { "enabled": True } }
  ```
+ `torch_distributed` – This option runs `torchrun` and sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to the `distribution` parameter.

  ```
  { "torch_distributed": { "enabled": True } }
  ```
+ `smdistributed` – This option also runs `mpirun` but with `smddprun` that sets up environment variables needed for running PyTorch distributed training on SageMaker AI.

  ```
  { "smdistributed": { "dataparallel": { "enabled": True } } }
  ```

If you chose to replace NCCL `AllGather` to SMDDP `AllGather`, you can use all three options. Choose one option that fits with your use case.

If you chose to replace NCCL `AllReduce` with SMDDP `AllReduce`, you should choose one of the `mpirun`-based options: `smdistributed` or `pytorchddp`. You can also add additional MPI options as follows.

```
{ 
    "pytorchddp": {
        "enabled": True, 
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}
```

```
{ 
    "smdistributed": { 
        "dataparallel": {
            "enabled": True, 
            "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
        }
    }
}
```

The following code sample shows the basic structure of a PyTorch estimator with distributed training options.

```
from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
    base_job_name="training_job_name_prefix",
    source_dir="subdirectory-to-your-code",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    py_version="py310",
    framework_version="2.0.1",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p4de.24xlarge
    instance_type="ml.p4d.24xlarge",

    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)

pt_estimator.fit("s3://bucket/path/to/training/data")
```

**Note**  
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the SageMaker AI PyTorch DLCs. Create the following `requirements.txt` file and save in the source directory where you save the training script.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For example, the tree-structured directory should look like the following.  

```
├── pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code
    ├──  adapted-training-script.py
    └──  requirements.txt
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

**Considerations for activating SMDDP collective operations and using the right distributed training launcher options**
+ SMDDP `AllReduce` and SMDDP `AllGather` are not mutually compatible at present.
+ SMDDP `AllReduce` is activated by default when using `smdistributed` or `pytorchddp`, which are `mpirun`-based launchers, and NCCL `AllGather` is used.
+ SMDDP `AllGather` is activated by default when using `torch_distributed` launcher, and `AllReduce` falls back to NCCL.
+ SMDDP `AllGather` can also be activated when using the `mpirun`-based launchers with an additional environment variable set as follows.

  ```
  export SMDATAPARALLEL_OPTIMIZE_SDP=true
  ```

------
#### [ TensorFlow ]

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see [TensorFlow (deprecated)](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks-tensorflow).

```
from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
    base_job_name = "training_job_name_prefix",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    framework_version="2.11.0",
    py_version="py38",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
    instance_type="ml.p3.16xlarge",

    # Training using the SageMaker AI data parallel distributed training strategy
    distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

tf_estimator.fit("s3://bucket/path/to/training/data")
```

------

# Use the SageMaker AI generic estimator to extend pre-built DLC containers
<a name="data-parallel-use-python-skd-api"></a>

You can customize SageMaker AI prebuilt containers or extend them to handle any additional functional requirements for your algorithm or model that the prebuilt SageMaker AI Docker image doesn't support. For an example of how you can extend a pre-built container, see [Extend a Prebuilt Container](https://docs.aws.amazon.com/sagemaker/latest/dg/prebuilt-containers-extend.html).

To extend a prebuilt container or adapt your own container to use the library, you must use one of the images listed in [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

**Note**  
From TensorFlow 2.4.1 and PyTorch 1.8.1, SageMaker AI framework DLCs supports EFA-enabled instance types. We recommend that you use the DLC images that contain TensorFlow 2.4.1 or later and PyTorch 1.8.1 or later. 

For example, if you use PyTorch, your Dockerfile should contain a `FROM` statement similar to the following:

```
# SageMaker AI PyTorch image
FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/pytorch-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker AI PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# /opt/ml and all subdirectories are utilized by SageMaker AI, use the /code subdirectory to store your user code.
COPY train.py /opt/ml/code/train.py

# Defines cifar10.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py
```

You can further customize your own Docker container to work with SageMaker AI using the [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) and the binary file of the SageMaker AI distributed data parallel library. To learn more, see the instructions in the following section.

# Create your own Docker container with the SageMaker AI distributed data parallel library
<a name="data-parallel-bring-your-own-container"></a>

To build your own Docker container for training and use the SageMaker AI data parallel library, you must include the correct dependencies and the binary files of the SageMaker AI distributed parallel libraries in your Dockerfile. This section provides instructions on how to create a complete Dockerfile with the minimum set of dependencies for distributed training in SageMaker AI using the data parallel library.

**Note**  
This custom Docker option with the SageMaker AI data parallel library as a binary is available only for PyTorch.

**To create a Dockerfile with the SageMaker training toolkit and the data parallel library**

1. Start with a Docker image from [NVIDIA CUDA](https://hub.docker.com/r/nvidia/cuda). Use the cuDNN developer versions that contain CUDA runtime and development tools (headers and libraries) to build from the [PyTorch source code](https://github.com/pytorch/pytorch#from-source).

   ```
   FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
   ```
**Tip**  
The official AWS Deep Learning Container (DLC) images are built from the [NVIDIA CUDA base images](https://hub.docker.com/r/nvidia/cuda). If you want to use the prebuilt DLC images as references while following the rest of the instructions, see the [AWS Deep Learning Containers for PyTorch Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/pytorch). 

1. Add the following arguments to specify versions of PyTorch and other packages. Also, indicate the Amazon S3 bucket paths to the SageMaker AI data parallel library and other software to use AWS resources, such as the Amazon S3 plug-in. 

   To use versions of the third party libraries other than the ones provided in the following code example, we recommend you look into the [official Dockerfiles of AWS Deep Learning Container for PyTorch](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker) to find versions that are tested, compatible, and suitable for your application. 

   To find URLs for the `SMDATAPARALLEL_BINARY` argument, see the lookup tables at [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

   ```
   ARG PYTORCH_VERSION=1.10.2
   ARG PYTHON_SHORT_VERSION=3.8
   ARG EFA_VERSION=1.14.1
   ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
   ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
   ARG CONDA_PREFIX="/opt/conda"
   ARG BRANCH_OFI=1.1.3-aws
   ```

1. Set the following environment variables to properly build SageMaker training components and run the data parallel library. You use these variables for the components in the subsequent steps.

   ```
   # Set ENV variables required to build PyTorch
   ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0"
   ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
   ENV NCCL_VERSION=2.10.3
   
   # Add OpenMPI to the path.
   ENV PATH /opt/amazon/openmpi/bin:$PATH
   
   # Add Conda to path
   ENV PATH $CONDA_PREFIX/bin:$PATH
   
   # Set this enviroment variable for SageMaker AI to launch SMDDP correctly.
   ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
   
   # Add enviroment variable for processes to be able to call fork()
   ENV RDMAV_FORK_SAFE=1
   
   # Indicate the container type
   ENV DLC_CONTAINER_TYPE=training
   
   # Add EFA and SMDDP to LD library path
   ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
   ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
   ```

1. Install or update `curl`, `wget`, and `git` to download and build packages in the subsequent steps.

   ```
   RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
       apt-get update && apt-get install -y  --no-install-recommends \
           curl \
           wget \
           git \
       && rm -rf /var/lib/apt/lists/*
   ```

1. Install [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) software for Amazon EC2 network communication.

   ```
   RUN DEBIAN_FRONTEND=noninteractive apt-get update
   RUN mkdir /tmp/efa \
       && cd /tmp/efa \
       && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
       && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
       && cd aws-efa-installer \
       && ./efa_installer.sh -y --skip-kmod -g \
       && rm -rf /tmp/efa
   ```

1. Install [Conda](https://docs.conda.io/en/latest/) to handle package management. 

   ```
   RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
       chmod +x ~/miniconda.sh && \
       ~/miniconda.sh -b -p $CONDA_PREFIX && \
       rm ~/miniconda.sh && \
       $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \
       $CONDA_PREFIX/bin/conda clean -ya
   ```

1. Get, build, and install PyTorch and its dependencies. We build [PyTorch from the source code](https://github.com/pytorch/pytorch#from-source) because we need to have control of the NCCL version to guarantee compatibility with the [AWS OFI NCCL plug-in](https://github.com/aws/aws-ofi-nccl).

   1. Following the steps in the [PyTorch official dockerfile](https://github.com/pytorch/pytorch/blob/master/Dockerfile), install build dependencies and set up [ccache](https://ccache.dev/) to speed up recompilation.

      ```
      RUN DEBIAN_FRONTEND=noninteractive \
          apt-get install -y --no-install-recommends \
              build-essential \
              ca-certificates \
              ccache \
              cmake \
              git \
              libjpeg-dev \
              libpng-dev \
          && rm -rf /var/lib/apt/lists/*
        
      # Setup ccache
      RUN /usr/sbin/update-ccache-symlinks
      RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
      ```

   1. Install [PyTorch’s common and Linux dependencies](https://github.com/pytorch/pytorch#install-dependencies).

      ```
      # Common dependencies for PyTorch
      RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
      
      # Linux specific dependency for PyTorch
      RUN conda install -c pytorch magma-cuda113
      ```

   1. Clone the [PyTorch GitHub repository](https://github.com/pytorch/pytorch).

      ```
      RUN --mount=type=cache,target=/opt/ccache \
          cd / \
          && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
      ```

   1. Install and build a specific [NCCL](https://developer.nvidia.com/nccl) version. To do this, replace the content in the PyTorch’s default NCCL folder (`/pytorch/third_party/nccl`) with the specific NCCL version from the NVIDIA repository. The NCCL version was set in the step 3 of this guide.

      ```
      RUN cd /pytorch/third_party/nccl \
          && rm -rf nccl \
          && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
          && cd nccl \
          && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
          && make pkg.txz.build \
          && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
      ```

   1. Build and install PyTorch. This process usually takes slightly more than 1 hour to complete. It is built using the NCCL version downloaded in a previous step.

      ```
      RUN cd /pytorch \
          && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
          python setup.py install \
          && rm -rf /pytorch
      ```

1. Build and install [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl). This enables [libfabric](https://github.com/ofiwg/libfabric) support for the SageMaker AI data parallel library.

   ```
   RUN DEBIAN_FRONTEND=noninteractive apt-get update \
       && apt-get install -y --no-install-recommends \
           autoconf \
           automake \
           libtool
   RUN mkdir /tmp/efa-ofi-nccl \
       && cd /tmp/efa-ofi-nccl \
       && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
       && cd aws-ofi-nccl \
       && ./autogen.sh \
       && ./configure --with-libfabric=/opt/amazon/efa \
       --with-mpi=/opt/amazon/openmpi \
       --with-cuda=/usr/local/cuda \
       --with-nccl=$CONDA_PREFIX \
       && make \
       && make install \
       && rm -rf /tmp/efa-ofi-nccl
   ```

1. Build and install [TorchVision](https://github.com/pytorch/vision.git).

   ```
   RUN pip install --no-cache-dir -U \
       packaging \
       mpi4py==3.0.3
   RUN cd /tmp \
       && git clone https://github.com/pytorch/vision.git -b v0.9.1 \
       && cd vision \
       && BUILD_VERSION="0.9.1+cu111" python setup.py install \
       && cd /tmp \
       && rm -rf vision
   ```

1. Install and configure OpenSSH. OpenSSH is required for MPI to communicate between containers. Allow OpenSSH to talk to containers without asking for confirmation.

   ```
   RUN apt-get update \
       && apt-get install -y  --allow-downgrades --allow-change-held-packages --no-install-recommends \
       && apt-get install -y --no-install-recommends openssh-client openssh-server \
       && mkdir -p /var/run/sshd \
       && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
       && echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
       && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
       && rm -rf /var/lib/apt/lists/*
   
   # Configure OpenSSH so that nodes can communicate with each other
   RUN mkdir -p /var/run/sshd && \
    sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
   RUN rm -rf /root/.ssh/ && \
    mkdir -p /root/.ssh/ && \
    ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
    && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
   ```

1. Install the PT S3 plug-in to efficiently access datasets in Amazon S3.

   ```
   RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
   RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
   ```

1. Install the [libboost](https://www.boost.org/) library. This package is needed for networking the asynchronous IO functionality of the SageMaker AI data parallel library.

   ```
   WORKDIR /
   RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
       && tar -xzf boost_1_73_0.tar.gz \
       && cd boost_1_73_0 \
       && ./bootstrap.sh \
       && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \
       && cd .. \
       && rm -rf boost_1_73_0.tar.gz \
       && rm -rf boost_1_73_0 \
       && cd ${CONDA_PREFIX}/include/boost
   ```

1. Install the following SageMaker AI tools for PyTorch training.

   ```
   WORKDIR /root
   RUN pip install --no-cache-dir -U \
       smclarify \
       "sagemaker>=2,<3" \
       sagemaker-experiments==0.* \
       sagemaker-pytorch-training
   ```

1. Finally, install the SageMaker AI data parallel binary and the remaining dependencies.

   ```
   RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
     apt-get update && apt-get install -y  --no-install-recommends \
     jq \
     libhwloc-dev \
     libnuma1 \
     libnuma-dev \
     libssl1.1 \
     libtool \
     hwloc \
     && rm -rf /var/lib/apt/lists/*
   
   RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
   ```

1. After you finish creating the Dockerfile, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html) to learn how to build the Docker container, host it in Amazon ECR, and run a training job using the SageMaker Python SDK.

The following example code shows a complete Dockerfile after combining all the previous code blocks.

```
# This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

# Set appropiate versions and location for components
ARG PYTORCH_VERSION=1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
ARG CONDA_PREFIX="/opt/conda"
ARG BRANCH_OFI=1.1.3-aws

# Set ENV variables required to build PyTorch
ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV NCCL_VERSION=2.10.3

# Add OpenMPI to the path.
ENV PATH /opt/amazon/openmpi/bin:$PATH

# Add Conda to path
ENV PATH $CONDA_PREFIX/bin:$PATH

# Set this enviroment variable for SageMaker AI to launch SMDDP correctly.
ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main

# Add enviroment variable for processes to be able to call fork()
ENV RDMAV_FORK_SAFE=1

# Indicate the container type
ENV DLC_CONTAINER_TYPE=training

# Add EFA and SMDDP to LD library path
ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH

# Install basic dependencies to download and build other dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
  apt-get update && apt-get install -y  --no-install-recommends \
  curl \
  wget \
  git \
  && rm -rf /var/lib/apt/lists/*

# Install EFA.
# This is required for SMDDP backend communication
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN mkdir /tmp/efa \
    && cd /tmp/efa \
    && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
    && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
    && cd aws-efa-installer \
    && ./efa_installer.sh -y --skip-kmod -g \
    && rm -rf /tmp/efa

# Install Conda
RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
    chmod +x ~/miniconda.sh && \
    ~/miniconda.sh -b -p $CONDA_PREFIX && \
    rm ~/miniconda.sh && \
    $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \
    $CONDA_PREFIX/bin/conda clean -ya

# Install PyTorch.
# Start with dependencies listed in official PyTorch dockerfile
# https://github.com/pytorch/pytorch/blob/master/Dockerfile
RUN DEBIAN_FRONTEND=noninteractive \
    apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        ccache \
        cmake \
        git \
        libjpeg-dev \
        libpng-dev && \
    rm -rf /var/lib/apt/lists/*

# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache

# Common dependencies for PyTorch
RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses

# Linux specific dependency for PyTorch
RUN conda install -c pytorch magma-cuda113

# Clone PyTorch
RUN --mount=type=cache,target=/opt/ccache \
    cd / \
    && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
# Note that we need to use the same NCCL version for PyTorch and OFI plugin.
# To enforce that, install NCCL from source before building PT and OFI plugin.

# Install NCCL.
# Required for building OFI plugin (OFI requires NCCL's header files and library)
RUN cd /pytorch/third_party/nccl \
    && rm -rf nccl \
    && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
    && cd nccl \
    && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    && make pkg.txz.build \
    && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1

# Build and install PyTorch.
RUN cd /pytorch \
    && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    python setup.py install \
    && rm -rf /pytorch

RUN ccache -C

# Build and install OFI plugin. \
# It is required to use libfabric.
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
    && apt-get install -y --no-install-recommends \
        autoconf \
        automake \
        libtool
RUN mkdir /tmp/efa-ofi-nccl \
    && cd /tmp/efa-ofi-nccl \
    && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
    && cd aws-ofi-nccl \
    && ./autogen.sh \
    && ./configure --with-libfabric=/opt/amazon/efa \
        --with-mpi=/opt/amazon/openmpi \
        --with-cuda=/usr/local/cuda \
        --with-nccl=$CONDA_PREFIX \
    && make \
    && make install \
    && rm -rf /tmp/efa-ofi-nccl

# Build and install Torchvision
RUN pip install --no-cache-dir -U \
    packaging \
    mpi4py==3.0.3
RUN cd /tmp \
    && git clone https://github.com/pytorch/vision.git -b v0.9.1 \
    && cd vision \
    && BUILD_VERSION="0.9.1+cu111" python setup.py install \
    && cd /tmp \
    && rm -rf vision

# Install OpenSSH.
# Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation
RUN apt-get update \
    && apt-get install -y  --allow-downgrades --allow-change-held-packages --no-install-recommends \
    && apt-get install -y --no-install-recommends openssh-client openssh-server \
    && mkdir -p /var/run/sshd \
    && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
    && echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
    && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
    && rm -rf /var/lib/apt/lists/*
# Configure OpenSSH so that nodes can communicate with each other
RUN mkdir -p /var/run/sshd && \
    sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
RUN rm -rf /root/.ssh/ && \
    mkdir -p /root/.ssh/ && \
    ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
    && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config

# Install PT S3 plugin.
# Required to efficiently access datasets in Amazon S3
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt

# Install libboost from source.
# This package is needed for smdataparallel functionality (for networking asynchronous IO).
WORKDIR /
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
    && tar -xzf boost_1_73_0.tar.gz \
    && cd boost_1_73_0 \
    && ./bootstrap.sh \
    && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \
    && cd .. \
    && rm -rf boost_1_73_0.tar.gz \
    && rm -rf boost_1_73_0 \
    && cd ${CONDA_PREFIX}/include/boost

# Install SageMaker AI PyTorch training.
WORKDIR /root
RUN pip install --no-cache-dir -U \
    smclarify \
    "sagemaker>=2,<3" \
    sagemaker-experiments==0.* \
    sagemaker-pytorch-training

# Install SageMaker AI data parallel binary (SMDDP)
# Start with dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
    apt-get update && apt-get install -y  --no-install-recommends \
        jq \
        libhwloc-dev \
        libnuma1 \
        libnuma-dev \
        libssl1.1 \
        libtool \
        hwloc \
    && rm -rf /var/lib/apt/lists/*

# Install SMDDP
RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
```

**Tip**  
For more general information about creating a custom Dockerfile for training in SageMaker AI, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

**Tip**  
If you want to extend the custom Dockerfile to incorporate the SageMaker AI model parallel library, see [Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library](model-parallel-sm-sdk.md#model-parallel-bring-your-own-container).

# Amazon SageMaker AI data parallelism library examples
<a name="distributed-data-parallel-v2-examples"></a>

This page provides Jupyter notebooks that present examples of implementing the SageMaker AI distributed data parallelism (SMDDP) library to run distributed training jobs on SageMaker AI.

## Blogs and Case Studies
<a name="distributed-data-parallel-v2-examples-blog"></a>

The following blogs discuss case studies about using the SMDDP library.

**SMDDP v2 blogs**
+ [Enable faster training with Amazon SageMaker AI data parallel library](https://aws.amazon.com/blogs/machine-learning/enable-faster-training-with-amazon-sagemaker-data-parallel-library/), *AWS Machine Learning Blog* (December 05, 2023)

**SMDDP v1 blogs**
+ [How I trained 10TB for Stable Diffusion on SageMaker AI](https://medium.com/@emilywebber/how-i-trained-10tb-for-stable-diffusion-on-sagemaker-39dcea49ce32) in *Medium* (November 29, 2022)
+ [Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon Search ](https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/), *AWS Machine Learning Blog* (August 18, 2022)
+ [Training YOLOv5 on AWS with PyTorch and the SageMaker AI distributed data parallel library](https://medium.com/@sitecao/training-yolov5-on-aws-with-pytorch-and-sagemaker-distributed-data-parallel-library-a196ab01409b), *Medium* (May 6, 2022)
+ [Speed up EfficientNet model training on SageMaker AI with PyTorch and the SageMaker AI distributed data parallel library](https://medium.com/@dangmz/speed-up-efficientnet-model-training-on-amazon-sagemaker-with-pytorch-and-sagemaker-distributed-dae4b048c01a), *Medium* (March 21, 2022)
+ [Speed up EfficientNet training on AWS with the SageMaker AI distributed data parallel library](https://towardsdatascience.com/speed-up-efficientnet-training-on-aws-by-up-to-30-with-sagemaker-distributed-data-parallel-library-2dbf6d1e18e8), *Towards Data Science* (January 12, 2022)
+ [Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/hyundai-reduces-training-time-for-autonomous-driving-models-using-amazon-sagemaker/), *AWS Machine Learning Blog* (June 25, 2021)
+ [Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker AI](https://huggingface.co/blog/sagemaker-distributed-training-seq2seq), the *Hugging Face website* (April 8, 2021)

## Example notebooks
<a name="distributed-data-parallel-v2-examples-pytorch"></a>

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/). To download the examples, run the following command to clone the repository and go to `training/distributed_training/pytorch/data_parallel`.

**Note**  
Clone and run the example notebooks in the following SageMaker AI ML IDEs.  
[SageMaker AI JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker AI Code Editor](https://docs.aws.amazon.com/sagemaker/latest/dg/code-editor.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) (available as an application in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)

```
git clone https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/training/distributed_training/pytorch/data_parallel
```

**SMDDP v2 examples**
+ [Train Llama 2 using the SageMaker AI distributed data parallel library (SMDDP) and DeepSpeed](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/deepspeed/llama2/smddp_deepspeed_example.ipynb)
+ [Train Falcon using the SageMaker AI distributed data parallel library (SMDDP) and PyTorch Fully Sharded Data Parallelism (FSDP)](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/fully_sharded_data_parallel/falcon/smddp_fsdp_example.ipynb)

**SMDDP v1 examples**
+ [CNN with PyTorch and the SageMaker AI data parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/mnist/pytorch_smdataparallel_mnist_demo.ipynb)
+ [BERT with PyTorch and the SageMaker AI data parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/bert/pytorch_smdataparallel_bert_demo.ipynb)
+ [CNN with TensorFlow 2.3.1 and the SageMaker AI data parallelism library](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/tensorflow/data_parallel/mnist/tensorflow2_smdataparallel_mnist_demo.html)
+ [BERT with TensorFlow 2.3.1 and the SageMaker AI data parallelism library](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/tensorflow/data_parallel/bert/tensorflow2_smdataparallel_bert_demo.html)
+ [HuggingFace Distributed Data Parallel Training in PyTorch on SageMaker AI - Distributed Question Answering](https://github.com/huggingface/notebooks/blob/master/sagemaker/03_distributed_training_data_parallelism/sagemaker-notebook.ipynb)
+ [HuggingFace Distributed Data Parallel Training in PyTorch on SageMaker AI - Distributed Text Summarization](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb)
+ [HuggingFace Distributed Data Parallel Training in TensorFlow on SageMaker AI](https://github.com/huggingface/notebooks/blob/master/sagemaker/07_tensorflow_distributed_training_data_parallelism/sagemaker-notebook.ipynb)

# Configuration tips for the SageMaker AI distributed data parallelism library
<a name="data-parallel-config"></a>

Review the following tips before using the SageMaker AI distributed data parallelism (SMDDP) library. This list includes tips that are applicable across frameworks.

**Topics**
+ [Data preprocessing](#data-parallel-config-dataprep)
+ [Single versus multiple nodes](#data-parallel-config-multi-node)
+ [Debug scaling efficiency with Debugger](#data-parallel-config-debug)
+ [Batch size](#data-parallel-config-batch-size)
+ [Custom MPI options](#data-parallel-config-mpi-custom)
+ [Use Amazon FSx and set up an optimal storage and throughput capacity](#data-parallel-config-fxs)

## Data preprocessing
<a name="data-parallel-config-dataprep"></a>

If you preprocess data during training using an external library that utilizes the CPU, you may run into a CPU bottleneck because SageMaker AI distributed data parallel uses the CPU for `AllReduce` operations. You may be able to improve training time by moving preprocessing steps to a library that uses GPUs or by completing all preprocessing before training.

## Single versus multiple nodes
<a name="data-parallel-config-multi-node"></a>

We recommend that you use this library with multiple nodes. The library can be used with a single-host, multi-device setup (for example, a single ML compute instance with multiple GPUs); however, when you use two or more nodes, the library’s `AllReduce` operation gives you significant performance improvement. Also, on a single host, NVLink already contributes to in-node `AllReduce` efficiency.

## Debug scaling efficiency with Debugger
<a name="data-parallel-config-debug"></a>

You can use Amazon SageMaker Debugger to monitor and visualize CPU and GPU utilization and other metrics of interest during training. You can use Debugger [built-in rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) to monitor computational performance issues, such as `CPUBottleneck`, `LoadBalancing`, and `LowGPUUtilization`. You can specify these rules with [Debugger configurations](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configuration-for-debugging.html) when you define an Amazon SageMaker Python SDK estimator. If you use AWS CLI and AWS SDK for Python (Boto3) for training on SageMaker AI, you can enable Debugger as shown in [Configure SageMaker Debugger Using Amazon SageMaker API](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).

To see an example using Debugger in a SageMaker training job, you can reference one of the notebook examples in the [SageMaker Notebook Examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger). To learn more about Debugger, see [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html).

## Batch size
<a name="data-parallel-config-batch-size"></a>

In distributed training, as more nodes are added, batch sizes should increase proportionally. To improve convergence speed as you add more nodes to your training job and increase the global batch size, increase the learning rate.

One way to achieve this is by using a gradual learning rate warmup where the learning rate is ramped up from a small to a large value as the training job progresses. This ramp avoids a sudden increase of the learning rate, allowing healthy convergence at the start of training. For example, you can use a *Linear Scaling Rule* where each time the mini-batch size is multiplied by k, the learning rate is also multiplied by k. To learn more about this technique, see the research paper, [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/pdf/1706.02677.pdf), Sections 2 and 3.

## Custom MPI options
<a name="data-parallel-config-mpi-custom"></a>

The SageMaker AI distributed data parallel library employs Message Passing Interface (MPI), a popular standard for managing communication between nodes in a high-performance cluster, and uses NVIDIA’s NCCL library for GPU-level communication. When you use the data parallel library with a TensorFlow or Pytorch `Estimator`, the respective container sets up the MPI environment and executes the `mpirun` command to start jobs on the cluster nodes.

You can set custom MPI operations using the `custom_mpi_options` parameter in the `Estimator`. Any `mpirun` flags passed in this field are added to the `mpirun` command and executed by SageMaker AI for training. For example, you may define the `distribution` parameter of an `Estimator` using the following to use the [https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) variable to print the NCCL version at the start of the program:

```
distribution = {'smdistributed':{'dataparallel':{'enabled': True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"}}}
```

## Use Amazon FSx and set up an optimal storage and throughput capacity
<a name="data-parallel-config-fxs"></a>

When training a model on multiple nodes with distributed data parallelism, it is highly recommended to use [FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html). Amazon FSx is a scalable and high-performance storage service that supports shared file storage with a faster throughput. Using Amazon FSx storage at scale, you can achieve a faster data loading speed across the compute nodes.

Typically, with distributed data parallelism, you would expect that the total training throughput scales near-linearly with the number of GPUs. However, if you use suboptimal Amazon FSx storage, the training performance might slow down due to a low Amazon FSx throughput. 

For example, if you use the [**SCRATCH\$12** deployment type of Amazon FSx file system](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) with the minimum 1.2 TiB storage capacity, the I/O throughput capacity is 240 MB/s. Amazon FSx storage works in a way that you can assign physical storage devices, and the more devices assigned, the larger throughput you get. The smallest storage increment for the SRATCH\$12 type is 1.2 TiB, and the corresponding throughput gain is 240 MB/s.

Assume that you have a model to train on a 4-node cluster over a 100 GB data set. With a given batch size that’s optimized to the cluster, assume that the model can complete one epoch in about 30 seconds. In this case, the minimum required I/O speed is approximately 3 GB/s (100 GB / 30 s). This is apparently a much higher throughput requirement than 240 MB/s. With such a limited Amazon FSx capacity, scaling your distributed training job up to larger clusters might aggravate I/O bottleneck problems; model training throughput might improve in later epochs as cache builds up, but Amazon FSx throughput can still be a bottleneck.

To alleviate such I/O bottleneck problems, you should increase the Amazon FSx storage size to obtain a higher throughput capacity. Typically, to find an optimal I/O throughput, you may experiment with different Amazon FSx throughput capacities, assigning an equal to or slightly lower throughput than your estimate, until you find that it is sufficient to resolve the I/O bottleneck problems. In case of the aforementioned example, Amazon FSx storage with 2.4 GB/s throughput and 67 GB RAM cache would be sufficient. If the file system has an optimal throughput, the model training throughput should reach maximum either immediately or after the first epoch as cache has built up.

To learn more about how to increase Amazon FSx storage and deployment types, see the following pages in the *Amazon FSx for Lustre documentation*:
+  [How to increase storage capacity](https://docs.aws.amazon.com/fsx/latest/LustreGuide/managing-storage-capacity.html#increase-storage-capacity) 
+  [Aggregate file system performance](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) 

# Amazon SageMaker AI distributed data parallelism library FAQ
<a name="data-parallel-faq"></a>

Use the following to find answers to commonly asked questions about the SMDDP library.

**Q: When using the library, how are the `allreduce`-supporting CPU instances managed? Do I have to create heterogeneous CPU-GPU clusters, or does the SageMaker AI service create extra C5s for jobs that use the SMDDP library? **

The SMDDP library only supports GPU instances, more specificcally, P4d and P4de instances with NVIDIA A100 GPUs and EFA. No additional C5 or CPU instances are launched; if your SageMaker AI training job is on an 8-node P4d cluster, only 8 `ml.p4d.24xlarge` instances are used. No additional instances are provisioned.

**Q: I have a training job taking 5 days on a single `ml.p3.24xlarge` instance with a set of hyperparameters H1 (learning rate, batch size, optimizer, etc). Is using SageMaker AI's data parallelism library and a five-time bigger cluster enough to achieve an approximate five-time speedup? Or do I have to revisit its training hyperparameters after activating the SMDDP library? **

The library changes the overall batch size. The new overall batch size is scaled linearly with the number of training instances used. As a result of this, hyperparameters, such as learning rate, have to be changed to ensure convergence. 

**Q: Does the SMDDP library support Spot? **

Yes. You can use managed spot training. You specify the path to the checkpoint file in the SageMaker training job. You save and restore checkpoints in their training script as mentioned in the last steps of [Use the SMDDP library in your TensorFlow training script (deprecated)](data-parallel-modify-sdp-tf2.md) and [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md). 

**Q: Is the SMDDP library relevant in a single-host, multi-device setup?**

The library can be used in single-host multi-device training but the library offers performance improvements only in multi-host training.

**Q: Where should the training dataset be stored? **

The training dataset can be stored in an Amazon S3 bucket or on an Amazon FSx drive. See this [document for various supported input file systems for a training job](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.FileSystemInput). 

**Q: When using the SMDDP library, is it mandatory to have training data in FSx for Lustre? Can Amazon EFS and Amazon S3 be used? **

We generally recommend you use Amazon FSx because of its lower latency and higher throughput. If you prefer, you can use Amazon EFS or Amazon S3.

**Q: Can the library be used with CPU nodes?** 

No. To find instance types supported by the SMDDP library, see [Supported instance types](distributed-data-parallel-support.md#distributed-data-parallel-supported-instance-types).

**Q: What frameworks and framework versions are currently supported by the SMDDP library at launch?** 

the SMDDP library currently supports PyTorch v1.6.0 or later and TensorFlow v2.3.0 or later. It doesn't support TensorFlow 1.x. For more information about which version of the SMDDP library is packaged within AWS deep learning containers, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html).

**Q: Does the library support AMP?**

Yes, the SMDDP library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to use AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism library runs its `AllReduce` operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:
+ [Frameworks - PyTorch](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#pytorch) in the *NVIDIA Deep Learning Performace documentation*
+ [Frameworks - TensorFlow](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensorflow) in the *NVIDIA Deep Learning Performace documentation*
+ [Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision) in the *NVIDIA Developer Docs*
+ [Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/) in the *PyTorch Blog*
+ [TensorFlow mixed precision APIs](https://www.tensorflow.org/guide/mixed_precision) in the *TensorFlow documentation*

**Q: How do I identify if my distributed training job is slowed down due to I/O bottleneck?**

With a larger cluster, the training job requires more I/O throughput, and therefore the training throughput might take longer (more epochs) to ramp up to the maximum performance. This indicates that I/O is being bottlenecked and cache is harder to build up as you scale nodes up (higher throughput requirement and more complex network topology). For more information about monitoring the Amazon FSx throughput on CloudWatch, see [Monitoring FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring_overview.html) in the *FSx for Lustre User Guide*. 

**Q: How do I resolve I/O bottlenecks when running a distributed training job with data parallelism?**

We highly recommend that you use Amazon FSx as your data channel if you are using Amazon S3. If you are already using Amazon FSx but still having I/O bottleneck problems, you might have set up your Amazon FSx file system with a low I/O throughput and a small storage capacity. For more information about how to estimate and choose the right size of I/O throughput capacity, see [Use Amazon FSx and set up an optimal storage and throughput capacity](data-parallel-config.md#data-parallel-config-fxs).

**Q: (For the library v1.4.0 or later) How do I resolve the `Invalid backend` error while initializing process group.**

If you encounter the error message `ValueError: Invalid backend: 'smddp'` when calling `init_process_group`, this is due to the breaking change in the SMDDP library v1.4.0 and later. You must import the PyTorch client of the library, `smdistributed.dataparallel.torch.torch_smddp`, which registers `smddp` as a backend for PyTorch. To learn more, see [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md).

**Q: (For the SMDDP library v1.4.0 or later) I would like to call the collective primitives of the [https://pytorch.org/docs/stable/distributed.html](https://pytorch.org/docs/stable/distributed.html) interface. Which primitives does the `smddp` backend support?**

In v1.4.0, the SMDDP library supports `all_reduce`, `broadcast`, `reduce`, `all_gather`, and `barrier` of of the `torch.distributed` interface.

**Q: (For the SMDDP library v1.4.0 or later) Does this new API work with other custom DDP classes or libraries like Apex DDP? **

The SMDDP library is tested with other third-party distributed data parallel libraries and framework implementations that use the `torch.distribtued` modules. Using the SMDDP library with custom DDP classes works as long as the collective operations used by the custom DDP classes are supported by the SMDDP library. See the preceding question for a list of supported collectives. If you have these use cases and need further support, reach out to the SageMaker AI team through the [AWS Support Center](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

**Q: Does the SMDDP library support the bring-your-own-container (BYOC) option? If so, how do I install the library and run a distributed training job by writing a custom Dockerfile?**

If you want to integrate the SMDDP library and its minimum dependencies into your own Docker container, BYOC is the right approach. You can build your own container using the binary file of the library. The recommended process is to write a custom Dockerfile with the library and its dependencies, build the Docker container, host it in Amazon ECR, and use the ECR image URI to launch a training job using the SageMaker AI generic estimator class. For more instructions on how to prepare a custom Dockerfile for distributed training in SageMaker AI with the SMDDP library, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

# Troubleshooting for distributed training in Amazon SageMaker AI
<a name="distributed-troubleshooting-data-parallel"></a>

If you have problems in running a training job when you use the library, use the following list to try to troubleshoot. If you need further support, reach out to the SageMaker AI team through [AWS Support Center](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

**Topics**
+ [Using SageMaker AI distributed data parallel with Amazon SageMaker Debugger and checkpoints](#distributed-ts-data-parallel-debugger)
+ [An unexpected prefix attached to model parameter keys](#distributed-ts-data-parallel-pytorch-prefix)
+ [SageMaker AI distributed training job stalling during initialization](#distributed-ts-data-parallel-efa-sg)
+ [SageMaker AI distributed training job stalling at the end of training](#distributed-ts-data-parallel-stall-at-the-end)
+ [Observing scaling efficiency degradation due to Amazon FSx throughput bottlenecks](#distributed-ts-data-parallel-fxs-bottleneck)
+ [SageMaker AI distributed training job with PyTorch returns deprecation warnings](#distributed-ts-data-parallel-deprecation-warnings)

## Using SageMaker AI distributed data parallel with Amazon SageMaker Debugger and checkpoints
<a name="distributed-ts-data-parallel-debugger"></a>

To monitor system bottlenecks, profile framework operations, and debug model output tensors for training jobs with SageMaker AI distributed data parallel, use Amazon SageMaker Debugger. 

However, when you use SageMaker Debugger, SageMaker AI distributed data parallel, and SageMaker AI checkpoints, you might see an error that looks like the following example. 

```
SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
```

This is due to an internal error between Debugger and checkpoints, which occurs when you enable SageMaker AI distributed data parallel. 
+ If you enable all three features, SageMaker Python SDK automatically turns off Debugger by passing `debugger_hook_config=False`, which is equivalent to the following framework `estimator` example.

  ```
  bucket=sagemaker.Session().default_bucket()
  base_job_name="sagemaker-checkpoint-test"
  checkpoint_in_bucket="checkpoints"
  
  # The S3 URI to store the checkpoints
  checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)
  
  estimator = TensorFlow(
      ...
  
      distribution={"smdistributed": {"dataparallel": { "enabled": True }}},
      checkpoint_s3_uri=checkpoint_s3_bucket,
      checkpoint_local_path="/opt/ml/checkpoints",
      debugger_hook_config=False
  )
  ```
+ If you want to keep using both SageMaker AI distributed data parallel and SageMaker Debugger, a workaround is manually adding checkpointing functions to your training script instead of specifying the `checkpoint_s3_uri` and `checkpoint_local_path` parameters from the estimator. For more information about setting up manual checkpointing in a training script, see [Saving Checkpoints](distributed-troubleshooting-model-parallel.md#distributed-ts-model-parallel-checkpoints).

## An unexpected prefix attached to model parameter keys
<a name="distributed-ts-data-parallel-pytorch-prefix"></a>

For PyTorch distributed training jobs, an unexpected prefix (`model` for example) might be attached to `state_dict` keys (model parameters). The SageMaker AI data parallel library does not directly alter or prepend any model parameter names when PyTorch training jobs save model artifacts. The PyTorch's distributed training changes the names in the `state_dict` to go over the network, prepending the prefix. If you encounter any model failure problem due to different parameter names while you are using the SageMaker AI data parallel library and checkpointing for PyTorch training, adapt the following example code to remove the prefix at the step you load checkpoints in your training script.

```
state_dict = {k.partition('model.')[2]:state_dict[k] for k in state_dict.keys()}
```

This takes each `state_dict` key as a string value, separates the string at the first occurrence of `'model.'`, and takes the third list item (with index 2) of the partitioned string.

For more information about the prefix issue, see a discussion thread at [Prefix parameter names in saved model if trained by multi-GPU?](https://discuss.pytorch.org/t/prefix-parameter-names-in-saved-model-if-trained-by-multi-gpu/494) in the *PyTorch discussion forum*.

For more information about the PyTorch methods for saving and loading models, see [Saving & Loading Model Across Devices](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-across-devices) in the *PyTorch documentation*.

## SageMaker AI distributed training job stalling during initialization
<a name="distributed-ts-data-parallel-efa-sg"></a>

If your SageMaker AI distributed data parallel training job stalls during initialization when using EFA-enabled instances, this might be due to a misconfiguration in the security group of the VPC subnet that's used for the training job. EFA requires a proper security group configuration to enable traffic between the nodes.

**To configure inbound and outbound rules for the security group**

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. Choose **Security Groups** in the left navigation pane.

1. Select the security group that's tied to the VPC subnet you use for training. 

1. In the **Details** section, copy the **Security group ID**.

1. On the **Inbound rules** tab, choose **Edit inbound rules**.

1. On the **Edit inbound rules** page, do the following: 

   1. Choose **Add rule**.

   1. For **Type**, choose **All traffic**.

   1. For **Source**, choose **Custom**, paste the security group ID into the search box, and select the security group that pops up.

1. Choose **Save rules** to finish configuring the inbound rule for the security group.

1. On the **Outbound rules** tab, choose **Edit outbound rules**.

1. Repeat the step 6 and 7 to add the same rule as an outbound rule.

After you complete the preceding steps for configuring the security group with the inbound and outbound rules, re-run the training job and verify if the stalling issue is resolved.

For more information about configuring security groups for VPC and EFA, see [Security groups for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html) and [Elastic Fabric Adapter](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html).

## SageMaker AI distributed training job stalling at the end of training
<a name="distributed-ts-data-parallel-stall-at-the-end"></a>

One of the root causes of stalling issues at the end of training is a mismatch in the number of batches that are processed per epoch across different ranks. All workers (GPUs) synchronize their local gradients in the backward pass to ensure they all have the same copy of the model at the end of the batch iteration. If the batch sizes are unevenly assigned to different worker groups during the final epoch of training, the training job stalls. For example, while a group of workers (group A) finishes processing all batches and exits the training loop, another group of workers (group B) starts processing another batch and still expects communication from group A to synchronize the gradients. This causes group B to wait for group A, which already completed training and does not have any gradients to synchronize. 

Therefore, when setting up your training dataset, it is important that each worker gets the same number of data samples so that each worker goes through the same number of batches while training. Make sure each rank gets the same number of batches to avoid this stalling issue.

## Observing scaling efficiency degradation due to Amazon FSx throughput bottlenecks
<a name="distributed-ts-data-parallel-fxs-bottleneck"></a>

One potential cause of lowered scaling efficiency is the FSx throughput limit. If you observe a sudden drop in scaling efficiency when you switch to a larger training cluster, try using a larger FSx for Lustre file system with a higher throughput limit. For more information, see [Aggregate file system performance](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) and [Managing storage and throughput capacity](https://docs.aws.amazon.com/fsx/latest/LustreGuide/managing-storage-capacity.html) in the *Amazon FSx for Lustre User Guide*.

## SageMaker AI distributed training job with PyTorch returns deprecation warnings
<a name="distributed-ts-data-parallel-deprecation-warnings"></a>

Since v1.4.0, the SageMaker AI distributed data parallelism library works as a backend of PyTorch distributed. Because of the breaking change of using the library with PyTorch, you might encounter a warning message that the `smdistributed` APIs for the PyTorch distributed package are deprecated. The warning message should be similar to the following:

```
smdistributed.dataparallel.torch.dist is deprecated in the SageMaker AI distributed data parallel library v1.4.0+.
Please use torch.distributed and specify 'smddp' as a backend when initializing process group as follows:
torch.distributed.init_process_group(backend='smddp')
For more information, see the library's API documentation at
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html
```

In v1.4.0 and later, the library only needs to be imported once at the top of your training script and set as the backend during the PyTorch distributed initialization. With the single line of backend specification, you can keep your PyTorch training script unchanged and directly use the PyTorch distributed modules. See [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md) to learn about the breaking changes and the new way to use the library with PyTorch.

# SageMaker AI data parallelism library release notes
<a name="data-parallel-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker AI distributed data parallelism (SMDDP) library.

## The SageMaker AI distributed data parallelism library v2.5.0
<a name="data-parallel-release-notes-20241017"></a>

*Date: October 17, 2024*

**New features**
+ Added support for PyTorch v2.4.1 with CUDA v12.1.

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.6.0](model-parallel-release-notes.md#model-parallel-release-notes-20241017).

```
658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.4.1/cu121/2024-10-09/smdistributed_dataparallel-2.5.0-cp311-cp311-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.3.0
<a name="data-parallel-release-notes-20240611"></a>

*Date: June 11, 2024*

**New features**
+ Added support for PyTorch v2.3.0 with CUDA v12.1 and Python v3.11.
+ Added support for PyTorch Lightning v2.2.5. This is integrated into the SageMaker AI framework container for PyTorch v2.3.0.
+ Added instance type validation during import to prevent loading the SMDDP library on unsupported instance types. For a list of instance types compatible with the SMDDP library, see [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md).

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.3.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
  ```

For a complete list of versions of the SMDDP library and the pre-built containers, see [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed_dataparallel-2.3.0-cp311-cp311-linux_x86_64.whl
```

**Other changes**
+ The SMDDP library v2.2.0 is integrated into the SageMaker AI framework container for PyTorch v2.2.0.

## The SageMaker AI distributed data parallelism library v2.2.0
<a name="data-parallel-release-notes-20240304"></a>

*Date: March 4, 2024*

**New features**
+ Added support for PyTorch v2.2.0 with CUDA v12.1.

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.2.0](model-parallel-release-notes.md#model-parallel-release-notes-20240307).

```
658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed_dataparallel-2.2.0-cp310-cp310-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.1.0
<a name="data-parallel-release-notes-20240301"></a>

*Date: March 1, 2024*

**New features**
+ Added support for PyTorch v2.1.0 with CUDA v12.1.

**Bug fixes**
+ Fixed the CPU memory leak issue in [SMDDP v2.0.1](#data-parallel-release-notes-20231207).

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library passed benchmark testing and is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.1.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker
  ```

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.1.0](model-parallel-release-notes.md#model-parallel-release-notes-20240206).

```
658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.0.1
<a name="data-parallel-release-notes-20231207"></a>

*Date: December 7, 2023*

**New features**
+ Added a new SMDDP-implementation of `AllGather` collective operation optimized for AWS compute resources and network infrastructure. To learn more, see [SMDDP `AllGather` collective operation](data-parallel-intro.md#data-parallel-allgather).
+ The SMDDP `AllGather` collective operation is compatible with PyTorch FSDP and DeepSpeed. To learn more, see [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md).
+ Added support for PyTorch v2.0.1

**Known issues**
+ There's a CPU memory leak issue from a gradual CPU memory increase while training with SMDDP `AllReduce` in DDP mode.

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library passed benchmark testing and is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.0.1

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
  ```

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl
```

**Other changes**
+ Starting from this release, documentation for the SMDDP library is fully available in this *Amazon SageMaker AI Developer Guide*. In favor of the complete developer guide for SMDDP v2 housed in the *Amazon SageMaker AI Developer Guide*, documentation for the [additional reference for SMDDP v1.x](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html) in the *SageMaker AI Python SDK documentation* is no longer supported. If you still need SMP v1.x documentation, see the following snapshot of the documentation at [SageMaker Python SDK v2.212.0 documentation](https://sagemaker.readthedocs.io/en/v2.212.0/api/training/distributed.html#the-sagemaker-distributed-data-parallel-library).