

# Distributed training in Amazon SageMaker AI
<a name="distributed-training"></a>

SageMaker AI provides distributed training libraries and supports various distributed training options for deep learning tasks such as computer vision (CV) and natural language processing (NLP). With SageMaker AI’s distributed training libraries, you can run highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs. You can also use other distributed training frameworks and packages such as PyTorch DistributedDataParallel (DDP), `torchrun`, MPI (`mpirun`), and parameter server. The following section gives information about fundamental distributed training concepts. Throughout the documentation, instructions and examples focus on how to set up the distributed training options for deep learning tasks using the SageMaker Python SDK.

**Tip**  
To learn best practices for distributed computing of machine learning (ML) training and processing jobs in general, see [Distributed computing with SageMaker AI best practices](distributed-training-options.md).

## Distributed training concepts
<a name="distributed-training-basic-concepts"></a>

 SageMaker AI’s distributed training libraries use the following distributed training terms and features. 

**Datasets and Batches**
+ **Training Dataset**: All of the data you use to train the model.
+ **Global batch size**: The number of records selected from the training dataset in each iteration to send to the GPUs in the cluster. This is the number of records over which the gradient is computed at each iteration. If data parallelism is used, it is equal to the total number of model replicas multiplied by the per-replica batch size: `global batch size = (the number of model replicas) * (per-replica batch size)`. A single batch of global batch size is often referred to as the *mini-batch * in machine learning literature.
+ **Per-replica batch size:** When data parallelism is used, this is the number of records sent to each model replica. Each model replica performs a forward and backward pass with this batch to calculate weight updates. The resulting weight updates are synchronized (averaged) across all replicas before the next set of per-replica batches are processed. 
+ **Micro-batch**: A subset of the mini-batch or, if hybrid model and data parallelism is used , it is a subset of the per-replica sized batch . When you use SageMaker AI’s distributed model parallelism library, each micro-batch is fed into the training pipeline one-by-one and follows an [execution schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html#model-parallel-pipeline-execution) defined by the library's runtime.

**Training**
+ **Epoch**: One training cycle through the entire dataset. It is common to have multiple iterations per an epoch. The number of epochs you use in training is unique on your model and use case.
+ **Iteration**: A single forward and backward pass performed using a global batch sized batch (a mini-batch) of training data. The number of iterations performed during training is determined by the global batch size and the number of epochs used for training. For example, if a dataset includes 5,000 samples, and you use a global batch size of 500, it will take 10 iterations to complete a single epoch.
+ **Learning rate**: A variable that influences the amount that weights are changed in response to the calculated error of the model. The learning rate plays an important role in the model’s ability to converge as well as the speed and optimality of convergence.

**Instances and GPUs**
+ **Instances**: An AWS [machine learning compute instance](https://aws.amazon.com/sagemaker/pricing/). These are also referred to as *nodes*.
+ **Cluster size**: When using SageMaker AI's distributed training library, this is the number of instances multiplied by the number of GPUs in each instance. For example, if you use two ml.p3.8xlarge instances in a training job, which have 4 GPUs each, the cluster size is 8. While increasing cluster size can lead to faster training times, communication between instances must be optimized; Otherwise, communication between the nodes can add overhead and lead to slower training times. The SageMaker AI distributed training library is designed to optimize communication between Amazon EC2 ML compute instances, leading to higher device utilization and faster training times.

**Distributed Training Solutions**
+ **Data parallelism**: A strategy in distributed training where a training dataset is split up across multiple GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. Each GPU contains a *replica* of the model, receives different batches of training data, performs a forward and backward pass, and shares weight updates with the other nodes for synchronization before moving on to the next batch and ultimately another epoch.
+ **Model parallelism**: A strategy in distributed training where the model partitioned across multiple GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. The model might be complex and have a large number of hidden layers and weights, making it unable to fit in the memory of a single instance. Each GPU carries a subset of the model, through which the data flows and the transformations are shared and compiled. The efficiency of model parallelism, in terms of GPU utilization and training time, is heavily dependent on how the model is partitioned and the execution schedule used to perform forward and backward passes.
+ **Pipeline Execution Schedule** (**Pipelining**): The pipeline execution schedule determines the order in which computations (micro-batches) are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism and overcome the performance loss due to sequential computation by having the GPUs compute simultaneously on different data samples. To learn more, see [Pipeline Execution Schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html#model-parallel-pipeline-execution). 

### Advanced concepts
<a name="distributed-training-advanced-concepts"></a>

Machine Learning (ML) practitioners commonly face two scaling challenges when training models: *scaling model size* and *scaling training data*. While model size and complexity can result in better accuracy, there is a limit to the model size you can fit into a single CPU or GPU. Furthermore, scaling model size may result in more computations and longer training times.

Not all models handle training data scaling equally well because they need to ingest all the training data *in memory* for training. They only scale vertically, and to bigger and bigger instance types. In most cases, scaling training data results in longer training times.

Deep Learning (DL) is a specific family of ML algorithms consisting of several layers of artificial neural networks. The most common training method is with mini-batch Stochastic Gradient Descent (SGD). In mini-batch SGD, the model is trained by conducting small iterative changes of its coefficients in the direction that reduces its error. Those iterations are conducted on equally sized subsamples of the training dataset called *mini-batches*. For each mini-batch, the model is run in each record of the mini-batch, its error measured and the gradient of the error estimated. Then the average gradient is measured across all the records of the mini-batch and provides an update direction for each model coefficient. One full pass over the training dataset is called an *epoch*. Model trainings commonly consist of dozens to hundreds of epochs. Mini-batch SGD has several benefits: First, its iterative design makes training time theoretically linear of dataset size. Second, in a given mini-batch each record is processed individually by the model without need for inter-record communication other than the final gradient average. The processing of a mini-batch is consequently particularly suitable for parallelization and distribution.  

Parallelizing SGD training by distributing the records of a mini-batch over different computing devices is called *data parallel distributed training*, and is the most commonly used DL distribution paradigm. Data parallel training is a relevant distribution strategy to scale the mini-batch size and process each mini-batch faster. However, data parallel training comes with the extra complexity of having to compute the mini-batch gradient average with gradients coming from all the workers and communicating it to all the workers, a step called *allreduce* that can represent a growing overhead, as the training cluster is scaled, and that can also drastically penalize training time if improperly implemented or implemented over improper hardware subtracts.  

Data parallel SGD still requires developers to be able to fit at least the model and a single record in a computing device, such as a single CPU or GPU. When training very large models such as large transformers in Natural Language Processing (NLP), or segmentation models over high-resolution images, there may be situations in which this is not feasible. An alternative way to break up the workload is to partition the model over multiple computing devices, an approach called *model-parallel distributed training*. 

# Get started with distributed training in Amazon SageMaker AI
<a name="distributed-training-get-started"></a>

The following page gives information about the steps needed to get started with distributed training in Amazon SageMaker AI. If you’re already familiar with distributed training, choose one of the following options that matches your preferred strategy or framework to get started. If you want to learn about distributed training in general, see [Distributed training concepts](distributed-training.md#distributed-training-basic-concepts).

The SageMaker AI distributed training libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker AI, and improve training speed and throughput. The libraries offer both data parallel and model parallel training strategies. They combine software and hardware technologies to improve inter-GPU and inter-node communications, and extend SageMaker AI’s training capabilities with built-in options that require minimal code changes to your training scripts. 

## Before you get started
<a name="distributed-training-before-getting-started"></a>

SageMaker Training supports distributed training on a single instance as well as multiple instances, so you can run any size of training at scale. We recommend you to use the framework estimator classes such as [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator) and [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) in the SageMaker Python SDK, which are the training job launchers with various distributed training options. When you create an estimator object, the object sets up distributed training infrastructure, runs the `CreateTrainingJob` API in the backend, finds the Region where your current session is running, and pulls one of the pre-built AWS deep learning container prepackaged with a number of libraries including deep learning frameworks, distributed training frameworks, and the [EFA](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) driver. If you want to mount an FSx file system to the training instances, you need to pass your VPC subnet and security group ID to the estimator. Before running your distributed training job in SageMaker AI, read the following general guidance on the basic infrastructure setup.

### Availability zones and network backplane
<a name="availability-zones"></a>

When using multiple instances (also called *nodes*), it’s important to understand the network that connects the instances, how they read the training data, and how they share information between themselves. For example, when you run a distributed data-parallel training job, a number of factors, such as communication between the nodes of a compute cluster for running the `AllReduce` operation and data transfer between the nodes and data storage in Amazon Simple Storage Service or Amazon FSx for Lustre, play a crucial role to achieve an optimal use of compute resources and a faster training speed. To reduce communication overhead, make sure that you configure instances, VPC subnet, and data storage in the same AWS Region and Availability Zone.

### GPU instances with faster network and high-throughput storage
<a name="optimized-GPU"></a>

You can technically use any instances for distributed training. For cases where you need to run multi-node distributed training jobs for training large models, such as large language models (LLMs) and diffusion models, which require faster inter-node commutation, we recommend [EFA-enabled GPU instances supported by SageMaker AI](http://aws.amazon.com/about-aws/whats-new/2021/05/amazon-sagemaker-supports-elastic-fabric-adapter-distributed-training/). Especially, to achieve the most performant distributed training job in SageMaker AI, we recommend [P4d and P4de instances equipped with NVIDIA A100 GPUs](http://aws.amazon.com/ec2/instance-types/p4/). These are also equipped with high-throughput low-latency local instance storage and faster intra-node network. For data storage, we recommend [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) that provides high throughput for storing training datasets and model checkpoints.

**Use the SageMaker AI distributed data parallelism (SMDDP) library**

The SMDDP library improves communication between nodes with implementations of `AllReduce` and `AllGather` collective communication operations that are optimized for AWS network infrastructure and Amazon SageMaker AI ML instance topology. You can use the [SMDDP library as the backend of PyTorch-based distributed training packages](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html): [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html), [DeepSpeed](https://github.com/microsoft/DeepSpeed), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed). The following code example shows how to set a `PyTorch` estimator for launching a distributed training job on two `ml.p4d.24xlarge` instances.

```
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    ...,
    instance_count=2,
    instance_type="ml.p4d.24xlarge",
    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)
```

To learn how to prepare your training script and launch a distributed data-parallel training job on SageMaker AI, see [Run distributed training with the SageMaker AI distributed data parallelism library](data-parallel.md).

**Use the SageMaker AI model parallelism library (SMP)**

SageMaker AI provides the SMP library and supports various distributed training techniques, such as sharded data parallelism, pipelining, tensor parallelism, optimizer state sharding, and more. To learn more about what the SMP library offers, see [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md).

To use SageMaker AI's model parallelism library, configure the `distribution` parameter of the SageMaker AI framework estimators. Supported framework estimators are [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator) and [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator). The following code example shows how to construct a framework estimator for distributed training with the model parallelism library on two `ml.p4d.24xlarge` instances.

```
from sagemaker.framework import Framework

distribution={
    "smdistributed": {
        "modelparallel": {
            "enabled":True,
            "parameters": {
                ...   # enter parameter key-value pairs here
            }
        },
    },
    "mpi": {
        "enabled" : True,
        ...           # enter parameter key-value pairs here
    }
}

estimator = Framework(
    ...,
    instance_count=2,
    instance_type="ml.p4d.24xlarge",
    distribution=distribution
)
```

To learn how to adapt your training script, configure distribution parameters in the `estimator` class, and launch a distributed training job, see [SageMaker AI's model parallelism library](model-parallel.md) (see also [Distributed Training APIs](https://sagemaker.readthedocs.io/en/stable/api/training/distributed.html#the-sagemaker-distributed-model-parallel-library) in the *SageMaker Python SDK documentation*).

**Use open source distributed training frameworks**

SageMaker AI also supports the following options to operate `mpirun` and `torchrun` in the backend.
+ To use [PyTorch DistributedDataParallel (DDP)](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html) in SageMaker AI with the `mpirun` backend, add `distribution={"pytorchddp": {"enabled": True}}` to your PyTorch estimator. For more information, see also [PyTorch Distributed Training](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training) and [SageMaker AI PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)'s `distribution` argument in the *SageMaker Python SDK documentation*.
**Note**  
This option is available for PyTorch 1.12.0 and later.

  ```
  from sagemaker.pytorch import PyTorch
  
  estimator = PyTorch(
      ...,
      instance_count=2,
      instance_type="ml.p4d.24xlarge",
      distribution={"pytorchddp": {"enabled": True}}  # runs mpirun in the backend
  )
  ```
+ SageMaker AI supports the [PyTorch `torchrun` launcher](https://pytorch.org/docs/stable/elastic/run.html) for distributed training on GPU-based Amazon EC2 instances, such as P3 and P4, as well as Trn1 powered by the [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) device. 

  To use [PyTorch DistributedDataParallel (DDP)](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html) in SageMaker AI with the `torchrun` backend, add `distribution={"torch_distributed": {"enabled": True}}` to the PyTorch estimator.
**Note**  
This option is available for PyTorch 1.13.0 and later.

  The following code snippet shows an example of constructing a SageMaker AI PyTorch estimator to run distributed training on two `ml.p4d.24xlarge` instances with the `torch_distributed` distribution option.

  ```
  from sagemaker.pytorch import PyTorch
  
  estimator = PyTorch(
      ...,
      instance_count=2,
      instance_type="ml.p4d.24xlarge",
      distribution={"torch_distributed": {"enabled": True}}   # runs torchrun in the backend
  )
  ```

  For more information, see [Distributed PyTorch Training](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training) and [SageMaker AI PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)'s `distribution` argument in the *SageMaker Python SDK documentation*.

  **Notes for distributed training on Trn1**

  A Trn1 instance consists of up to 16 Trainium devices, and each Trainium device consists of two [NeuronCores](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuroncores-arch.html#neuroncores-v2-arch). For specs of the AWS Trainium devices, see [Trainium Architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn1-arch.html#id2) in the *AWS Neuron Documentation*.

  To train on the Trainium-powered instances, you only need to specify the Trn1 instance code, `ml.trn1.*`, in string to the `instance_type` argument of the SageMaker AI PyTorch estimator class. To find available Trn1 instance types, see [AWS Trn1 Architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn1-arch.html#aws-trn1-arch) in the *AWS Neuron documentation*.
**Note**  
SageMaker Training on Amazon EC2 Trn1 instances is currently available only for the PyTorch framework in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0. To find a complete list of supported versions of PyTorch Neuron, see [Neuron Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) in the *AWS Deep Learning Containers GitHub repository*.

  When you launch a training job on Trn1 instances using the SageMaker Python SDK, SageMaker AI automatically picks up and runs the right container from [Neuron Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) provided by AWS Deep Learning Containers. The Neuron Containers are prepackaged with training environment settings and dependencies for easier adaptation of your training job to the SageMaker Training platform and Amazon EC2 Trn1 instances.
**Note**  
To run your PyTorch training job on Trn1 instances with SageMaker AI, you should modify your training script to initialize process groups with the `xla` backend and use [PyTorch/XLA](https://pytorch.org/xla/release/1.12/index.html). To support the XLA adoption process, the AWS Neuron SDK provides PyTorch Neuron that uses XLA to make conversion of PyTorch operations to Trainium instructions. To learn how to modify your training script, see [Developer Guide for Training with PyTorch Neuron (`torch-neuronx`)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html) in the *AWS Neuron Documentation*.

  For more information, see [Distributed Training with PyTorch Neuron on Trn1 instances](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id24) and [SageMaker AI PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)'s `distribution` argument in the *SageMaker Python SDK documentation*.
+ To use MPI in SageMaker AI, add `distribution={"mpi": {"enabled": True}}` to your estimator. The MPI distribution option is available for the following frameworks: MXNet, PyTorch, and TensorFlow.
+ To use a parameter server in SageMaker AI, add `distribution={"parameter_server": {"enabled": True}}` to your estimator. The parameter server option is available for the following frameworks: MXNet, PyTorch, and TensorFlow. 
**Tip**  
For more information about using the MPI and parameter server options per framework, use the following links to the *SageMaker Python SDK documentation*.  
[MXNet Distributed Training](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#distributed-training) and [SageMaker AI MXNet Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html#mxnet-estimator)'s `distribution` argument
[PyTorch Distributed Training](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training) and [SageMaker AI PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)'s `distribution` argument
[TensorFlow Distributed Training](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#distributed-training) and [SageMaker AI TensorFlow Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator)'s `distribution` argument.

# Strategies for distributed training
<a name="distributed-training-strategies"></a>

Distributed training is usually split by two approaches: data parallel and model parallel. *Data parallel* is the most common approach to distributed training: You have a lot of data, batch it up, and send blocks of data to multiple CPUs or GPUs (nodes) to be processed by the neural network or ML algorithm, then combine the results. The neural network is the same on each node. A *model parallel* approach is used with large models that won’t fit in a node’s memory in one piece; it breaks up the model and places different parts on different nodes. In this situation, you need to send your batches of data out to each node so that the data is processed on all parts of the model. 

The terms *network* and *model* are often used interchangeably: A large model is really a large network with many layers and parameters. Training with a large network produces a large model, and loading the model back onto the network with all your pre-trained parameters and their weights loads a large model into memory. When you break apart a model to split it across nodes, you’re also breaking apart the underlying network. A network consists of layers, and to split up the network, you put layers on different compute devices.

A common pitfall of naively splitting layers across devices is severe GPU under-utilization. Training is inherently sequential in both forward and backward passes, and at a given time, only one GPU can actively compute, while the others wait on the activations to be sent. Modern model parallel libraries solve this problem by using pipeline execution schedules to improve device utilization. However, only the Amazon SageMaker AI's distributed model parallel library includes automatic model splitting. The two core features of the library, automatic model splitting and pipeline execution scheduling, simplifies the process of implementing model parallelism by making automated decisions that lead to efficient device utilization.

## Train with data parallel and model parallel
<a name="distributed-training-data-model-parallel"></a>

If you are training with a large dataset, start with a data parallel approach. If you run out of memory during training, you may want to switch to a model parallel approach, or try hybrid model and data parallelism. You can also try the following to improve performance with data parallel:
+ Change your model’s hyperparameters. 
+ Reduce the batch size.
+ Keep reducing the batch size until it fits. If you reduce batch size to 1, and still run out of memory, then you should try model-parallel training. 

Try gradient compression (FP16, INT8):
+ On NVIDIA TensorCore-equipped hardware, using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) creates both speed-up and memory consumption reduction.
+ SageMaker AI's distributed data parallelism library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to enable AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism library runs its `AllReduce` operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:
  + [Frameworks - PyTorch](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#pytorch) in the *NVIDIA Deep Learning Performance documentation*
  + [Frameworks - TensorFlow](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensorflow) in the *NVIDIA Deep Learning Performance documentation*
  + [Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision) in the *NVIDIA Developer Docs*
  + [Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/) in the *PyTorch Blog*
  + [TensorFlow mixed precision APIs](https://www.tensorflow.org/guide/mixed_precision) in the *TensorFlow documentation*

Try reducing the input size:
+ Reduce the NLP sequence length if you increase the sequence link, need to adjust the batch size down, or adjust the GPUs up to spread the batch. 
+ Reduce image resolution. 

Check if you use batch normalization, since this can impact convergence. When you use distributed training, your batch is split across GPUs and the effect of a much lower batch size can be a higher error rate thereby disrupting the model from converging. For example, if you prototyped your network on a single GPU with a batch size of 64, then scaled up to using four p3dn.24xlarge, you now have 32 GPUs and your per-GPU batch size drops from 64 to 2. This will likely break the convergence you saw with a single node. 

Start with model-parallel training when: 
+  Your model does not fit on a single device. 
+ Due to your model size, you’re facing limitations in choosing larger batch sizes, such as if your model weights take up most of your GPU memory and you are forced to choose a smaller, suboptimal batch size.  

To learn more about the SageMaker AI distributed libraries, see the following:
+  [Run distributed training with the SageMaker AI distributed data parallelism library](data-parallel.md) 
+  [(Archived) SageMaker model parallelism library v1.x](model-parallel.md) 

# Distributed training optimization
<a name="distributed-training-optimize"></a>

Customize hyperparameters for your use case and your data to get the best scaling efficiency. In the following discussion, we highlight some of the most impactful training variables and provide references to state-of-the-art implementations so you can learn more about your options. Also, we recommend that you refer to your preferred framework’s distributed training documentation. 
+  [Apache MXNet distributed training](https://mxnet.apache.org/versions/1.7/api/faq/distributed_training) 
+  [PyTorch distributed training](https://pytorch.org/tutorials/beginner/dist_overview.html) 
+  [TensorFlow distributed training](https://www.tensorflow.org/guide/distributed_training) 

## Batch Size
<a name="batch-size-intro"></a>

SageMaker AI distributed toolkits generally allow you to train on bigger batches. For example, if a model fits within a single device but can only be trained with a small batch size, using either model-parallel training or data parallel training enables you to experiment with larger batch sizes. 

Be aware that batch size directly influences model accuracy by controlling the amount of noise in the model update at each iteration. Increasing batch size reduces the amount of noise in the gradient estimation, which can be beneficial when increasing from very small batches sizes, but can result in degraded model accuracy as the batch size increases to large values.  

**Tip**  
Adjust your hyperparameters to ensure that your model trains to a satisfying convergence as you increase its batch size.

A number of techniques have been developed to maintain good model convergence when batch is increased.

## Mini-batch size
<a name="distributed-training-mini-batch"></a>

In SGD, the mini-batch size quantifies the amount of noise present in the gradient estimation. A small mini-batch results in a very noisy mini-batch gradient, which is not representative of the true gradient over the dataset. A large mini-batch results in a mini-batch gradient close to the true gradient over the dataset and potentially not noisy enough—likely to stay locked in irrelevant minima. 

To learn more about these techniques, see the following papers:
+ [Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour](https://arxiv.org/pdf/1706.02677.pdf), Goya et al. 
+ [PowerAI DDL](https://arxiv.org/pdf/1708.02188.pdf), Cho et al. 
+ [Scale Out for Large Minibatch SGD: Residual Network Training on ImageNet-1K with Improved Accuracy and Reduced Time to Train](https://arxiv.org/pdf/1711.04291.pdf), Codreanu et al. 
+ [ImageNet Training in Minutes](https://arxiv.org/pdf/1709.05011.pdf), You et al. 
+ [Large Batch Training of Convolutional Networks](https://arxiv.org/pdf/1708.03888.pdf), You et al. 
+ [Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes](https://arxiv.org/pdf/1904.00962.pdf), You et al. 
+ [Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes](https://arxiv.org/pdf/2006.13484.pdf), Zheng et al. 
+ [Deep Gradient Compression](https://arxiv.org/abs/1712.01887), Lin et al. 

# Scaling training
<a name="distributed-training-scenarios"></a>

The following sections cover scenarios in which you may want to scale up training, and how you can do so using AWS resources. You may want to scale training in one of the following situations:
+ Scaling from a single GPU to many GPUs
+ Scaling from a single instance to multiple instances
+ Using custom training scripts

## Scaling from a single GPU to many GPUs
<a name="scaling-from-one-GPU"></a>

The amount of data or the size of the model used in machine learning can create situations in which the time to train a model is longer that you are willing to wait. Sometimes, the training doesn’t work at all because the model or the training data is too large. One solution is to increase the number of GPUs you use for training. On an instance with multiple GPUs, like a `p3.16xlarge` that has eight GPUs, the data and processing is split across the eight GPUs. When you use distributed training libraries, this can result in a near-linear speedup in the time it takes to train your model. It takes slightly over 1/8 the time it would have taken on `p3.2xlarge` with one GPU.


|  Instance type  |  GPUs  | 
| --- | --- | 
|  p3.2xlarge  |  1  | 
|  p3.8xlarge  |  4  | 
|  p3.16xlarge  |  8  | 
|  p3dn.24xlarge  |  8  | 

**Note**  
The ml instance types used by SageMaker training have the same number of GPUs as the corresponding p3 instance types. For example, `ml.p3.8xlarge` has the same number of GPUs as `p3.8xlarge` - 4. 

## Scaling from a single instance to multiple instances
<a name="scaling-from-one-instance"></a>

If you want to scale your training even further, you can use more instances. However, you should choose a larger instance type before you add more instances. Review the previous table to see how many GPUs are in each p3 instance type. 

If you have made the jump from a single GPU on a `p3.2xlarge` to four GPUs on a `p3.8xlarge`, but decide that you require more processing power, you may see better performance and incur lower costs if you choose a `p3.16xlarge` before trying to increase instance count. Depending on the libraries you use, when you keep your training on a single instance, performance is better and costs are lower than a scenario where you use multiple instances.

When you are ready to scale the number of instances, you can do this with SageMaker AI Python SDK `estimator` function by setting your `instance_count`. For example, you can set `instance_type = p3.16xlarge` and   `instance_count = 2`. Instead of the eight GPUs on a single `p3.16xlarge`, you have 16 GPUs across two identical instances. The following chart shows [scaling and throughput starting with eight GPUs](https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/) on a single instance and increasing to 64 instances for a total of 256 GPUs. 

 ![\[Chart showing how throughput increases and time to train decreases with more GPUs.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/Distributed-Training-in-SageMaker-image.png) 

## Custom training scripts
<a name="custom-training-scripts"></a>

While SageMaker AI makes it simple to deploy and scale the number of instances and GPUs, depending on your framework of choice, managing the data and results can be very challenging, which is why external supporting libraries are often used. This most basic form of distributed training requires modification of your training script to manage the data distribution. 

SageMaker AI also supports Horovod and implementations of distributed training native to each major deep learning framework. If you choose to use examples from these frameworks, you can follow SageMaker AI’s [container guide](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html) for Deep Learning Containers, and various [example notebooks](https://sagemaker-examples.readthedocs.io/en/latest/training/bring_your_own_container.html) that demonstrate implementations. 

# Run distributed training with the SageMaker AI distributed data parallelism library
<a name="data-parallel"></a>

The SageMaker AI distributed data parallelism (SMDDP) library extends SageMaker training capabilities on deep learning models with near-linear scaling efficiency by providing implementations of collective communication operations optimized for AWS infrastructure.

When training large machine learning (ML) models, such as large language models (LLM) and diffusion models, on a huge training dataset, ML practitioners use clusters of accelerators and distributed training techniques to reduce the time to train or resolve memory constraints for models that cannot fit in each GPU memory. ML practitioners often start with multiple accelerators on a single instance and then scale to clusters of instances as their workload requirements increase. As the cluster size increases, so does the communication overhead between multiple nodes, which leads to drop in overall computational performance.

To address such overhead and memory problems, the SMDDP library offers the following.
+ The SMDDP library optimizes training jobs for AWS network infrastructure and Amazon SageMaker AI ML instance topology.
+ The SMDDP library improves communication between nodes with implementations of `AllReduce` and `AllGather` collective communication operations that are optimized for AWS infrastructure. 

To learn more about the details of the SMDDP library offerings, proceed to [Introduction to the SageMaker AI distributed data parallelism library](data-parallel-intro.md).

For more information about training with the model-parallel strategy offered by SageMaker AI, see also [(Archived) SageMaker model parallelism library v1.x](model-parallel.md).

**Topics**
+ [

# Introduction to the SageMaker AI distributed data parallelism library
](data-parallel-intro.md)
+ [

# Supported frameworks, AWS Regions, and instances types
](distributed-data-parallel-support.md)
+ [

# Distributed training with the SageMaker AI distributed data parallelism library
](data-parallel-modify-sdp.md)
+ [

# Amazon SageMaker AI data parallelism library examples
](distributed-data-parallel-v2-examples.md)
+ [

# Configuration tips for the SageMaker AI distributed data parallelism library
](data-parallel-config.md)
+ [

# Amazon SageMaker AI distributed data parallelism library FAQ
](data-parallel-faq.md)
+ [

# Troubleshooting for distributed training in Amazon SageMaker AI
](distributed-troubleshooting-data-parallel.md)
+ [

# SageMaker AI data parallelism library release notes
](data-parallel-release-notes.md)

# Introduction to the SageMaker AI distributed data parallelism library
<a name="data-parallel-intro"></a>

The SageMaker AI distributed data parallelism (SMDDP) library is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library addresses communications overhead of the key collective communication operations by offering the following.

1. The library offers `AllReduce` optimized for AWS. `AllReduce` is a key operation used for synchronizing gradients across GPUs at the end of each training iteration during distributed data training.

1. The library offers `AllGather` optimized for AWS. `AllGather` is another key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries such as the SageMaker AI model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP).

1. The library performs optimized node-to-node communication by fully utilizing AWS network infrastructure and the Amazon EC2 instance topology. 

The SMDDP library can increase training speed by offering performance improvement as you scale your training cluster, with near-linear scaling efficiency.

**Note**  
The SageMaker AI distributed training libraries are available through the AWS deep learning containers for PyTorch and Hugging Face within the SageMaker Training platform. To use the libraries, you must use the SageMaker Python SDK or the SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout the documentation, instructions and examples focus on how to use the distributed training libraries with the SageMaker Python SDK.

## SMDDP collective communication operations optimized for AWS compute resources and network infrastructure
<a name="data-parallel-collective-operations"></a>

The SMDDP library provides implementations of the `AllReduce` and `AllGather` collective operations that are optimized for AWS compute resources and network infrastructure.

### SMDDP `AllReduce` collective operation
<a name="data-parallel-allreduce"></a>

The SMDDP library achieves optimal overlapping of the `AllReduce` operation with the backward pass, significantly improving GPU utilization. It achieves near-linear scaling efficiency and faster training speed by optimizing kernel operations between CPUs and GPUs. The library performs `AllReduce` in parallel while GPU is computing gradients without taking away additional GPU cycles, which makes the library to achieve faster training.
+  *Leverages CPUs*: The library uses CPUs to `AllReduce` gradients, offloading this task from the GPUs. 
+ * Improved GPU usage*: The cluster’s GPUs focus on computing gradients, improving their utilization throughout training.

The following is the high-level workflow of the SMDDP `AllReduce` operation.

1. The library assigns ranks to GPUs (workers).

1. At each iteration, the library divides each global batch by the total number of workers (world size) and assigns small batches (batch shards) to the workers.
   + The size of the global batch is `(number of nodes in a cluster) * (number of GPUs per node) * (per batch shard)`. 
   + A batch shard (small batch) is a subset of dataset assigned to each GPU (worker) per iteration. 

1. The library launches a training script on each worker.

1. The library manages copies of model weights and gradients from the workers at the end of every iteration.

1. The library synchronizes model weights and gradients across the workers to aggregate a single trained model.

The following architecture diagram shows an example of how the library sets up data parallelism for a cluster of 3 nodes. 

 

![\[SMDDP AllReduce and data parallelism architecture diagram\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/data-parallel/sdp-architecture.png)


### SMDDP `AllGather` collective operation
<a name="data-parallel-allgather"></a>

`AllGather` is a collective operation where each worker starts with an input buffer, and then concatenates or *gathers* the input buffers from all other workers into an output buffer.

**Note**  
The SMDDP `AllGather` collective operation is available in `smdistributed-dataparallel>=2.0.1` and AWS Deep Learning Containers (DLC) for PyTorch v2.0.1 and later.

`AllGather` is heavily used in distributed training techniques such as sharded data parallelism where each individual worker holds a fraction of a model, or a sharded layer. The workers call `AllGather` before forward and backward passes to reconstruct the sharded layers. The forward and backward passes continue onward after the parameters are *all gathered*. During the backward pass, each worker also calls `ReduceScatter` to collect (reduce) gradients and break (scatter) them into gradient shards to update the corresponding sharded layer. For more details on the role of these collective operations in sharded data parallelism, see the [SMP library's implementati on of sharded data parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html), [ZeRO](https://deepspeed.readthedocs.io/en/latest/zero3.html#) in the DeepSpeed documentation, and the blog about [PyTorch Fully Sharded Data Parallelism](https://engineering.fb.com/2021/07/15/open-source/fsdp/).

Because collective operations like AllGather are called in every iteration, they are the main contributors to GPU communication overhead. Faster computation of these collective operations directly translates to a shorter training time with no side effects on convergence. To achieve this, the SMDDP library offers `AllGather` optimized for [P4d instances](https://aws.amazon.com/ec2/instance-types/p4/).

SMDDP `AllGather` uses the following techniques to improve computational performance on P4d instances.

1. It transfers data between instances (inter-node) through the [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) network with a mesh topology. EFA is the AWS low-latency and high-throughput network solution. A mesh topology for inter-node network communication is more tailored to the characteristics of EFA and AWS network infrastructure. Compared to the NCCL ring or tree topology that involves multiple packet hops, SMDDP avoids accumulating latency from multiple hops as it only needs one hop. SMDDP implements a network rate control algorithm that balances the workload to each communication peer in a mesh topology and achieves a higher global network throughput.

1. It adopts [low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology (GDRCopy)](https://github.com/NVIDIA/gdrcopy) to coordinate local NVLink and EFA network traffic. GDRCopy, a low-latency GPU memory copy library offered by NVIDIA, provides low-latency communication between CPU processes and GPU CUDA kernels. With this technology, the SMDDP library is able to pipeline the intra-node and inter-node data movement.

1. It reduces the usage of GPU streaming multiprocessors to increase compute power for running model kernels. P4d and P4de instances are equipped with NVIDIA A100 GPUs, which each have 108 streaming multiprocessors. While NCCL takes up to 24 streaming multiprocessors to run collective operations, SMDDP uses fewer than 9 streaming multiprocessors. Model compute kernels pick up the saved streaming multiprocessors for faster computation.

# Supported frameworks, AWS Regions, and instances types
<a name="distributed-data-parallel-support"></a>

Before using the SageMaker AI distributed data parallelism (SMDDP) library, check what are the supported ML frameworks and instance types and if there are enough quotas in your AWS account and AWS Region.

## Supported frameworks
<a name="distributed-data-parallel-supported-frameworks"></a>

The following tables show the deep learning frameworks and their versions that SageMaker AI and SMDDP support. The SMDDP library is available in [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only), integrated in [Docker containers distributed by the SageMaker model parallelism (SMP) library v2](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2), or downloadable as a binary file.

**Note**  
To check the latest updates and release notes of the SMDDP library, see the [SageMaker AI data parallelism library release notes](data-parallel-release-notes.md).

**Topics**
+ [

### PyTorch
](#distributed-data-parallel-supported-frameworks-pytorch)
+ [

### PyTorch Lightning
](#distributed-data-parallel-supported-frameworks-lightning)
+ [

### Hugging Face Transformers
](#distributed-data-parallel-supported-frameworks-transformers)
+ [

### TensorFlow (deprecated)
](#distributed-data-parallel-supported-frameworks-tensorflow)

### PyTorch
<a name="distributed-data-parallel-supported-frameworks-pytorch"></a>


| PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | 
| v2.3.1 | smdistributed-dataparallel==v2.5.0 | Not available | 658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.4.1/cu121/2024-10-09/smdistributed\$1dataparallel-2.5.0-cp311-cp311-linux\$1x86\$164.whl | 
| v2.3.0 | smdistributed-dataparallel==v2.3.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker | Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\$1dataparallel-2.3.0-cp311-cp311-linux\$1x86\$164.whl | 
| v2.2.0 | smdistributed-dataparallel==v2.2.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\$1dataparallel-2.2.0-cp310-cp310-linux\$1x86\$164.whl | 
| v2.1.0 | smdistributed-dataparallel==v2.1.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\$1dataparallel-2.1.0-cp310-cp310-linux\$1x86\$164.whl | 
| v2.0.1 | smdistributed-dataparallel==v2.0.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\$1dataparallel-2.0.2-cp310-cp310-linux\$1x86\$164.whl | 
| v2.0.0 | smdistributed-dataparallel==v1.8.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed\$1dataparallel-1.8.0-cp310-cp310-linux\$1x86\$164.whl | 
| v1.13.1 | smdistributed-dataparallel==v1.7.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed\$1dataparallel-1.7.0-cp39-cp39-linux\$1x86\$164.whl | 
| v1.12.1 | smdistributed-dataparallel==v1.6.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.1/cu113/2022-12-05/smdistributed\$1dataparallel-1.6.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.12.0 | smdistributed-dataparallel==v1.5.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\$1dataparallel-1.5.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.11.0 | smdistributed-dataparallel==v1.4.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.11.0/cu113/2022-04-14/smdistributed\$1dataparallel-1.4.1-cp38-cp38-linux\$1x86\$164.whl | 

\$1\$1 The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

**Note**  
The SMDDP library is available in AWS Regions where the [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) and the [SMP Docker images](distributed-model-parallel-support-v2.md) are in service.

**Note**  
The SMDDP library v1.4.0 and later works as a backend of PyTorch distributed (torch.distributed) data parallelism (torch.parallel.DistributedDataParallel). In accordance with the change, the following [smdistributed APIs](https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest/smd_data_parallel_pytorch.html#pytorch-api) for the PyTorch distributed package have been deprecated.  
`smdistributed.dataparallel.torch.distributed` is deprecated. Use the [torch.distributed](https://pytorch.org/docs/stable/distributed.html) package instead.
`smdistributed.dataparallel.torch.parallel.DistributedDataParallel` is deprecated. Use the [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) API instead.
If you need to use the previous versions of the library (v1.3.0 or before), see the [archived SageMaker AI distributed data parallelism documentation](https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest.html#documentation-archive) in the *SageMaker AI Python SDK documentation*.

### PyTorch Lightning
<a name="distributed-data-parallel-supported-frameworks-lightning"></a>

The SMDDP library is available for PyTorch Lightning in the following SageMaker AI Framework Containers for PyTorch and the SMP Docker containers.

**PyTorch Lightning v2**


| PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | --- | 
| 2.2.5 | 2.3.0 | smdistributed-dataparallel==v2.3.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker | Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\$1dataparallel-2.3.0-cp311-cp311-linux\$1x86\$164.whl | 
| 2.2.0 | 2.2.0 | smdistributed-dataparallel==v2.2.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\$1dataparallel-2.2.0-cp310-cp310-linux\$1x86\$164.whl | 
| 2.1.2 | 2.1.0 | smdistributed-dataparallel==v2.1.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\$1dataparallel-2.1.0-cp310-cp310-linux\$1x86\$164.whl | 
| 2.1.0 | 2.0.1 | smdistributed-dataparallel==v2.0.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\$1dataparallel-2.0.2-cp310-cp310-linux\$1x86\$164.whl | 

**PyTorch Lightning v1**


| PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | 
|  1.7.2 1.7.0 1.6.4 1.6.3 1.5.10  | 1.12.0 | smdistributed-dataparallel==v1.5.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\$1dataparallel-1.5.0-cp38-cp38-linux\$1x86\$164.whl | 

\$1\$1 The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

**Note**  
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the PyTorch DLCs. When you construct a SageMaker AI PyTorch estimator and submit a training job request in [Step 2](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-framework-estimator), you need to provide `requirements.txt` to install `pytorch-lightning` and `lightning-bolts` in the SageMaker AI PyTorch training container.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

### Hugging Face Transformers
<a name="distributed-data-parallel-supported-frameworks-transformers"></a>

The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and paired PyTorch and TensorFlow versions, see the latest [Hugging Face Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers) and the [Prior Hugging Face Container Versions](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#prior-hugging-face-container-versions).

### TensorFlow (deprecated)
<a name="distributed-data-parallel-supported-frameworks-tensorflow"></a>

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. The following table lists previous DLCs for TensorFlow with the SMDDP library installed.


| TensorFlow version | SMDDP library version | 
| --- | --- | 
| 2.9.1, 2.10.1, 2.11.0 |  smdistributed-dataparallel==v1.4.1  | 
| 2.8.3 |  smdistributed-dataparallel==v1.3.0  | 

## AWS Regions
<a name="distributed-data-parallel-availablity-zone"></a>

The SMDDP library is available in all of the AWS Regions where the [AWS Deep Learning Containers for SageMaker AI](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) and the [SMP Docker images](distributed-model-parallel-support-v2.md) are in service.

## Supported instance types
<a name="distributed-data-parallel-supported-instance-types"></a>

The SMDDP library requires one of the following instance types.


| Instance type | 
| --- | 
| ml.p3dn.24xlarge\$1 | 
| ml.p4d.24xlarge | 
| ml.p4de.24xlarge | 

**Tip**  
To properly run distributed training on the EFA-enabled instance types, you should enable traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

**Important**  
\$1 The SMDDP library has discontinued support for optimizing its collective communication operations on P3 instances. While you can still utilize the SMDDP optimized `AllReduce` collective on `ml.p3dn.24xlarge` instances, there will be no further development support to enhance performance on this instance type. Note that the SMDDP optimized `AllGather` collective is only available for P4 instances.

For specs of the instance types, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Request a service quota increase for SageMaker AI resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure).

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.
```

# Distributed training with the SageMaker AI distributed data parallelism library
<a name="data-parallel-modify-sdp"></a>

The SageMaker AI distributed data parallelism (SMDDP) library is designed for ease of use and to provide seamless integration with PyTorch.

When training a deep learning model with the SMDDP library on SageMaker AI, you can focus on writing your training script and model training. 

To get started, import the SMDDP library to use its collective operations optimized for AWS. The following topics provide instructions on what to add to your training script depending on which collective operation you want to optimize.

**Topics**
+ [

# Adapting your training script to use the SMDDP collective operations
](data-parallel-modify-sdp-select-framework.md)
+ [

# Launching distributed training jobs with SMDDP using the SageMaker Python SDK
](data-parallel-use-api.md)

# Adapting your training script to use the SMDDP collective operations
<a name="data-parallel-modify-sdp-select-framework"></a>

The training script examples provided in this section are simplified and highlight only the required changes to enable the SageMaker AI distributed data parallelism (SMDDP) library in your training script. For end-to-end Jupyter notebook examples that demonstrate how to run a distributed training job with the SMDDP library, see [Amazon SageMaker AI data parallelism library examples](distributed-data-parallel-v2-examples.md).

**Topics**
+ [

# Use the SMDDP library in your PyTorch training script
](data-parallel-modify-sdp-pt.md)
+ [

# Use the SMDDP library in your PyTorch Lightning training script
](data-parallel-modify-sdp-pt-lightning.md)
+ [

# Use the SMDDP library in your TensorFlow training script (deprecated)
](data-parallel-modify-sdp-tf2.md)

# Use the SMDDP library in your PyTorch training script
<a name="data-parallel-modify-sdp-pt"></a>

Starting from the SageMaker AI distributed data parallelism (SMDDP) library v1.4.0, you can use the library as a backend option for the [PyTorch distributed package](https://pytorch.org/tutorials/beginner/dist_overview.html). To use the SMDDP `AllReduce` and `AllGather` collective operations, you only need to import the SMDDP library at the beginning of your training script and set SMDDP as the the backend of PyTorch distributed modules during process group initialization. With the single line of backend specification, you can keep all the native PyTorch distributed modules and the entire training script unchanged. The following code snippets show how to use the SMDDP library as the backend of PyTorch-based distributed training packages: [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html), [DeepSpeed](https://github.com/microsoft/DeepSpeed), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed).

## For PyTorch DDP or FSDP
<a name="data-parallel-enable-for-ptddp-ptfsdp"></a>

Initialize the process group as follows.

```
import torch.distributed as dist
import smdistributed.dataparallel.torch.torch_smddp

dist.init_process_group(backend="smddp")
```

**Note**  
(For PyTorch DDP jobs only) The `smddp` backend currently does not support creating subprocess groups with the `torch.distributed.new_group()` API. You also cannot use the `smddp` backend concurrently with other process group backends such as `NCCL` and `Gloo`.

## For DeepSpeed or Megatron-DeepSpeed
<a name="data-parallel-enable-for-deepspeed"></a>

Initialize the process group as follows.

```
import deepspeed
import smdistributed.dataparallel.torch.torch_smddp

deepspeed.init_distributed(dist_backend="smddp")
```

**Note**  
To use SMDDP `AllGather` with the `mpirun`-based launchers (`smdistributed` and `pytorchddp`) in [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md), you also need to set the following environment variable in your training script.  

```
export SMDATAPARALLEL_OPTIMIZE_SDP=true
```

For general guidance on writing a PyTorch FSDP training script, see [Advanced Model Training with Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) in the PyTorch documentation.

For general guidance on writing a PyTorch DDP training script, see [Getting started with distributed data parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) in the PyTorch documentation.

After you have completed adapting your training script, proceed to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md).

# Use the SMDDP library in your PyTorch Lightning training script
<a name="data-parallel-modify-sdp-pt-lightning"></a>

If you want to bring your [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction.html) training script and run a distributed data parallel training job in SageMaker AI, you can run the training job with minimal changes in your training script. The necessary changes include the following: import the `smdistributed.dataparallel` library’s PyTorch modules, set up the environment variables for PyTorch Lightning to accept the SageMaker AI environment variables that are preset by the SageMaker training toolkit, and activate the SMDDP library by setting the process group backend to `"smddp"`. To learn more, walk through the following instructions that break down the steps with code examples.

**Note**  
The PyTorch Lightning support is available in the SageMaker AI data parallel library v1.5.0 and later.

## PyTorch Lightning == v2.1.0 and PyTorch == 2.0.1
<a name="smddp-pt-201-lightning-210"></a>

1. Import the `pytorch_lightning` library and the `smdistributed.dataparallel.torch` modules.

   ```
   import lightning as pl
   import smdistributed.dataparallel.torch.torch_smddp
   ```

1. Instantiate the [LightningEnvironment](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.plugins.environments.LightningEnvironment.html).

   ```
   from lightning.fabric.plugins.environments.lightning import LightningEnvironment
   
   env = LightningEnvironment()
   env.world_size = lambda: int(os.environ["WORLD_SIZE"])
   env.global_rank = lambda: int(os.environ["RANK"])
   ```

1. **For PyTorch DDP** – Create an object of the [DDPStrategy](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.DDPStrategy.html) class with `"smddp"` for `process_group_backend` and `"gpu"` for `accelerator`, and pass that to the [Trainer](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) class.

   ```
   import lightning as pl
   from lightning.pytorch.strategies import DDPStrategy
   
   ddp = DDPStrategy(
       cluster_environment=env, 
       process_group_backend="smddp", 
       accelerator="gpu"
   )
   
   trainer = pl.Trainer(
       max_epochs=200, 
       strategy=ddp, 
       devices=num_gpus, 
       num_nodes=num_nodes
   )
   ```

   **For PyTorch FSDP** – Create an object of the [FSDPStrategy](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.FSDPStrategy.html) class (with [wrapping policy](https://pytorch.org/docs/stable/fsdp.html) of choice) with `"smddp"` for `process_group_backend` and `"gpu"` for `accelerator`, and pass that to the [Trainer](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) class.

   ```
   import lightning as pl
   from lightning.pytorch.strategies import FSDPStrategy
   
   from functools import partial
   from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
   
   policy = partial(
       size_based_auto_wrap_policy, 
       min_num_params=10000
   )
   
   fsdp = FSDPStrategy(
       auto_wrap_policy=policy,
       process_group_backend="smddp", 
       cluster_environment=env
   )
   
   trainer = pl.Trainer(
       max_epochs=200, 
       strategy=fsdp, 
       devices=num_gpus, 
       num_nodes=num_nodes
   )
   ```

After you have completed adapting your training script, proceed to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md). 

**Note**  
When you construct a SageMaker AI PyTorch estimator and submit a training job request in [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md), you need to provide `requirements.txt` to install `pytorch-lightning` and `lightning-bolts` in the SageMaker AI PyTorch training container.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

# Use the SMDDP library in your TensorFlow training script (deprecated)
<a name="data-parallel-modify-sdp-tf2"></a>

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

The following steps show you how to modify a TensorFlow training script to utilize SageMaker AI's distributed data parallel library.  

The library APIs are designed to be similar to Horovod APIs. For additional details on each API that the library offers for TensorFlow, see the [SageMaker AI distributed data parallel TensorFlow API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html#api-documentation).

**Note**  
SageMaker AI distributed data parallel is adaptable to TensorFlow training scripts composed of `tf` core modules except `tf.keras` modules. SageMaker AI distributed data parallel does not support TensorFlow with Keras implementation.

**Note**  
The SageMaker AI distributed data parallelism library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to enable AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism library runs its `AllReduce` operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:  
[Frameworks - TensorFlow](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensorflow) in the *NVIDIA Deep Learning Performance documentation*
[Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision) in the *NVIDIA Developer Docs*
[TensorFlow mixed precision APIs](https://www.tensorflow.org/guide/mixed_precision) in the *TensorFlow documentation*

1. Import the library's TensorFlow client and initialize it.

   ```
   import smdistributed.dataparallel.tensorflow as sdp 
   sdp.init()
   ```

1. Pin each GPU to a single `smdistributed.dataparallel` process with `local_rank`—this refers to the relative rank of the process within a given node. The `sdp.tensorflow.local_rank()` API provides you with the local rank of the device. The leader node is rank 0, and the worker nodes are rank 1, 2, 3, and so on. This is invoked in the following code block as `sdp.local_rank()`. `set_memory_growth` is not directly related to SageMaker AI distributed, but must be set for distributed training with TensorFlow. 

   ```
   gpus = tf.config.experimental.list_physical_devices('GPU')
   for gpu in gpus:
       tf.config.experimental.set_memory_growth(gpu, True)
   if gpus:
       tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')
   ```

1. Scale the learning rate by the number of workers. The `sdp.tensorflow.size()` API provides you the number of workers in the cluster. This is invoked in the following code block as `sdp.size()`. 

   ```
   learning_rate = learning_rate * sdp.size()
   ```

1. Use the library’s `DistributedGradientTape` to optimize `AllReduce` operations during training. This wraps `tf.GradientTape`.  

   ```
   with tf.GradientTape() as tape:
         output = model(input)
         loss_value = loss(label, output)
       
   # SageMaker AI data parallel: Wrap tf.GradientTape with the library's DistributedGradientTape
   tape = sdp.DistributedGradientTape(tape)
   ```

1. Broadcast the initial model variables from the leader node (rank 0) to all the worker nodes (ranks 1 through n). This is needed to ensure a consistent initialization across all the worker ranks. Use the `sdp.tensorflow.broadcast_variables` API after the model and optimizer variables are initialized. This is invoked in the following code block as `sdp.broadcast_variables()`. 

   ```
   sdp.broadcast_variables(model.variables, root_rank=0)
   sdp.broadcast_variables(opt.variables(), root_rank=0)
   ```

1. Finally, modify your script to save checkpoints only on the leader node. The leader node has a synchronized model. This also avoids worker nodes overwriting the checkpoints and possibly corrupting the checkpoints. 

   ```
   if sdp.rank() == 0:
       checkpoint.save(checkpoint_dir)
   ```

The following is an example TensorFlow training script for distributed training with the library.

```
import tensorflow as tf

# SageMaker AI data parallel: Import the library TF API
import smdistributed.dataparallel.tensorflow as sdp

# SageMaker AI data parallel: Initialize the library
sdp.init()

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    # SageMaker AI data parallel: Pin GPUs to a single library process
    tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')

# Prepare Dataset
dataset = tf.data.Dataset.from_tensor_slices(...)

# Define Model
mnist_model = tf.keras.Sequential(...)
loss = tf.losses.SparseCategoricalCrossentropy()

# SageMaker AI data parallel: Scale Learning Rate
# LR for 8 node run : 0.000125
# LR for single node run : 0.001
opt = tf.optimizers.Adam(0.000125 * sdp.size())

@tf.function
def training_step(images, labels, first_batch):
    with tf.GradientTape() as tape:
        probs = mnist_model(images, training=True)
        loss_value = loss(labels, probs)

    # SageMaker AI data parallel: Wrap tf.GradientTape with the library's DistributedGradientTape
    tape = sdp.DistributedGradientTape(tape)

    grads = tape.gradient(loss_value, mnist_model.trainable_variables)
    opt.apply_gradients(zip(grads, mnist_model.trainable_variables))

    if first_batch:
       # SageMaker AI data parallel: Broadcast model and optimizer variables
       sdp.broadcast_variables(mnist_model.variables, root_rank=0)
       sdp.broadcast_variables(opt.variables(), root_rank=0)

    return loss_value

...

# SageMaker AI data parallel: Save checkpoints only from master node.
if sdp.rank() == 0:
    checkpoint.save(checkpoint_dir)
```

After you have completed adapting your training script, move on to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md). 

# Launching distributed training jobs with SMDDP using the SageMaker Python SDK
<a name="data-parallel-use-api"></a>

To run a distributed training job with your adapted script from [Adapting your training script to use the SMDDP collective operations](data-parallel-modify-sdp-select-framework.md), use the SageMaker Python SDK's framework or generic estimators by specifying the prepared training script as an entry point script and the distributed training configuration.

This page walks you through how to use the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/index.html) in two ways.
+ If you want to achieve a quick adoption of your distributed training job in SageMaker AI, configure a SageMaker AI [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch) or [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) framework estimator class. The framework estimator picks up your training script and automatically matches the right image URI of the [pre-built PyTorch or TensorFlow Deep Learning Containers (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only), given the value specified to the `framework_version` parameter.
+ If you want to extend one of the pre-built containers or build a custom container to create your own ML environment with SageMaker AI, use the SageMaker AI generic `Estimator` class and specify the image URI of the custom Docker container hosted in your Amazon Elastic Container Registry (Amazon ECR).

Your training datasets should be stored in Amazon S3 or [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) in the AWS Region in which you are launching your training job. If you use Jupyter notebooks, you should have a SageMaker notebook instance or a SageMaker Studio Classic app running in the same AWS Region. For more information about storing your training data, see the [SageMaker Python SDK data inputs](https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-input) documentation. 

**Tip**  
We recommend that you use Amazon FSx for Lustre instead of Amazon S3 to improve training performance. Amazon FSx has higher throughput and lower latency than Amazon S3.

**Tip**  
To properly run distributed training on the EFA-enabled instance types, you should enables traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

Choose one of the following topics for instructions on how to run a distributed training job of your training script. After you launch a training job, you can monitor system utilization and model performance using [Amazon SageMaker Debugger](train-debugger.md) or Amazon CloudWatch.

While you follow instructions in the following topics to learn more about technical details, we also recommend that you try the [Amazon SageMaker AI data parallelism library examples](distributed-data-parallel-v2-examples.md) to get started.

**Topics**
+ [

# Use the PyTorch framework estimators in the SageMaker Python SDK
](data-parallel-framework-estimator.md)
+ [

# Use the SageMaker AI generic estimator to extend pre-built DLC containers
](data-parallel-use-python-skd-api.md)
+ [

# Create your own Docker container with the SageMaker AI distributed data parallel library
](data-parallel-bring-your-own-container.md)

# Use the PyTorch framework estimators in the SageMaker Python SDK
<a name="data-parallel-framework-estimator"></a>

You can launch distributed training by adding the `distribution` argument to the SageMaker AI framework estimators, [https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch) or [https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator). For more details, choose one of the frameworks supported by the SageMaker AI distributed data parallelism (SMDDP) library from the following selections.

------
#### [ PyTorch ]

The following launcher options are available for launching PyTorch distributed training.
+ `pytorchddp` – This option runs `mpirun` and sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to the `distribution` parameter.

  ```
  { "pytorchddp": { "enabled": True } }
  ```
+ `torch_distributed` – This option runs `torchrun` and sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to the `distribution` parameter.

  ```
  { "torch_distributed": { "enabled": True } }
  ```
+ `smdistributed` – This option also runs `mpirun` but with `smddprun` that sets up environment variables needed for running PyTorch distributed training on SageMaker AI.

  ```
  { "smdistributed": { "dataparallel": { "enabled": True } } }
  ```

If you chose to replace NCCL `AllGather` to SMDDP `AllGather`, you can use all three options. Choose one option that fits with your use case.

If you chose to replace NCCL `AllReduce` with SMDDP `AllReduce`, you should choose one of the `mpirun`-based options: `smdistributed` or `pytorchddp`. You can also add additional MPI options as follows.

```
{ 
    "pytorchddp": {
        "enabled": True, 
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}
```

```
{ 
    "smdistributed": { 
        "dataparallel": {
            "enabled": True, 
            "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
        }
    }
}
```

The following code sample shows the basic structure of a PyTorch estimator with distributed training options.

```
from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
    base_job_name="training_job_name_prefix",
    source_dir="subdirectory-to-your-code",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    py_version="py310",
    framework_version="2.0.1",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p4de.24xlarge
    instance_type="ml.p4d.24xlarge",

    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)

pt_estimator.fit("s3://bucket/path/to/training/data")
```

**Note**  
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the SageMaker AI PyTorch DLCs. Create the following `requirements.txt` file and save in the source directory where you save the training script.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For example, the tree-structured directory should look like the following.  

```
├── pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code
    ├──  adapted-training-script.py
    └──  requirements.txt
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

**Considerations for activating SMDDP collective operations and using the right distributed training launcher options**
+ SMDDP `AllReduce` and SMDDP `AllGather` are not mutually compatible at present.
+ SMDDP `AllReduce` is activated by default when using `smdistributed` or `pytorchddp`, which are `mpirun`-based launchers, and NCCL `AllGather` is used.
+ SMDDP `AllGather` is activated by default when using `torch_distributed` launcher, and `AllReduce` falls back to NCCL.
+ SMDDP `AllGather` can also be activated when using the `mpirun`-based launchers with an additional environment variable set as follows.

  ```
  export SMDATAPARALLEL_OPTIMIZE_SDP=true
  ```

------
#### [ TensorFlow ]

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see [TensorFlow (deprecated)](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks-tensorflow).

```
from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
    base_job_name = "training_job_name_prefix",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    framework_version="2.11.0",
    py_version="py38",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
    instance_type="ml.p3.16xlarge",

    # Training using the SageMaker AI data parallel distributed training strategy
    distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

tf_estimator.fit("s3://bucket/path/to/training/data")
```

------

# Use the SageMaker AI generic estimator to extend pre-built DLC containers
<a name="data-parallel-use-python-skd-api"></a>

You can customize SageMaker AI prebuilt containers or extend them to handle any additional functional requirements for your algorithm or model that the prebuilt SageMaker AI Docker image doesn't support. For an example of how you can extend a pre-built container, see [Extend a Prebuilt Container](https://docs.aws.amazon.com/sagemaker/latest/dg/prebuilt-containers-extend.html).

To extend a prebuilt container or adapt your own container to use the library, you must use one of the images listed in [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

**Note**  
From TensorFlow 2.4.1 and PyTorch 1.8.1, SageMaker AI framework DLCs supports EFA-enabled instance types. We recommend that you use the DLC images that contain TensorFlow 2.4.1 or later and PyTorch 1.8.1 or later. 

For example, if you use PyTorch, your Dockerfile should contain a `FROM` statement similar to the following:

```
# SageMaker AI PyTorch image
FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/pytorch-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker AI PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# /opt/ml and all subdirectories are utilized by SageMaker AI, use the /code subdirectory to store your user code.
COPY train.py /opt/ml/code/train.py

# Defines cifar10.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py
```

You can further customize your own Docker container to work with SageMaker AI using the [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) and the binary file of the SageMaker AI distributed data parallel library. To learn more, see the instructions in the following section.

# Create your own Docker container with the SageMaker AI distributed data parallel library
<a name="data-parallel-bring-your-own-container"></a>

To build your own Docker container for training and use the SageMaker AI data parallel library, you must include the correct dependencies and the binary files of the SageMaker AI distributed parallel libraries in your Dockerfile. This section provides instructions on how to create a complete Dockerfile with the minimum set of dependencies for distributed training in SageMaker AI using the data parallel library.

**Note**  
This custom Docker option with the SageMaker AI data parallel library as a binary is available only for PyTorch.

**To create a Dockerfile with the SageMaker training toolkit and the data parallel library**

1. Start with a Docker image from [NVIDIA CUDA](https://hub.docker.com/r/nvidia/cuda). Use the cuDNN developer versions that contain CUDA runtime and development tools (headers and libraries) to build from the [PyTorch source code](https://github.com/pytorch/pytorch#from-source).

   ```
   FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
   ```
**Tip**  
The official AWS Deep Learning Container (DLC) images are built from the [NVIDIA CUDA base images](https://hub.docker.com/r/nvidia/cuda). If you want to use the prebuilt DLC images as references while following the rest of the instructions, see the [AWS Deep Learning Containers for PyTorch Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/pytorch). 

1. Add the following arguments to specify versions of PyTorch and other packages. Also, indicate the Amazon S3 bucket paths to the SageMaker AI data parallel library and other software to use AWS resources, such as the Amazon S3 plug-in. 

   To use versions of the third party libraries other than the ones provided in the following code example, we recommend you look into the [official Dockerfiles of AWS Deep Learning Container for PyTorch](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker) to find versions that are tested, compatible, and suitable for your application. 

   To find URLs for the `SMDATAPARALLEL_BINARY` argument, see the lookup tables at [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

   ```
   ARG PYTORCH_VERSION=1.10.2
   ARG PYTHON_SHORT_VERSION=3.8
   ARG EFA_VERSION=1.14.1
   ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
   ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
   ARG CONDA_PREFIX="/opt/conda"
   ARG BRANCH_OFI=1.1.3-aws
   ```

1. Set the following environment variables to properly build SageMaker training components and run the data parallel library. You use these variables for the components in the subsequent steps.

   ```
   # Set ENV variables required to build PyTorch
   ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0"
   ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
   ENV NCCL_VERSION=2.10.3
   
   # Add OpenMPI to the path.
   ENV PATH /opt/amazon/openmpi/bin:$PATH
   
   # Add Conda to path
   ENV PATH $CONDA_PREFIX/bin:$PATH
   
   # Set this enviroment variable for SageMaker AI to launch SMDDP correctly.
   ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
   
   # Add enviroment variable for processes to be able to call fork()
   ENV RDMAV_FORK_SAFE=1
   
   # Indicate the container type
   ENV DLC_CONTAINER_TYPE=training
   
   # Add EFA and SMDDP to LD library path
   ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
   ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
   ```

1. Install or update `curl`, `wget`, and `git` to download and build packages in the subsequent steps.

   ```
   RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
       apt-get update && apt-get install -y  --no-install-recommends \
           curl \
           wget \
           git \
       && rm -rf /var/lib/apt/lists/*
   ```

1. Install [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) software for Amazon EC2 network communication.

   ```
   RUN DEBIAN_FRONTEND=noninteractive apt-get update
   RUN mkdir /tmp/efa \
       && cd /tmp/efa \
       && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
       && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
       && cd aws-efa-installer \
       && ./efa_installer.sh -y --skip-kmod -g \
       && rm -rf /tmp/efa
   ```

1. Install [Conda](https://docs.conda.io/en/latest/) to handle package management. 

   ```
   RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
       chmod +x ~/miniconda.sh && \
       ~/miniconda.sh -b -p $CONDA_PREFIX && \
       rm ~/miniconda.sh && \
       $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \
       $CONDA_PREFIX/bin/conda clean -ya
   ```

1. Get, build, and install PyTorch and its dependencies. We build [PyTorch from the source code](https://github.com/pytorch/pytorch#from-source) because we need to have control of the NCCL version to guarantee compatibility with the [AWS OFI NCCL plug-in](https://github.com/aws/aws-ofi-nccl).

   1. Following the steps in the [PyTorch official dockerfile](https://github.com/pytorch/pytorch/blob/master/Dockerfile), install build dependencies and set up [ccache](https://ccache.dev/) to speed up recompilation.

      ```
      RUN DEBIAN_FRONTEND=noninteractive \
          apt-get install -y --no-install-recommends \
              build-essential \
              ca-certificates \
              ccache \
              cmake \
              git \
              libjpeg-dev \
              libpng-dev \
          && rm -rf /var/lib/apt/lists/*
        
      # Setup ccache
      RUN /usr/sbin/update-ccache-symlinks
      RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
      ```

   1. Install [PyTorch’s common and Linux dependencies](https://github.com/pytorch/pytorch#install-dependencies).

      ```
      # Common dependencies for PyTorch
      RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
      
      # Linux specific dependency for PyTorch
      RUN conda install -c pytorch magma-cuda113
      ```

   1. Clone the [PyTorch GitHub repository](https://github.com/pytorch/pytorch).

      ```
      RUN --mount=type=cache,target=/opt/ccache \
          cd / \
          && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
      ```

   1. Install and build a specific [NCCL](https://developer.nvidia.com/nccl) version. To do this, replace the content in the PyTorch’s default NCCL folder (`/pytorch/third_party/nccl`) with the specific NCCL version from the NVIDIA repository. The NCCL version was set in the step 3 of this guide.

      ```
      RUN cd /pytorch/third_party/nccl \
          && rm -rf nccl \
          && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
          && cd nccl \
          && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
          && make pkg.txz.build \
          && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
      ```

   1. Build and install PyTorch. This process usually takes slightly more than 1 hour to complete. It is built using the NCCL version downloaded in a previous step.

      ```
      RUN cd /pytorch \
          && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
          python setup.py install \
          && rm -rf /pytorch
      ```

1. Build and install [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl). This enables [libfabric](https://github.com/ofiwg/libfabric) support for the SageMaker AI data parallel library.

   ```
   RUN DEBIAN_FRONTEND=noninteractive apt-get update \
       && apt-get install -y --no-install-recommends \
           autoconf \
           automake \
           libtool
   RUN mkdir /tmp/efa-ofi-nccl \
       && cd /tmp/efa-ofi-nccl \
       && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
       && cd aws-ofi-nccl \
       && ./autogen.sh \
       && ./configure --with-libfabric=/opt/amazon/efa \
       --with-mpi=/opt/amazon/openmpi \
       --with-cuda=/usr/local/cuda \
       --with-nccl=$CONDA_PREFIX \
       && make \
       && make install \
       && rm -rf /tmp/efa-ofi-nccl
   ```

1. Build and install [TorchVision](https://github.com/pytorch/vision.git).

   ```
   RUN pip install --no-cache-dir -U \
       packaging \
       mpi4py==3.0.3
   RUN cd /tmp \
       && git clone https://github.com/pytorch/vision.git -b v0.9.1 \
       && cd vision \
       && BUILD_VERSION="0.9.1+cu111" python setup.py install \
       && cd /tmp \
       && rm -rf vision
   ```

1. Install and configure OpenSSH. OpenSSH is required for MPI to communicate between containers. Allow OpenSSH to talk to containers without asking for confirmation.

   ```
   RUN apt-get update \
       && apt-get install -y  --allow-downgrades --allow-change-held-packages --no-install-recommends \
       && apt-get install -y --no-install-recommends openssh-client openssh-server \
       && mkdir -p /var/run/sshd \
       && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
       && echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
       && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
       && rm -rf /var/lib/apt/lists/*
   
   # Configure OpenSSH so that nodes can communicate with each other
   RUN mkdir -p /var/run/sshd && \
    sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
   RUN rm -rf /root/.ssh/ && \
    mkdir -p /root/.ssh/ && \
    ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
    && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
   ```

1. Install the PT S3 plug-in to efficiently access datasets in Amazon S3.

   ```
   RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
   RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
   ```

1. Install the [libboost](https://www.boost.org/) library. This package is needed for networking the asynchronous IO functionality of the SageMaker AI data parallel library.

   ```
   WORKDIR /
   RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
       && tar -xzf boost_1_73_0.tar.gz \
       && cd boost_1_73_0 \
       && ./bootstrap.sh \
       && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \
       && cd .. \
       && rm -rf boost_1_73_0.tar.gz \
       && rm -rf boost_1_73_0 \
       && cd ${CONDA_PREFIX}/include/boost
   ```

1. Install the following SageMaker AI tools for PyTorch training.

   ```
   WORKDIR /root
   RUN pip install --no-cache-dir -U \
       smclarify \
       "sagemaker>=2,<3" \
       sagemaker-experiments==0.* \
       sagemaker-pytorch-training
   ```

1. Finally, install the SageMaker AI data parallel binary and the remaining dependencies.

   ```
   RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
     apt-get update && apt-get install -y  --no-install-recommends \
     jq \
     libhwloc-dev \
     libnuma1 \
     libnuma-dev \
     libssl1.1 \
     libtool \
     hwloc \
     && rm -rf /var/lib/apt/lists/*
   
   RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
   ```

1. After you finish creating the Dockerfile, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html) to learn how to build the Docker container, host it in Amazon ECR, and run a training job using the SageMaker Python SDK.

The following example code shows a complete Dockerfile after combining all the previous code blocks.

```
# This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

# Set appropiate versions and location for components
ARG PYTORCH_VERSION=1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
ARG CONDA_PREFIX="/opt/conda"
ARG BRANCH_OFI=1.1.3-aws

# Set ENV variables required to build PyTorch
ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV NCCL_VERSION=2.10.3

# Add OpenMPI to the path.
ENV PATH /opt/amazon/openmpi/bin:$PATH

# Add Conda to path
ENV PATH $CONDA_PREFIX/bin:$PATH

# Set this enviroment variable for SageMaker AI to launch SMDDP correctly.
ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main

# Add enviroment variable for processes to be able to call fork()
ENV RDMAV_FORK_SAFE=1

# Indicate the container type
ENV DLC_CONTAINER_TYPE=training

# Add EFA and SMDDP to LD library path
ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH

# Install basic dependencies to download and build other dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
  apt-get update && apt-get install -y  --no-install-recommends \
  curl \
  wget \
  git \
  && rm -rf /var/lib/apt/lists/*

# Install EFA.
# This is required for SMDDP backend communication
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN mkdir /tmp/efa \
    && cd /tmp/efa \
    && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
    && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
    && cd aws-efa-installer \
    && ./efa_installer.sh -y --skip-kmod -g \
    && rm -rf /tmp/efa

# Install Conda
RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
    chmod +x ~/miniconda.sh && \
    ~/miniconda.sh -b -p $CONDA_PREFIX && \
    rm ~/miniconda.sh && \
    $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \
    $CONDA_PREFIX/bin/conda clean -ya

# Install PyTorch.
# Start with dependencies listed in official PyTorch dockerfile
# https://github.com/pytorch/pytorch/blob/master/Dockerfile
RUN DEBIAN_FRONTEND=noninteractive \
    apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        ccache \
        cmake \
        git \
        libjpeg-dev \
        libpng-dev && \
    rm -rf /var/lib/apt/lists/*

# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache

# Common dependencies for PyTorch
RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses

# Linux specific dependency for PyTorch
RUN conda install -c pytorch magma-cuda113

# Clone PyTorch
RUN --mount=type=cache,target=/opt/ccache \
    cd / \
    && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
# Note that we need to use the same NCCL version for PyTorch and OFI plugin.
# To enforce that, install NCCL from source before building PT and OFI plugin.

# Install NCCL.
# Required for building OFI plugin (OFI requires NCCL's header files and library)
RUN cd /pytorch/third_party/nccl \
    && rm -rf nccl \
    && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
    && cd nccl \
    && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    && make pkg.txz.build \
    && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1

# Build and install PyTorch.
RUN cd /pytorch \
    && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    python setup.py install \
    && rm -rf /pytorch

RUN ccache -C

# Build and install OFI plugin. \
# It is required to use libfabric.
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
    && apt-get install -y --no-install-recommends \
        autoconf \
        automake \
        libtool
RUN mkdir /tmp/efa-ofi-nccl \
    && cd /tmp/efa-ofi-nccl \
    && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
    && cd aws-ofi-nccl \
    && ./autogen.sh \
    && ./configure --with-libfabric=/opt/amazon/efa \
        --with-mpi=/opt/amazon/openmpi \
        --with-cuda=/usr/local/cuda \
        --with-nccl=$CONDA_PREFIX \
    && make \
    && make install \
    && rm -rf /tmp/efa-ofi-nccl

# Build and install Torchvision
RUN pip install --no-cache-dir -U \
    packaging \
    mpi4py==3.0.3
RUN cd /tmp \
    && git clone https://github.com/pytorch/vision.git -b v0.9.1 \
    && cd vision \
    && BUILD_VERSION="0.9.1+cu111" python setup.py install \
    && cd /tmp \
    && rm -rf vision

# Install OpenSSH.
# Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation
RUN apt-get update \
    && apt-get install -y  --allow-downgrades --allow-change-held-packages --no-install-recommends \
    && apt-get install -y --no-install-recommends openssh-client openssh-server \
    && mkdir -p /var/run/sshd \
    && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
    && echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
    && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
    && rm -rf /var/lib/apt/lists/*
# Configure OpenSSH so that nodes can communicate with each other
RUN mkdir -p /var/run/sshd && \
    sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
RUN rm -rf /root/.ssh/ && \
    mkdir -p /root/.ssh/ && \
    ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
    && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config

# Install PT S3 plugin.
# Required to efficiently access datasets in Amazon S3
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt

# Install libboost from source.
# This package is needed for smdataparallel functionality (for networking asynchronous IO).
WORKDIR /
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
    && tar -xzf boost_1_73_0.tar.gz \
    && cd boost_1_73_0 \
    && ./bootstrap.sh \
    && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \
    && cd .. \
    && rm -rf boost_1_73_0.tar.gz \
    && rm -rf boost_1_73_0 \
    && cd ${CONDA_PREFIX}/include/boost

# Install SageMaker AI PyTorch training.
WORKDIR /root
RUN pip install --no-cache-dir -U \
    smclarify \
    "sagemaker>=2,<3" \
    sagemaker-experiments==0.* \
    sagemaker-pytorch-training

# Install SageMaker AI data parallel binary (SMDDP)
# Start with dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
    apt-get update && apt-get install -y  --no-install-recommends \
        jq \
        libhwloc-dev \
        libnuma1 \
        libnuma-dev \
        libssl1.1 \
        libtool \
        hwloc \
    && rm -rf /var/lib/apt/lists/*

# Install SMDDP
RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
```

**Tip**  
For more general information about creating a custom Dockerfile for training in SageMaker AI, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

**Tip**  
If you want to extend the custom Dockerfile to incorporate the SageMaker AI model parallel library, see [Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library](model-parallel-sm-sdk.md#model-parallel-bring-your-own-container).

# Amazon SageMaker AI data parallelism library examples
<a name="distributed-data-parallel-v2-examples"></a>

This page provides Jupyter notebooks that present examples of implementing the SageMaker AI distributed data parallelism (SMDDP) library to run distributed training jobs on SageMaker AI.

## Blogs and Case Studies
<a name="distributed-data-parallel-v2-examples-blog"></a>

The following blogs discuss case studies about using the SMDDP library.

**SMDDP v2 blogs**
+ [Enable faster training with Amazon SageMaker AI data parallel library](https://aws.amazon.com/blogs/machine-learning/enable-faster-training-with-amazon-sagemaker-data-parallel-library/), *AWS Machine Learning Blog* (December 05, 2023)

**SMDDP v1 blogs**
+ [How I trained 10TB for Stable Diffusion on SageMaker AI](https://medium.com/@emilywebber/how-i-trained-10tb-for-stable-diffusion-on-sagemaker-39dcea49ce32) in *Medium* (November 29, 2022)
+ [Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon Search ](https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/), *AWS Machine Learning Blog* (August 18, 2022)
+ [Training YOLOv5 on AWS with PyTorch and the SageMaker AI distributed data parallel library](https://medium.com/@sitecao/training-yolov5-on-aws-with-pytorch-and-sagemaker-distributed-data-parallel-library-a196ab01409b), *Medium* (May 6, 2022)
+ [Speed up EfficientNet model training on SageMaker AI with PyTorch and the SageMaker AI distributed data parallel library](https://medium.com/@dangmz/speed-up-efficientnet-model-training-on-amazon-sagemaker-with-pytorch-and-sagemaker-distributed-dae4b048c01a), *Medium* (March 21, 2022)
+ [Speed up EfficientNet training on AWS with the SageMaker AI distributed data parallel library](https://towardsdatascience.com/speed-up-efficientnet-training-on-aws-by-up-to-30-with-sagemaker-distributed-data-parallel-library-2dbf6d1e18e8), *Towards Data Science* (January 12, 2022)
+ [Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/hyundai-reduces-training-time-for-autonomous-driving-models-using-amazon-sagemaker/), *AWS Machine Learning Blog* (June 25, 2021)
+ [Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker AI](https://huggingface.co/blog/sagemaker-distributed-training-seq2seq), the *Hugging Face website* (April 8, 2021)

## Example notebooks
<a name="distributed-data-parallel-v2-examples-pytorch"></a>

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/). To download the examples, run the following command to clone the repository and go to `training/distributed_training/pytorch/data_parallel`.

**Note**  
Clone and run the example notebooks in the following SageMaker AI ML IDEs.  
[SageMaker AI JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker AI Code Editor](https://docs.aws.amazon.com/sagemaker/latest/dg/code-editor.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) (available as an application in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)

```
git clone https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/training/distributed_training/pytorch/data_parallel
```

**SMDDP v2 examples**
+ [Train Llama 2 using the SageMaker AI distributed data parallel library (SMDDP) and DeepSpeed](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/deepspeed/llama2/smddp_deepspeed_example.ipynb)
+ [Train Falcon using the SageMaker AI distributed data parallel library (SMDDP) and PyTorch Fully Sharded Data Parallelism (FSDP)](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/fully_sharded_data_parallel/falcon/smddp_fsdp_example.ipynb)

**SMDDP v1 examples**
+ [CNN with PyTorch and the SageMaker AI data parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/mnist/pytorch_smdataparallel_mnist_demo.ipynb)
+ [BERT with PyTorch and the SageMaker AI data parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/bert/pytorch_smdataparallel_bert_demo.ipynb)
+ [CNN with TensorFlow 2.3.1 and the SageMaker AI data parallelism library](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/tensorflow/data_parallel/mnist/tensorflow2_smdataparallel_mnist_demo.html)
+ [BERT with TensorFlow 2.3.1 and the SageMaker AI data parallelism library](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/tensorflow/data_parallel/bert/tensorflow2_smdataparallel_bert_demo.html)
+ [HuggingFace Distributed Data Parallel Training in PyTorch on SageMaker AI - Distributed Question Answering](https://github.com/huggingface/notebooks/blob/master/sagemaker/03_distributed_training_data_parallelism/sagemaker-notebook.ipynb)
+ [HuggingFace Distributed Data Parallel Training in PyTorch on SageMaker AI - Distributed Text Summarization](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb)
+ [HuggingFace Distributed Data Parallel Training in TensorFlow on SageMaker AI](https://github.com/huggingface/notebooks/blob/master/sagemaker/07_tensorflow_distributed_training_data_parallelism/sagemaker-notebook.ipynb)

# Configuration tips for the SageMaker AI distributed data parallelism library
<a name="data-parallel-config"></a>

Review the following tips before using the SageMaker AI distributed data parallelism (SMDDP) library. This list includes tips that are applicable across frameworks.

**Topics**
+ [

## Data preprocessing
](#data-parallel-config-dataprep)
+ [

## Single versus multiple nodes
](#data-parallel-config-multi-node)
+ [

## Debug scaling efficiency with Debugger
](#data-parallel-config-debug)
+ [

## Batch size
](#data-parallel-config-batch-size)
+ [

## Custom MPI options
](#data-parallel-config-mpi-custom)
+ [

## Use Amazon FSx and set up an optimal storage and throughput capacity
](#data-parallel-config-fxs)

## Data preprocessing
<a name="data-parallel-config-dataprep"></a>

If you preprocess data during training using an external library that utilizes the CPU, you may run into a CPU bottleneck because SageMaker AI distributed data parallel uses the CPU for `AllReduce` operations. You may be able to improve training time by moving preprocessing steps to a library that uses GPUs or by completing all preprocessing before training.

## Single versus multiple nodes
<a name="data-parallel-config-multi-node"></a>

We recommend that you use this library with multiple nodes. The library can be used with a single-host, multi-device setup (for example, a single ML compute instance with multiple GPUs); however, when you use two or more nodes, the library’s `AllReduce` operation gives you significant performance improvement. Also, on a single host, NVLink already contributes to in-node `AllReduce` efficiency.

## Debug scaling efficiency with Debugger
<a name="data-parallel-config-debug"></a>

You can use Amazon SageMaker Debugger to monitor and visualize CPU and GPU utilization and other metrics of interest during training. You can use Debugger [built-in rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) to monitor computational performance issues, such as `CPUBottleneck`, `LoadBalancing`, and `LowGPUUtilization`. You can specify these rules with [Debugger configurations](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configuration-for-debugging.html) when you define an Amazon SageMaker Python SDK estimator. If you use AWS CLI and AWS SDK for Python (Boto3) for training on SageMaker AI, you can enable Debugger as shown in [Configure SageMaker Debugger Using Amazon SageMaker API](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).

To see an example using Debugger in a SageMaker training job, you can reference one of the notebook examples in the [SageMaker Notebook Examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger). To learn more about Debugger, see [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html).

## Batch size
<a name="data-parallel-config-batch-size"></a>

In distributed training, as more nodes are added, batch sizes should increase proportionally. To improve convergence speed as you add more nodes to your training job and increase the global batch size, increase the learning rate.

One way to achieve this is by using a gradual learning rate warmup where the learning rate is ramped up from a small to a large value as the training job progresses. This ramp avoids a sudden increase of the learning rate, allowing healthy convergence at the start of training. For example, you can use a *Linear Scaling Rule* where each time the mini-batch size is multiplied by k, the learning rate is also multiplied by k. To learn more about this technique, see the research paper, [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/pdf/1706.02677.pdf), Sections 2 and 3.

## Custom MPI options
<a name="data-parallel-config-mpi-custom"></a>

The SageMaker AI distributed data parallel library employs Message Passing Interface (MPI), a popular standard for managing communication between nodes in a high-performance cluster, and uses NVIDIA’s NCCL library for GPU-level communication. When you use the data parallel library with a TensorFlow or Pytorch `Estimator`, the respective container sets up the MPI environment and executes the `mpirun` command to start jobs on the cluster nodes.

You can set custom MPI operations using the `custom_mpi_options` parameter in the `Estimator`. Any `mpirun` flags passed in this field are added to the `mpirun` command and executed by SageMaker AI for training. For example, you may define the `distribution` parameter of an `Estimator` using the following to use the [https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) variable to print the NCCL version at the start of the program:

```
distribution = {'smdistributed':{'dataparallel':{'enabled': True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"}}}
```

## Use Amazon FSx and set up an optimal storage and throughput capacity
<a name="data-parallel-config-fxs"></a>

When training a model on multiple nodes with distributed data parallelism, it is highly recommended to use [FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html). Amazon FSx is a scalable and high-performance storage service that supports shared file storage with a faster throughput. Using Amazon FSx storage at scale, you can achieve a faster data loading speed across the compute nodes.

Typically, with distributed data parallelism, you would expect that the total training throughput scales near-linearly with the number of GPUs. However, if you use suboptimal Amazon FSx storage, the training performance might slow down due to a low Amazon FSx throughput. 

For example, if you use the [**SCRATCH\$12** deployment type of Amazon FSx file system](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) with the minimum 1.2 TiB storage capacity, the I/O throughput capacity is 240 MB/s. Amazon FSx storage works in a way that you can assign physical storage devices, and the more devices assigned, the larger throughput you get. The smallest storage increment for the SRATCH\$12 type is 1.2 TiB, and the corresponding throughput gain is 240 MB/s.

Assume that you have a model to train on a 4-node cluster over a 100 GB data set. With a given batch size that’s optimized to the cluster, assume that the model can complete one epoch in about 30 seconds. In this case, the minimum required I/O speed is approximately 3 GB/s (100 GB / 30 s). This is apparently a much higher throughput requirement than 240 MB/s. With such a limited Amazon FSx capacity, scaling your distributed training job up to larger clusters might aggravate I/O bottleneck problems; model training throughput might improve in later epochs as cache builds up, but Amazon FSx throughput can still be a bottleneck.

To alleviate such I/O bottleneck problems, you should increase the Amazon FSx storage size to obtain a higher throughput capacity. Typically, to find an optimal I/O throughput, you may experiment with different Amazon FSx throughput capacities, assigning an equal to or slightly lower throughput than your estimate, until you find that it is sufficient to resolve the I/O bottleneck problems. In case of the aforementioned example, Amazon FSx storage with 2.4 GB/s throughput and 67 GB RAM cache would be sufficient. If the file system has an optimal throughput, the model training throughput should reach maximum either immediately or after the first epoch as cache has built up.

To learn more about how to increase Amazon FSx storage and deployment types, see the following pages in the *Amazon FSx for Lustre documentation*:
+  [How to increase storage capacity](https://docs.aws.amazon.com/fsx/latest/LustreGuide/managing-storage-capacity.html#increase-storage-capacity) 
+  [Aggregate file system performance](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) 

# Amazon SageMaker AI distributed data parallelism library FAQ
<a name="data-parallel-faq"></a>

Use the following to find answers to commonly asked questions about the SMDDP library.

**Q: When using the library, how are the `allreduce`-supporting CPU instances managed? Do I have to create heterogeneous CPU-GPU clusters, or does the SageMaker AI service create extra C5s for jobs that use the SMDDP library? **

The SMDDP library only supports GPU instances, more specificcally, P4d and P4de instances with NVIDIA A100 GPUs and EFA. No additional C5 or CPU instances are launched; if your SageMaker AI training job is on an 8-node P4d cluster, only 8 `ml.p4d.24xlarge` instances are used. No additional instances are provisioned.

**Q: I have a training job taking 5 days on a single `ml.p3.24xlarge` instance with a set of hyperparameters H1 (learning rate, batch size, optimizer, etc). Is using SageMaker AI's data parallelism library and a five-time bigger cluster enough to achieve an approximate five-time speedup? Or do I have to revisit its training hyperparameters after activating the SMDDP library? **

The library changes the overall batch size. The new overall batch size is scaled linearly with the number of training instances used. As a result of this, hyperparameters, such as learning rate, have to be changed to ensure convergence. 

**Q: Does the SMDDP library support Spot? **

Yes. You can use managed spot training. You specify the path to the checkpoint file in the SageMaker training job. You save and restore checkpoints in their training script as mentioned in the last steps of [Use the SMDDP library in your TensorFlow training script (deprecated)](data-parallel-modify-sdp-tf2.md) and [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md). 

**Q: Is the SMDDP library relevant in a single-host, multi-device setup?**

The library can be used in single-host multi-device training but the library offers performance improvements only in multi-host training.

**Q: Where should the training dataset be stored? **

The training dataset can be stored in an Amazon S3 bucket or on an Amazon FSx drive. See this [document for various supported input file systems for a training job](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.FileSystemInput). 

**Q: When using the SMDDP library, is it mandatory to have training data in FSx for Lustre? Can Amazon EFS and Amazon S3 be used? **

We generally recommend you use Amazon FSx because of its lower latency and higher throughput. If you prefer, you can use Amazon EFS or Amazon S3.

**Q: Can the library be used with CPU nodes?** 

No. To find instance types supported by the SMDDP library, see [Supported instance types](distributed-data-parallel-support.md#distributed-data-parallel-supported-instance-types).

**Q: What frameworks and framework versions are currently supported by the SMDDP library at launch?** 

the SMDDP library currently supports PyTorch v1.6.0 or later and TensorFlow v2.3.0 or later. It doesn't support TensorFlow 1.x. For more information about which version of the SMDDP library is packaged within AWS deep learning containers, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html).

**Q: Does the library support AMP?**

Yes, the SMDDP library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to use AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism library runs its `AllReduce` operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:
+ [Frameworks - PyTorch](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#pytorch) in the *NVIDIA Deep Learning Performace documentation*
+ [Frameworks - TensorFlow](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensorflow) in the *NVIDIA Deep Learning Performace documentation*
+ [Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision) in the *NVIDIA Developer Docs*
+ [Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/) in the *PyTorch Blog*
+ [TensorFlow mixed precision APIs](https://www.tensorflow.org/guide/mixed_precision) in the *TensorFlow documentation*

**Q: How do I identify if my distributed training job is slowed down due to I/O bottleneck?**

With a larger cluster, the training job requires more I/O throughput, and therefore the training throughput might take longer (more epochs) to ramp up to the maximum performance. This indicates that I/O is being bottlenecked and cache is harder to build up as you scale nodes up (higher throughput requirement and more complex network topology). For more information about monitoring the Amazon FSx throughput on CloudWatch, see [Monitoring FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring_overview.html) in the *FSx for Lustre User Guide*. 

**Q: How do I resolve I/O bottlenecks when running a distributed training job with data parallelism?**

We highly recommend that you use Amazon FSx as your data channel if you are using Amazon S3. If you are already using Amazon FSx but still having I/O bottleneck problems, you might have set up your Amazon FSx file system with a low I/O throughput and a small storage capacity. For more information about how to estimate and choose the right size of I/O throughput capacity, see [Use Amazon FSx and set up an optimal storage and throughput capacity](data-parallel-config.md#data-parallel-config-fxs).

**Q: (For the library v1.4.0 or later) How do I resolve the `Invalid backend` error while initializing process group.**

If you encounter the error message `ValueError: Invalid backend: 'smddp'` when calling `init_process_group`, this is due to the breaking change in the SMDDP library v1.4.0 and later. You must import the PyTorch client of the library, `smdistributed.dataparallel.torch.torch_smddp`, which registers `smddp` as a backend for PyTorch. To learn more, see [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md).

**Q: (For the SMDDP library v1.4.0 or later) I would like to call the collective primitives of the [https://pytorch.org/docs/stable/distributed.html](https://pytorch.org/docs/stable/distributed.html) interface. Which primitives does the `smddp` backend support?**

In v1.4.0, the SMDDP library supports `all_reduce`, `broadcast`, `reduce`, `all_gather`, and `barrier` of of the `torch.distributed` interface.

**Q: (For the SMDDP library v1.4.0 or later) Does this new API work with other custom DDP classes or libraries like Apex DDP? **

The SMDDP library is tested with other third-party distributed data parallel libraries and framework implementations that use the `torch.distribtued` modules. Using the SMDDP library with custom DDP classes works as long as the collective operations used by the custom DDP classes are supported by the SMDDP library. See the preceding question for a list of supported collectives. If you have these use cases and need further support, reach out to the SageMaker AI team through the [AWS Support Center](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

**Q: Does the SMDDP library support the bring-your-own-container (BYOC) option? If so, how do I install the library and run a distributed training job by writing a custom Dockerfile?**

If you want to integrate the SMDDP library and its minimum dependencies into your own Docker container, BYOC is the right approach. You can build your own container using the binary file of the library. The recommended process is to write a custom Dockerfile with the library and its dependencies, build the Docker container, host it in Amazon ECR, and use the ECR image URI to launch a training job using the SageMaker AI generic estimator class. For more instructions on how to prepare a custom Dockerfile for distributed training in SageMaker AI with the SMDDP library, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

# Troubleshooting for distributed training in Amazon SageMaker AI
<a name="distributed-troubleshooting-data-parallel"></a>

If you have problems in running a training job when you use the library, use the following list to try to troubleshoot. If you need further support, reach out to the SageMaker AI team through [AWS Support Center](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

**Topics**
+ [

## Using SageMaker AI distributed data parallel with Amazon SageMaker Debugger and checkpoints
](#distributed-ts-data-parallel-debugger)
+ [

## An unexpected prefix attached to model parameter keys
](#distributed-ts-data-parallel-pytorch-prefix)
+ [

## SageMaker AI distributed training job stalling during initialization
](#distributed-ts-data-parallel-efa-sg)
+ [

## SageMaker AI distributed training job stalling at the end of training
](#distributed-ts-data-parallel-stall-at-the-end)
+ [

## Observing scaling efficiency degradation due to Amazon FSx throughput bottlenecks
](#distributed-ts-data-parallel-fxs-bottleneck)
+ [

## SageMaker AI distributed training job with PyTorch returns deprecation warnings
](#distributed-ts-data-parallel-deprecation-warnings)

## Using SageMaker AI distributed data parallel with Amazon SageMaker Debugger and checkpoints
<a name="distributed-ts-data-parallel-debugger"></a>

To monitor system bottlenecks, profile framework operations, and debug model output tensors for training jobs with SageMaker AI distributed data parallel, use Amazon SageMaker Debugger. 

However, when you use SageMaker Debugger, SageMaker AI distributed data parallel, and SageMaker AI checkpoints, you might see an error that looks like the following example. 

```
SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
```

This is due to an internal error between Debugger and checkpoints, which occurs when you enable SageMaker AI distributed data parallel. 
+ If you enable all three features, SageMaker Python SDK automatically turns off Debugger by passing `debugger_hook_config=False`, which is equivalent to the following framework `estimator` example.

  ```
  bucket=sagemaker.Session().default_bucket()
  base_job_name="sagemaker-checkpoint-test"
  checkpoint_in_bucket="checkpoints"
  
  # The S3 URI to store the checkpoints
  checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)
  
  estimator = TensorFlow(
      ...
  
      distribution={"smdistributed": {"dataparallel": { "enabled": True }}},
      checkpoint_s3_uri=checkpoint_s3_bucket,
      checkpoint_local_path="/opt/ml/checkpoints",
      debugger_hook_config=False
  )
  ```
+ If you want to keep using both SageMaker AI distributed data parallel and SageMaker Debugger, a workaround is manually adding checkpointing functions to your training script instead of specifying the `checkpoint_s3_uri` and `checkpoint_local_path` parameters from the estimator. For more information about setting up manual checkpointing in a training script, see [Saving Checkpoints](distributed-troubleshooting-model-parallel.md#distributed-ts-model-parallel-checkpoints).

## An unexpected prefix attached to model parameter keys
<a name="distributed-ts-data-parallel-pytorch-prefix"></a>

For PyTorch distributed training jobs, an unexpected prefix (`model` for example) might be attached to `state_dict` keys (model parameters). The SageMaker AI data parallel library does not directly alter or prepend any model parameter names when PyTorch training jobs save model artifacts. The PyTorch's distributed training changes the names in the `state_dict` to go over the network, prepending the prefix. If you encounter any model failure problem due to different parameter names while you are using the SageMaker AI data parallel library and checkpointing for PyTorch training, adapt the following example code to remove the prefix at the step you load checkpoints in your training script.

```
state_dict = {k.partition('model.')[2]:state_dict[k] for k in state_dict.keys()}
```

This takes each `state_dict` key as a string value, separates the string at the first occurrence of `'model.'`, and takes the third list item (with index 2) of the partitioned string.

For more information about the prefix issue, see a discussion thread at [Prefix parameter names in saved model if trained by multi-GPU?](https://discuss.pytorch.org/t/prefix-parameter-names-in-saved-model-if-trained-by-multi-gpu/494) in the *PyTorch discussion forum*.

For more information about the PyTorch methods for saving and loading models, see [Saving & Loading Model Across Devices](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-across-devices) in the *PyTorch documentation*.

## SageMaker AI distributed training job stalling during initialization
<a name="distributed-ts-data-parallel-efa-sg"></a>

If your SageMaker AI distributed data parallel training job stalls during initialization when using EFA-enabled instances, this might be due to a misconfiguration in the security group of the VPC subnet that's used for the training job. EFA requires a proper security group configuration to enable traffic between the nodes.

**To configure inbound and outbound rules for the security group**

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. Choose **Security Groups** in the left navigation pane.

1. Select the security group that's tied to the VPC subnet you use for training. 

1. In the **Details** section, copy the **Security group ID**.

1. On the **Inbound rules** tab, choose **Edit inbound rules**.

1. On the **Edit inbound rules** page, do the following: 

   1. Choose **Add rule**.

   1. For **Type**, choose **All traffic**.

   1. For **Source**, choose **Custom**, paste the security group ID into the search box, and select the security group that pops up.

1. Choose **Save rules** to finish configuring the inbound rule for the security group.

1. On the **Outbound rules** tab, choose **Edit outbound rules**.

1. Repeat the step 6 and 7 to add the same rule as an outbound rule.

After you complete the preceding steps for configuring the security group with the inbound and outbound rules, re-run the training job and verify if the stalling issue is resolved.

For more information about configuring security groups for VPC and EFA, see [Security groups for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html) and [Elastic Fabric Adapter](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html).

## SageMaker AI distributed training job stalling at the end of training
<a name="distributed-ts-data-parallel-stall-at-the-end"></a>

One of the root causes of stalling issues at the end of training is a mismatch in the number of batches that are processed per epoch across different ranks. All workers (GPUs) synchronize their local gradients in the backward pass to ensure they all have the same copy of the model at the end of the batch iteration. If the batch sizes are unevenly assigned to different worker groups during the final epoch of training, the training job stalls. For example, while a group of workers (group A) finishes processing all batches and exits the training loop, another group of workers (group B) starts processing another batch and still expects communication from group A to synchronize the gradients. This causes group B to wait for group A, which already completed training and does not have any gradients to synchronize. 

Therefore, when setting up your training dataset, it is important that each worker gets the same number of data samples so that each worker goes through the same number of batches while training. Make sure each rank gets the same number of batches to avoid this stalling issue.

## Observing scaling efficiency degradation due to Amazon FSx throughput bottlenecks
<a name="distributed-ts-data-parallel-fxs-bottleneck"></a>

One potential cause of lowered scaling efficiency is the FSx throughput limit. If you observe a sudden drop in scaling efficiency when you switch to a larger training cluster, try using a larger FSx for Lustre file system with a higher throughput limit. For more information, see [Aggregate file system performance](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) and [Managing storage and throughput capacity](https://docs.aws.amazon.com/fsx/latest/LustreGuide/managing-storage-capacity.html) in the *Amazon FSx for Lustre User Guide*.

## SageMaker AI distributed training job with PyTorch returns deprecation warnings
<a name="distributed-ts-data-parallel-deprecation-warnings"></a>

Since v1.4.0, the SageMaker AI distributed data parallelism library works as a backend of PyTorch distributed. Because of the breaking change of using the library with PyTorch, you might encounter a warning message that the `smdistributed` APIs for the PyTorch distributed package are deprecated. The warning message should be similar to the following:

```
smdistributed.dataparallel.torch.dist is deprecated in the SageMaker AI distributed data parallel library v1.4.0+.
Please use torch.distributed and specify 'smddp' as a backend when initializing process group as follows:
torch.distributed.init_process_group(backend='smddp')
For more information, see the library's API documentation at
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html
```

In v1.4.0 and later, the library only needs to be imported once at the top of your training script and set as the backend during the PyTorch distributed initialization. With the single line of backend specification, you can keep your PyTorch training script unchanged and directly use the PyTorch distributed modules. See [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md) to learn about the breaking changes and the new way to use the library with PyTorch.

# SageMaker AI data parallelism library release notes
<a name="data-parallel-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker AI distributed data parallelism (SMDDP) library.

## The SageMaker AI distributed data parallelism library v2.5.0
<a name="data-parallel-release-notes-20241017"></a>

*Date: October 17, 2024*

**New features**
+ Added support for PyTorch v2.4.1 with CUDA v12.1.

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.6.0](model-parallel-release-notes.md#model-parallel-release-notes-20241017).

```
658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.4.1/cu121/2024-10-09/smdistributed_dataparallel-2.5.0-cp311-cp311-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.3.0
<a name="data-parallel-release-notes-20240611"></a>

*Date: June 11, 2024*

**New features**
+ Added support for PyTorch v2.3.0 with CUDA v12.1 and Python v3.11.
+ Added support for PyTorch Lightning v2.2.5. This is integrated into the SageMaker AI framework container for PyTorch v2.3.0.
+ Added instance type validation during import to prevent loading the SMDDP library on unsupported instance types. For a list of instance types compatible with the SMDDP library, see [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md).

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.3.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
  ```

For a complete list of versions of the SMDDP library and the pre-built containers, see [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed_dataparallel-2.3.0-cp311-cp311-linux_x86_64.whl
```

**Other changes**
+ The SMDDP library v2.2.0 is integrated into the SageMaker AI framework container for PyTorch v2.2.0.

## The SageMaker AI distributed data parallelism library v2.2.0
<a name="data-parallel-release-notes-20240304"></a>

*Date: March 4, 2024*

**New features**
+ Added support for PyTorch v2.2.0 with CUDA v12.1.

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.2.0](model-parallel-release-notes.md#model-parallel-release-notes-20240307).

```
658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed_dataparallel-2.2.0-cp310-cp310-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.1.0
<a name="data-parallel-release-notes-20240301"></a>

*Date: March 1, 2024*

**New features**
+ Added support for PyTorch v2.1.0 with CUDA v12.1.

**Bug fixes**
+ Fixed the CPU memory leak issue in [SMDDP v2.0.1](#data-parallel-release-notes-20231207).

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library passed benchmark testing and is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.1.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker
  ```

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.1.0](model-parallel-release-notes.md#model-parallel-release-notes-20240206).

```
658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.0.1
<a name="data-parallel-release-notes-20231207"></a>

*Date: December 7, 2023*

**New features**
+ Added a new SMDDP-implementation of `AllGather` collective operation optimized for AWS compute resources and network infrastructure. To learn more, see [SMDDP `AllGather` collective operation](data-parallel-intro.md#data-parallel-allgather).
+ The SMDDP `AllGather` collective operation is compatible with PyTorch FSDP and DeepSpeed. To learn more, see [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md).
+ Added support for PyTorch v2.0.1

**Known issues**
+ There's a CPU memory leak issue from a gradual CPU memory increase while training with SMDDP `AllReduce` in DDP mode.

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library passed benchmark testing and is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.0.1

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
  ```

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl
```

**Other changes**
+ Starting from this release, documentation for the SMDDP library is fully available in this *Amazon SageMaker AI Developer Guide*. In favor of the complete developer guide for SMDDP v2 housed in the *Amazon SageMaker AI Developer Guide*, documentation for the [additional reference for SMDDP v1.x](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html) in the *SageMaker AI Python SDK documentation* is no longer supported. If you still need SMP v1.x documentation, see the following snapshot of the documentation at [SageMaker Python SDK v2.212.0 documentation](https://sagemaker.readthedocs.io/en/v2.212.0/api/training/distributed.html#the-sagemaker-distributed-data-parallel-library).

# SageMaker model parallelism library v2
<a name="model-parallel-v2"></a>

**Note**  
Since the release of the SageMaker model parallelism (SMP) library v2.0.0 on December 19, 2023, this documentation is renewed for the SMP library v2. For previous versions of the SMP library, see [(Archived) SageMaker model parallelism library v1.x](model-parallel.md).

The Amazon SageMaker AI model parallelism library is a capability of SageMaker AI that enables high performance and optimized large scale training on SageMaker AI accelerate compute instances. The [Core features of the SageMaker model parallelism library v2](model-parallel-core-features-v2.md) include techniques and optimizations to accelerate and simplify large model training, such as hybrid sharded data parallelism, tensor parallelism, activation checkpointing, and activation offloading. You can use the SMP library to accelerate the training and fine-tuning of large language models (LLMs), large vision models (LVMs), and foundation models (FMs) with hundreds of billions of parameters.

The SageMaker model parallelism library v2 (SMP v2) aligns the library’s APIs and methods with open source PyTorch Fully Sharded Data Parallelism (FSDP), which gives you the benefit of SMP performance optimizations with minimal code changes. With SMP v2, you can improve the computational performance of training a state-of-the-art large model on SageMaker AI by bringing your PyTorch FSDP training scripts to SageMaker AI.

You can use SMP v2 for the general [SageMaker Training](train-model.md) jobs and distributed training workloads on [Amazon SageMaker HyperPod](sagemaker-hyperpod.md) clusters.

**Topics**
+ [

# Model parallelism concepts
](model-parallel-intro-v2.md)
+ [

# Supported frameworks and AWS Regions
](distributed-model-parallel-support-v2.md)
+ [

# Use the SageMaker model parallelism library v2
](model-parallel-use-api-v2.md)
+ [

# Core features of the SageMaker model parallelism library v2
](model-parallel-core-features-v2.md)
+ [

# Amazon SageMaker AI model parallelism library v2 examples
](distributed-model-parallel-v2-examples.md)
+ [

# SageMaker distributed model parallelism best practices
](model-parallel-best-practices-v2.md)
+ [

# The SageMaker model parallel library v2 reference
](distributed-model-parallel-v2-reference.md)
+ [

# Release notes for the SageMaker model parallelism library
](model-parallel-release-notes.md)
+ [

# (Archived) SageMaker model parallelism library v1.x
](model-parallel.md)

# Model parallelism concepts
<a name="model-parallel-intro-v2"></a>

Model parallelism is a distributed training method in which the deep learning (DL) model is partitioned across multiple GPUs and instances. The SageMaker model parallel library v2 (SMP v2) is compatible with the native PyTorch APIs and capabilities. This makes it convenient for you to adapt your PyTorch Fully Sharded Data Parallel (FSDP) training script to the SageMaker Training platform and take advantage of the performance improvement that SMP v2 provides. This introduction page provides a high-level overview about model parallelism and a description of how it can help overcome issues that arise when training deep learning (DL) models that are typically very large in size. It also provides examples of what the SageMaker model parallel library offers to help manage model parallel strategies and memory consumption.

## What is model parallelism?
<a name="model-parallel-what-is-v2"></a>

Increasing the size of deep learning models (layers and parameters) yields better accuracy for complex tasks such as computer vision and natural language processing. However, there is a limit to the maximum model size you can fit in the memory of a single GPU. When training DL models, GPU memory limitations can be bottlenecks in the following ways:
+ They limit the size of the model that you can train, because the memory footprint of a model scales proportionally to the number of parameters.
+ They limit the per-GPU batch size during training, driving down GPU utilization and training efficiency.

To overcome the limitations associated with training a model on a single GPU, SageMaker AI provides the model parallel library to help distribute and train DL models efficiently on multiple compute nodes. Furthermore, with the library, you can achieve optimized distributed training using EFA-supported devices, which enhance the performance of inter-node communication with low latency, high throughput, and OS bypass.

## Estimate memory requirements before using model parallelism
<a name="model-parallel-intro-estimate-memory-requirements-v2"></a>

Before you use the SageMaker model parallel library, consider the following to get a sense of the memory requirements of training large DL models.

For a training job that uses automatic mixed precision such as `float16` (FP16) or `bfloat16` (BF16) and Adam optimizers, the required GPU memory per parameter is about 20 bytes, which we can break down as follows:
+ An FP16 or BF16 parameter \$1 2 bytes
+ An FP16 or BF16 gradient \$1 2 bytes
+ An FP32 optimizer state \$1 8 bytes based on the Adam optimizers
+ An FP32 copy of parameter \$1 4 bytes (needed for the `optimizer apply` (OA) operation)
+ An FP32 copy of gradient \$1 4 bytes (needed for the OA operation)

Even for a relatively small DL model with 10 billion parameters, it can require at least 200GB of memory, which is much larger than the typical GPU memory (for example, NVIDIA A100 with 40GB/80GB memory) available on a single GPU. On top of the memory requirements for model and optimizer states, there are other memory consumers such as activations generated in the forward pass. The memory required can be a lot greater than 200GB.

For distributed training, we recommend that you use Amazon EC2 P4 and P5 instances that have NVIDIA A100 and H100 Tensor Core GPUs respectively. For more details about specifications such as CPU cores, RAM, attached storage volume, and network bandwidth, see the *Accelerated Computing* section in the [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/) page. For instance types that SMP v2 supports, see [Supported instance types](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-instance-types-v2).

Even with the accelerated computing instances, models with about 10 billion parameters such as Megatron-LM and T5, and even larger models with hundreds of billions of parameters such as GPT-3, cannot fit model replicas in each GPU device. 

## How the library employs model parallelism and memory saving techniques
<a name="model-parallel-intro-features-v2"></a>

The library consists of various types of model parallelism features and memory-saving features such as optimizer state sharding, activation checkpointing, and activation offloading. All these techniques can be combined to efficiently train large models that consist of hundreds of billions of parameters.

**Topics**
+ [

### Sharded data parallelism
](#model-parallel-intro-sdp-v2)
+ [

### Expert parallelism
](#model-parallel-intro-expert-parallelism-v2)
+ [

### Tensor parallelism
](#model-parallel-intro-tp-v2)
+ [

### Activation checkpointing and offloading
](#model-parallel-intro-activation-offloading-checkpointing-v2)
+ [

### Choosing the right techniques for your model
](#model-parallel-intro-choosing-techniques-v2)

### Sharded data parallelism
<a name="model-parallel-intro-sdp-v2"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs within a data-parallel group.

SMP v2 implements sharded data parallelism through FSDP, and extends it to implement the scale aware hybrid sharding strategy discussed in the blog post [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws).

You can apply sharded data parallelism to your model as a standalone strategy. Furthermore, if you are using the most performant GPU instances equipped with NVIDIA A100 Tensor Core GPUs, `ml.p4d.24xlarge` and `ml.p4de.24xlarge`, you can take the advantage of improved training speed from the `AllGather` operation offered by the [SageMaker data parallelism (SMDDP) library](data-parallel.md).

To dive deep into sharded data parallelism and learn how to set it up or use a combination of sharded data parallelism with other techniques like tensor parallelism and mixed precision training, see [Hybrid sharded data parallelism](model-parallel-core-features-v2-sharded-data-parallelism.md).

### Expert parallelism
<a name="model-parallel-intro-expert-parallelism-v2"></a>

SMP v2 integrates with [NVIDIA Megatron](https://github.com/NVIDIA/Megatron-LM) for implementing *expert parallelism* on top of its support for the native PyTorch FSDP APIs. You can keep your PyTorch FSDP training code as is and apply SMP expert parallelism for training *Mixture of Experts* (MoE) models within SageMaker AI.

An MoE model is a type of transformer model that consists of multiple *experts*, each consisting of a neural network, typically a feed-forward network (FFN). A gate network called *router* determines which tokens are sent to which expert. These experts specialize in processing specific aspects of the input data, enabling the model to train faster, reduce compute cost, while achieving the same performance quality as its counterpart dense model. And *expert parallelism* is a parallelism technique that handles splitting experts of an MoE model across GPU devices.

To learn how to train MoE models with SMP v2, see [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md).

### Tensor parallelism
<a name="model-parallel-intro-tp-v2"></a>

*Tensor parallelism* splits individual layers, or `nn.Modules`, across devices to run in parallel. The following figure shows the simplest example of how the SMP library splits a model with four layers to achieve two-way tensor parallelism (`"tensor_parallel_degree": 2`). In the following figure, the notations for model parallel group, tensor parallel group, and data parallel group are `MP_GROUP`, `TP_GROUP`, and `DP_GROUP` respectively. The layers of each model replica are bisected and distributed into two GPUs. The library manages communication across the tensor-distributed model replicas.

![\[Simplest example of how the SMP library splits a model with four layers to achieve two-way tensor parallelism ("tensor_parallel_degree": 2).\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/smp-v2-tensor-parallel.png)


To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set a combination of the core features, see [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md).

### Activation checkpointing and offloading
<a name="model-parallel-intro-activation-offloading-checkpointing-v2"></a>

To save GPU memory, the library supports activation checkpointing to avoid storing internal activations in the GPU memory for user-specified modules during the forward pass. The library recomputes these activations during the backward pass. In addition, with activation offloading, it offloads the stored activations to CPU memory and fetches them back to GPU during the backward pass to further reduce the activation memory footprint. For more information about how to use these features, see [Activation checkpointing](model-parallel-core-features-v2-pytorch-activation-checkpointing.md) and [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md).

### Choosing the right techniques for your model
<a name="model-parallel-intro-choosing-techniques-v2"></a>

For more information about choosing the right techniques and configurations, see [SageMaker distributed model parallelism best practices](model-parallel-best-practices-v2.md).

# Supported frameworks and AWS Regions
<a name="distributed-model-parallel-support-v2"></a>

Before using the SageMaker model parallelism library v2 (SMP v2), check the supported frameworks and instance types and determine if there are enough quotas in your AWS account and AWS Region.

**Note**  
To check the latest updates and release notes of the library, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).

## Supported frameworks
<a name="distributed-model-parallel-supported-frameworks-v2"></a>

SMP v2 supports the following deep learning frameworks and available through SMP Docker containers and an SMP Conda channel. When you use the framework estimator classes in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use SMP v2, we recommend that you always keep the SageMaker Python SDK up to date in your development environment.

**PyTorch versions that the SageMaker model parallelism library supports**

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-support-v2.html)

**SMP Conda channel**

The following Amazon S3 bucket is a public Conda channel hosted by the SMP service team. If you want to install the SMP v2 library in an environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.

```
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/
```

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

**Note**  
To find previous versions of the SMP library v1.x and pre-packaged DLCs, see [Supported Frameworks](distributed-model-parallel-support.md#distributed-model-parallel-supported-frameworks) in the *SMP v1 documentation*.

### Use SMP v2 with open source libraries
<a name="distributed-model-parallel-supported-frameworks-v2-open-source"></a>

The SMP v2 library works with other PyTorch-based open source libraries such as PyTorch Lightning, Hugging Face Transformers, and Hugging Face Accelerate, because SMP v2 is compatible with the PyTorch FSDP APIs. If you have further questions on using the SMP library with other third party libraries, contact the SMP service team at `sm-model-parallel-feedback@amazon.com`.

## AWS Regions
<a name="distributed-model-parallel-availablity-zone-v2"></a>

SMP v2 is available in the following AWS Regions. If you'd like to use the SMP Docker image URIs or the SMP Conda channel, check the following list and choose the AWS Region matching with yours, and update the image URI or the channel URL accordingly.
+ ap-northeast-1
+ ap-northeast-2
+ ap-northeast-3
+ ap-south-1
+ ap-southeast-1
+ ap-southeast-2
+ ca-central-1
+ eu-central-1
+ eu-north-1
+ eu-west-1
+ eu-west-2
+ eu-west-3
+ sa-east-1
+ us-east-1
+ us-east-2
+ us-west-1
+ us-west-2

## Supported instance types
<a name="distributed-model-parallel-supported-instance-types-v2"></a>

SMP v2 requires one of the following ML instance types.


| Instance type | 
| --- | 
| ml.p4d.24xlarge | 
| ml.p4de.24xlarge | 
| ml.p5.48xlarge | 
| ml.p5e.48xlarge | 

**Tip**  
Starting from SMP v2.2.0 supporting PyTorch v2.2.0 and later, [Mixed precision training with FP8 on P5 instances using Transformer Engine](model-parallel-core-features-v2-mixed-precision.md#model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5) is available.

For specs of the SageMaker machine learning instance types in general, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *AWS Service Quotas User Guide*.

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
    the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
    for training job usage' is 0 Instances, with current utilization of 0 Instances
    and a request delta of 1 Instances.
    Please contact AWS support to request an increase for this limit.
```

# Use the SageMaker model parallelism library v2
<a name="model-parallel-use-api-v2"></a>

On this page, you'll learn how to use the SageMaker model parallelism library v2 APIs and get started with running a PyTorch Fully Sharded Data Parallel (FSDP) training job in the SageMaker Training platform or on a SageMaker HyperPod cluster.

There are various scenarios for running a PyTorch training job with SMP v2.

1. For SageMaker training, use one of the pre-built SageMaker Framework Containers for PyTorch v2.0.1 and later, which are pre-packaged with SMP v2.

1. Use the SMP v2 binary file to set up a Conda environment for running a distributed training workload on a SageMaker HyperPod cluster.

1. Extend the pre-built SageMaker Framework Containers for PyTorch v2.0.1 and later to install any additional functional requirements for your use case. To learn how to extend a pre-built container, see [Extend a Pre-built Container](prebuilt-containers-extend.md).

1. You can also bring your own Docker container and manually set up all SageMaker Training environment using the [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) and install the SMP v2 binary file. This is the least recommended option due to the complexity of dependencies. To learn how to run your own Docker container, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html).

This getting started guide covers the first two scenarios.

**Topics**
+ [

## Step 1: Adapt your PyTorch FSDP training script
](#model-parallel-adapt-pytorch-script-v2)
+ [

## Step 2: Launch a training job
](#model-parallel-launch-a-training-job-v2)

## Step 1: Adapt your PyTorch FSDP training script
<a name="model-parallel-adapt-pytorch-script-v2"></a>

To activate and configure the SMP v2 library, start with importing and adding the `torch.sagemaker.init()` module at the top of the script. This module takes in the SMP configuration dictionary of [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config) that you'll prepare in [Step 2: Launch a training job](#model-parallel-launch-a-training-job-v2). Also, for using the various core features offered by SMP v2, you might need to make few more changes to adapt your training script. More detailed instructions on adapting your training script for using the SMP v2 core features are provided at [Core features of the SageMaker model parallelism library v2](model-parallel-core-features-v2.md).

------
#### [ SageMaker Training ]

In your training script, add the following two lines of code, which is the minimal requirement to start training with SMP v2. In [Step 2: Launch a training job](#model-parallel-launch-a-training-job-v2), you’ll set up an object of the SageMaker `PyTorch` estimator class with an SMP configuration dictionary through the `distribution` argument of the estimator class.

```
import torch.sagemaker as tsm
tsm.init()
```

**Note**  
You can also directly pass a configuration dictionary of the [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config) to the `torch.sagemaker.init()` module. However, the parameters passed to the PyTorch estimator in [Step 2: Launch a training job](#model-parallel-launch-a-training-job-v2) take priority and override the ones specified to the `torch.sagemaker.init()` module.

------
#### [ SageMaker HyperPod ]

In your training script, add the following two lines of code. In [Step 2: Launch a training job](#model-parallel-launch-a-training-job-v2), you’ll set up a `smp_config.json` file for setting up SMP configurations in JSON format, and upload it to a storage or a file system mapped with your SageMaker HyperPod cluster. We recommend that you keep the configuration file under the same directory where you upload your training script.

```
import torch.sagemaker as tsm
tsm.init("/dir_to_training_files/smp_config.json")
```

**Note**  
You can also directly pass a configuration dictionary of the [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config) into the `torch.sagemaker.init()` module.

------

## Step 2: Launch a training job
<a name="model-parallel-launch-a-training-job-v2"></a>

Learn how to configure SMP distribution options for launching a PyTorch FSDP training job with SMP core features.

------
#### [ SageMaker Training ]

When you set up a training job launcher object of the [PyTorch framework estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) class in the SageMaker Python SDK, configure [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config) through `distribution` argument as follows.

**Note**  
The `distribution` configuration for SMP v2 is integrated in the SageMaker Python SDK starting from v2.200. Make sure that you use the SageMaker Python SDK v2.200 or later.

**Note**  
In SMP v2, you should configure `smdistributed` with `torch_distributed` for the `distribution` argument of the SageMaker `PyTorch` estimator. With `torch_distributed`, SageMaker AI runs `torchrun`, which is the default multi-node job launcher of [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html).

```
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    framework_version=2.2.0,
    py_version="310"
    # image_uri="<smp-docker-image-uri>" # For using prior versions, specify the SMP image URI directly.
    entry_point="your-training-script.py", # Pass the training script you adapted with SMP from Step 1.
    ... # Configure other required and optional parameters
    distribution={
        "torch_distributed": { "enabled": True },
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "hybrid_shard_degree": Integer,
                    "sm_activation_offloading": Boolean,
                    "activation_loading_horizon": Integer,
                    "fsdp_cache_flush_warnings": Boolean,
                    "allow_empty_shards": Boolean,
                    "tensor_parallel_degree": Integer,
                    "expert_parallel_degree": Integer,
                    "random_seed": Integer
                }
            }
        }
    }
)
```

**Important**  
For using one of the prior versions of PyTorch or SMP instead of the latest, you need to specify the SMP Docker image directly using the `image_uri` argument instead of the `framework_version` and `py_version` pair. The following is an example of   

```
estimator = PyTorch(
    ...,
    image_uri="658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121"
)
```
To find SMP Docker image URIs, see [Supported frameworks](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2).

------
#### [ SageMaker HyperPod ]

Before you start, make sure if the following prerequisites are met.
+ An Amazon FSx shared directory mounted (`/fsx`) to your HyperPod cluster.
+ Conda installed in the FSx shared directory. To learn how to install Conda, use the instructions at [Installing on Linux](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) in the *Conda User Guide*.
+ `cuda11.8` or `cuda12.1` installed on the head and compute nodes of your HyperPod cluster.

If the prerequisites are all met, proceed to the following instructions on launching a workload with SMP v2 on a HyperPod cluster.

1. Prepare an `smp_config.json` file that contains a dictionary of [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config). Make sure that you upload this JSON file to where you store your training script, or the path you specified to the `torch.sagemaker.init()` module in [Step 1](#model-parallel-adapt-pytorch-script-v2). If you’ve already passed the configuration dictionary to the `torch.sagemaker.init()` module in the training script in [Step 1](#model-parallel-adapt-pytorch-script-v2), you can skip this step. 

   ```
   // smp_config.json
   {
       "hybrid_shard_degree": Integer,
       "sm_activation_offloading": Boolean,
       "activation_loading_horizon": Integer,
       "fsdp_cache_flush_warnings": Boolean,
       "allow_empty_shards": Boolean,
       "tensor_parallel_degree": Integer,
       "expert_parallel_degree": Integer,
       "random_seed": Integer
   }
   ```

1. Upload the `smp_config.json` file to a directory in your file system. The directory path must match with the path you specified in [Step 1](#model-parallel-adapt-pytorch-script-v2). If you’ve already passed the configuration dictionary to the `torch.sagemaker.init()` module in the training script, you can skip this step.

1. On the compute nodes of your cluster, start a terminal session with the following command.

   ```
   sudo su -l ubuntu
   ```

1. Create a Conda environment on the compute nodes. The following code is an example script of creating a Conda environment and installing SMP, [SMDDP](data-parallel.md), CUDA, and other dependencies.

   ```
   # Run on compute nodes
   SMP_CUDA_VER=<11.8 or 12.1>
   
   source /fsx/<path_to_miniconda>/miniconda3/bin/activate
   
   export ENV_PATH=/fsx/<path to miniconda>/miniconda3/envs/<ENV_NAME>
   conda create -p ${ENV_PATH} python=3.10
   
   conda activate ${ENV_PATH}
   
   # Verify aws-cli is installed: Expect something like "aws-cli/2.15.0*"
   aws ‐‐version
   # Install aws-cli if not already installed
   # https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#cliv2-linux-install
   
   # Install the SMP library
   conda install pytorch="2.0.1=sm_py3.10_cuda${SMP_CUDA_VER}*" packaging ‐‐override-channels \
     -c https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-2.0.0-pt-2.0.1/2023-12-11/smp-v2/ \
     -c pytorch -c numba/label/dev \
     -c nvidia -c conda-forge
   
   # Install dependencies of the script as below
   python -m pip install packaging transformers==4.31.0 accelerate ninja tensorboard h5py datasets \
       && python -m pip install expecttest hypothesis \
       && python -m pip install "flash-attn>=2.0.4" ‐‐no-build-isolation
   
   # Install the SMDDP wheel
   SMDDP_WHL="smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl" \
     && wget -q https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/${SMDDP_WHL} \
     && pip install ‐‐force ${SMDDP_WHL} \
     && rm ${SMDDP_WHL}
   
   # cuDNN installation for Transformer Engine installation for CUDA 11.8
   # Please download from below link, you need to agree to terms 
   # https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.5/local_installers/11.x/cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz
   
   tar xf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
       && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
       && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
       && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
       && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
       && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive/
   
   # Please download from below link, you need to agree to terms 
   # https://developer.download.nvidia.com/compute/cudnn/secure/8.9.7/local_installers/12.x/cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
   # cuDNN installation for TransformerEngine installation for cuda12.1
   tar xf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
       && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
       && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
       && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
       && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
       && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive/
       
   # TransformerEngine installation
   export CUDA_HOME=/usr/local/cuda-$SMP_CUDA_VER
   export CUDNN_PATH=/usr/local/cuda-$SMP_CUDA_VER/lib
   export CUDNN_LIBRARY=/usr/local/cuda-$SMP_CUDA_VER/lib
   export CUDNN_INCLUDE_DIR=/usr/local/cuda-$SMP_CUDA_VER/include
   export PATH=/usr/local/cuda-$SMP_CUDA_VER/bin:$PATH
   export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-$SMP_CUDA_VER/lib
   
   python -m pip install ‐‐no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v1.0
   ```

1. Run a test training job.

   1. In the shared file system (`/fsx`), clone the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/), and go to the `3.test_cases/11.modelparallel` folder.

      ```
      git clone https://github.com/aws-samples/awsome-distributed-training/
      cd awsome-distributed-training/3.test_cases/11.modelparallel
      ```

   1. Submit a job using `sbatch` as follows.

      ```
      conda activate <ENV_PATH>
      sbatch -N 16 conda_launch.sh
      ```

      If the job submission is successful, the output message of this `sbatch` command should be similar to `Submitted batch job ABCDEF`.

   1. Check the log file in the current directory under `logs/`.

      ```
      tail -f ./logs/fsdp_smp_ABCDEF.out
      ```

------

# Core features of the SageMaker model parallelism library v2
<a name="model-parallel-core-features-v2"></a>

The Amazon SageMaker AI model parallelism library v2 (SMP v2) offers distribution strategies and memory-saving techniques, such as sharded data parallelism, tensor parallelism, and checkpointing. The model parallelism strategies and techniques offered by SMP v2 help distribute large models across multiple devices while optimizing training speed and memory consumption. SMP v2 also provides a Python package `torch.sagemaker` to help adapt your training script with few lines of code change.

This guide follows the basic two-step flow introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). To dive deep into the core features of SMP v2 and how to use them, see the following topics.

**Note**  
These core features are available in SMP v2.0.0 and later and the SageMaker Python SDK v2.200.0 and later, and works for PyTorch v2.0.1 and later. To check the versions of the packages, see [Supported frameworks and AWS Regions](distributed-model-parallel-support-v2.md).

**Topics**
+ [

# Hybrid sharded data parallelism
](model-parallel-core-features-v2-sharded-data-parallelism.md)
+ [

# Expert parallelism
](model-parallel-core-features-v2-expert-parallelism.md)
+ [

# Context parallelism
](model-parallel-core-features-v2-context-parallelism.md)
+ [

# Compatibility with the SMDDP library optimized for AWS infrastructure
](model-parallel-core-features-v2-smddp-allgather.md)
+ [

# Mixed precision training
](model-parallel-core-features-v2-mixed-precision.md)
+ [

# Delayed parameter initialization
](model-parallel-core-features-v2-delayed-param-init.md)
+ [

# Activation checkpointing
](model-parallel-core-features-v2-pytorch-activation-checkpointing.md)
+ [

# Activation offloading
](model-parallel-core-features-v2-pytorch-activation-offloading.md)
+ [

# Tensor parallelism
](model-parallel-core-features-v2-tensor-parallelism.md)
+ [

# Fine-tuning
](model-parallel-core-features-v2-fine-tuning.md)
+ [

# FlashAttention
](model-parallel-core-features-v2-flashattention.md)
+ [

# Checkpointing using SMP
](model-parallel-core-features-v2-checkpoints.md)

# Hybrid sharded data parallelism
<a name="model-parallel-core-features-v2-sharded-data-parallelism"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across devices. This helps you fit a larger model or increase the batch size using the freed-up GPU memory. The SMP library offers a capability of running sharded data parallelism with PyTorch Fully Sharded Data Parallel (FSDP). PyTorch FSDP by default shards across the whole set of GPUs being used. In SMP v2, the library offers this sharded data parallelism on top of PyTorch FSDP by extending PyTorch hybrid sharding (`HYBRID_SHARD`), which is one of the [sharding strategies provided by PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy): `FULL_SHARD`, `SHARD_GRAD_OP`, `HYBRID_SHARD`, `_HYBRID_SHARD_ZERO2`. Extending hybrid sharding in this manner helps implement scale-aware-sharding as described in the blog [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws) for PyTorch FSDP.

The SMP library makes it easy to use `HYBRID_SHARD` and `_HYBRID_SHARD_ZERO2` across any configurable number of GPUs, extending the native PyTorch FSDP that supports sharding across a single node (`HYBRID_SHARD`) or all GPUs (`FULL_SHARD`). PyTorch FSDP calls can stay as is, and you only need to add the `hybrid_shard_degree` argument to the SMP configuration, as shown in the following code example. You don't need to change the value of the `sharding_strategy` argument in the PyTorch FSDP wrapper around your PyTorch model. You can pass `ShardingStrategy.HYBRID_SHARD` as the value. Alternatively, the SMP library overrides the strategy in the script and sets it to `ShardingStrategy.HYBRID_SHARD` if you specify a value equal to or greater than 2 to the `hybrid_shard_degree` parameter.

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `hybrid_shard_degree` parameter, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

**SMP configuration dictionary**

```
{ "hybrid_shard_degree": 16 }
```

**In training script**

```
import torch.sagemaker as tsm
tsm.init()

# Set up a PyTorch model
model = ...

# Wrap the PyTorch model using the PyTorch FSDP module
model = FSDP(
    model,
    ...
)

# Optimizer needs to be created after FSDP wrapper
optimizer = ...
```

# Expert parallelism
<a name="model-parallel-core-features-v2-expert-parallelism"></a>

A *Mixture of Experts* (MoE) model is a type of transformer model that employs a *sparse* approach, making it lighter for training compared to training traditional dense models. In this MoE neural network architecture, only a subset of the model's components called *experts* are utilized for each input. This approach offers several advantages, including more efficient training and faster inference, even with a larger model size. In other words, with the same compute budget for training a full dense model, you can fit a larger model or dataset when using MoE.

An MoE model consists of multiple *experts*, each consisting of a neural network, typically a feed-forward network (FFN). A gate network called *router* determines which tokens are sent to which expert. These experts specialize in processing specific aspects of the input data, enabling the model to train faster, reduce compute cost, while achieving the same performance quality as its counterpart dense model. To learn more about Mixture of Experts in general, refer to the blog [Applying Mixture of Experts in LLM Architectures](https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm-architectures/) in the *NVIDIA developer website*.

*Expert parallelism* is a type of parallelism that handles splitting experts of an MoE model across GPU devices.

SMP v2 integrates with [NVIDIA Megatron](https://github.com/NVIDIA/Megatron-LM) for implementing expert parallelism to support training MoE models, and runs on top of PyTorch FSDP APIs. You keep using your PyTorch FSDP training code as is and activate SMP expert parallelism for training MoE models.

## Hugging Face Transformer models compatible with SMP expert parallelism
<a name="model-parallel-core-features-v2-expert-parallelism-supported-models"></a>

SMP v2 currently offers expert parallelism support for the following Hugging Face transformer models.
+ [Mixtral](https://huggingface.co/docs/transformers/en/model_doc/mixtral)

## Configure expert parallelism
<a name="model-parallel-core-features-v2-expert-parallelism-configure"></a>

For `expert_parallel_degree`, you select a value for the degree of expert parallelism. The value must evenly divide the number of GPUs in your cluster. For example, to shard your model while using an instance with 8 GPUs, choose 2, 4, or 8. We recommend that you start with a small number, and gradually increase it until the model fits in the GPU memory.

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `expert_parallel_degree` parameter, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

**Note**  
You can use expert parallelism with [Hybrid sharded data parallelism](model-parallel-core-features-v2-sharded-data-parallelism.md). Note that expert parallelism is currently not compatible with tensor parallelism.

**Note**  
This expert parallelism training feature is available in the following combination of libraries of SageMaker and the PyTorch library:  
SMP v2.3.0 and later
The SageMaker Python SDK v2.214.4 and later
PyTorch v2.2.0 and later

### In your training script
<a name="model-parallel-core-features-v2-expert-parallelism-configure-in-script"></a>

As part of [Step 1](model-parallel-use-api-v2.md#model-parallel-adapt-pytorch-script-v2), initialize your script with `torch.sagemaker.init()` to activate SMP v2 and wrap your model with the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) API, adding the `config` parameter to the API to activate MoE. The following code snippet shows how to activate SMP MoE for the generic model class `AutoModelForCausalLM` pulling an MoE transformer model configuration using the `from_config` method for training from scratch, or the `from_pretrained` method for fine-tuning. To learn more about the SMP `MoEConfig` class, see [`torch.sagemaker.moe.moe_config.MoEConfig`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-moe).

```
# Import the torch.sagemaker.transform API and initialize.
import torch.sagemaker as tsm
tsm.init()

# Import transformers AutoModelForCausalLM class.
from transformers import AutoModelForCausalLM

# Import the SMP-implementation of MoE configuration class.
from torch.sagemaker.moe.moe_config import MoEConfig

# Define a transformer model with an MoE model configuration
model = AutoModelForCausalLM.from_config(MoEModelConfig)

# Wrap it by torch.sagemaker.transform with the SMP MoE configuration.
model = tsm.transform(
    model, 
    config=MoEConfig(
        smp_moe=True,
        random_seed=12345,
        moe_load_balancing="sinkhorn",
        global_token_shuffle=False,
        moe_all_to_all_dispatcher=True,
        moe_aux_loss_coeff=0.001,
        moe_z_loss_coeff=0.001
    )
)
```

### SMP configuration
<a name="model-parallel-core-features-v2-expert-parallelism-configure-in-estimator-config"></a>

As part of [Step 2](model-parallel-use-api-v2.md#model-parallel-launch-a-training-job-v2), add the following parameter to the SMP configuration dictionary for the SageMaker PyTorch estimator.

```
{   
    ..., # other SMP config parameters
    "expert_parallel_degree": 8
}
```

# Context parallelism
<a name="model-parallel-core-features-v2-context-parallelism"></a>

*Context parallelism* is a type of model parallelism that partitions the model activations along the sequence dimension. Unlike other [sequence parallelism](https://arxiv.org/abs/2205.05198) techniques, which only partition the `LayerNorm` and `RMSNorm`, context parallelism partitions the network inputs and all intermediate activations along the sequence dimension. 

SMP v2 integrates with [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) for context parallelism and can be used in conjunction with PyTorch FSDP and SMP [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md). You can enable all three parallelisms simultaneously for model training. Context parallelism is beneficial for training models with large activation sizes and long sequence lengths. It accelerates the computation of attention scores and attention outputs, by allowing each device to computes only a part of the scores and outputs along the sequence dimension. While tensor parallelism also accelerates computation through partitioning along the hidden dimension, the advantage of context parallelism is more substantial since computational requirements increase quadratically with sequence dimension.

## Hugging Face Transformer models compatible with SMP context parallelism
<a name="model-parallel-core-features-v2-context-parallelism-supported-models"></a>

SMP v2 currently offers context parallelism support for the following Hugging Face transformer models.
+ GPT-NeoX
+ Llama 2 and Llama 3
+ [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.3)

## Configure context parallelism
<a name="model-parallel-core-features-v2-context-parallelism-configure"></a>

Set an integer value to the `context_parallel_degree` parameter that evenly divides the number of GPUs in your cluster. For example, if you have an 8-GPU instance, use 2, 4, or 8 for `context_parallel_degree`. We recommend starting with a small `context_parallel_degree` value and gradually increasing it until the model fits in the GPU memory with the required input sequence length.

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `context_parallel_degree` parameter, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

### In your training script
<a name="model-parallel-core-features-v2-context-parallelism-configure-in-script"></a>

As part of [Step 1](model-parallel-use-api-v2.md#model-parallel-adapt-pytorch-script-v2), initialize your script with `torch.sagemaker.init()` to activate SMP v2 and wrap your model with the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) API. 

Starting from SMP v2.6.0, you can use the argument `cp_comm_type` to determine which context parallelism implementation to use. The SMP library currently supports two implementations: `p2p` and `all_gather`. The `p2p` implementation uses peer-to-peer send-receive calls for key-value accumulation during the attention implementation and runs asynchronously, allowing overlaps with compute. `all_gather` implementation, instead, uses the `AllGather` collective operation and runs synchronously.

```
import torch.sagemaker as tsm
tsm.init()

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_config(..)
model = tsm.transform(model, cp_comm_type="p2p")
```

### SMP configuration
<a name="model-parallel-core-features-v2-context-parallelism-configure-in-estimator"></a>

As part of [Step 2](model-parallel-use-api-v2.md#model-parallel-launch-a-training-job-v2), add the following parameter to the SMP configuration dictionary for the SageMaker PyTorch estimator.

```
{   
    ..., # other SMP config parameters
    "context_parallel_degree": 2
}
```

# Compatibility with the SMDDP library optimized for AWS infrastructure
<a name="model-parallel-core-features-v2-smddp-allgather"></a>

You can use the SageMaker model parallelism library v2 (SMP v2) in conjunction with the [SageMaker distributed data parallelism (SMDDP) library](data-parallel.md) that offers the `AllGather` collective communication operation optimized for AWS infrastructure. In distributed training, collective communication operations are designed for synchronizing multiple GPU workers and exchange information between them. `AllGather` is one of the core collective communication operations typically used in sharded data parallelism. To learn more about the SMDDP `AllGather` operation, see [SMDDP `AllGather` collective operation](data-parallel-intro.md#data-parallel-allgather) Optimizing such collective communication operations would directly contribute to a faster end-to-end training without side effects on convergence.

**Note**  
The SMDDP library supports P4 and P4de instances (see also [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md) by the SMDDP library).

The SMDDP library integrates natively with PyTorch through the [process group](https://pytorch.org/docs/stable/distributed.html) layer. To use the SMDDP library, you only need to add two lines of code to your training script. It supports any training frameworks such as SageMaker Model Parallelism Library, PyTorch FSDP, and DeepSpeed.

To activate SMDDP and use its `AllGather` operation, you need to add two lines of code to your training script as part of [Step 1: Adapt your PyTorch FSDP training script](model-parallel-use-api-v2.md#model-parallel-adapt-pytorch-script-v2). Note that you need to initialize PyTorch Distributed with the SMDDP backend first, and then run the SMP initialization.

```
import torch.distributed as dist

# Initialize with SMDDP
import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend="smddp") # Replacing "nccl"

 # Initialize with SMP
import torch.sagemaker as tsm
tsm.init()
```

[SageMaker Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for PyTorch (see also [Supported frameworks and AWS Regions](distributed-model-parallel-support-v2.md) by SMP v2 and [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md) by the SMDDP library) are pre-packaged with the SMP binary and the SMDDP binary. To learn more about the SMDDP library, see [Run distributed training with the SageMaker AI distributed data parallelism library](data-parallel.md). 

# Mixed precision training
<a name="model-parallel-core-features-v2-mixed-precision"></a>

The SageMaker model parallelism (SMP) library v2 supports mixed precision training out of the box by integrating with open source frameworks such as PyTorch FSDP and Transformer Engine. To learn more, see the following topics.

**Topics**
+ [

## Mixed precision training with FP8 on P5 instances using Transformer Engine
](#model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5)
+ [

## Mixed precision training with half-precision data types using PyTorch FSDP
](#model-parallel-core-features-v2-mixed-precision-half-precision)

## Mixed precision training with FP8 on P5 instances using Transformer Engine
<a name="model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5"></a>

Starting from the SageMaker model parallelism (SMP) library v2.2.0, the SMP library integrates with [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) and supports [FP8 mixed precision training](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html) out of the box, keeping compatibility with [PyTorch FSDP `MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision). This means that you can use both PyTorch FSDP for mixed precision training and Transformer Engine for FP8 training. For model layers not supported by Transformer Engine's FP8 training feature, those layers fall back to PyTorch FSDP mixed precision.

**Note**  
SMP v2 offers FP8 support for the following Hugging Face Transformer models:  
GPT-NeoX (available in SMP v2.2.0 and later)
Llama 2 (available in SMP v2.2.0 and later)
Mixtral 8x7b and Mixtral 8x22b (available in SMP v2.5.0 and later)

**Note**  
This FP8 training on the P5 feature is available in the following combination of libraries of SageMaker and the PyTorch library:  
The SageMaker Python SDK v2.212.0 and later
PyTorch v2.2.0 and later

*FP8* (8-bit floating point precision) is a data type that has emerged as another paradigm to accelerate deep learning training of LLM models. With the release of NVIDIA H100 GPUs supporting FP8 data types, you can benefit from the advantages from the performance improvements on P5 instances equipped with the H100 GPUs, while accelerating distributed training with FP8 mixed precision training.

The FP8 data type further branches down to E4M3 and E5M2 formats. *E4M3* offers a better precision, has a limited dynamic range, and is ideal for the forward pass in model training. *E5M2* has a broader dynamic range, but reduced precision, and is better suited for the backward pass, where precision is less critical and a wider dynamic range becomes beneficial. Hence, we recommend that you use the [hybrid FP8 strategy recipe](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#FP8-recipe) to leverage these characteristics effectively.

For half-precision data types (FP16 and BF16), global loss-scaling techniques such as static loss-scaling or dynamic loss-scaling handle convergence issues that arise from information loss due to rounding gradients in half-precision. However, the dynamic range of FP8 is even narrower, and the global loss scaling techniques are not sufficient. At this point, we need a finer-grained per-tensor scaling technique. *Delayed scaling* is a strategy that selects a scaling factor based on the maximum absolute values observed in a number of tensors form previous iterations. There's a trade-off in this strategy; it uses the full performance benefits of FP8 computation but requires memory for keeping the maximum value history of tensors. To learn more about the delayed scaling strategy in general, see the paper [https://arxiv.org/pdf/2209.05433.pdf](https://arxiv.org/pdf/2209.05433.pdf).

In practice, using FP8 is helpful in all training scenarios on P5 instances. We strongly recommend enabling FP8 whenever possible for enhancing training performance.

SMP v2 supports Transformer Engine out of the box. Therefore, when running FP8 training with SMP v2 on P5 instances of SageMaker AI (`ml.p5.48xlarge`), the only thing you need to do is to import `torch.sagemaker` in your training script and keep using the native Transformer Engine Python package. To learn more about using Transformer Engine for FP8 training in general, see [Using FP8 with Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html) in the *NVIDIA Transformer Engine documentation*. The following code snippet shows how the code lines for importing the SMP library and setting up FP8 in your training script should look.

```
import torch.sagemaker as tsm
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling, Format

# Initialize the SMP torch.sagemaker API.
tsm.init()

# Define a transformer model and wrap it with the torch.sagemaker.transform API.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_config(ModelConfig)
model = tsm.transform(model)

# Enable E4M3 during forward pass, E5M2 during backward pass.
fp8_format = Format.HYBRID

# Create an FP8 recipe.
fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=32, amax_compute_algo="max")

# Enable FP8 autocasting.
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe, fp8_group=tsm.state.world_process_group):
    out = model(inp)

loss = out.sum()
loss.backward()
```

To find a practical example of FP8 training with SMP v2 on P5 instances, see the example notebook at [Accelerate SageMaker PyTorch FSDP Training of Llama-v2 (or GPT-NeoX) with FP8 on P5 instances](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/llama_v2/smp-train-llama-fsdp-tp-fp8.ipynb).

## Mixed precision training with half-precision data types using PyTorch FSDP
<a name="model-parallel-core-features-v2-mixed-precision-half-precision"></a>

SMP v2 supports [PyTorch FSDP `MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) for training jobs on P4 and P5 instances. PyTorch FSDP provides various configurations for mixed precision for both performance improvement and memory reduction. 

**Note**  
This mixed precision training with the PyTorch FSDP feature is available in the following combination of libraries of SageMaker and the PyTorch library.  
SMP v2.0.0 and later
the SageMaker Python SDK v2.200.0 and later
PyTorch v2.0.1 and later

The standard way to configure a model for mixed precision is to create the model in `float32`, and then allow FSDP to cast the parameters to `float16` or `bfloat16` on the fly by passing a `MixedPrecision` policy, as shown in the following code snippet. For more information about options to change the `dtype` for parameters, reduction, or buffers for mixed precision in PyTorch, see [PyTorch FSDP `MixedPrecision` API](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) in the *PyTorch documentation*.

```
# Native PyTorch API
from torch.distributed.fsdp import MixedPrecision

dtype = torch.bfloat16
mixed_precision_policy = MixedPrecision(
    param_dtype=dtype, reduce_dtype=dtype, buffer_dtype=dtype
)

model = FSDP(
    model,
    ...,
    mixed_precision=mixed_precision_policy
)
```

Note that certain models (such as the Hugging Face Transformers Llama model) expect buffers as `float32`. To use `float32`, replace `torch.bfloat16` with `torch.float32` in the line defining the `dtype` object.

# Delayed parameter initialization
<a name="model-parallel-core-features-v2-delayed-param-init"></a>

Initialization of a large model for training is not always possible with the limited GPU memory. To resolve this problem of insufficient GPU memory, you can initialize the model on CPU memory. However, for larger models with more than 20 or 40 billion parameters, even CPU memory might not be enough. For such case, we recommend that you initialize the model on what PyTorch calls a *meta device*, which allows the creation of tensors without any data attached to them. A tensor on a meta device only needs the shape information, and this allows to create a large model with its parameters on meta devices. [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index) provides the context manager `init_empty_weights` to help create such model on meta devices while initializing the buffers on a regular device. Before training starts, PyTorch FSDP initializes the model parameters. This delayed parameter initialization feature of SMP v2 delays this creation of model parameters to happen after PyTorch FSDP performs parameter sharding. PyTorch FSDP accepts a parameter initialization function (`param_init_fn`) when sharding the modules, and it calls `param_init_fn` for each module. The `param_init_fn` API takes a module as an argument and initializes all the parameters in it, not including the parameters of any child module. Note that this behavior *differs* from the native PyTorch v2.0.1 which has a bug causing the parameters to be initialized multiple times.

SMP v2 provides the [`torch.sagemaker.delayed_param.DelayedParamIniter`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-delayed-param-init) API for applying delayed parameter initialization.

The following code snippets show how to apply the `torch.sagemaker.delayed_param.DelayedParamIniter` API to your training script.

Assume that you have a PyTorch FSDP training script as follows.

```
# Creation of model on meta device
from accelerate import init_empty_weights
with init_empty_weights():
    model = create_model()

# Define a param init fn, below is an example for Hugging Face GPTNeoX.
def init_weights(module):
    d = torch.cuda.current_device()
    # Note that below doesn't work if you have buffers in the model
    # buffers will need to reinitialized after this call
    module.to_empty(device=d, recurse=False)
    if isinstance(module, (nn.Linear, Conv1D)):
        module.weight.data.normal_(mean=0.0, std=args.initializer_range)
        if module.bias:
            module.bias.data.zero_()
    elif isinstance(module, nn.Embedding):
        module.weight.data.normal_(mean=0.0, std=args.initializer_range)
        if module.padding_idx:
            module.weight.data[module.padding_idx].zero_()
    elif isinstance(module, nn.LayerNorm):
        module.bias.data.zero_()
        module.weight.data.fill_(1.0)

# Changes to FSDP wrapper.
model = FSDP(
    model,
    ...,
    param_init_fn=init_weights
)

# At this point model is initialized and sharded for sharded data parallelism.
```

Note that the delayed parameter initialization approach is not model agnostic. To resolve this issue, you need to write an `init_weights` function as shown in the preceding example to match the initialization in the original model definition, and it should cover all the parameters of the model. To simplify this process of preparing such `init_weights` function, SMP v2 implements this initialization function for the following models: GPT-2, GPT-J, GPT-NeoX, and Llama from Hugging Face Transformers. The `torch.sagemaker.delayed_param.DelayedParamIniter` API also works with the SMP tensor parallel implementation, `torch.sagemaker.tensor_parallel.transformer.TransformerLMHead` model, that you can call after the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) API call.

Using the `torch.sagemaker.delayed_param.DelayedParamIniter` API, you can adapt your PyTorch FSDP script as follows. After creating a model with empty weights, register the `torch.sagemaker.delayed_param.DelayedParamIniter` API to the model, and define an object of it. Pass the object to the `param_init_fn` of the PyTorch FSDP class.

```
from torch.sagemaker.delayed_param import DelayedParamIniter
from accelerate import init_empty_weights

with init_empty_weights():
    model = create_model()
    
delayed_initer = DelayedParamIniter(model)

with delayed_initer.validate_params_and_buffers_inited():
    model = FSDP(
        model,
        ...,
        param_init_fn=delayed_initer.get_param_init_fn()
    )
```

**Notes on tied weights**

When training models with tied weights, we need to take special care to tie the weights after initializing the weights with delayed parameter initialization. PyTorch FSDP does not have a mechanism to tie the weights after initializing them using `param_init_fn` as above. To address such cases we added API to allow a `post_init_hook_fn`, which can be used to tie the weights. You can pass any function in there which accepts the module as argument, but we also have a predefined `post_param_init_fn` defined in `DelayedParamIniter` which calls `tie_weights` method of the module if it exists. Note that it’s safe to always pass in `post_param_init_fn` even if there’s no `tie_weights` method for the module.

```
with delayed_initer.validate_params_and_buffers_inited():
    model = FSDP(
        model,
        ...,
        param_init_fn=delayed_initer.get_param_init_fn(),
        post_param_init_fn=delayed_initer.get_post_param_init_fn()
    )
```

# Activation checkpointing
<a name="model-parallel-core-features-v2-pytorch-activation-checkpointing"></a>

*Activation checkpointing* is a technique to reduce memory usage by clearing activations of certain layers and recomputing them during the backward pass. Effectively, this trades extra computation time for reducing memory usage. If a module is checkpointed, at the end of a forward pass, only the initial inputs to the module and final outputs from the module stay in memory. PyTorch releases any intermediate tensors that are part of the computation inside that module during the forward pass. During the backward pass of the checkpointed modules, PyTorch recomputes these tensors. At this point, the layers beyond this checkpointed module have finished their backward pass, so the peak memory usage with checkpointing becomes lower.

SMP v2 supports the PyTorch activation checkpointing module, [https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/#activation-checkpointing](https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/#activation-checkpointing). The following are examples of activation checkpointing of the Hugging Face GPT-NeoX model.

**Checkpointing Transformer layers of the Hugging Face GPT-NeoX model**

```
from transformers.models.gpt_neox import GPTNeoXLayer
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    apply_activation_checkpointing
)
    
# check_fn receives a module as the arg, 
# and it needs to return whether the module is to be checkpointed
def is_transformer_layer(module):
    from transformers.models.gpt_neox import GPTNeoXLayer
    return isinstance(submodule, GPTNeoXLayer)
    
apply_activation_checkpointing(model, check_fn=is_transformer_layer)
```

**Checkpointing every other Transformer layer of the Hugging Face GPT-NeoX model**

```
# check_fn receives a module as arg, 
# and it needs to return whether the module is to be checkpointed
# here we define that function based on global variable (transformer_layers)
from transformers.models.gpt_neox import GPTNeoXLayer
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    apply_activation_checkpointing
)

transformer_layers = [
    m for m model.modules() if isinstance(m, GPTNeoXLayer)
]

def is_odd_transformer_layer(module):
    return transformer_layers.index(module) % 2 == 0
    
apply_activation_checkpointing(model, check_fn=is_odd_transformer_layer)
```

Alternatively, PyTorch also has the `torch.utils.checkpoint` module for checkpointing, which is used by a subset of Hugging Face Transformers models. This module also works with SMP v2. However, it requires you to have access to the model definition for adding the checkpoint wrapper. Therefore, we recommend you to use the `apply_activation_checkpointing` method.

# Activation offloading
<a name="model-parallel-core-features-v2-pytorch-activation-offloading"></a>

**Important**  
In SMP v2.2.0, the activation offloading functionality of the SMP library doesn't work. Use the native PyTorch activation offloading instead.

Typically, the forward pass computes activations at each layer and keeps them in GPU memory until the backward pass for the corresponding layer finishes. Offloading these tensors to CPU memory after forward pass and fetching them back to GPU when they are needed can save substantial GPU memory usage. PyTorch supports offloading activations, but the implementation causes GPUs to be idle while activations are fetched back from CPU during backward pass. This causes a major performance degradation when using activation offloading.

SMP v2 improves this activation offloading. It pre-fetches activations ahead of time before the activations are needed for the GPU to start backward pass on those activations. The pre-fetching feature helps training progresses be run more efficiently without idle GPUs. This results in offering benefits from lower memory usage without a performance degradation.

You can keep the native PyTorch modules for offloading activations in your training script. The following is an example structure of applying the SMP activation offloading feature in your script. Note that activation offloading is applicable *only* when used together with [Activation checkpointing](model-parallel-core-features-v2-pytorch-activation-checkpointing.md). To learn more about the native PyTorch checkpoint tools for activation offloading, see:
+ [checkpoint\$1wrapper.py](https://github.com/pytorch/pytorch/blob/v2.0.1/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py#L171) in the *PyTorch GitHub repository*
+ [Activation Checkpointing](https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/#activation-checkpointing) in the PyTorch blog *Scaling Multi-modal Foundation Models in TorchMultimodal with PyTorch Distributed*.

You can apply the SMP activation offloading feature on [PyTorch activation checkpointing](https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/#activation-checkpointing). This is done by adding the `sm_activation_offloading` and `activation_loading_horizon` parameters to the SMP configuration dictionary during [Step 2: Launch a training job](model-parallel-use-api-v2.md#model-parallel-launch-a-training-job-v2). 

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `sm_activation_offloading` and `activation_loading_horizon` parameters, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

**SMP configuration**

```
{
    "activation_loading_horizon": 2,
    "sm_activation_offloading": True
}
```

**In training script**

**Note**  
While activating the SMP activation offloading feature, make sure that you also use the PyTorch `offload_wrapper` function and apply it to the root module. The SMP activation offloading feature uses the root module to determine when forward pass is done to start pre-fetching.

```
import torch.sagemaker as tsm
tsm.init()

# Native PyTorch module for activation offloading
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    apply_activation_checkpointing, 
    offload_wrapper,
)

model = FSDP(...)

# Activation offloading requires activation checkpointing.
apply_activation_checkpointing(
    model,
    check_fn=checkpoint_transformer_layers_policy,
)

model = offload_wrapper(model)
```

# Tensor parallelism
<a name="model-parallel-core-features-v2-tensor-parallelism"></a>

*Tensor parallelism* is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices. In contrast to pipeline parallelism, which keeps individual weights intact but partitions the *set* of weights, gradients, or optimizer across devices, tensor parallelism shards *individual* weights. This typically involves distributed computation of specific operations, modules, or layers of the model.

Tensor parallelism is required in cases in which a single parameter consumes most of the GPU memory (such as large embedding tables with a large vocabulary size or a large softmax layer with a large number of classes). In this case, treating this large tensor or operation as an atomic unit is inefficient and impedes balance of the memory load.

SMP v2 integrates with [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) for the implementation for tensor parallelism, and runs on top of PyTorch FSDP APIs. You can enable PyTorch FSDP and SMP tensor parallelism simultaneously, and determine the best model parallelism for best performance.

In practice, tensor parallelism is especially helpful in the following scenarios.
+ When training with long context lengths as that leads to high activation memory with FSDP alone.
+ When training with really large clusters on which the global batch size exceeds desired limits.

## Hugging Face Transformer models compatible with the SMP tensor parallelism
<a name="model-parallel-core-features-v2-tensor-parallelism-supported-models"></a>

SMP v2 currently offers tensor parallelism support for the following Hugging Face transformer models.
+ GPT-NeoX
+ Llama 2
+ Llama 3
+ [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.3)
+ [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
+ [Mixtral 8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)

For reference configuration for applying tensor parallelism on these models, see [Configuration tips](model-parallel-best-practices-v2.md#model-parallel-best-practices-v2-config-tips).

## Configure tensor parallelism
<a name="model-parallel-core-features-v2-tensor-parallelism-configuration"></a>

For `tensor_parallel_degree`, you select a value for the degree of tensor parallelism. The value must evenly divide the number of GPUs in your cluster. For example, to shard your model while using an instance with 8 GPUs, choose 2, 4, or 8. We recommend that you start with a small number, and gradually increase it until the model fits in the GPU memory.

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `tensor_parallel_degree` and `random_seed` parameters, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

**SMP configuration**

```
{
    "tensor_parallel_degree": 8,
    "random_seed": 0 
}
```

**In your training script**

Initialize with `torch.sagemaker.init()` to activate SMP v2 and wrap your model with the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) API.

```
import torch.sagemaker as tsm
tsm.init()

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_config(..)
model = tsm.transform(model)
```

## Saving and loading Hugging Face Transformer checkpoints
<a name="model-parallel-core-features-v2-tensor-parallelism-checkpoints"></a>

After the SMP library transforms a model, it changes the state dictionary (`state_dict`) of the model. This means that the model becomes incompatible with the original Hugging Face Transformer checkpointing functionalities. To handle this, the SMP library provides APIs to save checkpoints from a transformed model in Hugging Face Transformer representation, and the `torch.sagemaker.transform` API to load a Hugging Face Transformer model checkpoint for fine-tuning.

For more information about saving checkpoints while using the tensor parallelism feature of SMP v2, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).

For more information about fine-tuning a model applying the tensor parallelism feature of SMP v2, see [Fine-tuning](model-parallel-core-features-v2-fine-tuning.md).

# Fine-tuning
<a name="model-parallel-core-features-v2-fine-tuning"></a>

Fine-tuning is a process of continuously training pre-trained models to improve performance for specific use cases.

Fine-tuning small models that fit fully on a single GPU, or those that fit 8 copies of model fully on CPUs is straightforward. It requires no special change to regular FSDP training. In the realm of models larger than this, you need to consider using the delayed parameter initialization functionality, which can be tricky.

To address this, the SMP library loads the full model on one of the ranks while the rest of the ranks create models with empty weights on a meta device. Then, PyTorch FSDP initializes the weights on non-zero ranks using the `init_weights` function, and synchronizes the weights on all ranks to the weights on the 0th rank with `sync_module_states` set to `True`. The following code snippet shows how you should set it up in your training script.

```
import torch.distributed as dist
from transformers import AutoModelForCasalLM
from accelerate import init_empty_weights
from torch.sagemaker.delayed_param import DelayedParamIniter

if dist.get_rank() == 0:
    model = AutoModelForCasalLM.from_pretrained(..., low_cpu_mem_usage=True)
else:
    with init_empty_weights():
        model = AutoModelForCasalLM.from_config(AutoConfig.from_pretrained(...))
    delayed_initer = DelayedParamIniter(model)

model = FSDP(
    model,
    ...,
    sync_module_states=True,
    param_init_fn=delayed_initer.get_param_init_fn() if dist.get_rank() > 0 else None
)
```

## Fine-tuning a pre-trained Hugging Face Transformer model with SMP tensor parallelism
<a name="model-parallel-core-features-v2-tensor-parallelism-fine-tuning-hf-transformer-with-tp"></a>

This section discusses loading Transformer models for two use cases: fine-tuning small Transformer models and fine-tuning large Transformer models. For smaller models without delayed parameter initialization, wrap the model with the `torch.sagemaker.transform` API before wrapping it with PyTorch FSDP.

```
import functools
from transformers import AutoModelForCausalLM
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from torch.sagemaker import transform

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", low_cpu_mem_usage=True)

# Transform model while loading state dictionary from rank 0.
tp_model = transform(model, load_state_dict_from_rank0=True)

# Wrap with FSDP.
model = FSDP(
    tp_model, 
    ...
    sync_module_states=True,
)
```

For larger models, the preceding approach causes to run out of CPU memory. We recommend that you use delayed parameter initialization to avoid such CPU memory issues. In this case, you can apply the `torch.sagemaker.transform` API and the `torch.sagemaker.delayed_param.DelayedParamIniter` API as shown in the following code example.

```
from transformers import AutoModelForCausalLM
from torch.sagemaker import transform
from torch.sagemaker.delayed_param import DelayedParamIniter

# Create one instance of model without delayed param
# on CPU, on one rank.
if dist.get_rank() == 0:
    model = AutoModelForCasalLM.from_pretrained(...,low_cpu_mem_usage=True)
else:
    with init_empty_weights():
        model = AutoModelForCasalLM.from_config(AutoConfig.from_pretrained(...))

# Transform model while loading state dictionary from rank 0
model = transform(model, load_state_dict_from_rank0=True)

if dist.get_rank() != 0: # For fine-tuning, delayed parameter on non-zero ranks
    delayed_initer = DelayedParamIniter(model)
else:
    delayed_initer = None

with (
    delayed_initer.validate_params_and_buffers_inited() if delayed_initer else nullcontext()
):
    # Wrap the model with FSDP
    model = FSDP(
        model, 
        ..., 
        sync_module_states=True,
        param_init_fn=delayed_initer.get_param_init_fn() if delayed_initer else None
    )
```

# FlashAttention
<a name="model-parallel-core-features-v2-flashattention"></a>

SMP v2 supports [FlashAttention](https://github.com/HazyResearch/flash-attention) kernels and makes it easy to apply them to various scenarios for Hugging Face Transformer models. Note that if you use FlashAttention package v2.0 or later, SMP uses FlashAttention v2; however, the Triton flash attention defaults to the flash attention kernel in FlashAttention v1.x, making it exclusively supported in FlashAttention v1. 

The module (`nn.Module`) is a low level API that defines the attention layers of a model. It should be applied right after model creation, from the `AutoModelForCausalLM.from_config()` API for example, and before the model is being transformed or wrapped with FSDP.

## Use FlashAttention kernels for self attention
<a name="model-parallel-core-features-v2-flashattention-self"></a>

The following code snippet shows how to use the [`torch.sagemaker.nn.attn.FlashSelfAttention`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-flashselfattention) API provided by SMP v2.

```
def new_attn(self, q, k, v, attention_mask=None, head_mask=None):
    return (
        self.flashmod((q, k, v), causal=True, cast_dtype=torch.bfloat16, layout="b h s d"),
        None,
    )

for layer in model.gpt_neox.layers:
    layer.attention.flash_mod = torch.sagemaker.nn.attn.FlashSelfAttention()
    layer.attention._attn = functools.partial(new_attn, layer.attention)
```

## Use FlashAttention kernels for grouped-query attention
<a name="model-parallel-core-features-v2-flashattention-grouped-query"></a>

SMP v2 also supports [FlashAttention](https://github.com/HazyResearch/flash-attention) kernels for grouped-query attention (GQA) and makes it easy to apply them to various scenarios for Hugging Face Transformer models. Different from original attention architecture, GQA equally partitions query heads into groups, and query heads in the same group share the same key and value heads. Therefore, q and kv heads are passed into forward call separately. Note: The number of q heads needs to be divisible by the number of kv heads.

**Example of using FlashGroupedQueryAttention**

The following code snippet shows how to use the [`torch.sagemaker.nn.attn.FlashGroupedQueryAttention`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn) API provided by SMP v2.

```
from transformers.models.llama.modeling_llama import LlamaAttention
from torch.sagemaker.nn.attn import FlashGroupedQueryAttention

class LlamaFlashAttention(LlamaAttention):
    def __init__(self, config: LlamaConfig):
        super().__init__(config)

        self.flash_attn = FlashGroupedQueryAttention(
            attention_dropout_prob=0.0,
        )
        
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        ...
    ):
        query_states = self.q_proj(hidden_states)
        key_states = self.k_proj(hidden_states)
        value_states = self.v_proj(hidden_states)
        ...
        kv = (key_states, value_states)
        attn_output = self.flash_attn(
            query_states,
            kv,
            attn_mask=attention_mask,
            causal=True,
            layout="b h s d",
        )
        ...
        attn_output = self.o_proj(attn_output)
        ...
        return attn_output
```

The SMP library also provides [`torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-llamaFlashAttn), which uses the [`torch.sagemaker.nn.attn.FlashGroupedQueryAttention`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn) API at low level. Hugging Face Transformers has a similar implementation called [https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py) from v4.36.0. The following code snippet shows how to use the SMP v2 `LlamaFlashAttention` API or the Transformers `LlamaFlashAttention2` API to replace the attention layers of an existing Llama model.

```
from torch.sagemaker.nn.huggingface.llama_flashattn import LlamaFlashAttention
from transformers.models.llama.modeling_llama import LlamaFlashAttention2

flash_attn_class = LlamaFlashAttention # or flash_attn_class = LlamaFlashAttention2

attn_name = "self_attn"
for layer in model.model.layers:
    prev_layer = getattr(layer, attn_name)
    setattr(layer, attn_name, flash_attn_class(model.config))
```

# Checkpointing using SMP
<a name="model-parallel-core-features-v2-checkpoints"></a>

The SageMaker model parallelism (SMP) library supports PyTorch APIs for checkpoints, and provides APIs that help checkpoint properly while using the SMP library. 

PyTorch FSDP (Fully Sharded Data Parallelism) supports three types of checkpoints: full, sharded, and local, each serving different purposes. Full checkpoints are used when exporting the model after training is completed, as generating a full checkpoint is a computationally expensive process. Sharded checkpoints help save and load the state of a model sharded for each individual rank. With sharded checkpoints, you can resume training with different hardware configurations, such as a different number of GPUs. However, loading sharded checkpoints can be slow due to the communication involved among multiple devices. The SMP library provides local checkpointing functionalities, which allow faster retrieval of the model's state without additional communication overhead. Note that checkpoints created by FSDP require writing to a shared network file system such as Amazon FSx.

## Async local checkpoints
<a name="w2aac25c25c19c19c33b7"></a>

When training machine learning models, there is no need for subsequent iterations to wait for the checkpoint files to be saved to disk. With the release of SMP v2.5, the library supports saving checkpoint files asynchronously. This means that the subsequent training iteration can run simultaneously with the input and output (I/O) operations for creating checkpoints, without being slowed down or held back by those I/O operations. Also, the process of retrieving sharded model and optimizer paramemeters in PyTorch can be time-consuming due to the additional collective communication required to exchange distributed tensor metadata across ranks. Even when using `StateDictType.LOCAL_STATE_DICT` to save local checkpoints for each rank, PyTorch still invokes hooks that perform collective communication. To mitigate this issue and reduce the time required for checkpoint retrieval, SMP introduces `SMStateDictType.SM_LOCAL_STATE_DICT`, which allows for faster retrieval of model and optimizer checkpoints by bypassing the collective communication overhead. 

**Note**  
Maintaining consistency in the FSDP `SHARD_DEGREE` is a requirement for utilizing the `SMStateDictType.SM_LOCAL_STATE_DICT`. Ensure that the `SHARD_DEGREE` remains unchanged. While the number of model replications can vary, the model shard degree needs to be identical to the previous training setup when resuming from a checkpoint.

```
import os
import torch.distributed as dist
import torch.sagemaker as tsm
from torch.sagemaker import state
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.sagemaker.distributed.checkpoint.state_dict_saver import (
    async_save,
    maybe_finalize_async_calls,
)
from torch.sagemaker.distributed.checkpoint.state_dict_utils import (
    sm_state_dict_type,
    SMStateDictType,
)

global_rank = dist.get_rank()
save_dir = "/opt/ml/checkpoints"
sub_dir = f"tp{state.tp_rank}_ep{state.ep_rank}_fsdp{model.rank}"

# 1. Get replication ranks and group
current_replication_group = None
current_replication_ranks = None
for replication_ranks in state.ranker.get_rep_groups():
    rep_group = dist.new_group(replication_ranks)
    if global_rank in replication_ranks:
        current_replication_group = rep_group
        current_replication_ranks = replication_ranks

coordinator_rank = min(current_replication_ranks)

# 2. Wait for the previous checkpointing done
maybe_finalize_async_calls(
    blocking=True, process_group=current_replication_group
)

# 3. Get model local checkpoint
with sm_state_dict_type(model, SMStateDictType.SM_LOCAL_STATE_DICT):
    state_dict = {
       "model": model.state_dict(),
       "optimizer": optimizer.state_dict(),
        # Potentially add more customized state dicts.
    }

# 4. Save a local checkpoint 
async_save(
    state_dict,
    checkpoint_id=os.path.join(save_dir, sub_dir),
    process_group=current_replication_group,
    coordinator_rank=coordinator_rank,
)
```

The following code snippet demonstrates how to load a checkpoint utilizing `SMStateDictType.SM_LOCAL_STATE_DICT`.

```
import os
import torch.sagemaker as tsm
from torch.sagemaker import state
from torch.sagemaker.distributed.checkpoint.state_dict_loader import load
from torch.sagemaker.distributed.checkpoint.state_dict_utils import (
    sm_state_dict_type,
    SMStateDictType,
    init_optim_state
)
from torch.sagemaker.distributed.checkpoint.filesystem import (
    DistributedFileSystemReader,
)

load_dir = "/opt/ml/checkpoints"
sub_dir = f"tp{state.tp_rank}_ep{state.ep_rank}_fsdp{model.rank}"
global_rank = dist.get_rank()
checkpoint_id = os.path.join(load_dir, sub_dir)
storage_reader = DistributedFileSystemReader(checkpoint_id)

# 1. Get replication ranks and group
current_replication_group = None
current_replication_ranks = None
for replication_ranks in state.ranker.get_rep_groups():
    rep_group = dist.new_group(replication_ranks)
    if global_rank in replication_ranks:
        current_replication_group = rep_group
        current_replication_ranks = replication_ranks

coordinator_rank = min(current_replication_ranks)

# 2. Create local state_dict
with sm_state_dict_type(model, SMStateDictType.SM_LOCAL_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        # Potentially add more customized state dicts.
    }
 
    # Init optimizer state_dict states by setting zero grads and step.
    init_optim_state(optimizer, skip_empty_param=True)
    state_dict["optimizer"] = optimizer.state_dict()
 
# 3. Load a checkpoint
load(
    state_dict=state_dict,
    process_group=current_replication_group,
    coordinator_rank=coordinator_rank,
    storage_reader=storage_reader,
)
```

Storing checkpoints for large language models (LLMs) can be expensive as it often requires creating a large filesystem volume. To reduce costs, you have the option to save checkpoints directly to Amazon S3 without the need for additional filesystem services such as Amazon FSx. You can leverage the previous example with the following code snippet to save checkpoints to S3 by specifying an S3 URL as the destination. 

```
key = os.path.join(checkpoint_dir, sub_dir)
checkpoint_id= f"s3://{your_s3_bucket}/{key}"
async_save(state_dict, checkpoint_id=checkpoint_id, **kw)
load(state_dict, checkpoint_id=checkpoint_id, **kw)
```

## Async sharded checkpoints
<a name="w2aac25c25c19c19c33b9"></a>

There may be situations where you need to continue training with different hardware configurations, such as changing the number of GPUs. In these cases, your training processes must load checkpoints while resharding, which means resuming subsequent training with a different number of `SHARD_DEGREE`. In order to address the scenario where you need to resume training with a different number of `SHARD_DEGREE`, you must save your model checkpoints using the sharded state dictionary type, which is represented by `StateDictType.SHARDED_STATE_DICT`. Saving checkpoints in this format allows you to properly handle the resharding process when continuing the training with a modified hardware configuration. The provided code snippet illustrates how to use the `tsm` API to asynchronously save sharded checkpoints, enabling a more efficient and streamlined training process.

```
import os
import torch.sagemaker as tsm
from torch.sagemaker import state
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import StateDictType
from torch.sagemaker.utils.process_group_utils import get_global_ranks
from torch.sagemaker.distributed.checkpoint.state_dict_saver import (
    async_save,
    maybe_finalize_async_calls,
)

save_dir = "/opt/ml/checkpoints"
sub_dir = f"tp{state.tp_rank}_ep{state.ep_rank}"
checkpoint_id = os.path.join(save_dir, sub_dir)

# To determine whether curreto take part in checkpointing.
global_rank = dist.get_rank()
action_rank = state.ranker.get_rep_rank(global_rank) == 0
process_group = model.process_group
coordinator_rank = min(get_global_ranks(process_group))

# 1. wait for the previous checkpointing done
maybe_finalize_async_calls(blocking=True, process_group=process_group)

# 2. retrieve model & optimizer sharded state_dict
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        "optimizer": FSDP.optim_state_dict(model, optimizer),
        # Potentially add more customized state dicts.
    }
 
# 3. save checkpoints asynchronously using async_save
if action_rank:
    async_save(
        state_dict,
        checkpoint_id=checkpoint_id,
        process_group=process_group,
        coordinator_rank=coordinator_rank,
    )
```

The process of loading shared checkpoints is similar to the previous section, but it involves using the `torch.sagemaker.distributed.checkpoint.filesystem.DistributedFileSystemReader` and its `load` method. The `load` method of this class allows you to load the shared checkpoint data, following a process analogous to the one described earlier.

```
import os
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import StateDictType
from torch.distributed.checkpoint.optimizer import load_sharded_optimizer_state_dict
from torch.sagemaker.distributed.checkpoint.state_dict_loader import load
from torch.sagemaker.utils.process_group_utils import get_global_ranks
from torch.sagemaker.distributed.checkpoint.filesystem import (
    DistributedFileSystemReader,
)
 
 load_dir = "/opt/ml/checkpoints"
sub_dir = f"tp{state.tp_rank}_ep{state.ep_rank}"
checkpoint_id = os.path.join(load_dir, sub_dir)
reader = DistributedFileSystemReader(checkpoint_id)

process_group = model.process_group
coordinator_rank = min(get_global_ranks(process_group))

with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
   # 1. Load model and everything else except the optimizer.
   state_dict = {
        "model": model.state_dict()
        # Potentially more customized state dicts.
   }
   load(
        state_dict,
        storage_reader=reader,
        process_group=process_group,
        coordinator_rank=coordinator_rank,
   )
   model.load_state_dict(state_dict["model"])
 
   # 2. Load optimizer.
   optim_state = load_sharded_optimizer_state_dict(
        model_state_dict=state_dict["model"],
        optimizer_key="optimizer",
        storage_reader=reader,
        process_group=process_group,
    )    
   flattened_optimizer_state = FSDP.optim_state_dict_to_load(
        optim_state["optimizer"], model, optimizer,
         group=model.process_group
   )
   optimizer.load_state_dict(flattened_optimizer_state)
```

## Full model checkpoints
<a name="model-parallel-core-features-v2-checkpoints-full"></a>

At the end of training, you can save a full checkpoint that combines all shards of a model into a single model checkpoint file. The SMP library fully supports the PyTorch full model checkpoints API, so you don't need to make any changes.

Note that if you use the SMP [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md), the SMP library transforms the model. When checkpointing the full model in this case, the SMP library translates the model back to the Hugging Face Transformers checkpoint format by default.

In cases where you train with the SMP tensor parallelism and turn off the SMP translation process, you can use the `translate_on_save` argument of the PyTorch `FullStateDictConfig` API to switch the SMP auto-translation on or off as needed. For example, if you are focusing on training a model, you don’t need to add the translation process which adds overhead. In that case, we recommend you to set `translate_on_save=False`. Also, if you plan to keep using the SMP translation of the model for further training in future, you can switch it off to save the SMP translation of the model for later use. Translating the model back to the Hugging Face Transformers model checkpoint format is needed when you wrap up the training of your model and use that for inference.

```
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import FullStateDictConfig
import torch.sagemaker as tsm

# Save checkpoints.
with FSDP.state_dict_type(
    model, 
    StateDictType.FULL_STATE_DICT, 
    FullStateDictConfig(
        rank0_only=True, offload_to_cpu=True,
        # Default value is to translate back to Hugging Face Transformers format,
        # when saving full checkpoints for models trained with SMP tensor parallelism.
        # translate_on_save=True
    ),
):
    state_dict = model.state_dict()
    if dist.get_rank() == 0:
        logger.info("Processed state dict to save. Starting write to disk now.")
        os.makedirs(save_dir, exist_ok=True)
        # This name is needed for HF from_pretrained API to work.
        torch.save(state_dict, os.path.join(save_dir, "pytorch_model.bin"))
        hf_model_config.save_pretrained(save_dir)
    dist.barrier()
```

Note that the option `FullStateDictConfig(rank0_only=True, offload_to_cpu=True)` is to gather the model on the CPU of the 0th rank device to save memory when training large models.

To load the model back for inference, you do so as shown in the following code example. Note that the class `AutoModelForCausalLM` might change to other factor builder classes in Hugging Face Transformers, such as `AutoModelForSeq2SeqLM`, depending on your model. For more information, see [Hugging Face Transformers documentation](https://huggingface.co/docs/transformers/v4.36.1/en/model_doc/auto#natural-language-processing).

```
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(save_dir)
```

# Amazon SageMaker AI model parallelism library v2 examples
<a name="distributed-model-parallel-v2-examples"></a>

This page provides a list of blogs and Jupyter notebooks that present practical examples of implementing the SageMaker model parallelism (SMP) library v2 to run distributed training jobs on SageMaker AI.

## Blogs and Case Studies
<a name="distributed-model-parallel-v2-examples-blog"></a>

The following blogs discuss case studies about using SMP v2.
+ [Amazon SageMaker AI model parallel library now accelerates PyTorch FSDP workloads by up to 20%](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-model-parallel-library-now-accelerates-pytorch-fsdp-workloads-by-up-to-20/)

## PyTorch example notebooks
<a name="distributed-model-parallel-examples-v2-pytorch"></a>

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/). To download the examples, run the following command to clone the repository and go to `training/distributed_training/pytorch/model_parallel_v2`.

**Note**  
Clone and run the example notebooks in the following SageMaker AI ML IDEs.  
[SageMaker JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Code Editor](https://docs.aws.amazon.com/sagemaker/latest/dg/code-editor.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) (available as an application in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)

```
git clone https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/training/distributed_training/pytorch/model_parallel_v2
```

**SMP v2 example notebooks**
+ [Accelerate training of Llama v2 with SMP v2, PyTorch FSDP, and Transformer Engine by running FP8 training on P5 instances](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/llama_v2/smp-train-llama-fsdp-tp-fp8.ipynb)
+ [Fine-tune Llama v2 with SMP v2 and PyTorch FSDP at large-scale using tensor parallelism, hybrid sharding, and activation offloading](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/llama_v2/smp-finetuning-llama-fsdp-tp.ipynb)
+ [Train GPT-NeoX with SMP v2 and PyTorch FSDP at large scale](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/gpt-neox/smp-train-gpt-neox-fsdp-tp.ipynb)
+ [Fine-tune GPT-NeoX with SMP v2 and PyTorch FSDP at large-scale using tensor parallelism, hybrid sharding, and activation offloading](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/gpt-neox/smp-finetuning-gpt-neox-fsdp-tp.ipynb)

# SageMaker distributed model parallelism best practices
<a name="model-parallel-best-practices-v2"></a>

Use the following guidelines when you run a distributed training job with the SageMaker model parallel library v2 (SMP v2).

## Setting up the right configuration for distributed training
<a name="model-parallel-best-practices-configuration-v2"></a>

To estimate and find the best starting point to apply distributed training techniques that SMP v2 provides, review the following list. Each list item discusses the advantage of using the [Core features of the SageMaker model parallelism library v2](model-parallel-core-features-v2.md) along with potential tradeoffs. 

### Configuration tips
<a name="model-parallel-best-practices-v2-config-tips"></a>

This section provides guidelines on how to decide on the best model configurations for optimal throughput with global batch size requirements.

First, we recommend the following setups regardless of the size of your model.

1. Use the most powerful instance type that you can use.

1. Turn on [mixed precision](model-parallel-core-features-v2-mixed-precision.md) all the time, as it provides substantial benefits for performance and memory reduction. We recommend you to use `bfloat16` as it's more precise than `float16`.

1. Turn on the [SageMaker distributed data parallelism library](data-parallel.md) (instead of using NCCL) whenever it’s applicable, as shown in [Compatibility with the SMDDP library optimized for AWS infrastructure](model-parallel-core-features-v2-smddp-allgather.md). One exception is for tensor-parallelism-only use cases (`hybrid_shard_degree = 1` and `tensor_paralle_degree > 1`).

1. If your model has more than about 60 billion parameters, we recommend using [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md). You can also use delayed parameter initialization to speed up the initialization for any model.

1. We recommend you to enable [Activation checkpointing](model-parallel-core-features-v2-pytorch-activation-checkpointing.md). 

Depending on the size of you model, we recommend that you start with the following guidance.

1. Use sharded data parallelism.

   1. Depending on the batch size you intend to fit in the GPU memory, choose the appropriate sharded data parallel degree. Normally, you should start with the lowest degree to fit your model in the GPU memory while minimizing overhead from network communication. If you see a warning that cache flushes are happening, we recommend that you increase the sharding degree. 

   1. Determine `world_size` based on the maximum local batch size and required global batch size, if any.

   1. You can experiment with activation offloading. Depending on scenarios, it can address your memory needs without having to increase the sharding degree, which means less communication. 

1. Use sharded data parallelism of PyTorch FSDP and tensor parallelism of SMP v2 simultaneously, as introduced in [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md).

   1. When training on large clusters, with FSDP alone the global batch size can become too large, causing convergence issues for the model. Typically, most research work keeps the batch size under 4 million tokens. In this case, you can resolve the problem by composing PyTorch FSDP with tensor parallelism of SMP v2 to reduce the batch size.

      For example, if you have 256 nodes and sequence length 4096, even a batch size of 1 per GPU leads to global batch size of 8M tokens. However, when you use tensor parallelism with degree 2 and batch size of 1 per tensor parallel group, this becomes 1/2 batch size per GPU, which translates to 4 million tokens.

   1. When training with long context lengths such as 8k, 16k activation memory can become very high. FSDP doesn't shard activations, and activations can cause GPUs to go out of memory. In such scenarios, you can train efficiently by composing PyTorch FSDP with tensor parallelism of SMP v2.

### Reference configurations
<a name="model-parallel-best-practices-configuration-reference-v2"></a>

The SageMaker model parallelism training team provides the following reference points based on experiments with the Llama 2 model transformed to the SMP transformer model using [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform), and trained on `ml.p4d.24xlarge` instance(s) with sequence length 4096 and mixed precision (FP16 or BF16).

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-best-practices-v2.html)

You can extrapolate from the preceding configurations to estimate GPU memory usage for your model configuration. For example, if you increase the sequence length for a 10-billion-parameter model or increase the size of the model to 20 billion, you might want to lower batch size first. If the model still doesn’t fit, try increasing the degree of tensor parallelism.

## Monitoring and logging a training job using the SageMaker AI console and Amazon CloudWatch
<a name="model-parallel-best-practices-monitoring-v2"></a>

To monitor system-level metrics such as CPU memory utilization, GPU memory utilization, and GPU utilization, use visualization provided through the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Training**.

1. Choose **Training jobs**.

1. In the main pane, choose the training job name for which you want to see more details.

1. Browse the main pane and find the **Monitor** section to see the automated visualization.

1. To see training job logs, choose **View logs** in the **Monitor** section. You can access the distributed training job logs of the training job in CloudWatch. If you launched multi-node distributed training, you should see multiple log streams with tags in the format of **algo-n-1234567890**. The **algo-1** log stream tracks training logs from the main (0th) node.

For more information, see [Amazon CloudWatch Metrics for Monitoring and Analyzing Training Jobs](training-metrics.md).

## Permissions
<a name="model-parallel-best-practices-permissions-v2"></a>

To run a SageMaker training job with model parallelism, make sure you have the right permissions in your IAM role, such as the following:
+ To use [FSx for Lustre](https://aws.amazon.com/fsx/), add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonFSxFullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonFSxFullAccess).
+ To use Amazon S3 as a data channel, add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonS3FullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonS3FullAccess).
+ To use Docker, build your own container, and push it to Amazon ECR, add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonEC2ContainerRegistryFullAccess).
+ To have a full access to use the entire suite of SageMaker AI features, add [https://console.aws.amazon.com/iam/home#/policies/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#/policies/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonSageMakerFullAccess). 

# The SageMaker model parallel library v2 reference
<a name="distributed-model-parallel-v2-reference"></a>

The following are references for the SageMaker model parallel library v2 (SMP v2).

**Topics**
+ [

## SMP v2 core feature configuration parameters
](#distributed-model-parallel-v2-reference-init-config)
+ [

## Reference for the SMP v2 `torch.sagemaker` package
](#model-parallel-v2-torch-sagemaker-reference)
+ [

## Upgrade from SMP v1 to SMP v2
](#model-parallel-v2-upgrade-from-v1)

## SMP v2 core feature configuration parameters
<a name="distributed-model-parallel-v2-reference-init-config"></a>

The following is a complete list of parameters to activate and configure the [Core features of the SageMaker model parallelism library v2](model-parallel-core-features-v2.md). These must be written in JSON format and passed to the PyTorch estimator in the SageMaker Python SDK or saved as a JSON file for SageMaker HyperPod.

```
{
    "hybrid_shard_degree": Integer,
    "sm_activation_offloading": Boolean,
    "activation_loading_horizon": Integer,
    "fsdp_cache_flush_warnings": Boolean,
    "allow_empty_shards": Boolean,
    "tensor_parallel_degree": Integer,
    "context_parallel_degree": Integer,
    "expert_parallel_degree": Integer,
    "random_seed": Integer
}
```
+ `hybrid_shard_degree` (Integer) – Specifies a sharded parallelism degree. The value must be an integer between `0` and `world_size`. The default value is `0`.
  + If set to `0`, it falls back to the native PyTorch implementation and API in the script when `tensor_parallel_degree` is 1. Otherwise, it computes the largest possible `hybrid_shard_degree` based on `tensor_parallel_degree` and `world_size`. When falling back to the native PyTorch FSDP use cases, if `FULL_SHARD` is the strategy you use, it shards across the whole cluster of GPUs. If `HYBRID_SHARD` or `_HYBRID_SHARD_ZERO2` was the strategy, it is equivalent to `hybrid_shard_degree` of 8. When tensor parallelism is enabled, it shards based on the revised `hybrid_shard_degree`.
  + If set to `1`, it falls back to the native PyTorch implementation and API for `NO_SHARD` in the script when `tensor_parallel_degree` is 1. Otherwise, it's equivalent to `NO_SHARD` within any given tensor parallel groups.
  + If set to an integer between 2 and `world_size`, sharding happens across the specified number of GPUs. If you don't set up `sharding_strategy` in the FSDP script, it gets overridden to `HYBRID_SHARD`. If you set `_HYBRID_SHARD_ZERO2`, the `sharding_strategy` you specify is used.
+ `sm_activation_offloading` (Boolean) – Specifies whether to enable the SMP activation offloading implementation. If `False`, offloading uses the native PyTorch implementation. If `True`, it uses the SMP activation offloading implementation. You also need to use the PyTorch activation offload wrapper (`torch.distributed.algorithms._checkpoint.checkpoint_wrapper.offload_wrapper`) in your script. To learn more, see [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md). The default value is `True`.
+ `activation_loading_horizon` (Integer) – An integer specifying the activation offloading horizon type for FSDP. This is the maximum number of checkpointed or offloaded layers whose inputs can be in the GPU memory simultaneously. To learn more, see [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md). The input value must be a positive integer. The default value is `2`.
+ `fsdp_cache_flush_warnings` (Boolean) – Detects and warns if cache flushes happen in the PyTorch memory manager, because they can degrade computational performance. The default value is `True`.
+ `allow_empty_shards` (Boolean) – Whether to allow empty shards when sharding tensors if tensor is not divisible. This is an experimental fix for crash during checkpointing in certain scenarios. Disabling this falls back to the original PyTorch behavior. The default value is `False`.
+ `tensor_parallel_degree` (Integer) – Specifies a tensor parallelism degree. The value must be between `1` and `world_size`. The default value is `1`. Note that passing a value greater than 1 does not enable context parallelism automatically; you also need to use the [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API to wrap the model in your training script. To learn more, see [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md).
+ `context_parallel_degree` (Integer) – Specifies the context parallelism degree. The value must be between `1` and `world_size` , and must be `<= hybrid_shard_degree`. The default value is `1`. Note that passing a value greater than 1 does not enable context parallelism automatically; you also need to use the [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API to wrap the model in your training script. To learn more, see [Context parallelism](model-parallel-core-features-v2-context-parallelism.md).
+ `expert_parallel_degree` (Integer) – Specifies a expert parallelism degree. The value must be between 1 and `world_size`. The default value is `1`. Note that passing a value greater than 1 does not enable context parallelism automatically; you also need to use the [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API to wrap the model in your training script. To learn more, see [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md).
+ `random_seed` (Integer) – A seed number for the random operations in distributed modules by SMP tensor parallelism or expert parallelism. This seed is added to tensor-parallel or expert-parallel ranks to set the actual seed for each rank. It is unique for each tensor-parallel and expert-parallel rank. SMP v2 makes sure that the random number generated across tensor-parallel and expert-parallel ranks matches the non-tensor-parallelism and non-expert-parallelism cases respectively.

## Reference for the SMP v2 `torch.sagemaker` package
<a name="model-parallel-v2-torch-sagemaker-reference"></a>

This section is a reference for the `torch.sagemaker` package provided by SMP v2.

**Topics**
+ [

### `torch.sagemaker.delayed_param.DelayedParamIniter`
](#model-parallel-v2-torch-sagemaker-reference-delayed-param-init)
+ [

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.async_save`
](#model-parallel-v2-torch-sagemaker-reference-checkpoint-async-save)
+ [

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.maybe_finalize_async_calls`
](#model-parallel-v2-torch-sagemaker-reference-checkpoint-state-dict-saver)
+ [

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.save`
](#model-parallel-v2-torch-sagemaker-reference-checkpoint-save)
+ [

### `torch.sagemaker.distributed.checkpoint.state_dict_loader.load`
](#model-parallel-v2-torch-sagemaker-reference-checkpoint-load)
+ [

### `torch.sagemaker.moe.moe_config.MoEConfig`
](#model-parallel-v2-torch-sagemaker-reference-moe)
+ [

### `torch.sagemaker.nn.attn.FlashSelfAttention`
](#model-parallel-v2-torch-sagemaker-reference-flashselfattention)
+ [

### `torch.sagemaker.nn.attn.FlashGroupedQueryAttention`
](#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn)
+ [

### `torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention`
](#model-parallel-v2-torch-sagemaker-reference-llamaFlashAttn)
+ [

### `torch.sagemaker.transform`
](#model-parallel-v2-torch-sagemaker-reference-transform)
+ [

### `torch.sagemaker` util functions and properties
](#model-parallel-v2-torch-sagemaker-reference-utils)

### `torch.sagemaker.delayed_param.DelayedParamIniter`
<a name="model-parallel-v2-torch-sagemaker-reference-delayed-param-init"></a>

An API for applying [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md) to a PyTorch model.

```
class torch.sagemaker.delayed_param.DelayedParamIniter(
    model: nn.Module,
    init_method_using_config : Callable = None,
    verbose: bool = False,
)
```

**Parameters**
+ `model` (`nn.Module`) – A PyTorch model to wrap and apply the delayed parameter initialization functionality of SMP v2.
+ `init_method_using_config` (Callable) – If you use the tensor parallel implementation of SMP v2 or supported [Hugging Face Transformer models compatible with the SMP tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models), keep this parameter at the default value, which is `None`. By default, the `DelayedParamIniter` API finds out how to initialize the given model correctly. For any other models, you need to create a custom parameter initialization function and add it to your script. The following code snippet is the default `init_method_using_config` function that SMP v2 implemented for the [Hugging Face Transformer models compatible with the SMP tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models). Use the following code snippet as a reference for creating your own initialization configuration function, adding it to your script, and passing it to the `init_method_using_config` parameter of the SMP `DelayedParamIniter` API.

  ```
  from torch.sagemaker.utils.module_utils import empty_module_params, move_buffers_to_device
  
  # Define a custom init config function.
  def custom_init_method_using_config(module):
      d = torch.cuda.current_device()
      empty_module_params(module, device=d)
      if isinstance(module, (nn.Linear, Conv1D)):
          module.weight.data.normal_(mean=0.0, std=config.initializer_range)
          if module.bias is not None:
              module.bias.data.zero_()
      elif isinstance(module, nn.Embedding):
          module.weight.data.normal_(mean=0.0, std=config.initializer_range)
          if module.padding_idx is not None:
              module.weight.data[module.padding_idx].zero_()
      elif isinstance(module, nn.LayerNorm):
          module.weight.data.fill_(1.0)
          module.bias.data.zero_()
      elif isinstance(module, LlamaRMSNorm):
          module.weight.data.fill_(1.0)
      move_buffers_to_device(module, device=d)
  
  delayed_initer = DelayedParamIniter(model, init_method_using_config=custom_init_method_using_config)
  ```

  For more information about the `torch.sagemaker.module_util` functions in the preceding code snippet, see [`torch.sagemaker` util functions and properties](#model-parallel-v2-torch-sagemaker-reference-utils).
+ `verbose` (Boolean) – Whether to enable more detailed logging during initialization and validation. The default value is `False`.

**Methods**
+ `get_param_init_fn()` – Returns the parameter initialization function that you can pass to the `param_init_fn` argument of the PyTorch FSDP wrapper class.
+ `get_post_param_init_fn()` – Returns the parameter initialization function that you can pass to the `post_param_init_fn` argument of the PyTorch FSDP wrapper class. This is needed when you have tied weights in the model. The model must implement the method `tie_weights`. For more information, see the **Notes on tied weight** in [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md).
+ `count_num_params` (`module: nn.Module, *args: Tuple[nn.Parameter]`) – Tracks how many parameters are being initialized by the parameter initialization function. This helps implement the following `validate_params_and_buffers_inited` method. You usually don’t need to call this function explicitly, because the `validate_params_and_buffers_inited` method implicitly calls this method in the backend.
+ `validate_params_and_buffers_inited` (`enabled: bool=True`) – This is a context manager that helps validate that the number of parameters initialized matches the total number of parameters in the model. It also validates that all parameters and buffers are now on GPU devices instead of meta devices. It raises `AssertionErrors` if these conditions are not met. This context manager is only optional and you're not required to use this context manager to initialize parameters.

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.async_save`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-async-save"></a>

Entry API for asynchronous save. Use this method to save a `state_dict` asynchronously to a specified `checkpoint_id`. 

```
def async_save(
    state_dict: STATE_DICT_TYPE,
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_writer: Optional[StorageWriter] = None,
    planner: Optional[SavePlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    coordinator_rank: int = 0,
    queue : AsyncCallsQueue = None,
    sharded_strategy: Union[SaveShardedStrategy, Tuple[str, int], None] = None,
    wait_error_handling: bool = True,
    force_check_all_plans: bool = True,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**Parameters**
+ `state_dict` (dict) - Required. The state dict to save.
+ `checkpoint_id` (str) - Required. The storage path to save checkpoints to.
+ `storage_writer` (StorageWriter) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) in PyTorch to perform write operations. If this is not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) is used.
+ `planner` (SavePlanner) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) in PyTorch. If this is not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) is used.
+ `process_group` (ProcessGroup) - Optional. The process group to work on. If `None`, the default (global) process group is used.
+ `coordinator_rank` (int) - Optional. The rank of the coordinator when performing collective communication operators such as `AllReduce`.
+ `queue` (AsyncRequestQueue) - Optional. The async scheduler to use. By default, it takes the global parameter `DEFAULT_ASYNC_REQUEST_QUEUE`.
+ `sharded_strategy` (PyTorchDistSaveShardedStrategy) - Optional. The sharded strategy to use for saving checkpoints. If this is is not specified, `torch.sagemaker.distributed.checkpoint.state_dict_saver.PyTorchDistSaveShardedStrategy` is used by default.
+ `wait_error_handling` (bool) - Optional. A flag specifying whether to wait for all ranks to finish error handling. The default value is `True`.
+ `force_check_all_plans` (bool) - Optional. A flag that determines whether to forcibly synchronize plans across ranks, even in the case of a cache hit. The default value is `True`.
+ `s3_region` (str) - Optional. The region where the S3 bucket is located. If not specified, the region is inferred from the `checkpoint_id`.
+ `s3client_config` (S3ClientConfig) - Optional. The dataclass exposing configurable parameters for the S3 client. If not provided, the default configuration of [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) is used. The `part_size` parameter is set to 64MB by default.

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.maybe_finalize_async_calls`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-state-dict-saver"></a>

This function allows a training process to monitor multiple asynchronous requests to be done. 

```
def maybe_finalize_async_calls(
    blocking=True, 
    process_group=None
) -> List[int]:
```

**Parameters**
+ `blocking` (bool) - Optional. If `True`, it will wait until all active requests are completed. Otherwise, it finalizes only the asynchronous requests that have already finished. The default value is `True`.
+ `process_group` (ProcessGroup) - Optional. The process group to operate on. If set to `None`, the default (global) process group is utilized.

**Returns**
+ A list containing the indices of asynchronous calls are successfully finalized.

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.save`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-save"></a>

Use this method to save a `state_dict` synchronously to a specified `checkpoint_id`.

```
def save(
    state_dict: STATE_DICT_TYPE,
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_writer: Optional[StorageWriter] = None,
    planner: Optional[SavePlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    coordinator_rank: int = 0,
    wait_error_handling: bool = True,
    force_check_all_plans: bool = True,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**Parameters**
+ `state_dict` (dict) - Required. The state dict to save.
+ `checkpoint_id` (str) - Required. The storage path to save checkpoints to.
+ `storage_writer` (StorageWriter) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) in PyTorch to perform write operations. If this is not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) is used.
+ `planner` (SavePlanner) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) in PyTorch. If this is not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) is used.
+ `process_group` (ProcessGroup) - Optional. The process group to work on. If `None`, the default (global) process group is used.
+ `coordinator_rank` (int) - Optional. The rank of the coordinator when performing collective communication operators such as `AllReduce`.
+ `wait_error_handling` (bool) - Optional. A flag specifying whether to wait for all ranks to finish error handling. The default value is `True`.
+ `force_check_all_plans` (bool) - Optional. A flag that determines whether to forcibly synchronize plans across ranks, even in the case of a cache hit. The default value is `True`.
+ `s3_region` (str) - Optional. The region where the S3 bucket is located. If not specified, the region is inferred from the `checkpoint_id`.
+ `s3client_config` (S3ClientConfig) - Optional. The dataclass exposing configurable parameters for the S3 client. If not provided, the default configuration of [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) is used. The `part_size` parameter is set to 64MB by default.

### `torch.sagemaker.distributed.checkpoint.state_dict_loader.load`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-load"></a>

Load the state dictionary of a distributed model (`state_dict`).

```
def load(
    state_dict: Dict[str, Any],
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_reader: Optional[StorageReader] = None,
    planner: Optional[LoadPlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    check_keys_matched: bool = True,
    coordinator_rank: int = 0,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**Parameters**
+ `state_dict` (dict) - Required. The `state_dict` to load.
+ `checkpoint_id` (str) - Required. The ID of a checkpoint. The meaning of the `checkpoint_id` depends on the storage. It can be a path to a folder or to a file. It can also be a key if the storage is a key-value store.
+ `storage_reader` (StorageReader) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageReader](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageReader) in PyTorch to perform read operations. If not specified, distributed checkpointing will automatically infer the reader based on the `checkpoint_id`. If `checkpoint_id` is also `None`, an exception error is raised.
+ `planner` (StorageReader) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner) in PyTorch. If not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner) is used.
+ `check_keys_matched` (bool) - Optional. If enabled, checks whether the `state_dict` keys of all ranks are matched using `AllGather`.
+ `s3_region` (str) - Optional. The region where the S3 bucket is located. If not specified, the region is inferred from the `checkpoint_id`.
+ `s3client_config` (S3ClientConfig) - Optional. The dataclass exposing configurable parameters for the S3 client. If not provided, the default configuration of [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) is used. The `part_size` parameter is set to 64MB by default.

### `torch.sagemaker.moe.moe_config.MoEConfig`
<a name="model-parallel-v2-torch-sagemaker-reference-moe"></a>

A configuration class for setting up the SMP-implementation of Mixture-of-Experts (MoE). You can specify MoE configuration values through this class and pass it to the [https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-v2-reference.html#model-parallel-v2-torch-sagemaker-reference-transform](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-v2-reference.html#model-parallel-v2-torch-sagemaker-reference-transform) API call. To learn more about the usage of this class for training MoE models, see [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md).

```
class torch.sagemaker.moe.moe_config.MoEConfig(
    smp_moe=True,
    random_seed=12345,
    moe_load_balancing="sinkhorn",
    global_token_shuffle=False,
    moe_all_to_all_dispatcher=True,
    moe_aux_loss_coeff=0.001,
    moe_z_loss_coeff=0.001
)
```

**Parameters**
+ `smp_moe` (Boolean) - Whether to use the SMP-implementation of MoE. The default value is `True`.
+ `random_seed` (Integer) - A seed number for the random operations in expert-parallel distributed modules. This seed is added to the expert parallel rank to set the actual seed for each rank. It is unique for each expert parallel rank. The default value is `12345`.
+ `moe_load_balancing` (String) - Specify the load balancing type of the MoE router. Valid options are `aux_loss`, `sinkhorn`, `balanced`, and `none`. The default value is `sinkhorn`.
+ `global_token_shuffle` (Boolean) - Whether to shuffle tokens across EP ranks within the same EP group. The default value is `False`.
+ `moe_all_to_all_dispatcher` (Boolean) - Whether to use all-to-all dispatcher for the communications in MoE. The default value is `True`.
+ `moe_aux_loss_coeff` (Float) - A coefficient for auxiliary load balancing loss. The default value is `0.001`.
+ `moe_z_loss_coeff` (Float) - Coefficient for z-loss. The default value is `0.001`.

### `torch.sagemaker.nn.attn.FlashSelfAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-flashselfattention"></a>

An API for using [FlashAttention](model-parallel-core-features-v2-flashattention.md) with SMP v2.

```
class torch.sagemaker.nn.attn.FlashSelfAttention(
   attention_dropout_prob: float = 0.0,
   scale: Optional[float] = None,
   triton_flash_attention: bool = False,
   use_alibi: bool = False,
)
```

**Parameters**
+ `attention_dropout_prob` (float) – The dropout probability to apply to attention. The default value is `0.0`.
+ `scale` (float) – If passed, this scale factor is applied for softmax. If set to `None` (which is also the default value), the scale factor is `1 / sqrt(attention_head_size)`. The default value is `None`.
+ `triton_flash_attention` (bool) – If passed, Triton implementation of flash attention is used. This is necessary to supports Attention with Linear Biases (ALiBi) (see the following `use_alibi` parameter). This version of the kernel doesn’t support dropout. The default value is `False`.
+ `use_alibi` (bool) – If passed, it enables Attention with Linear Biases (ALiBi) using the mask provided. When using ALiBi, it needs an attention mask prepared as follows. The default value is `False`.

  ```
  def generate_alibi_attn_mask(attention_mask, batch_size, seq_length, 
      num_attention_heads, alibi_bias_max=8):
      device, dtype = attention_mask.device, attention_mask.dtype
      alibi_attention_mask = torch.zeros(
          1, num_attention_heads, 1, seq_length, dtype=dtype, device=device
      )
  
      alibi_bias = torch.arange(1 - seq_length, 1, dtype=dtype, device=device).view(
          1, 1, 1, seq_length
      )
      m = torch.arange(1, num_attention_heads + 1, dtype=dtype, device=device)
      m.mul_(alibi_bias_max / num_attention_heads)
      alibi_bias = alibi_bias * (1.0 / (2 ** m.view(1, num_attention_heads, 1, 1)))
  
      alibi_attention_mask.add_(alibi_bias)
      alibi_attention_mask = alibi_attention_mask[..., :seq_length, :seq_length]
      if attention_mask is not None and attention_mask.bool().any():
          alibi_attention_mask.masked_fill(
              attention_mask.bool().view(batch_size, 1, 1, seq_length), float("-inf")
          )
  
      return alibi_attention_mask
  ```

**Methods**
+ `forward(self, qkv, attn_mask=None, causal=False, cast_dtype=None, layout="b h s d")` – A regular PyTorch module function. When a `module(x)` is called, SMP runs this function automatically.
  + `qkv` – `torch.Tensor` of the following form: `(batch_size x seqlen x (3 x num_heads) x head_size)` or `(batch_size, (3 x num_heads) x seqlen x head_size)`, a tuple of `torch.Tensors` each of which might be of shape `(batch_size x seqlen x num_heads x head_size)`, or `(batch_size x num_heads x seqlen x head_size)`. An appropriate layout arg must be passed based on the shape. 
  + `attn_mask` – `torch.Tensor` of the following form `(batch_size x 1 x 1 x seqlen)`. To enable this attention mask parameter, it requires `triton_flash_attention=True` and `use_alibi=True`. To learn how to generate an attention mask using this method, see the code examples at [FlashAttention](model-parallel-core-features-v2-flashattention.md). The default value is `None`.
  + `causal` – When set to `False`, which is the default value of the argument, no mask is applied. When set to `True`, the `forward` method uses the standard lower triangular mask. The default value is `False`.
  + `cast_dtype` – When set to a particular `dtype`, it casts the `qkv` tensors to that `dtype` before `attn`. This is useful for implementations such as the Hugging Face Transformer GPT-NeoX model, which has `q` and `k` with `fp32` after rotary embeddings. If set to `None`, no cast is applied. The default value is `None`.
  + `layout` (string) – Available values are `b h s d` or `b s h d`. This should be set to the layout of `qkv` tensors passed, so appropriate transformations can be applied for `attn`. The default value is `b h s d`.

**Returns**

A single `torch.Tensor` with shape `(batch_size x num_heads x seq_len x head_size)`.

### `torch.sagemaker.nn.attn.FlashGroupedQueryAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn"></a>

An API for using `FlashGroupedQueryAttention` with SMP v2. To learn more about the usage of this API, see [Use FlashAttention kernels for grouped-query attention](model-parallel-core-features-v2-flashattention.md#model-parallel-core-features-v2-flashattention-grouped-query).

```
class torch.sagemaker.nn.attn.FlashGroupedQueryAttention(
    attention_dropout_prob: float = 0.0,
    scale: Optional[float] = None,
)
```

**Parameters**
+ `attention_dropout_prob` (float) – The dropout probability to apply to attention. The default value is `0.0`.
+ `scale` (float) – If passed, this scale factor is applied for softmax. If set to `None`, `1 / sqrt(attention_head_size)` is used as the scale factor. The default value is `None`.

**Methods**
+ `forward(self, q, kv, causal=False, cast_dtype=None, layout="b s h d")` – A regular PyTorch module function. When a `module(x)` is called, SMP runs this function automatically.
  + `q` – `torch.Tensor` of the following form `(batch_size x seqlen x num_heads x head_size)` or `(batch_size x num_heads x seqlen x head_size)`. Appropriate layout arg must be passed based on the shape. 
  + `kv` – `torch.Tensor` of the following form `(batch_size x seqlen x (2 x num_heads) x head_size)` or `(batch_size, (2 x num_heads) x seqlen x head_size)`, or a tuple of two `torch.Tensor`s, each of which might be of shape `(batch_size x seqlen x num_heads x head_size)` or `(batch_size x num_heads x seqlen x head_size)`. Appropriate `layout` argument must also be passed based on the shape.
  + `causal` – When set to `False`, which is the default value of the argument, no mask is applied. When set to `True`, the `forward` method uses the standard lower triangular mask. The default value is `False`.
  + `cast_dtype` – When set to a particular dtype, it casts the `qkv` tensors to that dtype before `attn`. This is useful for implementations such as Hugging Face Transformers GPT-NeoX, which has `q,k` with `fp32` after rotary embeddings. If set to `None`, no cast is applied. The default value is `None`.
  + layout (string) – Available values are `"b h s d"` or `"b s h d"`. This should be set to the layout of `qkv` tensors passed, so appropriate transformations can be applied for `attn`. The default value is `"b h s d"`.

**Returns**

Returns a single `torch.Tensor (batch_size x num_heads x seq_len x head_size)` that represents the output of attention computation.

### `torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-llamaFlashAttn"></a>

An API that supports FlashAttention for the Llama model. This API uses the [`torch.sagemaker.nn.attn.FlashGroupedQueryAttention`](#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn) API at low level. To learn how to use this, see [Use FlashAttention kernels for grouped-query attention](model-parallel-core-features-v2-flashattention.md#model-parallel-core-features-v2-flashattention-grouped-query).

```
class torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention(
    config: LlamaConfig
)
```

**Parameters**
+ `config` – A FlashAttention configuration for the Llama model.

**Methods**
+ `forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)`
  + `hidden_states` (`torch.Tensor`) – Hidden states of a tensor in form of `(batch_size x seq_len x num_heads x head_size)`.
  + `attention_mask` (`torch.LongTensor`) – Mask to avoid performing attention on padding token indices in form of `(batch_size x seqlen)`. The default value is `None`.
  + `position_ids` (`torch.LongTensor`) – When not being `None`, it is in form of `(batch_size x seqlen)`, indicating the indices of positions of each input sequence token in the position embeddings. The default value is `None`.
  + `past_key_value` (Cache) – Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks). The default value is `None`. 
  + `output_attentions` (bool) – Indicates whether to return the attentions tensors of all attention layers. The default value is `False`. 
  + `use_cache` (bool) – Indicates whether to return `past_key_values` key value states. The default value is `False`. 

**Returns**

Returns a single `torch.Tensor (batch_size x num_heads x seq_len x head_size)` that represents the output of attention computation.

### `torch.sagemaker.transform`
<a name="model-parallel-v2-torch-sagemaker-reference-transform"></a>

SMP v2 provides this `torch.sagemaker.transform()` API for transforming Hugging Face Transformer models to SMP model implementations and enabling the SMP tensor parallelism.

```
torch.sagemaker.transform(
    model: nn.Module, 
    device: Optional[torch.device] = None, 
    dtype: Optional[torch.dtype] = None, 
    config: Optional[Dict] = None, 
    load_state_dict_from_rank0: bool = False,
    cp_comm_type: str = "p2p"
)
```

SMP v2 maintains transformation policies for the [Hugging Face Transformer models compatible with the SMP tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models) by converting the configuration of the Hugging Face Transformer models to the SMP transformer configuration.

**Parameters**
+ `model` (`torch.nn.Module`) – A model from [Hugging Face Transformer models compatible with the SMP tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models) to transform and apply the tensor parallelism feature of the SMP library.
+ `device` (`torch.device`) – If passed, a new model is created on this device. If the original module has any parameter on meta device (see [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md)), then the transformed module will also be created on meta device, ignoring the argument passed here. The default value is `None`.
+ `dtype` (`torch.dtype`) – If passed, sets this as the dtype context manager for the creation of the model and creates a model with this dtype. This is typically unnecessary, as we want to create the model with `fp32` when using `MixedPrecision`, and `fp32` is the default dtype in PyTorch. The default value is `None`.
+ `config` (dict) – This is a dictionary for configuring the SMP transformer. The default value is `None`.
+ `load_state_dict_from_rank0` (Boolean) – By default, this module creates a new instance of the model with new weights. When this argument is set to `True`, SMP tries to load the state dictionary of the original PyTorch model from the 0th rank into transformed model for the tensor parallel group that the 0th rank is part of. When this is set to `True`, rank 0 can’t have any parameters on meta device. Only the first tensor parallel group populates the weights from the 0th rank after this transform call. You need to set `sync_module_states` to `True` in the FSDP wrapper to get these weights from the first tensor parallel group to all other processes. With this activated, the SMP library loads the state dictionary from the original model. The SMP library takes the `state_dict` of the model before transform, converts it to match the structure of the transformed model, shards it for each tensor parallel rank, communicates this state from the 0th rank to other ranks in the tensor parallel group that the 0th rank is part of, and loads it. The default value is `False`.
+ `cp_comm_type` (str) – Determines the context parallelism implementation and is only applicable when the `context_parallel_degree` is greater than 1. Available values for this parameter are `p2p` and `all_gather`. The `p2p` implementation utilizes peer-to-peer send-receive calls for key-and-value (KV) tensor accumulation during the attention computation, running asynchronously and allowing communication to overlap with computation. On the other hand, the `all_gather` implementation employs the `AllGather` communication collective operation for KV tensor accumulation. The default value is `"p2p"`.

**Returns **

Returns a transformed model that you can wrap with PyTorch FSDP. When `load_state_dict_from_rank0` is set to `True`, the tensor parallel group that involves rank 0 has weights loaded from the original state dictionary on rank 0. When using [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md) on the original model, only these ranks have the actual tensors on CPUs for the parameters and buffers of the transformed model. The rest of the ranks continue to have the parameters and buffers on the meta device to save memory.

### `torch.sagemaker` util functions and properties
<a name="model-parallel-v2-torch-sagemaker-reference-utils"></a>

**torch.sagemaker util functions**
+ `torch.sagemaker.init(config: Optional[Union[str, Dict[str, Any]]] = None) -> None` – Initializes the PyTorch training job with SMP.
+ `torch.sagemaker.is_initialized() -> bool` – Checks whether the training job is initialized with SMP. When falling back to the native PyTorch while the job is initialized with SMP, some of the properties are not relevant and become `None`, as indicated in the following **Properties** list.
+ `torch.sagemaker.utils.module_utils.empty_module_params(module: nn.Module, device: Optional[torch.device] = None, recurse: bool = False) -> nn.Module` – Creates empty parameters on the given `device` if any, and it can be recursive for all nested modules if specified.
+ `torch.sagemaker.utils.module_utils.move_buffers_to_device(module: nn.Module, device: torch.device, recurse: bool = False) -> nn.Module` – Moves module buffers to the given `device`, and it can be recursive for all nested modules if specified.

**Properties**

`torch.sagemaker.state` holds multiple useful properties after the initialization of SMP with `torch.sagemaker.init`.
+ `torch.sagemaker.state.hybrid_shard_degree` (int) – The sharded data parallelism degree, a copy from user input in the SMP configuration passed to `torch.sagemaker.init()`. To learn more, see [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md).
+ `torch.sagemaker.state.rank` (int) – The global rank for the device, in the range of `[0, world_size)`.
+ `torch.sagemaker.state.rep_rank_process_group` (`torch.distributed.ProcessGroup`) – The process group including all devices with the same replication rank. Note the subtle but fundamental difference with `torch.sagemaker.state.tp_process_group`. When falling back to native PyTorch, it returns `None`.
+ `torch.sagemaker.state.tensor_parallel_degree` (int) – The tensor parallelism degree, a copy from user input in the SMP configuration passed to `torch.sagemaker.init()`. To learn more, see [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md).
+ `torch.sagemaker.state.tp_size` (int) – An alias to `torch.sagemaker.state.tensor_parallel_degree`.
+ `torch.sagemaker.state.tp_rank` (int) – The tensor parallelism rank for the device in the range of `[0, tp_size)`, determined by the tensor parallelism degree and the ranking mechanism.
+ `torch.sagemaker.state.tp_process_group` (`torch.distributed.ProcessGroup`) – The tensor parallel process group including all devices with the same rank in other dimensions (for example, sharded data parallelism and replication) but unique tensor parallel ranks. When falling back to native PyTorch, it returns `None`.
+ `torch.sagemaker.state.world_size` (int) – The total number of devices used in training.

## Upgrade from SMP v1 to SMP v2
<a name="model-parallel-v2-upgrade-from-v1"></a>

To move from SMP v1 to SMP v2, you must make script changes to remove the SMP v1 APIs and apply the SMP v2 APIs. Instead of starting from your SMP v1 script, we recommend you start from a PyTorch FSDP script, and follow the instructions at [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md).

To bring SMP v1 *models* to SMP v2, in SMP v1 you must collect the full model state dictionary and apply the translation functions on the model state dictionary to convert it into the Hugging Face Transformers model checkpoint format. Then in SMP v2, as discussed in [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md), you can load the Hugging Face Transformers model checkpoints, and then continue with using the PyTorch checkpoint APIs with SMP v2. To use SMP with your PyTorch FSDP model, make sure that you move to SMP v2 and make changes to your training script to use PyTorch FSDP and other latest features.

```
import smdistributed.modelparallel.torch as smp

# Create model
model = ...
model = smp.DistributedModel(model)

# Run training
...

# Save v1 full checkpoint
if smp.rdp_rank() == 0:
    model_dict = model.state_dict(gather_to_rank0=True) # save the full model
    # Get the corresponding translation function in smp v1 and translate
    if model_type == "gpt_neox":
        from smdistributed.modelparallel.torch.nn.huggingface.gptneox import translate_state_dict_to_hf_gptneox
        translated_state_dict = translate_state_dict_to_hf_gptneox(state_dict, max_seq_len=None)
    
    # Save the checkpoint
    checkpoint_path = "checkpoint.pt"
    if smp.rank() == 0:
        smp.save(
            {"model_state_dict": translated_state_dict},
            checkpoint_path,
            partial=False,
        )
```

To find available translation functions in SMP v1, see [Support for Hugging Face Transformer Models](model-parallel-extended-features-pytorch-hugging-face.md).

For instruction on model checkpoints saving and loading in SMP v2, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).

# Release notes for the SageMaker model parallelism library
<a name="model-parallel-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker model parallelism (SMP) library. If you have further questions about the SMP library, contact the SMP service team at `sm-model-parallel-feedback@amazon.com`.

## The SageMaker model parallelism library v2.8.0
<a name="model-parallel-release-notes-20250306"></a>

*Date: April 01, 2025*

### SMP library updates
<a name="model-parallel-release-notes-20250306-smp-lib"></a>

**Bug fixes**
+ SMP gradient norm clipping now supports activation offloading.

### SMP Docker and Enroot containers
<a name="model-parallel-release-notes-20250306-smp-docker"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to `v2.243.0` or later.

**Currency updates**
+ Added support for PyTorch v2.5.1
+ Upgraded CUDA support to v12.4
+ Upgraded NCCL support to v2.23.4
+ Upgraded SMDDP library to 2.6.0

**Container details**
+ SMP Docker container for PyTorch v2.5.1 with CUDA v12.4

  ```
  658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.5.1-gpu-py311-cu124
  ```
+ SMP Enroot container for PyTorch v2.5.1 with CUDA v12.4

  ```
  https://sagemaker-distributed-model-parallel.s3.<us-west-2>.amazonaws.com/enroot/2.5.1-gpu-py311-cu124.sqsh
  ```
+ Pre-installed packages
  + The SMP library v2.8.0
  + The SMDDP library v2.6.0
  + CUDNN v9.4.0
  + FlashAttention v2.5.8
  + TransformerEngine v1.10
  + Megatron v0.8.0
  + Hugging Face Transformers v4.44.2
  + Hugging Face Datasets library v2.19.0
  + EFA v1.36.0
  + NCCL v2.23.4
  + AWS-OFI-NCCL v1.13.2

### SMP Conda channel
<a name="model-parallel-release-notes-20250306-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.7.0
<a name="model-parallel-release-notes-20241204"></a>

*Date: December 04, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20241204-smp-lib"></a>

**New features**
+ Added support for [SageMaker HyperPod recipes](sagemaker-hyperpod-recipes.md).

### SMP Docker and Enroot containers
<a name="model-parallel-release-notes-20241204-smp-docker"></a>

The SMP library team distributes Docker and Enroot containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to `v2.237.0` or later.

**Container details**
+ SMP Docker container for PyTorch v2.4.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.<us-west-2>.smdistributed-modelparallel:2.4.1-gpu-py311-cu121
  ```
+ SMP Enroot container for PyTorch v2.4.1 with CUDA v12.1

  ```
  https://sagemaker-distributed-model-parallel.s3.<us-west-2>.amazonaws.com/enroot/2.4.1-gpu-py311-cu121.sqsh
  ```
+ Pre-installed packages
  + The SMP library v2.7.0
  + The SMDDP library v2.5.0
  + CUDNN v9.4.0
  + FlashAttention v2.5.8
  + TransformerEngine v1.10
  + Megatron v0.8.0
  + Hugging Face Transformers v4.44.2
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20241204-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in a Conda environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.6.1
<a name="model-parallel-release-notes-20241031"></a>

*Date: October 31, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20241031-smp-lib"></a>

**Bug fixes**
+ Fixed an `ImportError` issue that occurred when using older training scripts with SMP v2.6.0. This fixes the backward incompatibility with SMP v2.6.0.
+ Added a `DeprecationWarning` for `torch.sagemaker.distributed.fsdp.checkpoint`. This module will be deprecated and removed in SMP v2.7.0. If you're currently using `torch.sagemaker.distributed.fsdp.checkpoint` in your code, you should plan to update your scripts before the release of SMP v2.7.0 to avoid issues in the future.
+ Fixed a backward compatibility issue identified in SMP v2.6.0. This issue was related to the deprecation of the `USE_PG_WITH_UTIL` checkpoint method in SMP v2.6.0, which broke backward compatibility with previous versions of training scripts. To resolve this issue, re-run your PyTorch training jobs to pick up the latest SMP container packaged with SMP v2.6.1.

### SMP Docker container
<a name="model-parallel-release-notes-20241031-smp-docker"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers.

**Container details**
+ SMP Docker container for PyTorch v2.4.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121
  ```
+ Pre-installed packages
  + The SMP library v2.6.1
  + The SMDDP library v2.5.0
  + CUDNN v9.4.0
  + FlashAttention v2.5.8
  + TransformerEngine v1.10
  + Megatron v0.8.0
  + Hugging Face Transformers v4.44.2
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20241031-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.6.0
<a name="model-parallel-release-notes-20241017"></a>

*Date: October 17, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20241017-smp-lib"></a>

**New features**
+ Added support for the following LLM model configurations. You can start using [Context parallelism](model-parallel-core-features-v2-context-parallelism.md) and [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md).
  + [Llama3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
  + [Llama3.1 70B](https://huggingface.co/meta-llama/Llama-3.1-70B)
  + [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.3)
+ Added [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md) support for the following Mixtral model configurations.
  + [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
  + [Mixtral 8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)
+ Added support for an AllGather-based context parallelism implementation that utilizes the AllGather communication collective to obtain the full sequence of key-and-value tensors. Available implementations are `p2p` and `all_gather`. The `p2p` implementation utilizes peer-to-peer send-receive calls for key-and-value (KV) tensor accumulation during the attention computation, running asynchronously and allowing communication to overlap with computation. On the other hand, the `all_gather` implementation employs the `AllGather` communication collective operation for KV tensor accumulation. To learn how to apply these context parallelism implementation, see [Context parallelism](model-parallel-core-features-v2-context-parallelism.md).
+ Added support for tuning the Rotary Position Embedding (RoPE) theta value.

**Bug fixes**
+ Fixed a bug where Rotary Position Embedding (RoPE) isn’t properly initialized during pre-training when delayed parameter is enabled.

**Known issues**
+ Transformer Engine does not currently support context parallelism or FP8 with sliding window attention enabled. Thus, the SMP version of Mistral transformers don’t support context parallelism or FP8 training when sliding window configuration is set to a non-null value.

### SMP Docker container
<a name="model-parallel-release-notes-20241017-smp-docker"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers.

**Currency updates**
+ Upgraded PyTorch to v2.4.1
+ Upgraded Megatron to v0.8.0
+ Upgraded the TransformerEngine library to v1.10
+ Upgraded Transformers to v4.44.2
+ Upgraded cuDNN to v9.4.0.58

**Container details**
+ SMP Docker container for PyTorch v2.4.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121
  ```
+ Pre-installed packages
  + The SMP library v2.6.0
  + The SMDDP library v2.5.0
  + CUDNN v9.4.0
  + FlashAttention v2.5.8
  + TransformerEngine v1.10
  + Megatron v0.8.0
  + Hugging Face Transformers v4.44.2
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20241017-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.5.0
<a name="model-parallel-release-notes-20240828"></a>

*Date: August 28, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20240828-smp-lib"></a>

**New features**
+ Added support for mixed-precision training using FP8 data format on P5 instances for the Mixtral model.
  + Supported Mixtral configurations are 8x7B and 8x22B. To learn more, see [Mixed precision training with FP8 on P5 instances using Transformer Engine](model-parallel-core-features-v2-mixed-precision.md#model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5).
+ Added support for [Context parallelism](model-parallel-core-features-v2-context-parallelism.md) for the following model configurations.
  + Llama-v2: 7B and 70B
  + Llama-v3: 8B and 70B
  + GPT-NeoX: 20B
+ Added support for saving checkpoints asynchronously. To learn more, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).
  + Support for saving checkpoints to S3 directly without using Amazon EBS or file servers.

**Bug fixes**
+ Resolved an issue that caused unexpectedly high initial loss during Llama fine-tuning when loading a pre-trained model checkpoint and utilizing tensor parallelism.

**Notes**
+ To use activation checkpointing for Mixtral with FP8 mixed precision, you will need to checkpoint the attention and expert layers separately. For an example of setting it up properly, see the [example training script](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/shared-scripts/train_utils.py) in the *Amazon SageMaker AI Examples repository*.

**Known issues**
+ The balanced load balancing type in the MoE configuration ([`torch.sagemaker.moe.moe_config.MoEConfig`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-moe)) is currently incompatible with activation checkpointing.
+ With context parallelism, GPT-NeoX shows performance regression in both pre-training and fine-tuning.
+ For GPT-NeoX on P4 instances, directly loading weights from a delayed parameter initialized transformed model into a Hugging Face transformer model leads to a loss mismatch on the first step.

### SMP Docker container
<a name="model-parallel-release-notes-20240828-smp-docker"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.224.0 or later.

**Currency updates**
+ Upgraded the FlashAttention library to v2.5.8
+ Upgraded the Transformer Engine library to v1.8
  + If you want to install Transformer Engine in a Conda environment, you need to build from the source and cherry-pick the specific upstream fixes ([744624d](https://github.com/NVIDIA/TransformerEngine/commit/744624d004f4514ffbaa90ac83e214311c86c607), [27c6342](https://github.com/NVIDIA/TransformerEngine/commit/27c6342ea8ad88034bf04b587dd13cb6088d2474), [7669bf3](https://github.com/NVIDIA/TransformerEngine/commit/7669bf3da68074517b134cd6acebd04b221fd545)).

**Container details**
+ SMP Docker container for PyTorch v2.3.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.3.1-gpu-py311-cu121
  ```

  For a complete list of supported regions, see [AWS Regions](distributed-data-parallel-support.md#distributed-data-parallel-availablity-zone).
+ Pre-installed packages
  + The SMP library v2.5.0
  + The SMDDP library v2.3.0
  + CUDNN v8.9.7.29
  + FlashAttention v2.5.8
  + TransformerEngine v1.8
  + Megatron v0.7.0
  + Hugging Face Transformers v4.40.1
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20240828-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.4.0
<a name="model-parallel-release-notes-20240620"></a>

*Date: June 20, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20240620-lib"></a>

**Bug fixes**
+ Fixed a bug that causes incorrect logit shapes when labels are not passed in the forward pass while using the SMP Transformer.

**Currency updates**
+ Added support for PyTorch v2.3.1.
+ Added support for Python v3.11.
+ Added support for the Hugging Face Transformers library v4.40.1.

**Deprecations**
+ Discontinued support for Python v3.10.
+ Discontinued support for the Hugging Face Transformers library versions before v4.40.1.

**Other changes**
+ Included a patch to toggle saving de-duplicated tensors on different ranks. To learn more, see the [discussion thread](https://github.com/pytorch/pytorch/pull/126569)in the PyTorch GitHub repository.

**Known issues**
+ There is a known issue that the loss might spike and then resume at a higher loss value while fine-tuning Llama-3 70B with tensor parallelism.

### SMP Docker container
<a name="model-parallel-release-notes-20240620-container"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.224.0 or later.

**Currency updates**
+ Upgraded the SMDDP library to v2.3.0.
+ Upgraded the NCCL library to v2.21.5.
+ Upgraded the EFA software to v1.32.0.

**Deprecations**
+ Discontinued the installation of the [Torch Distributed Experimental (torchdistX) library](https://pytorch.org/torchdistx/latest/index.html).

**Container details**
+ SMP Docker container for PyTorch v2.3.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.3.1-gpu-py311-cu121
  ```
+ Pre-installed packages
  + The SMP library v2.4.0
  + The SMDDP library v2.3.0
  + CUDNN v8.9.7.29
  + FlashAttention v2.3.3
  + TransformerEngine v1.2.1
  + Hugging Face Transformers v4.40.1
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20240620-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.3.1
<a name="model-parallel-release-notes-20240509"></a>

*Date: May 9, 2024*

**Bug fixes**
+ Fixed an `ImportError` issue when using `moe_load_balancing=balanced` in [`torch.sagemaker.moe.moe_config.MoEConfig`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-moe) for expert parallelism.
+ Fixed a fine-tuning issue where the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) call raised `KeyError` when `load_state_dict_from_rank0` is enabled.
+ Fixed an out-of-memory (OOM) error raised when loading large Mixture of Experts (MoE) models, such as Mixtral 8x22B, for fine-tuning.

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. This release incorporates the aforementioned bug fixes into the following SMP Docker image.
+ SMP Docker container for PyTorch v2.2.0 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
  ```

## The SageMaker model parallelism library v2.3.0
<a name="model-parallel-release-notes-20240409"></a>

*Date: April 11, 2024*

**New features**
+ Added a new core feature, *expert parallelism*, to support Mixture of Experts transformer models. To learn more, see [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md).

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.214.4 or later.
+ SMP Docker container for PyTorch v2.2.0 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
  ```
  + Pre-installed packages in this Docker container
    + The SMDDP library v2.2.0
    + CUDNN v8.9.5.29
    + FlashAttention v2.3.3
    + TransformerEngine v1.2.1
    + Hugging Face Transformers v4.37.1
    + Hugging Face Datasets library v2.16.1
    + Megatron-core 0.5.0
    + EFA v1.30.0
    + NCCL v2.19.4

## The SageMaker model parallelism library v2.2.0
<a name="model-parallel-release-notes-20240307"></a>

*Date: March 7, 2024*

**New Features**
+ Added support for [FP8 training](model-parallel-core-features-v2-mixed-precision.md#model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5) of the following Hugging Face transformer models on P5 instances with Transformer Engine integration:
  + GPT-NeoX
  + Llama 2

**Bug Fixes**
+ Fixed a bug where tensors were not guaranteed to be contiguous before the `AllGather` collective call during tensor parallelism training.

**Currency Updates**
+ Added support for PyTorch v2.2.0.
+ Upgraded the SMDDP library to v2.2.0. 
+ Upgraded the FlashAttention library to v2.3.3.
+ Upgraded the NCCL library to v2.19.4.

**Deprecation**
+ Discontinued support for Transformer Engine versions before v1.2.0.

**Known issues**
+ The SMP [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md) feature currently does not work. Use the native PyTorch activation offloading instead.

**Other changes**
+ Included a patch to fix the performance regression discussed in the issue thread at [https://github.com/pytorch/pytorch/issues/117748](https://github.com/pytorch/pytorch/issues/117748) in the PyTorch GitHub repository.

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.212.0 or later.
+ SMP Docker container for PyTorch v2.2.0 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
  ```
  + Available for P4d, P4de, and P5 instances
  + Pre-installed packages in this Docker container
    + The SMDDP library v2.2.0
    + CUDNN v8.9.5.29
    + FlashAttention v2.3.3
    + TransformerEngine v1.2.1
    + Hugging Face Transformers v4.37.1
    + Hugging Face Datasets library v2.16.1
    + EFA v1.30.0
    + NCCL v2.19.4

## The SageMaker model parallelism library v2.1.0
<a name="model-parallel-release-notes-20240206"></a>

*Date: February 6, 2024*

**Currency Updates**
+ Added support for PyTorch v2.1.2.

**Deprecation**
+ Discontinued support for Hugging Face Transformers v4.31.0.

**Known issues**
+ An issue is discovered that fine-tuning of the Hugging Face Llama 2 model with `attn_implementation=flash_attention_2` and FSDP causes the model to diverge. For reference, see the [issue ticket](https://github.com/huggingface/transformers/issues/28826) in the *Hugging Face Transformers GitHub repository*. To avoid the divergence issue, use `attn_implementation=sdpa`. Alternatively, use the SMP transformer model implementation by setting up `use_smp_implementation=True`.

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.207.0 or later.
+ SMP Docker container for PyTorch v2.1.2 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121
  ```
  + Available for P4d, P4de, and P5 instances
  + Pre-installed packages in this Docker container
    + The SMDDP library v2.1.0
    + CUDNN v8.9.5.29
    + FlashAttention v2.3.3
    + TransformerEngine v1.2.1
    + Hugging Face Transformers v4.37.1
    + Hugging Face Datasets library v2.16.1
    + EFA v1.30.0

**SMP Conda channel**

The following S3 bucket is a public Conda channel hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.0.0
<a name="model-parallel-release-notes-20231219"></a>

*Date: December 19, 2023*

**New features**

Released the SageMaker model parallelism (SMP) library v2.0.0 with the following new offerings.
+ A new `torch.sagemaker` package, entirely revamped from the previous `smdistributed.modelparallel.torch` package in SMP v1.x. 
+ Support for PyTorch 2.0.1.
+ Support for PyTorch FSDP.
+ Tensor parallelism implementation by integrating with the [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) library.
+ Support for both [SageMaker Training](train-model.md) and [SageMaker HyperPod](sagemaker-hyperpod.md).

**Breaking changes**
+ SMP v2 revamped the APIs entirely and provides the `torch.sagemaker` package. Mostly, you only need to initialize with the `torch.sagemaker.init()` module and pass model parallel configuration parameters. With this new package, you can significantly simplify code modifications in your training script. To learn more about adapting your training script to use SMP v2, see [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md).
+ If you've used SMP v1 for training Hugging Face Transformer models and want to reuse the models in SMP v2, see [Upgrade from SMP v1 to SMP v2](distributed-model-parallel-v2-reference.md#model-parallel-v2-upgrade-from-v1).
+ For PyTorch FSDP training, you should use SMP v2.

**Known issues**
+ Activation checkpointing currently only works with the following wrapping policies with FSDP.
  + `auto_wrap_policy = functools.partial(transformer_auto_wrap_policy, ...)`
+ To use [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md), FSDP activation checkpointing type must be [REENTRANT](https://pytorch.org/docs/stable/checkpoint.html).
+ When running with tensor parallel enabled with the sharded data parallel degree set to `1`, you must use `backend = nccl`. The `smddp` backend option is not supported in this scenario.
+ [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) is required to use PyTorch with the SMP library even when not using tensor parallelism.

**Other changes**
+ Starting from this release, the documentation for the SageMaker model parallelism library is fully available in this *Amazon SageMaker AI Developer Guide*. In favor of this complete developer guide for SMP v2 in the *Amazon SageMaker AI Developer Guide*, the [additional reference for SMP v1.x](https://sagemaker.readthedocs.io/en/stable/api/training/distributed.html#the-sagemaker-distributed-model-parallel-library) in the *SageMaker Python SDK documentation* is deprecated. If you still need the documentation for SMP v1.x, the developer guide for SMP v1.x is available at [(Archived) SageMaker model parallelism library v1.x](model-parallel.md), and the SMP Python library v1.x reference is available in the [SageMaker Python SDK v2.199.0 documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html).

**Deprecations**
+ Discontinued support for TensorFlow.
+ There is no pipeline parallelism support in SMP v2.
+ There is no support for the DeepSpeed library in favor of native PyTorch FSDP.

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.207.0 or later.
+ SMP Docker container for PyTorch v2.0.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.0.1-gpu-py310-cu121
  ```

# (Archived) SageMaker model parallelism library v1.x
<a name="model-parallel"></a>

**Important**  
As of December 19, 2023, the SageMaker model parallelism (SMP) library v2 is released. In favor of the SMP library v2, the SMP v1 capabilites are no longer supported in future releases. The following section and topics are archived and specific to using the SMP library v1. For information about using the SMP library v2, see [SageMaker model parallelism library v2](model-parallel-v2.md).

Use Amazon SageMaker AI's model parallel library to train large deep learning (DL) models that are difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances. Using the library, you can achieve a target prediction accuracy faster by efficiently training larger DL models with billions or trillions of parameters.

You can use the library to automatically partition your own TensorFlow and PyTorch models across multiple GPUs and multiple nodes with minimal code changes. You can access the library's API through the SageMaker Python SDK.

Use the following sections to learn more about model parallelism and the SageMaker model parallel library. This library's API documentation is located at [Distributed Training APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html) in the *SageMaker Python SDK v2.199.0 documentation*. 

**Topics**
+ [

# Introduction to Model Parallelism
](model-parallel-intro.md)
+ [

# Supported Frameworks and AWS Regions
](distributed-model-parallel-support.md)
+ [

# Core Features of the SageMaker Model Parallelism Library
](model-parallel-core-features.md)
+ [

# Run a SageMaker Distributed Training Job with Model Parallelism
](model-parallel-use-api.md)
+ [

# Checkpointing and Fine-Tuning a Model with Model Parallelism
](distributed-model-parallel-checkpointing-and-finetuning.md)
+ [

# Amazon SageMaker AI model parallelism library v1 examples
](distributed-model-parallel-examples.md)
+ [

# SageMaker Distributed Model Parallelism Best Practices
](model-parallel-best-practices.md)
+ [

# The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls
](model-parallel-customize-tips-pitfalls.md)
+ [

# Model Parallel Troubleshooting
](distributed-troubleshooting-model-parallel.md)

# Introduction to Model Parallelism
<a name="model-parallel-intro"></a>

Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or across instances. This introduction page provides a high-level overview about model parallelism, a description of how it can help overcome issues that arise when training DL models that are typically very large in size, and examples of what the SageMaker model parallel library offers to help manage model parallel strategies as well as memory consumption.

## What is Model Parallelism?
<a name="model-parallel-what-is"></a>

Increasing the size of deep learning models (layers and parameters) yields better accuracy for complex tasks such as computer vision and natural language processing. However, there is a limit to the maximum model size you can fit in the memory of a single GPU. When training DL models, GPU memory limitations can be bottlenecks in the following ways:
+ They limit the size of the model you can train, since the memory footprint of a model scales proportionally to the number of parameters.
+ They limit the per-GPU batch size during training, driving down GPU utilization and training efficiency.

To overcome the limitations associated with training a model on a single GPU, SageMaker provides the model parallel library to help distribute and train DL models efficiently on multiple compute nodes. Furthermore, with the library, you can achieve most optimized distributed training using EFA-supported devices, which enhance the performance of inter-node communication with low latency, high throughput, and OS bypass.

## Estimate Memory Requirements Before Using Model Parallelism
<a name="model-parallel-intro-estimate-memory-requirements"></a>

Before you use the SageMaker model parallel library, consider the following to get a sense of the memory requirements of training large DL models.

For a training job that uses AMP (FP16) and Adam optimizers, the required GPU memory per parameter is about 20 bytes, which we can break down as follows:
+ An FP16 parameter \$1 2 bytes
+ An FP16 gradient \$1 2 bytes
+ An FP32 optimizer state \$1 8 bytes based on the Adam optimizers
+ An FP32 copy of parameter \$1 4 bytes (needed for the `optimizer apply` (OA) operation)
+ An FP32 copy of gradient \$1 4 bytes (needed for the OA operation)

Even for a relatively small DL model with 10 billion parameters, it can require at least 200GB of memory, which is much larger than the typical GPU memory (for example, NVIDIA A100 with 40GB/80GB memory and V100 with 16/32 GB) available on a single GPU. Note that on top of the memory requirements for model and optimizer states, there are other memory consumers such as activations generated in the forward pass. The memory required can be a lot greater than 200GB.

For distributed training, we recommend that you use Amazon EC2 P3 and P4 instances that have NVIDIA V100 and A100 Tensor Core GPUs respectively. For more details about specifications such as CPU cores, RAM, attached storage volume, and network bandwidth, see the *Accelerated Computing* section in the [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/) page.

Even with the accelerated computing instances, it is obvious that models with about 10 billion parameters such as Megatron-LM and T5 and even larger models with hundreds of billions of parameters such as GPT-3 cannot fit model replicas in each GPU device. 

## How the Library Employs Model Parallelism and Memory Saving Techniques
<a name="model-parallel-intro-features"></a>

The library consists of various types of model parallelism features and memory-saving features such as optimizer state sharding, activation checkpointing, and activation offloading. All these techniques can be combined to efficiently train large models that consist of hundreds of billions of parameters.

**Topics**
+ [

### Sharded data parallelism (available for PyTorch)
](#model-parallel-intro-sdp)
+ [

### Pipeline parallelism (available for PyTorch and TensorFlow)
](#model-parallel-intro-pp)
+ [

### Tensor parallelism (available for PyTorch)
](#model-parallel-intro-tp)
+ [

### Optimizer state sharding (available for PyTorch)
](#model-parallel-intro-oss)
+ [

### Activation offloading and checkpointing (available for PyTorch)
](#model-parallel-intro-activation-offloading-checkpointing)
+ [

### Choosing the right techniques for your model
](#model-parallel-intro-choosing-techniques)

### Sharded data parallelism (available for PyTorch)
<a name="model-parallel-intro-sdp"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs within a data-parallel group.

SageMaker AI implements sharded data parallelism through the implementation of MiCS, which is a library that **mi**nimizes **c**ommunication **s**cale and discussed in the blog post [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws).

You can apply sharded data parallelism to your model as a stand-alone strategy. Furthermore, if you are using the most performant GPU instances equipped with NVIDIA A100 Tensor Core GPUs, `ml.p4d.24xlarge`, you can take the advantage of improved training speed from the `AllGather` operation offered by SMDDP Collectives.

To dive deep into sharded data parallelism and learn how to set it up or use a combination of sharded data parallelism with other techniques like tensor parallelism and FP16 training, see [Sharded Data Parallelism](model-parallel-extended-features-pytorch-sharded-data-parallelism.md).

### Pipeline parallelism (available for PyTorch and TensorFlow)
<a name="model-parallel-intro-pp"></a>

*Pipeline parallelism* partitions the set of layers or operations across the set of devices, leaving each operation intact. When you specify a value for the number of model partitions (`pipeline_parallel_degree`), the total number of GPUs (`processes_per_host`) must be divisible by the number of the model partitions. To set this up properly, you have to specify the correct values for the `pipeline_parallel_degree` and `processes_per_host` parameters. The simple math is as follows:

```
(pipeline_parallel_degree) x (data_parallel_degree) = processes_per_host
```

The library takes care of calculating the number of model replicas (also called `data_parallel_degree`) given the two input parameters you provide. 

For example, if you set `"pipeline_parallel_degree": 2` and `"processes_per_host": 8` to use an ML instance with eight GPU workers such as `ml.p3.16xlarge`, the library automatically sets up the distributed model across the GPUs and four-way data parallelism. The following image illustrates how a model is distributed across the eight GPUs achieving four-way data parallelism and two-way pipeline parallelism. Each model replica, where we define it as a *pipeline parallel group* and label it as `PP_GROUP`, is partitioned across two GPUs. Each partition of the model is assigned to four GPUs, where the four partition replicas are in a *data parallel group* and labeled as `DP_GROUP`. Without tensor parallelism, the pipeline parallel group is essentially the model parallel group.

![\[How a model is distributed across the eight GPUs achieving four-way data parallelism and two-way pipeline parallelism.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/smdmp-pipeline-parallel-only.png)


To dive deep into pipeline parallelism, see [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md). 

To get started with running your model using pipeline parallelism, see [Run a SageMaker Distributed Training Job with the SageMaker Model Parallel Library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html).

### Tensor parallelism (available for PyTorch)
<a name="model-parallel-intro-tp"></a>

*Tensor parallelism* splits individual layers, or `nn.Modules`, across devices, to be run in parallel. The following figure shows the simplest example of how the library splits a model with four layers to achieve two-way tensor parallelism (`"tensor_parallel_degree": 2`). The layers of each model replica are bisected and distributed into two GPUs. In this example case, the model parallel configuration also includes `"pipeline_parallel_degree": 1` and `"ddp": True` (uses PyTorch DistributedDataParallel package in the background), so the degree of data parallelism becomes eight. The library manages communication across the tensor-distributed model replicas.

![\[The simplest example of how the library splits a model with four layers to achieve two-way tensor parallelism ("tensor_parallel_degree": 2).\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/smdmp-tensor-parallel-only.png)


The usefulness of this feature is in the fact that you can select specific layers or a subset of layers to apply tensor parallelism. To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set a combination of pipeline and tensor parallelism, see [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md).

### Optimizer state sharding (available for PyTorch)
<a name="model-parallel-intro-oss"></a>

To understand how the library performs *optimizer state sharding*, consider a simple example model with four layers. The key idea in optimizing state sharding is you don't need to replicate your optimizer state in all of your GPUs. Instead, a single replica of the optimizer state is sharded across data-parallel ranks, with no redundancy across devices. For example, GPU 0 holds the optimizer state for layer one, the next GPU 1 holds the optimizer state for L2, and so on. The following animated figure shows a backward propagation with the optimizer state sharding technique. At the end of the backward propagation, there's compute and network time for the `optimizer apply` (OA) operation to update optimizer states and the `all-gather` (AG) operation to update the model parameters for the next iteration. Most importantly, the `reduce` operation can overlap with the compute on GPU 0, resulting in a more memory-efficient and faster backward propagation. In the current implementation, AG and OA operations do not overlap with `compute`. It can result in an extended computation during the AG operation, so there might be a tradeoff. 

![\[A backward propagation with the optimizer state sharding technique.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/smdmp-optimizer-state-sharding.gif)


For more information about how to use this feature, see [Optimizer State Sharding](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html).

### Activation offloading and checkpointing (available for PyTorch)
<a name="model-parallel-intro-activation-offloading-checkpointing"></a>

To save GPU memory, the library supports activation checkpointing to avoid storing internal activations in the GPU memory for user-specified modules during the forward pass. The library recomputes these activations during the backward pass. In addition, the activation offloading feature offloads the stored activations to CPU memory and fetches back to GPU during the backward pass to further reduce activation memory footprint. For more information about how to use these features, see [Activation Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html) and [Activation Offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html).

### Choosing the right techniques for your model
<a name="model-parallel-intro-choosing-techniques"></a>

For more information about choosing the right techniques and configurations, see [SageMaker Distributed Model Parallel Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-best-practices.html) and [Configuration Tips and Pitfalls](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html).

# Supported Frameworks and AWS Regions
<a name="distributed-model-parallel-support"></a>

Before using the SageMaker model parallelism library, check the supported frameworks and instance types, and determine if there are enough quotas in your AWS account and AWS Region.

**Note**  
To check the latest updates and release notes of the library, see the [SageMaker Model Parallel Release Notes](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html) in the *SageMaker Python SDK documentation*.

## Supported Frameworks
<a name="distributed-model-parallel-supported-frameworks"></a>

The SageMaker model parallelism library supports the following deep learning frameworks and is available in AWS Deep Learning Containers (DLC) or downloadable as a binary file.

PyTorch versions supported by SageMaker AI and the SageMaker model parallelism library


| PyTorch version | SageMaker model parallelism library version | `smdistributed-modelparallel` integrated DLC image URI | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | 
| v2.0.0 | smdistributed-modelparallel==v1.15.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker`  | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed\$1modelparallel-1.15.0-cp310-cp310-linux\$1x86\$164.whl | 
| v1.13.1 | smdistributed-modelparallel==v1.15.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker`  | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed\$1modelparallel-1.15.0-cp39-cp39-linux\$1x86\$164.whl | 
| v1.12.1 | smdistributed-modelparallel==v1.13.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker`  | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.1/build-artifacts/2022-12-08-21-34/smdistributed\$1modelparallel-1.13.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.12.0 | smdistributed-modelparallel==v1.11.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker`   | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed\$1modelparallel-1.11.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.11.0 | smdistributed-modelparallel==v1.10.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker`  | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-11-19-23/smdistributed\$1modelparallel-1.10.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.10.2 |  smdistributed-modelparallel==v1.7.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker`  | - | 
| v1.10.0 |  smdistributed-modelparallel==v1.5.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker`  | - | 
| v1.9.1 |  smdistributed-modelparallel==v1.4.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04`  | - | 
| v1.8.1\$1 |  smdistributed-modelparallel==v1.6.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04`  | - | 

**Note**  
The SageMaker model parallelism library v1.6.0 and later provides extended features for PyTorch. For more information, see [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md).

\$1\$1 The URLs of the binary files are for installing the SageMaker model parallelism library in custom containers. For more information, see [Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library](model-parallel-sm-sdk.md#model-parallel-bring-your-own-container).

TensorFlow versions supported by SageMaker AI and the SageMaker model parallelism library


| TensorFlow version | SageMaker model parallelism library version | `smdistributed-modelparallel` integrated DLC image URI | 
| --- | --- | --- | 
| v2.6.0 | smdistributed-modelparallel==v1.4.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.6.0-gpu-py38-cu112-ubuntu20.04 | 
| v2.5.1 | smdistributed-modelparallel==v1.4.0  | 763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.1-gpu-py37-cu112-ubuntu18.04  | 

**Hugging Face Transformers versions supported by SageMaker AI and the SageMaker distributed data parallel library**

The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and paired PyTorch and TensorFlow versions, see the latest [Hugging Face Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers) and the [Prior Hugging Face Container Versions](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#prior-hugging-face-container-versions).

## AWS Regions
<a name="distributed-model-parallel-availablity-zone"></a>

The SageMaker data parallel library is available in all of the AWS Regions where the [AWS Deep Learning Containers for SageMaker](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) are in service. For more information, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#available-deep-learning-containers-images).

## Supported Instance Types
<a name="distributed-model-parallel-supported-instance-types"></a>

The SageMaker model parallelism library requires one of the following ML instance types.


| Instance type | 
| --- | 
| ml.g4dn.12xlarge | 
| ml.p3.16xlarge | 
| ml.p3dn.24xlarge  | 
| ml.p4d.24xlarge | 
| ml.p4de.24xlarge | 

For specs of the instance types, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Request a service quota increase for SageMaker AI resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure).

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
    the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
    for training job usage' is 0 Instances, with current utilization of 0 Instances
    and a request delta of 1 Instances.
    Please contact AWS support to request an increase for this limit.
```

# Core Features of the SageMaker Model Parallelism Library
<a name="model-parallel-core-features"></a>

Amazon SageMaker AI's model parallelism library offers distribution strategies and memory-saving techniques, such as sharded data parallelism, tensor parallelism, model partitioning by layers for pipeline scheduling, and checkpointing. The model parallelism strategies and techniques help distribute large models across multiple devices while optimizing training speed and memory consumption. The library also provides Python helper functions, context managers, and wrapper functions to adapt your training script for automated or manual partitioning of your model.

When you implement model parallelism to your training job, you keep the same two-step workflow shown in the [Run a SageMaker Distributed Training Job with Model Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html) section. For adapting your training script, you'll add zero or few additional code lines to your training script. For launching a training job of the adapted training script, you'll need to set the distribution configuration parameters to activate the memory-saving features or to pass values for the degree of parallelism.

To get started with examples, see the following Jupyter notebooks that demonstrate how to use the SageMaker model parallelism library.
+ [PyTorch example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/model_parallel)
+ [TensorFlow example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/tensorflow/model_parallel/mnist)

To dive deep into the core features of the library, see the following topics.

**Note**  
The SageMaker distributed training libraries are available through the AWS deep learning containers for PyTorch, Hugging Face, and TensorFlow within the SageMaker Training platform. To utilize the features of the distributed training libraries, we recommend that you use the SageMaker Python SDK. You can also manually configure in JSON request syntax if you use SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout the documentation, instructions and examples focus on how to use the distributed training libraries with the SageMaker Python SDK.

**Important**  
The SageMaker model parallelism library supports all the core features for PyTorch, and supports pipeline parallelism for TensorFlow.

**Topics**
+ [

# Sharded Data Parallelism
](model-parallel-extended-features-pytorch-sharded-data-parallelism.md)
+ [

# Pipelining a Model
](model-parallel-core-features-pipieline-parallelism.md)
+ [

# Tensor Parallelism
](model-parallel-extended-features-pytorch-tensor-parallelism.md)
+ [

# Optimizer State Sharding
](model-parallel-extended-features-pytorch-optimizer-state-sharding.md)
+ [

# Activation Checkpointing
](model-parallel-extended-features-pytorch-activation-checkpointing.md)
+ [

# Activation Offloading
](model-parallel-extended-features-pytorch-activation-offloading.md)
+ [

# FP16 Training with Model Parallelism
](model-parallel-extended-features-pytorch-fp16.md)
+ [

# Support for FlashAttention
](model-parallel-attention-head-size-for-flash-attention.md)

# Sharded Data Parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group. 

**Note**  
Sharded data parallelism is available for PyTorch in the SageMaker model parallelism library v1.11.0 and later.

When scaling up your training job to a large GPU cluster, you can reduce the per-GPU memory footprint of the model by sharding the training state of the model over multiple GPUs. This returns two benefits: you can fit larger models, which would otherwise run out of memory with standard data parallelism, or you can increase the batch size using the freed-up GPU memory.

The standard data parallelism technique replicates the training states across the GPUs in the data parallel group, and performs gradient aggregation based on the `AllReduce` operation. Sharded data parallelism modifies the standard data-parallel distributed training procedure to account for the sharded nature of the optimizer states. A group of ranks over which the model and optimizer states are sharded is called a *sharding group*. The sharded data parallelism technique shards the trainable parameters of a model and corresponding gradients and optimizer states across the GPUs in the *sharding group*.

SageMaker AI achieves sharded data parallelism through the implementation of MiCS, which is discussed in the AWS blog post [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws). In this implementation, you can set the sharding degree as a configurable parameter, which must be less than the data parallelism degree. During each forward and backward pass, MiCS temporarily recombines the model parameters in all GPUs through the `AllGather` operation. After the forward or backward pass of each layer, MiCS shards the parameters again to save GPU memory. During the backward pass, MiCS reduces gradients and simultaneously shards them across GPUs through the `ReduceScatter` operation. Finally, MiCS applies the local reduced and sharded gradients to their corresponding local parameter shards, using the local shards of optimizer states. To bring down communication overhead, the SageMaker model parallelism library prefetches the upcoming layers in the forward or backward pass, and overlaps the network communication with the computation.

The training state of the model is replicated across the sharding groups. This means that before gradients are applied to the parameters, the `AllReduce` operation must take place across the sharding groups, in addition to the `ReduceScatter` operation that takes place within the sharding group.

In effect, sharded data parallelism introduces a tradeoff between the communication overhead and GPU memory efficiency. Using sharded data parallelism increases the communication cost, but the memory footprint per GPU (excluding the memory usage due to activations) is divided by the sharded data parallelism degree, thus larger models can be fit in the GPU cluster.

**Selecting the degree of sharded data parallelism**

When you select a value for the degree of sharded data parallelism, the value must evenly divide the degree of data parallelism. For example, for an 8-way data parallelism job, choose 2, 4, or 8 for the sharded data parallelism degree. While choosing the sharded data parallelism degree, we recommend that you start with a small number, and gradually increase it until the model fits in the memory together with the desired batch size.

**Selecting the batch size**

After setting up sharded data parallelism, make sure you find the most optimal training configuration that can successfully run on the GPU cluster. For training large language models (LLM), start from the batch size 1, and gradually increase it until you reach the point to receive the out-of-memory (OOM) error. If you encounter the OOM error even with the smallest batch size, apply a higher degree of sharded data parallelism or a combination of sharded data parallelism and tensor parallelism.

**Topics**
+ [

## How to apply sharded data parallelism to your training job
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use)
+ [

## Reference configurations
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-config-sample)
+ [

## Sharded data parallelism with SMDDP Collectives
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives)
+ [

## Mixed precision training with sharded data parallelism
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-16bits-training)
+ [

## Sharded data parallelism with tensor parallelism
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism)
+ [

## Tips and considerations for using sharded data parallelism
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-considerations)

## How to apply sharded data parallelism to your training job
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use"></a>

To get started with sharded data parallelism, apply required modifications to your training script, and set up the SageMaker PyTorch estimator with the sharded-data-parallelism-specific parameters. Also consider to take reference values and example notebooks as a starting point.

### Adapt your PyTorch training script
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-modify-script"></a>

Follow the instructions at [Step 1: Modify a PyTorch Training Script](model-parallel-customize-training-script-pt.md) to wrap the model and optimizer objects with the `smdistributed.modelparallel.torch` wrappers of the `torch.nn.parallel` and `torch.distributed` modules.

**(Optional) Additional modification to register external model parameters**

If your model is built with `torch.nn.Module` and uses parameters that is not defined within the module class, you should register them to the module manually for SMP to gather the full parameters while . To register parameters to a module, use `smp.register_parameter(module, parameter)`.

```
class Module(torch.nn.Module):
    def __init__(self, *args):
        super().__init__(self, *args)
        self.layer1 = Layer1()
        self.layer2 = Layer2()
        smp.register_parameter(self, self.layer1.weight)

    def forward(self, input):
        x = self.layer1(input)
        # self.layer1.weight is required by self.layer2.forward
        y = self.layer2(x, self.layer1.weight)
        return y
```

### Set up the SageMaker PyTorch estimator
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-set-estimator"></a>

When configuring a SageMaker PyTorch estimator in [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md), add the parameters for sharded data parallelism. 

To turn on sharded data parallelism, add the `sharded_data_parallel_degree` parameter to the SageMaker PyTorch Estimator. This parameter specifies the number of GPUs over which the training state is sharded. The value for `sharded_data_parallel_degree` must be an integer between one and the data parallelism degree and must evenly divide the data parallelism degree. Note that the library automatically detects the number of GPUs so thus the data parallel degree. The following additional parameters are available for configuring sharded data parallelism.
+ `"sdp_reduce_bucket_size"` *(int, default: 5e8)* – Specifies the size of [PyTorch DDP gradient buckets](https://pytorch.org/docs/stable/notes/ddp.html#internal-design) in number of elements of the default dtype.
+ `"sdp_param_persistence_threshold"` *(int, default: 1e6)* – Specifies the size of a parameter tensor in number of elements that can persist at each GPU. Sharded data parallelism splits each parameter tensor across GPUs of a data parallel group. If the number of elements in the parameter tensor is smaller than this threshold, the parameter tensor is not split; this helps reduce communication overhead because the parameter tensor is replicated across data-parallel GPUs.
+ `"sdp_max_live_parameters"` *(int, default: 1e9)* – Specifies the maximum number of parameters that can simultaneously be in a recombined training state during the forward and backward pass. Parameter fetching with the `AllGather` operation pauses when the number of active parameters reaches the given threshold. Note that increasing this parameter increases the memory footprint.
+ `"sdp_hierarchical_allgather"` *(bool, default: True)* – If set to `True`, the `AllGather` operation runs hierarchically: it runs within each node first, and then runs across nodes. For multi-node distributed training jobs, the hierarchical `AllGather` operation is automatically activated.
+ `"sdp_gradient_clipping"` *(float, default: 1.0)* – Specifies a threshold for gradient clipping the L2 norm of the gradients before propagating them backward through the model parameters. When sharded data parallelism is activated, gradient clipping is also activated. The default threshold is `1.0`. Adjust this parameter if you have the exploding gradients problem.

The following code shows an example of how to configure sharded data parallelism.

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled": True,
    "parameters": {
        # "pipeline_parallel_degree": 1,    # Optional, default is 1
        # "tensor_parallel_degree": 1,      # Optional, default is 1
        "ddp": True,
        # parameters for sharded data parallelism
        "sharded_data_parallel_degree": 2,              # Add this to activate sharded data parallelism
        "sdp_reduce_bucket_size": int(5e8),             # Optional
        "sdp_param_persistence_threshold": int(1e6),    # Optional
        "sdp_max_live_parameters": int(1e9),            # Optional
        "sdp_hierarchical_allgather": True,             # Optional
        "sdp_gradient_clipping": 1.0                    # Optional
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

smp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-job"
)

smp_estimator.fit('s3://my_bucket/my_training_data/')
```

## Reference configurations
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-config-sample"></a>

The SageMaker distributed training team provides the following reference configurations that you can use as a starting point. You can extrapolate from the following configurations to experiment and estimate the GPU memory usage for your model configuration. 

Sharded data parallelism with SMDDP Collectives


| Model/the number of parameters | Num instances | Instance type | Sequence length | Global batch size | Mini batch size | Sharded data parallel degree | 
| --- | --- | --- | --- | --- | --- | --- | 
| GPT-NEOX-20B | 2 | ml.p4d.24xlarge | 2048 | 64 | 4 | 16 | 
| GPT-NEOX-20B | 8 | ml.p4d.24xlarge | 2048 | 768 | 12 | 32 | 

For example, if you increase the sequence length for a 20-billion-parameter model or increase the size of the model to 65 billion parameters, you need to try reducing the batch size first. If the model still doesn’t fit with the smallest batch size (the batch size of 1), try increasing the degree of model parallelism.

Sharded data parallelism with tensor parallelism and NCCL Collectives


| Model/the number of parameters | Num instances | Instance type | Sequence length | Global batch size | Mini batch size | Sharded data parallel degree | Tensor parallel degree | Activation offloading | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| GPT-NEOX-65B | 64 | ml.p4d.24xlarge | 2048 | 512 | 8 | 16 | 8 | Y | 
| GPT-NEOX-65B | 64 | ml.p4d.24xlarge | 4096 | 512 | 2 | 64 | 2 | Y | 

The combined usage of sharded data parallelism and tensor parallelism is useful when you want to fit a large language model (LLM) into a large-scale cluster while using text data with a longer sequence length, which leads to use a smaller batch size, and consequently handling the GPU memory usage to train LLMs against longer text sequences. To learn more, see [Sharded data parallelism with tensor parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism).

For case studies, benchmarks, and more configuration examples, see the blog post [New performance improvements in Amazon SageMaker AI model parallel library](https://aws.amazon.com/blogs/machine-learning/new-performance-improvements-in-amazon-sagemaker-model-parallel-library/).

## Sharded data parallelism with SMDDP Collectives
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives"></a>

The SageMaker data parallelism library offers collective communication primitives (SMDDP collectives) optimized for the AWS infrastructure. It achieves optimization by adopting an all-to-all-type communication pattern by making use of [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/), resulting in high-throughput and less latency-sensitive collectives, offloading the communication-related processing to the CPU, and freeing up GPU cycles for computation. On large clusters, SMDDP Collectives can offer improvements in distributed training performance by up to 40% compared to NCCL. For case studies and benchmark results, see the blog [New performance improvements in the Amazon SageMaker AI model parallelism library](https://aws.amazon.com/blogs/machine-learning/new-performance-improvements-in-amazon-sagemaker-model-parallel-library/).

**Note**  
Sharded data parallelism with SMDDP Collectives is available in the SageMaker model parallelism library v1.13.0 and later, and the SageMaker data parallelism library v1.6.0 and later. See also [Supported configurations](#sharded-data-parallelism-smddp-collectives-supported-config) to use sharded data parallelism with SMDDP Collectives.

In sharded data parallelism, which is a commonly used technique in large-scale distributed training, the `AllGather` collective is used to reconstitute the sharded layer parameters for forward and backward pass computations, in parallel with GPU computation. For large models, performing the `AllGather` operation efficiently is critical to avoid GPU bottleneck problems and slowing down training speed. When sharded data parallelism is activated, SMDDP Collectives drops into these performance-critical `AllGather` collectives, improving training throughput.

**Train with SMDDP Collectives**

When your training job has sharded data parallelism activated and meets the [Supported configurations](#sharded-data-parallelism-smddp-collectives-supported-config), SMDDP Collectives are automatically activated. Internally, SMDDP Collectives optimize the `AllGather` collective to be performant on the AWS infrastructure and falls back to NCCL for all other collectives. Furthermore, under unsupported configurations, all collectives, including `AllGather`, automatically use the NCCL backend.

Since the SageMaker model parallelism library version 1.13.0, the `"ddp_dist_backend"` parameter is added to the `modelparallel` options. The default value for this configuration parameter is `"auto"`, which uses SMDDP Collectives whenever possible, and falls back to NCCL otherwise. To force the library to always use NCCL, specify `"nccl"` to the `"ddp_dist_backend"` configuration parameter. 

The following code example shows how to set up a PyTorch estimator using the sharded data parallelism with the `"ddp_dist_backend"` parameter, which is set to `"auto"` by default and, therefore, optional to add. 

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled":True,
    "parameters": {                        
        "partitions": 1,
        "ddp": True,
        "sharded_data_parallel_degree": 64
        "bf16": True,
        "ddp_dist_backend": "auto"  # Specify "nccl" to force to use NCCL.
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=8,
    instance_type='ml.p4d.24xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

**Supported configurations**

The `AllGather` operation with SMDDP Collectives are activated in training jobs when all the following configuration requirements are met.
+ The sharded data parallelism degree greater than 1
+ `Instance_count` greater than 1 
+ `Instance_type` equal to `ml.p4d.24xlarge` 
+ SageMaker training container for PyTorch v1.12.1 or later
+ The SageMaker data parallelism library v1.6.0 or later
+ The SageMaker model parallelism library v1.13.0 or later

**Performance and memory tuning**

SMDDP Collectives utilize additional GPU memory. There are two environment variables to configure the GPU memory usage depending on different model training use cases.
+ `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` – During the SMDDP `AllGather` operation, the `AllGather` input buffer is copied into a temporary buffer for inter-node communication. The `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` variable controls the size (in bytes) of this temporary buffer. If the size of the temporary buffer is smaller than the `AllGather` input buffer size, the `AllGather` collective falls back to use NCCL.
  + Default value: 16 \$1 1024 \$1 1024 (16 MB)
  + Acceptable values: any multiple of 8192
+  `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` – The `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` variable is to size the temporary buffer (in bytes) to hold data gathered from inter-node communication. If the size of this temporary buffer is smaller than `1/8 * sharded_data_parallel_degree * AllGather input size`, the `AllGather` collective falls back to use NCCL.
  + Default value: 128 \$1 1024 \$1 1024 (128 MB)
  + Acceptable values: any multiple of 8192

**Tuning guidance on the buffer size variables**

The default values for the environment variables should work well for most use cases. We recommend tuning these variables only if training runs into the out-of-memory (OOM) error. 

The following list discusses some tuning tips to reduce the GPU memory footprint of SMDDP Collectives while retaining the performance gain from them.
+ Tuning `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES`
  + The `AllGather` input buffer size is smaller for smaller models. Hence, the required size for `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` can be smaller for models with fewer parameters.
  + The `AllGather` input buffer size decreases as `sharded_data_parallel_degree` increases, because the model gets sharded across more GPUs. Hence, the required size for `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` can be smaller for training jobs with large values for `sharded_data_parallel_degree`.
+ Tuning `SMDDP_AG_SORT_BUFFER_SIZE_BYTES`
  + The amount of data gathered from inter-node communication is less for models with fewer parameters. Hence, the required size for `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` can be smaller for such models with fewer number of parameters.

Some collectives might fall back to use NCCL; hence, you might not get the performance gain from the optimized SMDDP collectives. If additional GPU memory is available for use, you can consider increasing the values of `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` and `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` to benefit from the performance gain.

The following code shows how you can configure the environment variables by appending them to `mpi_options` in the distribution parameter for the PyTorch estimator.

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    .... # All modelparallel configuration options go here
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

# Use the following two lines to tune values of the environment variables for buffer
mpioptions += " -x SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES=8192" 
mpioptions += " -x SMDDP_AG_SORT_BUFFER_SIZE_BYTES=8192"

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=8,
    instance_type='ml.p4d.24xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-demo-with-tuning",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

## Mixed precision training with sharded data parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-16bits-training"></a>

To further save GPU memory with half-precision floating point numbers and sharded data parallelism, you can activate 16-bit floating point format (FP16) or [Brain floating point format](https://en.wikichip.org/wiki/brain_floating-point_format) (BF16) by adding one additional parameter to the distributed training configuration.

**Note**  
Mixed precision training with sharded data parallelism is available in the SageMaker model parallelism library v1.11.0 and later.

**For FP16 Training with Sharded Data Parallelism**

To run FP16 training with sharded data parallelism, add `"fp16": True"` to the `smp_options` configuration dictionary. In your training script, you can choose between the static and dynamic loss scaling options through the `smp.DistributedOptimizer` module. For more information, see [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md).

```
smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "sharded_data_parallel_degree": 2,
        "fp16": True
    }
}
```

**For BF16 Training with Sharded Data Parallelism**

The sharded data parallelism feature of SageMaker AI supports training in BF16 data type. The BF16 data type uses 8 bits to represent the exponent of a floating point number, while the FP16 data type uses 5 bits. Preserving the 8 bits for the exponent allows to keep the same representation of the exponent of a 32-bit single precision floating point (FP32) number. This makes the conversion between FP32 and BF16 simpler and significantly less prone to cause overflow and underflow issues that arise often in FP16 training, especially when training larger models. While both data types use 16 bits in total, this increased representation range for the exponent in the BF16 format comes at the expense of reduced precision. For training large models, this reduced precision is often considered an acceptable trade-off for the range and training stability.

**Note**  
Currently, BF16 training works only when sharded data parallelism is activated.

To run BF16 training with sharded data parallelism, add `"bf16": True` to the `smp_options` configuration dictionary.

```
smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "sharded_data_parallel_degree": 2,
        "bf16": True
    }
}
```

## Sharded data parallelism with tensor parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism"></a>

If you use sharded data parallelism and also need to reduce the global batch size, consider using [tensor parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html) with sharded data parallelism. When training a large model with sharded data parallelism on a very large compute cluster (typically 128 nodes or beyond), even a small batch size per GPU results in a very large global batch size. It might lead to convergence issues or low computational performance issues. Reducing the batch size per GPU sometimes is not possible with sharded data parallelism alone when a single batch is already large and cannot be reduced further. In such cases, using sharded data parallelism in combination with tensor parallelism helps reduce the global batch size.

Choosing the optimal sharded data parallel and tensor parallel degrees depends on the scale of the model, the instance type, and the global batch size that is reasonable for the model to converge. We recommend that you start from a low tensor parallel degree to fit the global batch size into the compute cluster to resolve CUDA out-of-memory errors and achieve the best performance. See the following two example cases to learn how the combination of tensor parallelism and sharded data parallelism helps you adjust the global batch size by grouping GPUs for model parallelism, resulting in a lower number of model replicas and a smaller global batch size.

**Note**  
This feature is available from the SageMaker model parallelism library v1.15, and supports PyTorch v1.13.1.

**Note**  
This feature is available for the supported models by the tensor parallelism functionality of the library. To find the list of the supported models, see [Support for Hugging Face Transformer Models](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-hugging-face.html). Also note that you need to pass `tensor_parallelism=True` to the `smp.model_creation` argument while modifying your training script. To learn more, see the training script [https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L793](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L793) in the *SageMaker AI Examples GitHub repository*.

### Example 1
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex1"></a>

Assume that we want to train a model over a cluster of 1536 GPUs (192 nodes with 8 GPUs in each), setting the degree of sharded data parallelism to 32 (`sharded_data_parallel_degree=32`) and the batch size per GPU to 1, where each batch has a sequence length of 4096 tokens. In this case, there are 1536 model replicas, the global batch size becomes 1536, and each global batch contains about 6 million tokens. 

```
(1536 GPUs) * (1 batch per GPU) = (1536 global batches)
(1536 batches) * (4096 tokens per batch) = (6,291,456 tokens)
```

Adding tensor parallelism to it can lower the global batch size. One configuration example can be setting the tensor parallel degree to 8 and the batch size per GPU to 4. This forms 192 tensor parallel groups or 192 model replicas, where each model replica is distributed across 8 GPUs. The batch size of 4 is the amount of training data per iteration and per tensor parallel group; that is, each model replica consumes 4 batches per iteration. In this case, the global batch size becomes 768, and each global batch contains about 3 million tokens. Hence, the global batch size is reduced by half compared to the previous case with sharded data parallelism only.

```
(1536 GPUs) / (8 tensor parallel degree) = (192 tensor parallelism groups)
(192 tensor parallelism groups) * (4 batches per tensor parallelism group) = (768 global batches)
(768 batches) * (4096 tokens per batch) = (3,145,728 tokens)
```

### Example 2
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex2"></a>

When both sharded data parallelism and tensor parallelism are activated, the library first applies tensor parallelism and shards the model across this dimension. For each tensor parallel rank, the data parallelism is applied as per `sharded_data_parallel_degree`.

For example, assume that we want to set 32 GPUs with a tensor parallel degree of 4 (forming groups of 4 GPUs), a sharded data parallel degree of 4, ending up with a replication degree of 2. The assignment creates eight GPU groups based on the tensor parallel degree as follows: `(0,1,2,3)`, `(4,5,6,7)`, `(8,9,10,11)`, `(12,13,14,15)`, `(16,17,18,19)`, `(20,21,22,23)`, `(24,25,26,27)`, `(28,29,30,31)`. That is, four GPUs form one tensor parallel group. In this case, the reduced data parallel group for the 0th rank GPUs of the tensor parallel groups would be `(0,4,8,12,16,20,24,28)`. The reduced data parallel group is sharded based on the sharded data parallel degree of 4, resulting in two replication groups for data parallelism. GPUs `(0,4,8,12)` form one sharding group, which collectively hold a complete copy of all parameters for the 0th tensor parallel rank, and GPUs `(16,20,24,28)` form another such group. Other tensor parallel ranks also have similar sharding and replication groups.

![\[Figure 1: Tensor parallelism groups.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/sdp_tp_group_tp.jpg)


Figure 1: Tensor parallelism groups for (nodes, sharded data parallel degree, tensor parallel degree) = (4, 4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form tensor parallelism groups from TPG0 to TPG7. Replication groups are (\$1TPG0, TPG4\$1, \$1TPG1, TPG5\$1, \$1TPG2, TPG6\$1 and \$1TPG3, TPG7\$1); each replication group pair shares the same color but filled differently.

![\[Figure 2: Sharded data parallelism groups.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/sdp_tp_group_sdp.jpg)


Figure 2: Sharded data parallelism groups for (nodes, sharded data parallel degree, tensor parallel degree) = (4, 4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form sharded data parallelism groups from SDPG0 to SDPG7. Replication groups are (\$1SDPG0, SDPG4\$1, \$1SDPG1, SDPG5\$1, \$1SDPG2, SDPG6\$1 and \$1SDPG3, SDPG7\$1); each replication group pair shares the same color but filled differently.

### How to activate sharded data parallelism with tensor parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-activate"></a>

To use sharded data parallelism with tensor parallelism, you need to set both `sharded_data_parallel_degree` and `tensor_parallel_degree` in the configuration for `distribution` while creating an object of the SageMaker PyTorch estimator class. 

You also need to activate `prescaled_batch`. This means that, instead of each GPU reading its own batch of data, each tensor parallel group collectively reads a combined batch of the chosen batch size. Effectively, instead of dividing the dataset into parts equal to the number of GPUs (or data parallel size, `smp.dp_size()`), it divides into parts equal to the number of GPUs divided by `tensor_parallel_degree` (also called reduced data parallel size, `smp.rdp_size()`). For more details on prescaled batch, see [Prescaled Batch](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#prescaled-batch) in the *SageMaker Python SDK documentation*. See also the example training script [https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L164](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L164) for GPT-2 in the *SageMaker AI Examples GitHub repository*.

The following code snippet shows an example of creating a PyTorch estimator object based on the aforementioned scenario in [Example 2](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex2).

```
mpi_options = "-verbose --mca orte_base_help_aggregate 0 "
smp_parameters = {
    "ddp": True,
    "fp16": True,
    "prescaled_batch": True,
    "sharded_data_parallel_degree": 4,
    "tensor_parallel_degree": 4
}

pytorch_estimator = PyTorch(
    entry_point="your_training_script.py",
    role=role,
    instance_type="ml.p4d.24xlarge",
    volume_size=200,
    instance_count=4,
    sagemaker_session=sagemaker_session,
    py_version="py3",
    framework_version="1.13.1",
    distribution={
        "smdistributed": {
            "modelparallel": {
                "enabled": True, 
                "parameters": smp_parameters,
            }
        },
        "mpi": {
            "enabled": True,
            "processes_per_host": 8,
            "custom_mpi_options": mpi_options,
        },
    },
    source_dir="source_directory_of_your_code",
    output_path=s3_output_location
)
```

## Tips and considerations for using sharded data parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-considerations"></a>

Consider the following when using the SageMaker model parallelism library's sharded data parallelism.
+ Sharded data parallelism is compatible with FP16 training. To run FP16 training, see the [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md) section.
+ Sharded data parallelism is compatible with tensor parallelism. The following items are what you might need to consider for using sharded data parallelism with tensor parallelism.
  + When using sharded data parallelism with tensor parallelism, the embedding layers are also automatically distributed across the tensor parallel group. In other words, the `distribute_embedding` parameter is automatically set to `True`. For more information about tensor parallelism, see [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md).
  + Note that sharded data parallelism with tensor parallelism currently uses the NCCL collectives as the backend of the distributed training strategy.

  To learn more, see the [Sharded data parallelism with tensor parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism) section.
+ Sharded data parallelism currently is not compatible with [pipeline parallelism](model-parallel-intro.md#model-parallel-intro-pp) or [optimizer state sharding](model-parallel-extended-features-pytorch-optimizer-state-sharding.md). To activate sharded data parallelism, turn off optimizer state sharding and set the pipeline parallel degree to 1.
+ The [activation checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md) and [activation offloading](model-parallel-extended-features-pytorch-activation-offloading.md) features are compatible with sharded data parallelism.
+ To use sharded data parallelism with gradient accumulation, set the `backward_passes_per_step` argument to the number of accumulation steps while wrapping your model with the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel) module. This ensures that the gradient `AllReduce` operation across the model replication groups (sharding groups) takes place at the boundary of gradient accumulation.
+ You can checkpoint your models trained with sharded data parallelism using the library's checkpointing APIs, `smp.save_checkpoint` and `smp.resume_from_checkpoint`. For more information, see [Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and later)](distributed-model-parallel-checkpointing-and-finetuning.md#model-parallel-extended-features-pytorch-checkpoint).
+ The behavior of the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.delay_param_initialization](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.delay_param_initialization) configuration parameter changes under sharded data parallelism. When these two features are simultaneously turned on, parameters are immediately initialized upon model creation in a sharded manner instead of delaying the parameter initialization, so that each rank initializes and stores its own shard of parameters.
+ When sharded data parallelism is activated, the library performs gradient clipping internally when the `optimizer.step()` call runs. You don't need to use utility APIs for gradient clipping, such as [https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html). To adjust the threshold value for gradient clipping, you can set it through the `sdp_gradient_clipping` parameter for the distribution parameter configuration when you construct the SageMaker PyTorch estimator, as shown in the [How to apply sharded data parallelism to your training job](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use) section.

# Pipelining a Model
<a name="model-parallel-core-features-pipieline-parallelism"></a>

One of the core features of SageMaker's model parallelism library is *pipeline parallelism*, which determines the order in which computations are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the GPUs compute simultaneously on different data samples, and to overcome the performance loss due to sequential computation. When you use pipeline parallelism, training job is executed in a pipelined fashion over microbatches to maximize GPU usage.

**Note**  
Pipeline parallelism, also called model partitioning, is available for both PyTorch and TensorFlow. For supported versions of the frameworks, see [Supported Frameworks and AWS Regions](distributed-model-parallel-support.md).

## Pipeline Execution Schedule
<a name="model-parallel-pipeline-execution"></a>

Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline one-by-one and follow an execution schedule defined by the library runtime. A *microbatch* is a smaller subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by which device for every time slot. 

For example, depending on the pipeline schedule and the model partition, GPU `i` might perform (forward or backward) computation on microbatch `b` while GPU `i+1` performs computation on microbatch `b+1`, thereby keeping both GPUs active at the same time. During a single forward or backward pass, execution flow for a single microbatch might visit the same device multiple times, depending on the partitioning decision. For instance, an operation that is at the beginning of the model might be placed on the same device as an operation at the end of the model, while the operations in between are on different devices, which means this device is visited twice.

The library offers two different pipeline schedules, *simple* and *interleaved*, which can be configured using the `pipeline` parameter in the SageMaker Python SDK. In most cases, interleaved pipeline can achieve better performance by utilizing the GPUs more efficiently.

### Interleaved Pipeline
<a name="model-parallel-pipeline-execution-interleaved"></a>

In an interleaved pipeline, backward execution of the microbatches is prioritized whenever possible. This allows quicker release of the memory used for activations, using memory more efficiently. It also allows for scaling the number of microbatches higher, reducing the idle time of the GPUs. At steady-state, each device alternates between running forward and backward passes. This means that the backward pass of one microbatch may run before the forward pass of another microbatch finishes.

![\[Example execution schedule for the interleaved pipeline over 2 GPUs.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-execution.png)


The preceding figure illustrates an example execution schedule for the interleaved pipeline over 2 GPUs. In the figure, F0 represents the forward pass for microbatch 0, and B1 represents the backward pass for microbatch 1. **Update** represents the optimizer update of the parameters. GPU0 always prioritizes backward passes whenever possible (for instance, executes B0 before F2), which allows for clearing of the memory used for activations earlier.

### Simple Pipeline
<a name="model-parallel-pipeline-execution-simple"></a>

A simple pipeline, by contrast, finishes running the forward pass for each microbatch before starting the backward pass. This means that it only pipelines the forward pass and backward pass stages within themselves. The following figure illustrates an example of how this works, over 2 GPUs.

![\[Example on a pipeline running the forward pass for each microbatch before starting the backward pass.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/simple-pipeline-execution.png)


### Pipelining Execution in Specific Frameworks
<a name="model-parallel-pipeline-frameworks"></a>

Use the following sections to learn about the framework-specific pipeline scheduling decisions SageMaker's model parallelism library makes for TensorFlow and PyTorch. 

#### Pipeline Execution with TensorFlow
<a name="model-parallel-pipeline-execution-interleaved-tf"></a>

The following image is an example of a TensorFlow graph partitioned by the model parallelism library, using automated model splitting. When a graph is split, each resulting subgraph is replicated B times (except for the variables), where B is the number of microbatches. In this figure, each subgraph is replicated 2 times (B=2). An `SMPInput` operation is inserted at each input of a subgraph, and an `SMPOutput` operation is inserted at each output. These operations communicate with the library backend to transfer tensors to and from each other.

![\[Example of a TensorFlow graph partitioned by the model parallelism library, using automated model splitting.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-tf.png)


The following image is an example of 2 subgraphs split with B=2 with gradient operations added. The gradient of a `SMPInput` op is a `SMPOutput` op, and vice versa. This enables the gradients to flow backwards during back-propagation.

![\[Example of 2 subgraphs split with B=2 with gradient operations added.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-tf.gif)


This GIF demonstrates an example interleaved pipeline execution schedule with B=2 microbatches and 2 subgraphs. Each device sequentially executes one of the subgraph replicas to improve GPU utilization. As B grows larger, the fraction of idle time slots goes to zero. Whenever it is time to do (forward or backward) computation on a specific subgraph replica, the pipeline layer signals to the corresponding blue `SMPInput` operations to start executing.

Once the gradients from all microbatches in a single mini-batch are computed, the library combines the gradients across microbatches, which can then be applied to the parameters. 

#### Pipeline Execution with PyTorch
<a name="model-parallel-pipeline-execution-interleaved-pt"></a>

Conceptually, pipelining follows a similar idea in PyTorch. However, since PyTorch does not involve static graphs and so the model parallelism library's PyTorch feature uses a more dynamic pipelining paradigm. 

As in TensorFlow, each batch is split into a number of microbatches, which are executed one at a time on each device. However, the execution schedule is handled via execution servers launched on each device. Whenever the output of a submodule that is placed on another device is needed on the current device, an execution request is sent to the execution server of the remote device along with the input tensors to the submodule. The server then executes this module with the given inputs and returns the response to the current device.

Since the current device is idle during the remote submodule execution, the local execution for the current microbatch pauses, and the library runtime switches execution to another microbatch which the current device can actively work on. The prioritization of microbatches is determined by the chosen pipeline schedule. For an interleaved pipeline schedule, microbatches that are in the backward stage of the computation are prioritized whenever possible.

# Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-tensor-parallelism"></a>

*Tensor parallelism* is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices. In contrast to pipeline parallelism, which keeps individual weights intact but partitions the *set* of weights, tensor parallelism splits individual weights. This typically involves distributed computation of specific operations, modules, or layers of the model.

Tensor parallelism is required in cases in which a single parameter consumes most of the GPU memory (such as large embedding tables with a large vocabulary size or a large softmax layer with a large number of classes). In this case, treating this large tensor or operation as an atomic unit is inefficient and impedes balance of the memory load. 

Tensor parallelism is also useful for extremely large models in which a pure pipelining is simply not enough. For example, with GPT-3-scale models that require partitioning over tens of instances, a pure microbatch pipelining is inefficient because the pipeline depth becomes too high and the overhead becomes prohibitively large.

**Note**  
Tensor parallelism is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

**Topics**
+ [

# How Tensor Parallelism Works
](model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.md)
+ [

# Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism
](model-parallel-extended-features-pytorch-tensor-parallelism-examples.md)
+ [

# Support for Hugging Face Transformer Models
](model-parallel-extended-features-pytorch-hugging-face.md)
+ [

# Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism
](model-parallel-extended-features-pytorch-ranking-mechanism.md)

# How Tensor Parallelism Works
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works"></a>

Tensor parallelism takes place at the level of `nn.Modules`; it partitions specific modules in the model across tensor parallel ranks. This is in addition to the existing partition of the *set of modules* used in pipeline parallelism.

When a module is partitioned through tensor parallelism, its forward and backward propagation are distributed. The library handles the necessary communication across devices to implement the distributed execution of these modules. The modules are partitioned across multiple data parallel ranks. Contrary to the traditional distribution of workloads, each data parallel rank does **not** have the complete model replica when the library’s tensor parallelism is used. Instead, each data parallel rank may have only a partition of the distributed modules, in addition to the entirety of the modules that are not distributed.

**Example:** Consider tensor parallelism across data parallel ranks, where the degree of data parallelism is 4 and the degree of tensor parallelism is 2. Assume that you have a data parallel group that holds the following module tree, after partitioning the set of modules.

```
A
├── B
|   ├── E
|   ├── F
├── C
└── D
    ├── G
    └── H
```

Assume that tensor parallelism is supported for the modules B, G, and H. One possible outcome of tensor parallel partition of this model could be:

```
dp_rank 0 (tensor parallel rank 0): A, B:0, C, D, G:0, H
dp_rank 1 (tensor parallel rank 1): A, B:1, C, D, G:1, H
dp_rank 2 (tensor parallel rank 0): A, B:0, C, D, G:0, H
dp_rank 3 (tensor parallel rank 1): A, B:1, C, D, G:1, H
```

Each line represents the set of modules stored in that `dp_rank`, and the notation `X:y` represents the `y`th fraction of the module `X`. Note the following:

1. Partitioning takes place across subsets of data parallel ranks, which we call `TP_GROUP`, not the entire `DP_GROUP`, so that the exact model partition is replicated across `dp_rank` 0 and `dp_rank` 2, and similarly across `dp_rank` 1 and `dp_rank` 3.

1. The modules `E` and `F` are no longer part of the model, since their parent module `B` is partitioned, and any execution that is normally a part of `E` and `F` takes place within the (partitioned) `B` module.

1. Even though `H` is supported for tensor parallelism, in this example it is not partitioned, which highlights that whether to partition a module depends on user input. The fact that a module is supported for tensor parallelism does not necessarily mean it is partitioned.

## How the library adapts tensor parallelism to PyTorch `nn.Linear` module
<a name="model-parallel-extended-for-pytorch-adapt-to-module"></a>

When tensor parallelism is performed over data parallel ranks, a subset of the parameters, gradients, and optimizer states are partitioned across the tensor parallel devices *for the modules that are partitioned*. For the rest of the modules, the tensor parallel devices operate in a regular data parallel manner. To execute the partitioned module, a device first collects the necessary parts of *all data samples* across peer devices in the same tensor parallelism group. The device then runs the local fraction of the module on all these data samples, followed by another round of synchronization which both combines the parts of the output for each data sample and returns the combined data samples to the GPUs from which the data sample first originated. The following figure shows an example of this process over a partitioned `nn.Linear` module. 

![\[Two figures showing two tensor parallel concepts.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/tensor-parallel-concept.png)


The first figure shows a small model with a large `nn.Linear` module with data parallelism over the two tensor parallelism ranks. The `nn.Linear` module is replicated into the two parallel ranks. 

The second figure shows tensor parallelism applied on a larger model while splitting the `nn.Linear` module. Each `tp_rank` holds half the linear module, and the entirety of the rest of the operations. While the linear module runs, each `tp_rank` collects the relevant half of all data samples and passes it through their half of the `nn.Linear` module. The result needs to be reduce-scattered (with summation as the reduction operation) so that each rank has the final linear output for their own data samples. The rest of the model runs in the typical data parallel manner.

# Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-examples"></a>

In this section, you learn:
+ How to configure a SageMaker PyTorch estimator and the SageMaker model parallelism option to use tensor parallelism.
+ How to adapt your training script using the extended `smdistributed.modelparallel` modules for tensor parallelism.

To learn more about the `smdistributed.modelparallel` modules, see the [SageMaker model parallel APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html) in the *SageMaker Python SDK documentation*.

**Topics**
+ [

## Tensor parallelism alone
](#model-parallel-extended-features-pytorch-tensor-parallelism-alone)
+ [

## Tensor parallelism combined with pipeline parallelism
](#model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism)

## Tensor parallelism alone
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-alone"></a>

The following is an example of a distributed training option to activate tensor parallelism alone, without pipeline parallelism. Configure the `mpi_options` and `smp_options` dictionaries to specify distributed training options to the SageMaker `PyTorch` estimator.

**Note**  
Extended memory-saving features are available through Deep Learning Containers for PyTorch, which implements the SageMaker model parallelism library v1.6.0 or later.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
               
smp_options = {
    "enabled":True,
    "parameters": {
        "pipeline_parallel_degree": 1,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 4,      # tp over 4 devices
        "ddp": True
    }
}
              
smp_estimator = PyTorch(
    entry_point='your_training_script.py', # Specify
    role=role,
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    framework_version='1.13.1',
    py_version='py36',
    instance_count=1,
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smp_estimator.fit('s3://my_bucket/my_training_data/')
```

**Tip**  
To find a complete list of parameters for `distribution`, see [Configuration Parameters for Model Parallelism](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html) in the SageMaker Python SDK documentation.

**Adapt your PyTorch training script**

The following example training script shows how to adapt the SageMaker model parallelism library to a training script. In this example, it is assumed that the script is named `your_training_script.py`. 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)

def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by
        # the current process, based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target, reduction="mean")
        loss.backward()
        optimizer.step()

# smdistributed: Initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
if smp.local_rank() == 0:
    dataset = datasets.MNIST("../data", train=True, download=False)
smp.barrier()

# smdistributed: Shard the dataset based on data parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

train_loader = torch.utils.data.DataLoader(dataset, batch_size=64)

# smdistributed: Enable tensor parallelism for all supported modules in the model
# i.e., nn.Linear in this case. Alternatively, we can use
# smp.set_tensor_parallelism(model.fc1, True)
# to enable it only for model.fc1
with smp.tensor_parallelism():
    model = Net()

# smdistributed: Use the DistributedModel wrapper to distribute the
# modules for which tensor parallelism is enabled
model = smp.DistributedModel(model)

optimizer = optim.AdaDelta(model.parameters(), lr=4.0)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

## Tensor parallelism combined with pipeline parallelism
<a name="model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism"></a>

The following is an example of a distributed training option that enables tensor parallelism combined with pipeline parallelism. Set up the `mpi_options` and `smp_options` parameters to specify model parallel options with tensor parallelism when you configure a SageMaker `PyTorch` estimator.

**Note**  
Extended memory-saving features are available through Deep Learning Containers for PyTorch, which implements the SageMaker model parallelism library v1.6.0 or later.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
               
smp_options = {
    "enabled":True,
    "parameters": {
    "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True
    }
}
              
smp_estimator = PyTorch(
    entry_point='your_training_script.py', # Specify
    role=role,
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    framework_version='1.13.1',
    py_version='py36',
    instance_count=1,
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smp_estimator.fit('s3://my_bucket/my_training_data/')  
```

<a name="model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism-script"></a>**Adapt your PyTorch training script**

The following example training script shows how to adapt the SageMaker model parallelism library to a training script. Note that the training script now includes the `smp.step` decorator: 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)


# smdistributed: Define smp.step. Return any tensors needed outside.
@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(loss)
    return output, loss

def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by
        # the current process, based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        # Return value, loss_mb is a StepOutput object
        _, loss_mb = train_step(model, data, target)

        # smdistributed: Average the loss across microbatches.
        loss = loss_mb.reduce_mean()

        optimizer.step()

# smdistributed: Initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
if smp.local_rank() == 0:
    dataset = datasets.MNIST("../data", train=True, download=False)
smp.barrier()

# smdistributed: Shard the dataset based on data parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible
# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = Net()

# smdistributed: enable tensor parallelism only for model.fc1
smp.set_tensor_parallelism(model.fc1, True)

# smdistributed: Use the DistributedModel container to provide the model
# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)

optimizer = optim.AdaDelta(model.parameters(), lr=4.0)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

# Support for Hugging Face Transformer Models
<a name="model-parallel-extended-features-pytorch-hugging-face"></a>

The SageMaker model parallelism library's tensor parallelism offers out-of-the-box support for the following Hugging Face Transformer models:
+ GPT-2, BERT, and RoBERTa (Available in the SageMaker model parallelism library v1.7.0 and later)
+ GPT-J (Available in the SageMaker model parallelism library v1.8.0 and later)
+ GPT-Neo (Available in the SageMaker model parallelism library v1.10.0 and later)

**Note**  
For any other Transformers models, you need to use the [smdistributed.modelparallel.torch.tp\$1register\$1with\$1module()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.tp_register_with_module) API to apply tensor parallelism.

**Note**  
To use tensor parallelism for training Hugging Face Transformer models, make sure you use Hugging Face Deep Learning Containers for PyTorch that has the SageMaker model parallelism library v1.7.0 and later. For more information, see the [SageMaker model parallelism library release notes](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html).

## Supported Models Out of the Box
<a name="model-parallel-extended-features-pytorch-hugging-face-out-of-the-box"></a>

For the Hugging Face transformer models supported by the library out of the box, you don't need to manually implement hooks to translate Transformer APIs to `smdistributed` transformer layers. You can activate tensor parallelism by using the context manager [smdistributed.modelparallel.torch.tensor\$1parallelism()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.tensor_parallelism) and wrapping the model by [smdistributed.modelparallel.torch.DistributedModel()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel). You don't need to manually register hooks for tensor parallelism using the `smp.tp_register` API.

The `state_dict` translation functions between Hugging Face Transformers and `smdistributed.modelparallel` can be accessed as follows.
+  `smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_state_dict_to_hf_gpt2(state_dict, max_seq_len=None)`
+  `smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_hf_state_dict_to_smdistributed_gpt2(state_dict)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.bert.translate_state_dict_to_hf_bert(state_dict, max_seq_len=None)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.bert.translate_hf_state_dict_to_smdistributed_bert(state_dict)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_state_dict_to_hf_roberta(state_dict, max_seq_len=None)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_hf_state_dict_to_smdistributed_roberta(state_dict)` 
+ `smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_state_dict_to_hf_gptj(state_dict, max_seq_len=None)` (Available in the SageMaker model parallelism library v1.8.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_hf_gptj_state_dict_to_smdistributed_gptj` (Available in the SageMaker model parallelism library v1.8.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_state_dict_to_hf_gptneo(state_dict, max_seq_len=None)` (Available in the SageMaker model parallelism library v1.10.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_hf_state_dict_to_smdistributed_gptneo(state_dict)` (Available in the SageMaker model parallelism library v1.10.0 and later)

**Example usage of the GPT-2 translation function**

Start with wrapping the model as shown in the following code.

```
from transformers import AutoModelForCausalLM

with smp.tensor_parallelism():
    model = AutoModelForCausalLM.from_config(hf_gpt2_config)

model = smp.DistributedModel(model)
```

Given a `state_dict` from the `DistributedModel` object, you can load the weights into the original Hugging Face GPT-2 model using the `translate_state_dict_to_hf_gpt2` function as shown in the following code.

```
from smdistributed.modelparallel.torch.nn.huggingface.gpt2 \
                                      import translate_state_dict_to_hf_gpt2
max_seq_len = 1024

# [... code block for training ...]

if smp.rdp_rank() == 0:
    state_dict = dist_model.state_dict()
    hf_state_dict = translate_state_dict_to_hf_gpt2(state_dict, max_seq_len)

    # can now call model.load_state_dict(hf_state_dict) to the original HF model
```

**Example usage of the RoBERTa translation function**

Similarly, given a supported HuggingFace model `state_dict`, you can use the `translate_hf_state_dict_to_smdistributed` function to convert it to a format readable by `smp.DistributedModel`. This can be useful in transfer learning use cases, where a pre-trained model is loaded into a `smp.DistributedModel` for model-parallel fine-tuning:

```
from smdistributed.modelparallel.torch.nn.huggingface.roberta \
                                      import translate_state_dict_to_smdistributed

model = AutoModelForMaskedLM.from_config(roberta_config)
model = smp.DistributedModel(model)

pretrained_model = AutoModelForMaskedLM.from_pretrained("roberta-large")
translated_state_dict =
        translate_state_dict_to_smdistributed(pretrained_model.state_dict())

# load the translated pretrained weights into the smp.DistributedModel
model.load_state_dict(translated_state_dict)

# start fine-tuning...
```

# Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-ranking-mechanism"></a>

This section explains how the ranking mechanism of model parallelism works with tensor parallelism. This is extended from the [Ranking Basics](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#ranking-basics) for [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md). With tensor parallelism, the library introduces three types of ranking and process group APIs: `smp.tp_rank()` for tensor parallel rank, `smp.pp_rank()` for pipeline parallel rank, and `smp.rdp_rank()` for reduced-data parallel rank. The corresponding communication process groups are tensor parallel group (`TP_GROUP`), pipeline parallel group (`PP_GROUP`), and reduced-data parallel group (`RDP_GROUP`). These groups are defined as follows:
+ A *tensor parallel group* (`TP_GROUP`) is an evenly divisible subset of the data parallel group, over which tensor parallel distribution of modules takes place. When the degree of pipeline parallelism is 1, `TP_GROUP` is the same as *model parallel group* (`MP_GROUP`). 
+ A *pipeline parallel group* (`PP_GROUP`) is the group of processes over which pipeline parallelism takes place. When the tensor parallelism degree is 1, `PP_GROUP` is the same as `MP_GROUP`. 
+ A *reduced-data parallel group* (`RDP_GROUP`) is a set of processes that hold both the same pipeline parallelism partitions and the same tensor parallel partitions, and perform data parallelism among themselves. This is called the reduced data parallel group because it is a subset of the entire data parallelism group, `DP_GROUP`. For the model parameters that are distributed within the `TP_GROUP` , the gradient `allreduce` operation is performed only for reduced-data parallel group, while for the parameters that are not distributed, the gradient `allreduce` takes place over the entire `DP_GROUP`. 
+ A model parallel group (`MP_GROUP`) refers to a group of processes that collectively store the entire model. It consists of the union of the `PP_GROUP`s of all the ranks that are in the `TP_GROUP` of the current process. When the degree of tensor parallelism is 1, `MP_GROUP` is equivalent to `PP_GROUP`. It is also consistent with the existing definition of `MP_GROUP` from previous `smdistributed` releases. Note that the current `TP_GROUP` is a subset of both the current `DP_GROUP` and the current `MP_GROUP`. 

To learn more about the communication process APIs in the SageMaker model parallelism library, see the [Common API](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_common_api.html#) and the [PyTorch-specific APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html) in the *SageMaker Python SDK documentation*.

![\[Ranking mechanism, parameter distribution, and associated AllReduce operations of tensor parallelism.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/tensor-parallel-ranking-mechanism.png)


For example, consider process groups for a single node with 8 GPUs, where the degree of tensor parallelism is 2, the degree of pipeline parallelism is 2, and the degree of data parallelism is 4. The upper center part of the preceding figure shows an example of a model with 4 layers. The lower left and lower right parts of figure illustrate the 4-layer model distributed across 4 GPUs using both pipeline parallelism and tensor parallelism, where tensor parallelism is used for the middle two layers. These two lower figures are simple copies to illustrate different group boundary lines. The partitioned model is replicated for data parallelism across GPUs 0-3 and 4-7. The lower left figure shows the definitions of `MP_GROUP`, `PP_GROUP`, and `TP_GROUP`. The lower right figure shows `RDP_GROUP`, `DP_GROUP`, and `WORLD` over the same set of GPUs. The gradients for the layers and layer slices that have the same color are `allreduce`d together for data parallelism. For example, the first layer (light blue) gets the `allreduce` operations across `DP_GROUP`, whereas the dark orange slice in the second layer only gets the `allreduce` operations within the `RDP_GROUP` of its process. The bold dark red arrows represent tensors with the batch of its entire `TP_GROUP`.

```
GPU0: pp_rank 0, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 0
GPU1: pp_rank 1, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 1
GPU2: pp_rank 0, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 2
GPU3: pp_rank 1, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 3
GPU4: pp_rank 0, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 0
GPU5: pp_rank 1, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 1
GPU6: pp_rank 0, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 2
GPU7: pp_rank 1, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 3
```

In this example, pipeline parallelism occurs across the GPU pairs (0,1); (2,3); (4,5) and (6,7). In addition, data parallelism (`allreduce`) takes place across GPUs 0, 2, 4, 6, and independently over GPUs 1, 3, 5, 7. Tensor parallelism happens over subsets of `DP_GROUP`s, across the GPU pairs (0,2); (1,3); (4,6) and (5,7).

  For this kind of hybrid pipeline and tensor parallelism, the math for `data_parallel_degree` remains as `data_parallel_degree = number_of_GPUs / pipeline_parallel_degree`. The library further calculates the reduced data parallel degree from the following relation `reduced_data_parallel_degree * tensor_parallel_degree = data_parallel_degree`.  

# Optimizer State Sharding
<a name="model-parallel-extended-features-pytorch-optimizer-state-sharding"></a>

*Optimizer state sharding* is a useful memory-saving technique that shards the optimizer state (the set of weights that describes the state of optimizer) across data parallel device groups. You can use optimizer state sharding whenever you use a stateful optimizer (such as Adam) or an FP16 optimizer (which stores both FP16 and FP32 copies of the parameters).

**Note**  
Optimizer state sharding is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Optimizer State Sharding
<a name="model-parallel-extended-features-pytorch-optimizer-state-sharding-how-to-use"></a>

You can turn on *optimizer state sharding* by setting `"shard_optimizer_state": True` in the `modelparallel` configuration. 

When this feature is turned on, the library partitions the set of model parameters based on the data parallelism degree. The gradients corresponding to the `i`th partition get reduced only at the `i`th data parallel rank. At the end of the first call to an `smp.step` decorator function, the optimizer wrapped by `smp.DistributedOptimizer` redefines its parameters to be only limited to those parameters corresponding to the partition of the current data parallel rank. The redefined parameters are called *virtual parameters* and share underlying storage with the original parameters. During the first call to `optimizer.step`, the optimizer states are created based on these redefined parameters, which are sharded because of the original partition. After the optimizer update, the AllGather operation (as part of the `optimizer.step` call) runs across the data parallel ranks to achieve consistent parameter states.

**Tip**  
Optimizer state sharding can be useful when the degree of data parallelism is greater than 1 and the model has more than a billion parameters.   
The degree of data parallelism is calculated by `(processes_per_host * instance_count / pipeline_parallel_degree)`, and the `smp.dp_size()` function handles the sizing in the background.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True,
        "shard_optimizer_state": True
    }
}
```

**Adapt your PyTorch training script**

See [Adapt your PyTorch training script](model-parallel-extended-features-pytorch-tensor-parallelism-examples.md#model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism-script) in the *Tensor parallelism combined with pipeline parallelism* section. There’s no additional modification required for the script.

# Activation Checkpointing
<a name="model-parallel-extended-features-pytorch-activation-checkpointing"></a>

*Activation checkpointing* (or *gradient checkpointing*) is a technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra computation time for reduced memory usage. If a module is checkpointed, at the end of a forward pass, the inputs to and outputs from the module stay in memory. Any intermediate tensors that would have been part of the computation inside that module are freed up during the forward pass. During the backward pass of checkpointed modules, these tensors are recomputed. At this point, the layers beyond this checkpointed module have finished their backward pass, so the peak memory usage with checkpointing can be lower.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Activation Checkpointing
<a name="model-parallel-extended-for-pytorch-activation-checkpointing-how-to-use"></a>

With `smdistributed.modelparallel`, you can use activation checkpointing at the granularity of a module. For all `torch.nn` modules except `torch.nn.Sequential`, you can only checkpoint a module tree if it lies within one partition from the perspective of pipeline parallelism. In case of the `torch.nn.Sequential` module, each module tree inside the sequential module must lie completely within one partition for activation checkpointing to work. When you use manual partitioning, be aware of these restrictions.

When you use [automated model partitioning](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html#model-parallel-automated-model-splitting), you can find the partitioning assignment logs starting with `Partition assignments:` in the training job logs. If a module is partitioned across multiple ranks (for example, with one descendant on one rank and another descendant on a different rank), the library ignores the attempt to checkpoint the module and raises a warning message that the module won't be checkpointed.

**Note**  
The SageMaker model parallelism library supports both overlapping and non-overlapping `allreduce` operation in combination with checkpointing. 

**Note**  
PyTorch’s native checkpointing API is not compatible with `smdistributed.modelparallel`.

**Example 1:** The following sample code shows how to use activation checkpointing when you have a model definition in your script.

```
import torch.nn as nn
import torch.nn.functional as F

from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        # This call of fc1 will be checkpointed
        x = checkpoint(self.fc1, x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)
```

**Example 2:** The following sample code shows how to use activation checkpointing when you have a sequential model in your script.

```
import torch.nn as nn
from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint_sequential

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.seq = nn.Sequential(
            nn.Conv2d(1,20,5),
            nn.ReLU(),
            nn.Conv2d(20,64,5),
            nn.ReLU()
        )

    def forward(self, x):
        # This call of self.seq will be checkpointed
        x = checkpoint_sequential(self.seq, x)
        return F.log_softmax(x, 1)
```

**Example 3:** The following sample code shows how to use activation checkpointing when you import a prebuilt model from a library, such as PyTorch and Hugging Face Transformers. Whether you checkpoint sequential modules or not, do the following: 

1. Wrap the model by `smp.DistributedModel()`.

1. Define an object for sequential layers.

1. Wrap the sequential layer object by `smp.set_activation_checkpointig()`.

```
import smdistributed.modelparallel.torch as smp
from transformers import AutoModelForCausalLM

smp.init()
model = AutoModelForCausalLM(*args, **kwargs)
model = smp.DistributedModel(model)

# Call set_activation_checkpointing API
transformer_layers = model.module.module.module.transformer.seq_layers
smp.set_activation_checkpointing(
    transformer_layers, pack_args_as_tuple=True, strategy='each')
```

# Activation Offloading
<a name="model-parallel-extended-features-pytorch-activation-offloading"></a>

When activation checkpointing and pipeline parallelism are turned on and the number of microbatches is greater than one, *activation offloading* is an additional feature that can further reduce memory usage. Activation offloading asynchronously moves the checkpointed activations corresponding to their microbatches that are not currently running in the CPU. Right before the GPU needs the activations for the microbatch’s backward pass, this functionality prefetches the offloaded activations back from the CPU.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Activation Offloading
<a name="model-parallel-extended-for-pytorch-activation-offloading"></a>

Use activation offloading to reduce memory usage when **the number of microbatches is greater than 1, and activation checkpointing is turned on** (see [Activation Checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md)). When the activation checkpointing is not used, activation offloading has no effect. When it is used with only one microbatch, it does not save memory.

To use activation offloading, set `"offload_activations": True` in the `modelparallel` configuration.

Activation offloading moves the checkpointed activations in `nn.Sequential` modules to CPU asynchronously. The data transfer over the PCIe link overlaps with GPU computation. The offloading happens immediately, as soon as the forward pass for a particular checkpointed layer is computed. The activations are loaded back to the GPU shortly before they are needed for the backward pass of a particular microbatch. The CPU-GPU transfer similarly overlaps with computation. 

To adjust how early the activations are loaded back into the GPU, you can use the configuration parameter `"activation_loading_horizon"` (default is set to 4, must be `int` larger than 0). A larger activation loading horizon would cause the activations to be loaded back to the GPU earlier. If the horizon is too large, the memory-saving impact of activation offloading might be diminished. If the horizon is too small, the activations may not be loaded back in time, reducing the amount of overlap and degrading performance.

**Tip**  
Activation offloading can be useful for large models with over a hundred billion parameters.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True,
        "offload_activations": True,
        "activation_loading_horizon": 4   # optional. default is 4.
    }
}
```

# FP16 Training with Model Parallelism
<a name="model-parallel-extended-features-pytorch-fp16"></a>

For FP16 training, apply the following modifications to your training script and estimator.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.10.0 and later.

**Adapt your PyTorch training script**

1. Wrap your model using the [smdistributed.modelparallel.torch.model\$1creation()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.model_creation) context manager.

   ```
   # fp16_training_script.py
   
   import torch
   import smdistributed.modelparallel.torch as smp
   
   with smp.model_creation(
       dtype=torch.float16 if args.fp16 else torch.get_default_dtype()
   ):
       model = ...
   ```
**Tip**  
If you are using tensor parallelism, add `tensor_parallelism=smp.tp_size() > 1` to the `smp.model_creation` context manager. Adding this line also helps automatically detect whether tensor parallelism is activated or not.  

   ```
   with smp.model_creation(
       ... ,
       tensor_parallelism=smp.tp_size() > 1
   ):
       model = ...
   ```

1. When you wrap the optimizer with `smdistributed.modelparallel.torch.DistributedOptimizer`, set either the `static_loss_scaling` or `dynamic_loss_scaling` argument. By default, `static_loss_scaling` is set to `1.0`, and `dynamic_loss_scaling` is set to `False`. If you set `dynamic_loss_scale=True`, you can feed dynamic loss scaling options as a dictionary through the `dynamic_loss_args` argument. In most cases, we recommend you use dynamic loss scaling with the default options. For more information, options, and examples of the optimizer wrapper function, see the [smdistributed.modelparallel.torch.DistributedOptimizer](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedoptimizer) API.

   The following code is an example of wrapping an `Adadelta` optimizer object with dynamic loss scaling for FP16 training.

   ```
   optimizer = torch.optim.Adadelta(...)
   optimizer = smp.DistributedOptimizer(
       optimizer,
       static_loss_scale=None,
       dynamic_loss_scale=True,
       dynamic_loss_args={
           "scale_window": 1000,
           "min_scale": 1,
           "delayed_shift": 2
       }
   )
   ```

**Configure a SageMaker PyTorch estimator**

Add the FP16 parameter (`"fp16"`) to the distribution configuration for model parallelism when creating a SageMaker PyTorch estimator object. For a complete list of the configuration parameters for model parallelism, see [Parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#parameters-for-smdistributed).

```
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled": True,
    "parameters":  {
        "microbatches":  4,
        "pipeline_parallel_degree":  2,
        "tensor_parallel_degree":  2,
        ...,

        "fp16": True
    }
}

fp16_estimator = PyTorch(
    entry_point="fp16_training_script.py", # Specify your train script
    ...,

    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": {...}
    }
)

fp16_estimator.fit(...)
```

When FP16 training starts, the model and the optimizer are wrapped by `FP16_Module` and `FP16_Optimizer` respectively, which are modified `smdistributed` versions of the [Apex utils](https://nvidia.github.io/apex/fp16_utils.html#apex-fp16-utils). `FP16_Module` converts the model to FP16 dtype and deals with the forward pass in FP16.

**Tip**  
You can apply gradient clipping by calling `clip_master_grads` before `optimizer.step`.  

```
optimizer.clip_master_grads(max_norm)     # max_norm(float or int): max norm of the gradients
```

**Tip**  
When using `torch.optim.lr_scheduler` and FP16 training, you need to pass `optimizer.optimizer` to the LR scheduler rather than the optimizer. See the following example code.  

```
from torch.optim.lr_scheduler import StepLR

scheduler = StepLR(
    optimizer.optimizer if smp.state.cfg.fp16 else optimizer,
    step_size=1,
    gamma=args.gamma
)
```

# Support for FlashAttention
<a name="model-parallel-attention-head-size-for-flash-attention"></a>

Support for FlashAttention is a feature of the library only applicable for the *distributed transformer* model, which is a Transformer model wrapped by [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedmodel](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedmodel) for model-parallel training. This feature is also compatible with [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md). 

The [FlashAttention](https://github.com/HazyResearch/flash-attention) library only supports models when `attention_head_size` is set to a value that's a multiple of 8 and less than 128. Therefore, when you train a distributed transformer and make sure that FlashAttention works properly, you should adjust parameters to make the attention head size comply the requirements. For more information, see also [Installation and features](https://github.com/HazyResearch/flash-attention#installation-and-features) in the *FlashAttention GitHub repository*.

For example, assume that you configure a Transformer model with `hidden_width=864` and `num_heads=48`. The head size of FlashAttention is calculated as `attention_head_size = hidden_width / num_heads = 864 / 48 = 18`. To enable FlashAttention, you need to adjust the `num_heads` parameter to `54`, so that `attention_head_size = hidden_width / num_heads = 864 / 54 = 16`, which is a multiple of 8.

# Run a SageMaker Distributed Training Job with Model Parallelism
<a name="model-parallel-use-api"></a>

Learn how to run a model-parallel training job of your own training script using the SageMaker Python SDK with the SageMaker model parallelism library.

There are three use-case scenarios for running a SageMaker training job.

1. You can use one of the pre-built AWS Deep Learning Container for TensorFlow and PyTorch. This option is recommended if it is the first time for you to use the model parallel library. To find a tutorial for how to run a SageMaker model parallel training job, see the example notebooks at [PyTorch training with Amazon SageMaker AI's model parallelism library](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/model_parallel).

1. You can extend the pre-built containers to handle any additional functional requirements for your algorithm or model that the pre-built SageMaker Docker image doesn't support. To find an example of how you can extend a pre-built container, see [Extend a Pre-built Container](prebuilt-containers-extend.md).

1. You can adapt your own Docker container to work with SageMaker AI using the [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit). For an example, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html).

For options 2 and 3 in the preceding list, refer to [Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library](model-parallel-sm-sdk.md#model-parallel-customize-container) to learn how to install the model parallel library in an extended or customized Docker container. 

In all cases, you launch your training job configuring a SageMaker `TensorFlow` or `PyTorch` estimator to activate the library. To learn more, see the following topics.

**Topics**
+ [

# Step 1: Modify Your Own Training Script Using SageMaker's Distributed Model Parallel Library
](model-parallel-customize-training-script.md)
+ [

# Step 2: Launch a Training Job Using the SageMaker Python SDK
](model-parallel-sm-sdk.md)

# Step 1: Modify Your Own Training Script Using SageMaker's Distributed Model Parallel Library
<a name="model-parallel-customize-training-script"></a>

Use this section to learn how to customize your training script to use the core features of the Amazon SageMaker AI model parallelism library. To use the library-specific API functions and parameters, we recommend you use this documentation alongside the [SageMaker model parallel library APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html) in the *SageMaker Python SDK documentation*.

The training script examples provided in these sections are simplified and designed to highlight the required changes you must make to use the library. For end-to-end, runnable notebook examples that demonstrate how to use a TensorFlow or PyTorch training script with the SageMaker model parallelism library, see [Amazon SageMaker AI model parallelism library v2 examples](distributed-model-parallel-v2-examples.md).

**Topics**
+ [

## Split the model of your training script using the SageMaker model parallelism library
](#model-parallel-model-splitting-using-smp-lib)
+ [

# Modify a TensorFlow training script
](model-parallel-customize-training-script-tf.md)
+ [

# Modify a PyTorch Training Script
](model-parallel-customize-training-script-pt.md)

## Split the model of your training script using the SageMaker model parallelism library
<a name="model-parallel-model-splitting-using-smp-lib"></a>

There are two ways to modify your training script to set up model splitting: automated splitting or manual splitting.

### Automated model splitting
<a name="model-parallel-automated-model-splitting"></a>

When you use SageMaker's model parallelism library, you can take advantage of *automated model splitting*, also referred to as *automated model partitioning*. The library uses a partitioning algorithm that balances memory, minimizes communication between devices, and optimizes performance. You can configure the automated partitioning algorithm to optimize for speed or memory. 

Alternatively, you can use manual model splitting. We recommend automated model splitting, unless you are very familiar with the model architecture and have a good idea of how to efficiently partition your model.

#### How it works
<a name="model-parallel-automated-model-splitting-how-it-works"></a>

Auto-partitioning occurs during the first training step, when the `smp.step`-decorated function is first called. During this call, the library first constructs a version of the model on the CPU RAM (to avoid GPU memory limitations), and then analyzes the model graph and makes a partitioning decision. Based on this decision, each model partition is loaded on a GPU, and only then the first step is executed. Because of these analysis and partitioning steps, the first training step might take longer. 

In either framework, the library manages the communication between devices through its own backend, which is optimized for AWS infrastructure.

The auto-partition design adapts to the characteristics of the framework, and the library does the partitioning at the granularity level that is more natural in each framework. For instance, in TensorFlow, each specific operation can be assigned to a different device, whereas in PyTorch, the assignment is done at the module level, where each module consists of multiple operations. The follow section reviews the specifics of the design in each framework.

##### Automated model splitting with PyTorch
<a name="model-parallel-auto-model-split-pt"></a>

During the first training step, the model parallelism library internally runs a tracing step that is meant to construct the model graph and determine the tensor and parameter shapes. After this tracing step, the library constructs a tree, which consists of the nested `nn.Module` objects in the model, as well as additional data gathered from tracing, such as the amount of stored `nn.Parameters`, and execution time for each `nn.Module`. 

Next, the library traverses this tree from the root and runs a partitioning algorithm that assigns each `nn.Module` to a device, which balances computational load (measured by module execution time) and memory use (measured by the total stored `nn.Parameter` size and activations). If multiple `nn.Modules` share the same `nn.Parameter`, then these modules are placed on the same device to avoid maintaining multiple versions of the same parameter. Once the partitioning decision is made, the assigned modules and weights are loaded to their devices.

For instructions on how to register the `smp.step` decorator to your PyTorch training script, see [Automated splitting with PyTorch](model-parallel-customize-training-script-pt.md#model-parallel-customize-training-script-pt-16).

##### Automated model splitting with TensorFlow
<a name="model-parallel-auto-model-split-tf"></a>

The model parallelism library analyzes the sizes of the trainable variables and the graph structure, and internally uses a graph partitioning algorithm. This algorithm comes up with a device assignment for each operation, with the objective of minimizing the amount of communication needed across devices, subject to two constraints: 
+ Balancing the number of variables stored in each device
+ Balancing the number of operations executed in each device

If you specify `speed` for `optimize` (in the model parallelism parameters in the Python SDK), the library tries to balance the number of operations and `tf.Variable` objects in each device. Otherwise, it tries to balance the total size of `tf.Variables`.

Once the partitioning decision is made, the library creates a serialized representation of the subgraph that each device needs to execute and imports them onto each device. While partitioning, the library places operations that consume the same `tf.Variable` and operations that are part of the same Keras layer onto the same device. It also respects the colocation constraints imposed by TensorFlow. This means that, for example, if there are two Keras layers that share a `tf.Variable`, then all operations that are part of these layers are placed on a single device.

For instructions on how to register the `smp.step` decorator to your PyTorch training script, see [Automated splitting with TensorFlow](model-parallel-customize-training-script-tf.md#model-parallel-customize-training-script-tf-23).

##### Comparison of automated model splitting between frameworks
<a name="model-parallel-auto-model-split-comparison"></a>

In TensorFlow, the fundamental unit of computation is a `tf.Operation`, and TensorFlow represents the model as a directed acyclic graph (DAG) of `tf.Operation`s, and therefore the model parallelism library partitions this DAG so that each node goes to one device. Crucially, `tf.Operation` objects are sufficiently rich with customizable attributes, and they are universal in the sense that every model is guaranteed to consist of a graph of such objects. 

PyTorch on the other hand, does not have an equivalent notion of operation that is sufficiently rich and universal. The closest unit of computation in PyTorch that has these characteristics is an `nn.Module`, which is at a much higher granularity level, and this is why the library does partitioning at this level in PyTorch.

### Manual Model Splitting
<a name="model-parallel-manual-model-splitting"></a>

If you want to manually specify how to partition your model across devices, use the `smp.partition` context manager. For instructions on how to set the context manager for manual partitioning, see the following pages.
+ [Manual splitting with TensorFlow](model-parallel-customize-training-script-tf.md#model-parallel-customize-training-script-tf-manual)
+ [Manual splitting with PyTorch](model-parallel-customize-training-script-pt.md#model-parallel-customize-training-script-pt-16-hvd)

To use this option after making modifications, in Step 2, you'll need to set `auto_partition` to `False`, and define a `default_partition` in the framework estimator class of the SageMaker Python SDK. Any operation that is not explicitly placed on a partition through the `smp.partition` context manager is executed on the `default_partition`. In this case, the automated splitting logic is bypassed, and each operation is placed based on your specification. Based on the resulting graph structure, the model parallelism library creates a pipelined execution schedule automatically.

# Modify a TensorFlow training script
<a name="model-parallel-customize-training-script-tf"></a>

In this section, you learn how to modify TensorFlow training scripts to configure the SageMaker model parallelism library for auto-partitioning and manual partitioning. This selection of examples also includes an example integrated with Horovod for hybrid model and data parallelism.

**Note**  
To find which TensorFlow versions are supported by the library, see [Supported Frameworks and AWS Regions](distributed-model-parallel-support.md).

The required modifications you must make to your training script to use the library are listed in [Automated splitting with TensorFlow](#model-parallel-customize-training-script-tf-23).

To learn how to modify your training script to use hybrid model and data parallelism with Horovod, see [Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism](#model-parallel-customize-training-script-tf-2.3).

If you want to use manual partitioning, also review [Manual splitting with TensorFlow](#model-parallel-customize-training-script-tf-manual). 

The following topics show examples of training scripts that you can use to configure SageMaker's model parallelism library for auto-partitioning and manual partitioning TensorFlow models. 

**Note**  
Auto-partitioning is enabled by default. Unless otherwise specified, the example scripts use auto-partitioning.

**Topics**
+ [

## Automated splitting with TensorFlow
](#model-parallel-customize-training-script-tf-23)
+ [

## Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism
](#model-parallel-customize-training-script-tf-2.3)
+ [

## Manual splitting with TensorFlow
](#model-parallel-customize-training-script-tf-manual)
+ [

## Unsupported framework features
](#model-parallel-tf-unsupported-features)

## Automated splitting with TensorFlow
<a name="model-parallel-customize-training-script-tf-23"></a>

The following training script changes are required to run a TensorFlow model with SageMaker's model parallelism library:

1. Import and initialize the library with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init).

1. Define a Keras model by inheriting from [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.html](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.html) instead of the Keras Model class. Return the model outputs from the call method of the `smp.DistributedModel` object. Be mindful that any tensors returned from the call method will be broadcast across model-parallel devices, incurring communication overhead, so any tensors that are not needed outside the call method (such as intermediate activations) should not be returned.

1. Set `drop_remainder=True` in `tf.Dataset.batch()` method. This is to ensure that the batch size is always divisible by the number of microbatches.

1. Seed the random operations in the data pipeline using `smp.dp_rank()`, e.g., `shuffle(ds, seed=smp.dp_rank())` to ensure consistency of data samples across GPUs that hold different model partitions.

1. Put the forward and backward logic in a step function and decorate it with `smp.step`.

1. Perform post-processing on the outputs across microbatches using [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput) methods such as `reduce_mean`. The [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init) function must have a return value that depends on the output of `smp.DistributedModel`.

1. If there is an evaluation step, similarly place the forward logic inside an `smp.step`-decorated function and post-process the outputs using [`StepOutput` API](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput).

To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html). 

The following Python script is an example of a training script after the changes are made.

```
import tensorflow as tf

# smdistributed: Import TF2.x API
import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
    "MNIST-data-%d" % smp.rank()
)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: If needed, seed the shuffle with smp.dp_rank(), and drop_remainder
# in batching to make sure batch size is always divisible by number of microbatches
train_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(10000, seed=smp.dp_rank())
    .batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API 
class MyModel(smp.DistributedModel):
    def __init__(self):
        super(MyModel, self).__init__()
        # define layers

    def call(self, x, training=None):
        # define forward pass and return the model output

model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside
@smp.step
def get_grads(images, labels):
    predictions = model(images, training=True)
    loss = loss_object(labels, predictions)

    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss, predictions


@tf.function
def train_step(images, labels):
    gradients, loss, predictions = get_grads(images, labels)

    # smdistributed: Accumulate the gradients across microbatches
    gradients = [g.accumulate() for g in gradients]
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # smdistributed: Merge predictions and average losses across microbatches
    train_accuracy(labels, predictions.merge())
    return loss.reduce_mean()


for epoch in range(5):
    # Reset the metrics at the start of the next epoch
    train_accuracy.reset_states()
    for images, labels in train_ds:
        loss = train_step(images, labels)
    accuracy = train_accuracy.result()
```

If you are done preparing your training script, proceed to [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md). If you want to run a hybrid model and data parallel training job, continue to the next section.

## Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism
<a name="model-parallel-customize-training-script-tf-2.3"></a>

You can use the SageMaker model parallelism library with Horovod for hybrid model and data parallelism. To read more about how the library splits a model for hybrid parallelism, see [Pipeline parallelism (available for PyTorch and TensorFlow)](model-parallel-intro.md#model-parallel-intro-pp).

In this step, we focus on how to modify your training script to adapt the SageMaker model parallelism library.

To properly set up your training script to pick up the hybrid parallelism configuration that you'll set in [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md), use the library's helper functions, `smp.dp_rank()` and `smp.mp_rank()`, which automatically detect the data parallel rank and model parallel rank respectively. 

To find all MPI primitives the library supports, see [MPI Basics](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#mpi-basics) in the SageMaker Python SDK documentation. 

The required changes needed in the script are:
+ Adding `hvd.allreduce`
+ Broadcasting variables after the first batch, as required by Horovod
+ Seeding shuffling and/or sharding operations in the data pipeline with `smp.dp_rank()`.

**Note**  
When you use Horovod, you must not directly call `hvd.init` in your training script. Instead, you'll have to set `"horovod"` to `True` in the SageMaker Python SDK `modelparallel` parameters in [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md). This allows the library to internally initialize Horovod based on the device assignments of model partitions. Calling `hvd.init()` directly in your training script can cause problems.

**Note**  
Using the `hvd.DistributedOptimizer` API directly in your training script might result in a poor training performance and speed, because the API implicitly places the `AllReduce` operation inside `smp.step`. We recommend you to use the model parallelism library with Horovod by directly calling `hvd.allreduce` after calling `accumulate()` or `reduce_mean()` on the gradients returned from `smp.step`, as will be shown in the following example.

To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html).

```
import tensorflow as tf
import horovod.tensorflow as hvd

# smdistributed: Import TF2.x API 
import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
    "MNIST-data-%d" % smp.rank()
)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: Seed the shuffle with smp.dp_rank(), and drop_remainder
# in batching to make sure batch size is always divisible by number of microbatches
train_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(10000, seed=smp.dp_rank())
    .batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API 
class MyModel(smp.DistributedModel):
    def __init__(self):
        super(MyModel, self).__init__()
        # define layers

    def call(self, x, training=None):
        # define forward pass and return model outputs


model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside
@smp.step
def get_grads(images, labels):
    predictions = model(images, training=True)
    loss = loss_object(labels, predictions)

    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss, predictions


@tf.function
def train_step(images, labels, first_batch):
    gradients, loss, predictions = get_grads(images, labels)

    # smdistributed: Accumulate the gradients across microbatches
    # Horovod: AllReduce the accumulated gradients
    gradients = [hvd.allreduce(g.accumulate()) for g in gradients]
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # Horovod: Broadcast the variables after first batch 
    if first_batch:
        hvd.broadcast_variables(model.variables, root_rank=0)
        hvd.broadcast_variables(optimizer.variables(), root_rank=0)

    # smdistributed: Merge predictions across microbatches
    train_accuracy(labels, predictions.merge())
    return loss.reduce_mean()


for epoch in range(5):
    # Reset the metrics at the start of the next epoch
    train_accuracy.reset_states()

    for batch, (images, labels) in enumerate(train_ds):
        loss = train_step(images, labels, tf.constant(batch == 0))
```

## Manual splitting with TensorFlow
<a name="model-parallel-customize-training-script-tf-manual"></a>

Use `smp.partition` context managers to place operations in specific partition. Any operation not placed in any `smp.partition` contexts is placed in the `default_partition`. To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html). 

```
import tensorflow as tf

# smdistributed: Import TF2.x API.
import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
    "MNIST-data-%d" % smp.rank()
)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: If needed, seed the shuffle with smp.dp_rank(), and drop_remainder
# in batching to make sure batch size is always divisible by number of microbatches.
train_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(10000, seed=smp.dp_rank())
    .batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API.
class MyModel(smp.DistributedModel):
    def __init__(self):
         # define layers

    def call(self, x):
        with smp.partition(0):
            x = self.layer0(x)
        with smp.partition(1):
            return self.layer1(x)


model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside
@smp.step
def get_grads(images, labels):
    predictions = model(images, training=True)
    loss = loss_object(labels, predictions)

    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss, predictions


@tf.function
def train_step(images, labels):
    gradients, loss, predictions = get_grads(images, labels)

    # smdistributed: Accumulate the gradients across microbatches
    gradients = [g.accumulate() for g in gradients]
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # smdistributed: Merge predictions and average losses across microbatches
    train_accuracy(labels, predictions.merge())
    return loss.reduce_mean()


for epoch in range(5):
    # Reset the metrics at the start of the next epoch
    train_accuracy.reset_states()
    for images, labels in train_ds:
        loss = train_step(images, labels)
    accuracy = train_accuracy.result()
```

## Unsupported framework features
<a name="model-parallel-tf-unsupported-features"></a>

The following TensorFlow features are not supported by the library:
+ `tf.GradientTape()` is currently not supported. You can use `Optimizer.get_gradients()` or `Optimizer.compute_gradients()` instead to compute gradients.
+ The `tf.train.Checkpoint.restore()` API is currently not supported. For checkpointing, use `smp.CheckpointManager` instead, which provides the same API and functionality. Note that checkpoint restores with `smp.CheckpointManager` should take place after the first step.

# Modify a PyTorch Training Script
<a name="model-parallel-customize-training-script-pt"></a>

In this section, you learn how to modify PyTorch training scripts to configure the SageMaker model parallelism library for auto-partitioning and manual partitioning.

**Note**  
To find which PyTorch versions are supported by the library, see [Supported Frameworks and AWS Regions](distributed-model-parallel-support.md).

**Tip**  
For end-to-end notebook examples that demonstrate how to use a PyTorch training script with the SageMaker model parallelism library, see [Amazon SageMaker AI model parallelism library v1 examples](distributed-model-parallel-examples.md).

Note that auto-partitioning is enabled by default. Unless otherwise specified, the following scripts use auto-partitioning. 

**Topics**
+ [

## Automated splitting with PyTorch
](#model-parallel-customize-training-script-pt-16)
+ [

## Manual splitting with PyTorch
](#model-parallel-customize-training-script-pt-16-hvd)
+ [

## Considerations
](#model-parallel-pt-considerations)
+ [

## Unsupported framework features
](#model-parallel-pt-unsupported-features)

## Automated splitting with PyTorch
<a name="model-parallel-customize-training-script-pt-16"></a>

The following training script changes are required to run a PyTorch training script with SageMaker's model parallelism library:

1. Import and initialize the library with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init).

1. Wrap the model with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel). Be mindful that any tensors returned from the `forward` method of the underlying `nn.Module` object will be broadcast across model-parallel devices, incurring communication overhead, so any tensors that are not needed outside the call method (such as intermediate activations) should not be returned.
**Note**  
For FP16 training, you need to use the [smdistributed.modelparallel.torch.model\$1creation()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html) context manager to wrap the model. For more information, see [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md).

1. Wrap the optimizer with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedOptimizer](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedOptimizer).
**Note**  
For FP16 training, you need to set up static or dynamic loss scaling. For more information, see [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md).

1. Use the returned `DistributedModel` object instead of a user model.

1. Put the forward and backward logic in a step function and decorate it with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init).

1. Restrict each process to its own device through `torch.cuda.set_device(smp.local_rank())`.

1. Move the input tensors to the GPU using the `.to()` API before the `smp.step` call (see example below).

1. Replace `torch.Tensor.backward` and `torch.autograd.backward` with `DistributedModel.backward`.

1. Perform post-processing on the outputs across microbatches using [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput) methods such as `reduce_mean`.

1. If there is an evaluation step, similarly place the forward logic inside an `smp.step`-decorated function and post-process the outputs using [`StepOutput` API](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput).

1. Set `drop_last=True` in `DataLoader`. Alternatively, manually skip a batch in the training loop if the batch size is not divisible by the number of microbatches.

To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html). 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class GroupedNet(nn.Module):
    def __init__(self):
        super(GroupedNet, self).__init__()
        # define layers

    def forward(self, x):
        # define forward pass and return model outputs


# smdistributed: Define smp.step. Return any tensors needed outside.
@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(loss)
    return output, loss


def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by the current process,
        # based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        # Return value, loss_mb is a StepOutput object
        _, loss_mb = train_step(model, data, target)

        # smdistributed: Average the loss across microbatches.
        loss = loss_mb.reduce_mean()

        optimizer.step()

# smdistributed: initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
dataset = datasets.MNIST("../data", train=True, download=False)

# smdistributed: Shard the dataset based on data-parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible
# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = GroupedNet()
optimizer = optim.Adadelta(model.parameters(), lr=4.0)

# smdistributed: Use the DistributedModel container to provide the model
# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

## Manual splitting with PyTorch
<a name="model-parallel-customize-training-script-pt-16-hvd"></a>

Use [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedOptimizer](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedOptimizer) context managers to place modules in specific devices. Any module not placed in any `smp.partition` contexts is placed in the `default_partition`. The `default_partition` needs to be provided if `auto_partition` is set to `False`. The modules that are created within a specific `smp.partition` context are placed on the corresponding partition.

To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html). 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class GroupedNet(nn.Module):
    def __init__(self):
        super(GroupedNet, self).__init__()
        with smp.partition(0):
            # define child modules on device 0
        with smp.partition(1):
            # define child modules on device 1

    def forward(self, x):
        # define forward pass and return model outputs


# smdistributed: Define smp.step. Return any tensors needed outside.
@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(loss)
    return output, loss


def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by the current process,
        # based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        # Return value, loss_mb is a StepOutput object
        _, loss_mb = train_step(model, data, target)

        # smdistributed: Average the loss across microbatches.
        loss = loss_mb.reduce_mean()

        optimizer.step()

# smdistributed: initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
dataset = datasets.MNIST("../data", train=True, download=False)

# smdistributed: Shard the dataset based on data-parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible
# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = GroupedNet()
optimizer = optim.Adadelta(model.parameters(), lr=4.0)

# smdistributed: Use the DistributedModel container to provide the model
# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

## Considerations
<a name="model-parallel-pt-considerations"></a>

When you configure a PyTorch training script using SageMaker's model parallelism library, you should be aware of the following:
+ If you are using an optimization technique that relies on global gradient norms, for example gradient norm from the entire model, such as some variants of LAMB optimizer or global gradient clipping, you need to gather all the norms across the model partitions for correctness. You can use the library’s communication basic data types to do this.
+ All `torch.Tensor` arguments to the forward methods of the `nn.Modules` in your model must be used in the computation of the module output. In other words, the library does not support the case where there is a `torch.Tensor` argument to a module on which the module output does not depend.
+ The argument to the `smp.DistributedModel.backward()` call must depend on all model outputs. In other words, there cannot be an output from the `smp.DistributedModel.forward` call that is not used in the computation of the tensor that is fed into the `smp.DistributedModel.backward` call.
+ If there are `torch.cuda.synchronize()` calls in your code, you might need to call `torch.cuda.set_device(smp.local_rank())` immediately before the synchronize call. Otherwise unnecessary CUDA contexts might be created in device 0, which will needlessly consume memory.
+ Since the library places `nn.Modules` on different devices, the modules in the model must not depend on any global state that is modified inside `smp.step`. Any state that remains fixed throughout training, or that is modified outside `smp.step` in a way that is visible to all processes, is allowed.
+ You don’t need to move the model to GPU (for example, using `model.to(device)`) when using the library. If you try to move the model to GPU before the model is partitioned (before the first `smp.step` call), the move call is ignored. The library automatically moves the part of the model assigned to a rank to its GPU. Once training with the library starts, don’t move the model to CPU and use it, as it won’t have correct parameters for modules not assigned to the partition held by the process. If you want to retrain a model or use it for inference without the library after it was trained using the model parallelism library, the recommended way is to save the full model using our checkpointing API and load it back to a regular PyTorch Module.
+ If you have a list of modules such that output of one feeds into another, replacing that list with `nn.Sequential` can significantly improve performance.
+ The weight update (`optimizer.step()`) needs to happen outside of `smp.step` because that is when the entire backward pass is done and gradients are ready. When using a hybrid model with model and data parallelism, at this point, AllReduce of gradients is also guaranteed to finish.
+ When using the library in combination with data parallelism, make sure that the number of batches on all data parallel ranks is the same so that AllReduce does not hang waiting for a rank which is not participating in the step.
+ If you launch a training job using an ml.p4d instance type (such as ml.p4d.24xlarge), you must set the data loader variable `num_workers=0`. For example, you may define your `DataLoader` as follows:

  ```
  dataloader = torch.utils.data.DataLoader(
              data,
              batch_size=batch_size,
              num_workers=0,
              pin_memory=True,
              drop_last=True,
              shuffle=shuffle,
          )
  ```
+ The inputs to `smp.step` must be the model inputs generated by `DataLoader`. This is because `smp.step` internally splits the input tensors along the batch dimension and pipelines them. This means that passing `DataLoader` itself to the `smp.step` function to generate the model inputs inside does not work. 

  For example, if you define a `DataLoader` as follows:

  ```
  train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)
  ```

  You should access the model inputs generated by `train_loader` and pass those to an `smp.step` decorated function. Do not pass `train_loader` directly to the `smp.step` function.

  ```
  def train(model, device, train_loader, optimizer):
      model.train()
      for batch_idx, (data, target) in enumerate(train_loader):
          ...
          _, loss_mb = train_step(model, data, target)
          ...
  
  @smp.step
  def train_step(model, data, target):
      ...
      return output, loss
  ```
+ The input tensors to `smp.step` must be moved to the current device using `.to()` API, which must take place after the `torch.cuda.set_device(local_rank())` call.

  For example, you may define the `train` function as follows. This function adds `data` and `target` to the current device using `.to()` API before using those input tensors to call `train_step`.

  ```
  def train(model, device, train_loader, optimizer):
      model.train()
      for batch_idx, (data, target) in enumerate(train_loader):
          # smdistributed: Move input tensors to the GPU ID used by the current process,
          # based on the set_device call.
          data, target = data.to(device), target.to(device)
          optimizer.zero_grad()
          # Return value, loss_mb is a StepOutput object
          _, loss_mb = train_step(model, data, target)
  
          # smdistributed: Average the loss across microbatches.
          loss = loss_mb.reduce_mean()
  
          optimizer.step()
  ```

  The input tensors to this `smp.set` decorated function have been moved to the current device in the `train` function above. The model does *not* need to be moved to the current device. The library automatically moves the part of the model assigned to a rank to its GPU.

  ```
  @smp.step
  def train_step(model, data, target):
      output = model(data)
      loss = F.nll_loss(output, target, reduction="mean")
      model.backward(loss)
      return output, loss
  ```

## Unsupported framework features
<a name="model-parallel-pt-unsupported-features"></a>

The following PyTorch features are unsupported by SageMaker's model parallelism library:
+ If you use data parallelism with the native [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html), the [https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) wrapper module is not supported by the library. The library internally manages integrating with PyTorch DDP, including parameter broadcast and gradient AllReduce. When using the library, module buffers are only broadcast once at the start of training. If your model has module buffers that need to be synchronized across data parallel groups at each step, you can do so through the `torch.distributed` API, using the process group that can be obtained via `smp.get_dp_process_group()`.
+ For mixed precision training, the `apex.amp` module is not supported. The recommended way to use the library with automatic mixed-precision is to use `torch.cuda.amp`, with the exception of using `smp.amp.GradScaler` instead of the implementation in torch.
+ `torch.jit.ScriptModules` or `ScriptFunctions` are not supported by `smp.DistributedModel`.
+ `apex` : `FusedLayerNorm`, `FusedAdam`, `FusedLAMB`, and `FusedNovoGrad` from `apex` are not supported. You can use the library implementations of these through `smp.optimizers` and `smp.nn` APIs instead.

# Step 2: Launch a Training Job Using the SageMaker Python SDK
<a name="model-parallel-sm-sdk"></a>

The SageMaker Python SDK supports managed training of models with ML frameworks such as TensorFlow and PyTorch. To launch a training job using one of these frameworks, you define a SageMaker [TensorFlow estimator](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator), a SageMaker [PyTorch estimator](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator), or a SageMaker generic [Estimator](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/estimators.html#sagemaker.estimator.Estimator) to use the modified training script and model parallelism configuration.

**Topics**
+ [

## Using the SageMaker TensorFlow and PyTorch Estimators
](#model-parallel-using-sagemaker-pysdk)
+ [

## Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library
](#model-parallel-customize-container)
+ [

## Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
](#model-parallel-bring-your-own-container)

## Using the SageMaker TensorFlow and PyTorch Estimators
<a name="model-parallel-using-sagemaker-pysdk"></a>

The TensorFlow and PyTorch estimator classes contain the `distribution` parameter, which you can use to specify configuration parameters for using distributed training frameworks. The SageMaker model parallel library internally uses MPI for hybrid data and model parallelism, so you must use the MPI option with the library.

The following template of a TensorFlow or PyTorch estimator shows how to configure the `distribution` parameter for using the SageMaker model parallel library with MPI.

------
#### [ Using the SageMaker TensorFlow estimator ]

```
import sagemaker
from sagemaker.tensorflow import TensorFlow

smp_options = {
    "enabled":True,              # Required
    "parameters": {
        "partitions": 2,         # Required
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "horovod": True,         # Use this for hybrid model and data parallelism
    }
}

mpi_options = {
    "enabled" : True,            # Required
    "processes_per_host" : 8,    # Required
    # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}

smd_mp_estimator = TensorFlow(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='2.6.3',
    py_version='py38',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

------
#### [ Using the SageMaker PyTorch estimator ]

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled":True,
    "parameters": {                        # Required
        "pipeline_parallel_degree": 2,     # Required
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "ddp": True,
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8,              # Required
    # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='1.13.1',
    py_version='py38',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

------

To enable the library, you need to pass configuration dictionaries to the `"smdistributed"` and `"mpi"` keys through the `distribution` argument of the SageMaker estimator constructors.

**Configuration parameters for SageMaker model parallelism**
+ For the `"smdistributed"` key, pass a dictionary with the `"modelparallel"` key and the following inner dictionaries. 
**Note**  
Using `"modelparallel"` and `"dataparallel"` in one training job is not supported. 
  + `"enabled"` – Required. To enable model parallelism, set `"enabled": True`.
  + `"parameters"` – Required. Specify a set of parameters for SageMaker model parallelism.
    + For a complete list of common parameters, see [Parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#smdistributed-parameters) in the *SageMaker Python SDK documentation*.

      For TensorFlow, see [TensorFlow-specific Parameters](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#tensorflow-specific-parameters).

      For PyTorch, see [PyTorch-specific Parameters](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#pytorch-specific-parameters).
    + `"pipeline_parallel_degree"` (or `"partitions"` in `smdistributed-modelparallel<v1.6.0`) – Required. Among the [parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#smdistributed-parameters), this parameter is required to specify how many model partitions you want to split into.
**Important**  
There is a breaking change in the parameter name. The `"pipeline_parallel_degree"` parameter replaces the `"partitions"` since `smdistributed-modelparallel` v1.6.0. For more information, see [Common Parameters](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#common-parameters) for SageMaker model parallelism configuration and [SageMaker Distributed Model Parallel Release Notes](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html) in the *SageMaker Python SDK documentation*.
+ For the `"mpi"` key, pass a dictionary that contains the following:
  + `"enabled"` – Required. Set `True` to launch the distributed training job with MPI.
  + `"processes_per_host"` – Required. Specify the number of processes MPI should launch on each host. In SageMaker AI, a host is a single Amazon EC2 ML instance. The SageMaker Python SDK maintains a one-to-one mapping between processes and GPUs across model and data parallelism. This means that SageMaker AI schedules each process on a single, separate GPU and no GPU contains more than one process. If you are using PyTorch, you must restrict each process to its own device through `torch.cuda.set_device(smp.local_rank())`. To learn more, see [Automated splitting with PyTorch](model-parallel-customize-training-script-pt.md#model-parallel-customize-training-script-pt-16).
**Important**  
 `process_per_host` *must* not be greater than the number of GPUs per instance and typically will be equal to the number of GPUs per instance.
  + `"custom_mpi_options"` (optional) – Use this key to pass any custom MPI options you might need. If you do not pass any MPI custom options to the key, the MPI option is set by default to the following flag.

    ```
    --mca btl_vader_single_copy_mechanism none
    ```
**Note**  
You do not need to explicitly specify this default flag to the key. If you explicitly specify it, your distributed model parallel training job might fail with the following error:  

    ```
    The following MCA parameter has been listed multiple times on the command line: 
    MCA param: btl_vader_single_copy_mechanism MCA parameters can only be listed once 
    on a command line to ensure there is no ambiguity as to its value. 
    Please correct the situation and try again.
    ```
**Tip**  
If you launch a training job using an EFA-enabled instance type, such as `ml.p4d.24xlarge` and `ml.p3dn.24xlarge`, use the following flag for best performance:  

    ```
    -x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1
    ```

To launch the training job using the estimator and your SageMaker model parallel configured training script, run the `estimator.fit()` function.

Use the following resources to learn more about using the model parallelism features in the SageMaker Python SDK:
+ [Use TensorFlow with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/tensorflow/using_tf.html)
+ [Use PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/pytorch/using_pytorch.html)
+ We recommend you use a SageMaker notebook instance if you are new users. To see an example of how you can launch a training job using a SageMaker notebook instance, see [Amazon SageMaker AI model parallelism library v2 examples](distributed-model-parallel-v2-examples.md).
+ You can also submit a distributed training job from your machine using AWS CLI. To set up AWS CLI on your machine, see [set up your AWS credentials and Region for development](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html).

## Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library
<a name="model-parallel-customize-container"></a>

To extend a pre-built container and use SageMaker's model parallelism library, you must use one of the available AWS Deep Learning Containers (DLC) images for PyTorch or TensorFlow. The SageMaker model parallelism library is included in the TensorFlow (2.3.0 and later) and PyTorch (1.6.0 and later) DLC images with CUDA (`cuxyz`). For a complete list of DLC images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) in the *AWS Deep Learning Containers GitHub repository*.

**Tip**  
We recommend that you use the image that contains the latest version of TensorFlow or PyTorch to access the most up-to-date version of the SageMaker model parallelism library.

For example, your Dockerfile should contain a `FROM` statement similar to the following:

```
# Use the SageMaker DLC image URI for TensorFlow or PyTorch
FROM aws-dlc-account-id.dkr.ecr.aws-region.amazonaws.com/framework-training:{framework-version-tag}

# Add your dependencies here
RUN ...

ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker AI container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
```

Additionally, when you define a PyTorch or TensorFlow estimator, you must specify that the `entry_point` for your training script. This should be the same path identified with `ENV SAGEMAKER_SUBMIT_DIRECTORY` in your Dockerfile. 

**Tip**  
You must push this Docker container to Amazon Elastic Container Registry (Amazon ECR) and use the image URI (`image_uri`) to define a SageMaker estimator for training. For more information, see [Extend a Pre-built Container](prebuilt-containers-extend.md). 

After you finish hosting the Docker container and retrieving the image URI of the container, create a SageMaker `PyTorch` estimator object as follows. This example assumes that you have already defined `smp_options` and `mpi_options`. 

```
smd_mp_estimator = Estimator(
    entry_point="your_training_script.py",
    role=sagemaker.get_execution_role(),
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    image_uri='your_aws_account_id.dkr.ecr.region.amazonaws.com/name:tag'
    instance_count=1,
    distribution={
        "smdistributed": smp_options,
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

## Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
<a name="model-parallel-bring-your-own-container"></a>

To build your own Docker container for training and use the SageMaker model parallel library, you must include the correct dependencies and the binary files of the SageMaker distributed parallel libraries in your Dockerfile. This section provides the minimum set of code blocks you must include to properly prepare a SageMaker training environment and the model parallel library in your own Docker container.

**Note**  
This custom Docker option with the SageMaker model parallel library as a binary is available only for PyTorch.

**To create a Dockerfile with the SageMaker training toolkit and the model parallel library**

1. Start with one of the [NVIDIA CUDA base images](https://hub.docker.com/r/nvidia/cuda).

   ```
   FROM <cuda-cudnn-base-image>
   ```
**Tip**  
The official AWS Deep Learning Container (DLC) images are built from the [NVIDIA CUDA base images](https://hub.docker.com/r/nvidia/cuda). We recommend you look into the [official Dockerfiles of AWS Deep Learning Container for PyTorch](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker) to find which versions of the libraries you need to install and how to configure them. The official Dockerfiles are complete, benchmark tested, and managed by the SageMaker and Deep Learning Container service teams. In the provided link, choose the PyTorch version you use, choose the CUDA (`cuxyz`) folder, and choose the Dockerfile ending with `.gpu` or `.sagemaker.gpu`.

1. To set up a distributed training environment, you need to install software for communication and network devices, such as [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html), [NVIDIA Collective Communications Library (NCCL)](https://developer.nvidia.com/nccl), and [Open MPI](https://www.open-mpi.org/). Depending on the PyTorch and CUDA versions you choose, you must install compatible versions of the libraries.
**Important**  
Because the SageMaker model parallel library requires the SageMaker data parallel library in the subsequent steps, we highly recommend that you follow the instructions at [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md) to properly set up a SageMaker training environment for distributed training.

   For more information about setting up EFA with NCCL and Open MPI, see [Get started with EFA and MPI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html) and [Get started with EFA and NCCL](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html).

1. Add the following arguments to specify the URLs of the SageMaker distributed training packages for PyTorch. The SageMaker model parallel library requires the SageMaker data parallel library to use the cross-node Remote Direct Memory Access (RDMA).

   ```
   ARG SMD_MODEL_PARALLEL_URL=https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-02-21-19-26/smdistributed_modelparallel-1.7.0-cp38-cp38-linux_x86_64.whl
   ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.10.2/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
   ```

1. Install dependencies that the SageMaker model parallel library requires.

   1. Install the [METIS](http://glaros.dtc.umn.edu/gkhome/metis/metis/overview) library.

      ```
      ARG METIS=metis-5.1.0
      
      RUN rm /etc/apt/sources.list.d/* \
        && wget -nv http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/${METIS}.tar.gz \
        && gunzip -f ${METIS}.tar.gz \
        && tar -xvf ${METIS}.tar \
        && cd ${METIS} \
        && apt-get update \
        && make config shared=1 \
        && make install \
        && cd .. \
        && rm -rf ${METIS}.tar* \
        && rm -rf ${METIS} \
        && rm -rf /var/lib/apt/lists/* \
        && apt-get clean
      ```

   1. Install the [RAPIDS Memory Manager library](https://github.com/rapidsai/rmm#rmm-rapids-memory-manager). This requires [CMake](https://cmake.org/) 3.14 or later.

      ```
      ARG RMM_VERSION=0.15.0
      
      RUN  wget -nv https://github.com/rapidsai/rmm/archive/v${RMM_VERSION}.tar.gz \
        && tar -xvf v${RMM_VERSION}.tar.gz \
        && cd rmm-${RMM_VERSION} \
        && INSTALL_PREFIX=/usr/local ./build.sh librmm \
        && cd .. \
        && rm -rf v${RMM_VERSION}.tar* \
        && rm -rf rmm-${RMM_VERSION}
      ```

1. Install the SageMaker model parallel library.

   ```
   RUN pip install --no-cache-dir -U ${SMD_MODEL_PARALLEL_URL}
   ```

1. Install the SageMaker data parallel library.

   ```
   RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
   ```

1. Install the [sagemaker-training toolkit](https://github.com/aws/sagemaker-training-toolkit). The toolkit contains the common functionality that's necessary to create a container compatible with the SageMaker training platform and the SageMaker Python SDK.

   ```
   RUN pip install sagemaker-training
   ```

1. After you finish creating the Dockerfile, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html) to learn how to build the Docker container and host it in Amazon ECR.

**Tip**  
For more general information about creating a custom Dockerfile for training in SageMaker AI, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

# Checkpointing and Fine-Tuning a Model with Model Parallelism
<a name="distributed-model-parallel-checkpointing-and-finetuning"></a>

The SageMaker model parallelism library provides checkpointing APIs to save the model state and the optimizer state split by the various model parallelism strategies, and to load checkpoints for continuous training from where you want to restart training and fine-tune. The APIs also support options to save the model and optimizer states partially or fully.

**Topics**
+ [

## Checkpointing a distributed model
](#distributed-model-parallel-checkpoint)
+ [

## Fine-tuning a distributed model
](#distributed-model-parallel-fine-tuning)

## Checkpointing a distributed model
<a name="distributed-model-parallel-checkpoint"></a>

Choose one of the following topics depending on the framework between PyTorch and TensorFlow and the version of the SageMaker model parallelism library you use.

**Topics**
+ [

### Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and later)
](#model-parallel-extended-features-pytorch-checkpoint)
+ [

### Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between v1.6.0 and v1.9.0)
](#model-parallel-extended-features-pytorch-saving-loading-checkpoints)
+ [

### Checkpointing a distributed TensorFlow model
](#distributed-model-parallel-checkpoint-tensorflow)

### Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and later)
<a name="model-parallel-extended-features-pytorch-checkpoint"></a>

The SageMaker model parallelism library provides checkpoint APIs to save and load full or partial checkpoints of the distributed model state and its optimizer state.

**Note**  
This checkpointing method is recommended if you use PyTorch and the SageMaker model parallelism library v1.10.0 or later.

**Partial checkpointing**

To save checkpoints of a model trained with model parallelism, use the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save_checkpoint](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save_checkpoint) API with the partial checkpointing option set to true (`partial=True`). This saves each model partition individually. In addition to the model and the optimizer state, you can also save any additional custom data through the `user_content` argument. The checkpointed model, optimizer, and user content are saved as separate files. The `save_checkpoint` API call creates checkpoint folders in the following structure. 

```
- path
  - ${tag}_partial (folder for partial checkpoints)
    - model_rankinfo.pt
    - optimizer_rankinfo.pt
    - fp16_states_rankinfo.pt
    - user_content.pt
  - $tag (checkpoint file for full checkpoints)
  - user_content_$tag (user_content file for full checkpoints)
  - newest (a file that indicates the newest checkpoint)
```

To resume training from partial checkpoints, use the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.resume_from_checkpoint](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.resume_from_checkpoint) API with `partial=True`, and specify the checkpoint directory and the tag used while saving the partial checkpoints. Note that the actual loading of model weights happens after model partitioning, during the first run of the `smdistributed.modelparallel.torch.step`-decorated training step function.

When saving a partial checkpoint, the library also saves the model partition decision as files with `.pt` file extension. Conversely, when resuming from the partial checkpoint, the library loads the partition decision files together. Once the partition decision is loaded, you can't change the partition.

The following code snippet shows how to set the checkpoint APIs in a PyTorch training script.

```
import smdistributed.modelparallel.torch as smp

model = ...
model = smp.DistributedModel(model)
optimizer = ...
optimizer = smp.DistributedOptimizer(optimizer)
user_content = ...     # additional custom data
checkpoint_path = "/opt/ml/checkpoint/model_parallel"

# Save a checkpoint.
smp.save_checkpoint(
    path=checkpoint_path,
    tag=f"total_steps{total_steps}",
    partial=True,
    model=model,
    optimizer=optimizer,
    user_content=user_content
    num_kept_partial_checkpoints=5
)

# Load a checkpoint.
# This automatically loads the most recently saved checkpoint.
smp_checkpoint = smp.resume_from_checkpoint(
    path=checkpoint_path, 
    partial=True
)
```

**Full checkpointing**

To save the final model artifact for inference purposes, use the `smdistributed.modelparallel.torch.save_checkpoint` API with `partial=False`, which combines the model partitions to create a single model artifact. Note that this does not combine the optimizer states.

To initialize training with particular weights, given a full model checkpoint, you can use the `smdistributed.modelparallel.torch.resume_from_checkpoint` API with `partial=False`. Note that this does not load optimizer states.

**Note**  
With tensor parallelism, in general, the `state_dict` must be translated between the original model implementation and the `DistributedModel` implementation. Optionally, you can provide the `state_dict` translation function as an argument to the `smdistributed.modelparallel.torch.resume_from_checkpoint`. However, for [Supported Models Out of the Box](model-parallel-extended-features-pytorch-hugging-face.md#model-parallel-extended-features-pytorch-hugging-face-out-of-the-box), the library takes care of this translation automatically.

The following code shows an example of how to use the checkpoint APIs for fully checkpointing a PyTorch model trained with model parallelism.

```
import smdistributed.modelparallel.torch as smp

model = ...
model = smp.DistributedModel(model)
optimizer = ...
optimizer = smp.DistributedOptimizer(optimizer)
user_content = ...     # additional custom data
checkpoint_path = "/opt/ml/checkpoint/model_parallel"

# Save a checkpoint.
smp.save_checkpoint(
    path=checkpoint_path,
    tag=f"total_steps{total_steps}",
    partial=False,
    model=model,
    optimizer=optimizer,
    user_content=user_content
    num_kept_partial_checkpoints=5
)

# Load a checkpoint.
# This automatically loads the most recently saved checkpoint.
smp_checkpoint = smp.resume_from_checkpoint(
    path=checkpoint_path, 
    partial=False
)
```

### Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between v1.6.0 and v1.9.0)
<a name="model-parallel-extended-features-pytorch-saving-loading-checkpoints"></a>

The SageMaker model parallelism library provides Python functions for saving partial or full checkpoints for training jobs with tensor parallelism. The following procedure shows how to use [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save) and [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.load](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.load) to save and load a checkpoint when you use tensor parallelism.

**Note**  
This checkpointing method is recommended if you use PyTorch, [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md), and the SageMaker model parallelism library between v1.6.0 and v1.9.0.

1. Prepare a model object and wrap it with the library's wrapper function `smp.DistributedModel()`.

   ```
   model = MyModel(...)
   model = smp.DistributedModel(model)
   ```

1. Prepare an optimizer for the model. A set of model parameters is an iterable argument required by optimizer functions. To prepare a set of model parameters, you must process `model.parameters()` to assign unique IDs to individual model parameters. 

   If there are parameters with duplicated IDs in the model parameter iterable, loading the checkpointed optimizer state fails. To create an iterable of model parameters with unique IDs for your optimizer, see the following:

   ```
   unique_params = []
   unique_params_set = set()
   for p in model.parameters():
     if p not in unique_params_set:
       unique_params.append(p)
       unique_params_set.add(p)
   del unique_params_set
   
   optimizer = MyOpt(unique_params, ...)
   ```

1. Wrap the optimizer using the library's wrapper function `smp.DistributedOptimizer()`.

   ```
   optimizer = smp.DistributedOptimizer(optimizer)
   ```

1. Save the model and the optimizer state using [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save). Depending on how you want to save checkpoints, choose one of the following two options:
   + **Option 1:** Save a partial model on each `mp_rank` for a single `MP_GROUP`.

     ```
     model_dict = model.local_state_dict() # save a partial model
     opt_dict = optimizer.local_state_dict() # save a partial optimizer state
     # Save the dictionaries at rdp_rank 0 as a checkpoint
     if smp.rdp_rank() == 0:
         smp.save(
             {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
             f"/checkpoint.pt",
             partial=True,
         )
     ```

     With tensor parallelism, the library saves checkpointed files named in the following format: `checkpoint.pt_{pp_rank}_{tp_rank}`.
**Note**  
With tensor parallelism, make sure you set the if statement as `if smp.rdp_rank() == 0` instead of `if smp.dp_rank() == 0`. When the optimizer state is sharded with tensor parallelism, all reduced-data parallel ranks must save their own partition of the optimizer state. Using a wrong *if* statement for checkpointing might result in a stalling training job. For more information about using `if smp.dp_rank() == 0` without tensor parallelism, see [General Instruction for Saving and Loading](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#general-instruction-for-saving-and-loading) in the *SageMaker Python SDK documentation*. 
   + **Option 2:** Save the full model.

     ```
     if smp.rdp_rank() == 0:
         model_dict = model.state_dict(gather_to_rank0=True) # save the full model
         if smp.rank() == 0:
             smp.save(
                 {"model_state_dict": model_dict},
                 "/checkpoint.pt",
                 partial=False,
             )
     ```
**Note**  
Consider the following for full checkpointing:   
If you set `gather_to_rank0=True`, all ranks other than `0` return empty dictionaries.
For full checkpointing, you can only checkpoint the model. Full checkpointing of optimizer states is currently not supported.
The full model only needs to be saved at `smp.rank() == 0`.

1. Load the checkpoints using [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.load](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.load). Depending on how you checkpointed in the previous step, choose one of the following two options:
   + **Option 1:** Load the partial checkpoints.

     ```
     checkpoint = smp.load("/checkpoint.pt", partial=True)
     model.load_state_dict(checkpoint["model_state_dict"], same_partition_load=False)
     optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
     ```

     You can set `same_partition_load=True` in `model.load_state_dict()` for a faster load, if you know that the partition will not change.
   + **Option 2:** Load the full checkpoints.

     ```
     if smp.rdp_rank() == 0:
         checkpoint = smp.load("/checkpoint.pt", partial=False)
         model.load_state_dict(checkpoint["model_state_dict"])
     ```

     The `if smp.rdp_rank() == 0` condition is not required, but it can help avoid redundant loading among different `MP_GROUP`s. Full checkpointing optimizer state dict is currently not supported with tensor parallelism.

### Checkpointing a distributed TensorFlow model
<a name="distributed-model-parallel-checkpoint-tensorflow"></a>

To save a TensorFlow model while training with model parallelism, use the following functions provided by the SageMaker model parallelism library.
+ [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.DistributedModel.save_model](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.DistributedModel.save_model)
+ [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.CheckpointManager](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.CheckpointManager)

## Fine-tuning a distributed model
<a name="distributed-model-parallel-fine-tuning"></a>

The fine-tuning needs to be configured in your training script. The following code snippet shows an example structure of a training script using the [AutoModelForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM) class of Hugging Face Transformers with modifications for registering the `smdistributed.model.parallel.torch` modules and settings for fine-tuning.

**Note**  
Fine-tuning a distributed transformer (a Transformer model wrapped by `smp.DistributedModel()`) with the [smp.delayed\$1param\$1initialization](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.delay_param_initialization) function activated requires the fine-tuning job to be configured with an FSx for Lustre file system. In cases where you want to fine-tune a large-scale model with the delayed parameter initialization option, you should set up an FSx for Lustre file system.

```
import argparse
from transformers import AutoModelForCausalLM
import smdistributed.modelparallel
import smdistributed.modelparallel.torch as smp

def parse_args():

    parser = argparse.ArgumentParser()

    # set an arg group for model
    model_grp = parser.add_argument_group(
        title="model", description="arguments to describe model configuration"
    )

    ... # set up numerous args to parse from the configuration dictionary to the script for training

    # add arg for activating fine-tuning
    model_grp.add_argument(
        "--fine_tune",
        type=int,
        default=0,
        help="Fine-tune model from checkpoint or pretrained model",
    )

def main():
    """Main function to train GPT."""
    args = parse_args()

    ... # parse numerous args

    if args.fine_tune > 0 and args.delayed_param > 0 and smp.rank() == 0:
        pretrained_model = AutoModelForCausalLM.from_pretrained(
            args.model_name or args.model_dir
        )
        model_state_dict = pretrained_model.state_dict()
        path = os.path.join(args.model_dir, "fullmodel.pt")
        torch.save(model_state_dict, path)

    # create a Transformer model and wrap by smp.model_creation() 
    # with options to configure model parallelism parameters offered by SageMaker AI
    with smp.model_creation(
        tensor_parallelism=smp.tp_size() > 1 or args.use_distributed_transformer > 0,
        zero_init=args.use_distributed_transformer == 0,
        dtype=dtype,
        distribute_embedding=args.sharded_data_parallel_degree > 1 and smp.tp_size() > 1,
        use_alibi=args.alibi > 0,
        attention_in_fp32=args.attention_in_fp32 > 0,
        fp32_residual_addition=args.residual_addition_in_fp32 > 0,
        query_key_layer_scaling=args.query_key_layer_scaling > 0 and args.bf16 < 1,
        fused_softmax=args.fused_softmax > 0,
        fused_dropout=args.fused_dropout > 0,
        fused_bias_gelu=args.fused_bias_gelu > 0,
        flash_attention=args.flash_attention > 0,
    ):
        if args.fine_tune > 0 and args.delayed_param == 0:
            model = AutoModelForCausalLM.from_pretrained(
                args.model_name or args.model_dir
            )
        else:
            model = AutoModelForCausalLM.from_config(model_config)

    # wrap the model by smp.DistributedModel() to apply SageMaker model parallelism
    model = smp.DistributedModel(
        model, trace_device="gpu", backward_passes_per_step=args.gradient_accumulation
    )

    # wrap the optimizer by smp.DistributedOptimizer() to apply SageMaker model parallelism
    optimizer= ... # define an optimizer
    optimizer = smp.DistributedOptimizer(
        optimizer,
        static_loss_scale=None,
        dynamic_loss_scale=True,
        dynamic_loss_args={"scale_window": 1000, "min_scale": 1, "delayed_shift": 2},
    )

    # for fine-tuning, use smp.resume_from_checkpoint() to load a pre-trained model
    if args.fine_tune > 0 and args.delayed_param > 0:
        smp.resume_from_checkpoint(args.model_dir, tag="fullmodel.pt", partial=False)
```

For a complete example of training scripts and Jupyter notebooks, see the [GPT-2 examples for PyTorch](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/model_parallel/gpt2) in the *SageMaker AI Examples GitHub repository*. 

# Amazon SageMaker AI model parallelism library v1 examples
<a name="distributed-model-parallel-examples"></a>

This page provides a list of blogs and Jupyter notebooks that present practical examples of implementing the SageMaker model parallelism (SMP) library v1 to run distributed training jobs on SageMaker AI.

## Blogs and Case Studies
<a name="distributed-model-parallel-examples-blog"></a>

The following blogs discuss case studies about using SMP v1.
+ [New performance improvements in the Amazon SageMaker AI model parallelism library](https://aws.amazon.com/blogs/machine-learning/new-performance-improvements-in-amazon-sagemaker-model-parallel-library/), *AWS Machine Learning Blog* (December 16, 2022)
+ [Train gigantic models with near-linear scaling using sharded data parallelism on Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/train-gigantic-models-with-near-linear-scaling-using-sharded-data-parallelism-on-amazon-sagemaker/), *AWS Machine Learning Blog* (October 31, 2022)

## Example notebooks
<a name="distributed-model-parallel-examples-pytorch"></a>

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/). To download the examples, run the following command to clone the repository and go to `training/distributed_training/pytorch/model_parallel`.

**Note**  
Clone and run the example notebooks in the following SageMaker AI ML IDEs.  
[SageMaker JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Code Editor](https://docs.aws.amazon.com/sagemaker/latest/dg/code-editor.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) (available as an application in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)

```
git clone https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/training/distributed_training/pytorch/model_parallel
```

**SMP v1 example notebooks for PyTorch**
+ [Train GPT-2 with near-linear scaling using the sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-sharded-data-parallel.ipynb)
+ [Fine-tune GPT-2 with near-linear scaling using sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-fine-tune-gpt-sharded-data-parallel.ipynb)
+ [Train GPT-NeoX-20B with near-linear scaling using the sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt-neox/smp-train-gpt-neox-sharded-data-parallel.ipynb)
+ [Train GPT-J 6B using the sharded data parallelism and tensor parallelism techniques in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt-j/smp-train-gptj-sharded-data-parallel-tp.ipynb)
+ [Train FLAN-T5 with near-linear scaling using sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb)
+ [Train Falcon with near-linear scaling using sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/falcon/smp-train-falcon-sharded-data-parallel.ipynb)

**SMP v1 example notebooks for TensorFlow**
+ [CNN with TensorFlow 2.3.1 and the SageMaker model parallelism library](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/tensorflow/model_parallel/mnist/tensorflow_smmodelparallel_mnist.html)
+ [HuggingFace with TensorFlow Distributed model parallelism library Training on SageMaker AI](https://github.com/huggingface/notebooks/blob/master/sagemaker/04_distributed_training_model_parallelism/sagemaker-notebook.ipynb)

# SageMaker Distributed Model Parallelism Best Practices
<a name="model-parallel-best-practices"></a>

Use the following guidelines when you run a distributed training job with the SageMaker model parallel library.

## Setting Up the Right Configuration for a Given Model
<a name="model-parallel-best-practices-configuration"></a>

When scaling up a model, we recommend you to go over the following list in order. Each list item discusses the advantage of using the library's techniques along with the tradeoffs that might arise. 

**Tip**  
If a model can fit well using a subset of the library's features, adding more model parallelism or memory saving features does not usually improve performance.

**Using large GPU instance types**
+ In the realm of model parallelism, it is best to use powerful instances with large GPU memories to handle overhead from model parallelism operations such as partitioning models across multiple GPUs. We recommend using `ml.p4d` or `ml.p3dn` instances for training large DL models. These instances are also equipped with Elastic Fabric Adapter (EFA), which provides higher network bandwidth and enables large-scale training with model parallelism.

**Sharding optimizer state**
+ The impact of sharding optimizer state depends on the number of data parallel ranks. Typically, a higher degree of data parallelism (proportional to the size of compute node) can improve the efficiency of memory usage.

  When you want to downsize a cluster, make sure you check the optimizer state sharding configuration. For example, a large DL model with optimizer state sharding that fits on a compute cluster with 16 GPUs (for example, two P4d or P4de instances) might not always fit on a node with 8 GPUs (for example, a single P4d or P4de instance). This is because the combined memory of 8 GPUs is lower than the combined memory of 16 GPUs, and the required memory per GPU for sharding over 8 GPUs is also higher than the memory per GPU for sharding over the 16-GPU scenario. As a result, the increased memory requirement might not fit into the smaller cluster.

  For more information, see [Optimizer State Sharding](model-parallel-extended-features-pytorch-optimizer-state-sharding.md).

**Activation checkpointing**
+ Memory efficiency can be improved by using activation checkpointing for a group of modules. The more you group the modules, the more efficient the memory usage. When checkpointing sequential modules for layers, the `strategy` argument of the `smp.set_activation_checkpointing` function groups the layers together for checkpointing. For example, grouping two or more layers together for checkpointing is more memory efficient than checkpointing one layer at a time, and this trades extra computation time for reduced memory usage.

  For more information, see [Activation Checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md).

**Tensor parallelism**
+ The degree of tensor parallelism should be a power of two (2, 4, 8, ..., 2n), where the maximum degree must be equal to the number of GPUs per node. For example, if you use a node with 8 GPUs, possible numbers for the degree of tensor parallelism are 2, 4, and 8. We don’t recommend arbitrary numbers (such as 3, 5, 6, and 7) for the degree of tensor parallelism. When you use multiple nodes, misconfiguring the degree of tensor parallelism might result in running tensor parallelism across the nodes; this adds significant overhead from communication of activations across the nodes and can become computationally expensive.

  For more information, see [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md).<a name="model-parallel-best-practices-configuration-pipeline-across-nodes"></a>

**Pipeline parallelism across nodes**
+ You can run pipeline parallelism both within a single node and across multiple nodes. When you use pipeline parallelism in combination with tensor parallelism, we recommend running pipeline parallelism across multiple nodes and keeping tensor parallelism within individual nodes. 
+ Pipeline parallelism comes with the following three knobs: `microbatches`, `active_microbatches`, and `prescaled_batch`.
  + When you use tensor parallelism with pipeline parallelism, we recommend activating `prescaled_batch` so that the batch size per model parallel group can be increased for efficient pipelining. With `prescaled_batch` activated, the batch size set in the training script becomes `tp_size` times the batch size set for each rank without `prescaled_batch`.
  + Increasing the number of `microbatches` helps achieve efficient pipelining and better performance. Note that the effective microbatch size is the batch size divided by number of microbatches. If you increase the number of microbatches while keeping batch size constant, each microbatch processes fewer samples.
  + The number of `active_microbatches` is the maximum number of microbatches that are simultaneously in process during pipelining. For each active microbatch in process, its activations and gradients take up GPU memory. Therefore, increasing `active_microbatches` takes up more GPU memory.
+ If both GPU and GPU memory are underutilized, increase `active_microbatches` for better parallelization during pipelining.
+ For more information about how to use tensor parallelism with pipeline parallelism, see [Tensor parallelism combined with pipeline parallelism](model-parallel-extended-features-pytorch-tensor-parallelism-examples.md#model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism).
+ To find descriptions of the aforementioned parameters, see [Parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#parameters-for-smdistributed) in the *SageMaker Python SDK documentation*.

**Offloading activations to CPU**
+ Make sure that this is used in combination with activation checkpointing and pipeline parallelism. To ensure that the offloading and preloading happen in the background, specify a value greater than 1 to the microbatches parameter. 
+ When offloading activations, you might be able to increase `active_microbatches` and sometimes match with the total number of microbatches. This depends on which modules are checkpointed and how the model is partitioned.

  For more information, see [Activation Offloading](model-parallel-extended-features-pytorch-activation-offloading.md).

### Reference configurations
<a name="model-parallel-best-practices-configuration-reference"></a>

The SageMaker model parallelism training team provides the following reference points based on experiments with the GPT-2 model, the sequence length of 512, and the vocabulary size of 50,000. 


| The number of model parameters | Instance type | Pipeline parallelism | Tensor parallelism | Optimizer state sharding | Activation checkpointing | Prescaled batch | Batch size | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
| 10 billion | 16 ml.p4d.24xlarge | 1 | 4 | True | Each transformer layer | True | batch\$1size=40 | 
| 30 billion | 16 ml.p4d.24xlarge | 1 | 8 | True | Each transformer layer | True | batch\$1size=32 | 
| 60 billion | 32 ml.p4d.24xlarge | 2 | 8 | True | Each transformer layer | True | batch\$1size=56, microbatches=4, active\$1microbatches=2 | 

You can extrapolate from the preceding configurations to estimate GPU memory usage for your model configuration. For example, if you increase the sequence length for a 10-billion-parameter model or increase the size of the model to 20 billion, you might want to lower batch size first. If the model still doesn’t fit, try increasing the degree of tensor parallelism.

## Modifying Your Training Script
<a name="model-parallel-best-practices-modify-training-script"></a>
+ Before you use the SageMaker model parallel library’s features in your training script, review [The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls](model-parallel-customize-tips-pitfalls.md).
+ To launch a training job faster, use the [SageMaker AI local mode](https://sagemaker.readthedocs.io/en/v2.199.0/overview.html?highlight=local%20mode#local-mode). This helps you quickly run a training job locally on a SageMaker notebook instance. Depending on the scale of the ML instance on which your SageMaker notebook instance is running, you might need to adjust the size of your model by changing the model configurations, such as the hidden width, number of transformer layers, and attention heads. Validate if the reduced model runs well on the notebook instance before using a large cluster for training the full model. 

## Monitoring and Logging a Training Job Using the SageMaker AI Console and Amazon CloudWatch
<a name="model-parallel-best-practices-monitoring"></a>

To monitor system-level metrics such as CPU memory utilization, GPU memory utilization, and GPU utilization, use visualization provided through the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Training**.

1. Choose **Training jobs**.

1. In the main pane, choose the training job name for which you want to see more details.

1. Browse the main pane and find the **Monitor** section to see the automated visualization.

1. To see training job logs, choose **View logs** in the **Monitor** section. You can access the distributed training job logs of the training job in CloudWatch. If you launched multi-node distributed training, you should see multiple log streams with tags in the format of **algo-n-1234567890**. The **algo-1** log stream tracks training logs from the main (0th) node.

For more information, see [Amazon CloudWatch Metrics for Monitoring and Analyzing Training Jobs](training-metrics.md).

## Permissions
<a name="model-parallel-best-practices-permissions"></a>

To run a SageMaker training job with model parallelism or the [SageMaker distributed training example notebooks](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html), make sure you have the right permissions in your IAM role, such as the following:
+ To use [FSx for Lustre](https://aws.amazon.com/fsx/), add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonFSxFullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonFSxFullAccess).
+ To use Amazon S3 as a data channel, add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonS3FullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonS3FullAccess).
+ To use Docker, build your own container, and push it to Amazon ECR, add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonEC2ContainerRegistryFullAccess).
+ To have a full access to use the entire suite of SageMaker AI features, add [https://console.aws.amazon.com/iam/home#/policies/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#/policies/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonSageMakerFullAccess). 

# The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls
<a name="model-parallel-customize-tips-pitfalls"></a>

Review the following tips and pitfalls before using Amazon SageMaker AI's model parallelism library. This list includes tips that are applicable across frameworks. For TensorFlow and PyTorch specific tips, see [Modify a TensorFlow training script](model-parallel-customize-training-script-tf.md) and [Modify a PyTorch Training Script](model-parallel-customize-training-script-pt.md), respectively. 

## Batch Size and Number of Microbatches
<a name="model-parallel-customize-tips-pitfalls-batch-size"></a>
+ The library is most efficient when the batch size is increased. For use cases where the model fits within a single device, but can only be trained with a small batch size, batch size can and should be increased after the library is integrated. Model parallelism saves memory for large models, enabling you to train using batch sizes that previously did not fit in memory.
+ Choosing a number of microbatches that is too small or too large can lower performance. The library executes each microbatch sequentially in each device, so microbatch size (batch size divided by number of microbatches) must be large enough to fully utilize each GPU. At the same time, pipeline efficiency increases with the number of microbatches, so striking the right balance is important. Typically, a good starting point is to try 2 or 4 microbatches, increasing the batch size to the memory limit, and then experiment with larger batch sizes and numbers of microbatches. As the number of microbatches is increased, larger batch sizes might become feasible if an interleaved pipeline is used.
+ Your batch size must be always divisible by the number of microbatches. Note that depending on the size of the dataset, sometimes the last batch of every epoch can be of a smaller size than the rest, and this smaller batch needs to be divisible by the number of microbatches as well. If it is not, you can set `drop_remainder=True` in the `tf.Dataset.batch()` call (in TensorFlow), or set `drop_last=True` in `DataLoader` (in PyTorch), so that this last, small batch is not used. If you are using a different API for the data pipeline, you might need to manually skip the last batch whenever it is not divisible by the number of microbatches.

## Manual Partitioning
<a name="model-parallel-customize-tips-pitfalls-manual-partitioning"></a>
+ If you use manual partitioning, be mindful of the parameters that are consumed by multiple operations and modules in your model, such as the embedding table in transformer architectures. Modules that share the same parameter must be placed in the same device for correctness. When auto-partitioning is used, the library automatically enforces this constraint.

## Data Preparation
<a name="model-parallel-customize-tips-pitfalls-data-preparation"></a>
+ If the model takes multiple inputs, make sure you seed the random operations in your data pipeline (e.g., shuffling) with `smp.dp_rank()`. If the dataset is being deterministically sharded across data parallel devices, make sure that the shard is indexed by `smp.dp_rank()`. This is to make sure that the order of the data seen on all ranks that form a model partition is consistent.

## Returning Tensors from `smp.DistributedModel`
<a name="model-parallel-customize-tips-pitfalls-return-tensors"></a>
+ Any tensor that is returned from the `smp.DistributedModel.call` (for TensorFlow) or `smp.DistributedModel.forward` (for PyTorch) function is broadcast to all other ranks, from the rank that computed that particular tensor. As a result, any tensor that is not needed outside the call and forward methods (intermediate activations, for example) should not be returned, as this causes needless communication and memory overhead and hurts performance.

## The `@smp.step` Decorator
<a name="model-parallel-customize-tips-pitfalls-smp-step-decorator"></a>
+ If an `smp.step`-decorated function has a tensor argument that does not have a batch dimension, the argument name must be provided in the `non_split_inputs` list when calling `smp.step`. This prevents the library from attempting to split the tensor into microbatches. For more information see [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_common_api.html](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_common_api.html) in the API documentation.

## Delaying Parameter Initialization
<a name="model-parallel-customize-tips-pitfalls-delaying-param-initialization"></a>

For very large models over 100 billion parameters, weight initialization through the CPU memory might result in an out-of-memory error. To get around this, the library offers `smp.delay_param_initialization` context manager. This delays the physical allocation of parameters until they move to GPU during the first execution of a `smp.step`-decorated function. This avoids unnecessary memory usage of the CPU during the initialization of training. Use the context manager when you create a model object as shown in the following code.

```
with smp.delay_param_initialization(enabled=True):    
    model = MyModel()
```

## Tensor Parallelism for PyTorch
<a name="model-parallel-customize-tips-pitfalls-tensor-parallelism-pytorch"></a>
+ If you are using a seed for deterministic results, set the seed based on `smp.dp_rank()` (for example, `torch.manual_seed(42 + smp.dp_rank())`). If you do not do this, different partitions of an `nn.Parameter` are initialized in the same way, impacting convergence. 
+ SageMaker’s model parallelism library uses NCCL to implement collectives needed for the distribution of the modules. Especially for smaller models, if too many NCCL calls are scheduled on the GPU at the same time, memory usage might increase because of additional space used by NCCL. To counteract this, `smp` throttles the NCCL calls so that the number of ongoing NCCL operations at any given time is less than or equal to a given limit. The default limit is 8, but this can be adjusted using the environment variable `SMP_NCCL_THROTTLE_LIMIT`. If you observe more memory usage than you expect while using tensor parallelism, you can try reducing this limit. However, choosing a limit that is too small might cause throughput loss. To disable throttling altogether, you can set `SMP_NCCL_THROTTLE_LIMIT=-1`. 
+ The following identity, which holds when the degree of tensor parallelism is 1, does not hold when the degree of tensor parallelism is greater than 1: `smp.mp_size() * smp.dp_size() == smp.size()`. This is because the tensor parallel group is part of both the model parallelism group and the data parallelism group. If your code has existing references to `mp_rank`, `mp_size`, `MP_GROUP`, and so on, and if you want to work with only the pipeline parallel group, you might need to replace the references with `smp.pp_size()`. The following identities are always true: 
  +  `smp.mp_size() * smp.rdp_size() == smp.size()` 
  +  `smp.pp_size() * smp.dp_size() == smp.size()` 
  +  `smp.pp_size() * smp.tp_size() * smp.rdp_size() == smp.size()` 
+ Since the `smp.DistributedModel` wrapper modifies the model parameters when tensor parallelism is enabled, the optimizer should be created after calling `smp.DistributedModel`, with the distributed parameters. For example, the following does not work: 

  ```
  ## WRONG
  model = MyModel()
  optimizer = SomeOptimizer(model.parameters())
  model = smp.DistributedModel(model)  # optimizer now has outdated parameters! 
  ```

  Instead, the optimizer should be created with the parameters of the `smp.DistributedModel` as follows:

  ```
  ## CORRECT
  model = smp.DistributedModel(MyModel())
  optimizer = SomeOptimizer(model.optimizers())
  ```
+ When a module is replaced with its distributed counterpart through tensor parallelism, the distributed module does not inherit its weights from the original module, and initializes new weights. This means that, for instance, if the weights need to be initialized in a particular call (for example, through a `load_state_dict` call), this needs to happen after the `smp.DistributedModel` call, once the module distribution takes place. 
+ When accessing the parameters of distributed modules directly, note that the weight does not have the same shape as the original module. For instance,  

  ```
  with smp.tensor_parallelism():
      linear = nn.Linear(60, 60)
  
  # will pass
  assert tuple(linear.weight.shape) == (60, 60)
  
  distributed_linear = smp.DistributedModel(linear)
  
  # will fail. the number of input channels will have been divided by smp.tp_size()
  assert tuple(distributed_linear.module.weight.shape) == (60, 60)
  ```
+ Using `torch.utils.data.distributed.DistributedSampler` is strongly recommended for tensor parallelism. This ensures that every data parallel rank receives the same number of data samples, which prevents hangs that might result from different `dp_rank`s taking a different number of steps. 
+ If you use the `join` API of PyTorch's `DistributedDataParallel` class to handle cases in which different data parallel ranks have different numbers of batches, you still need to make sure that ranks that are in the same `TP_GROUP` have the same number of batches; otherwise the communication collectives used in distributed execution of modules may hang. Ranks that are in different `TP_GROUP`s can have different numbers of batches, as long as `join` API is used. 
+ If you want to checkpoint your model and use tensor parallelism, consider the following: 
  + To avoid stalling and race conditions while saving and loading models when you use tensor parallelism, make sure you call appropriate functions from the following model and optimizer states inside a reduced-data parallelism rank.
  + If you are transitioning an existing pipeline parallel script and enabling tensor parallel for the script, ensure that you modify any `if smp.dp_rank() == 0` block used for saving and loading with `if smp.rdp_rank() == 0` blocks. Otherwise, it might cause your training job to stall. 

  For more information about checkpointing a model with tensor parallelism, see [Checkpointing a distributed model](distributed-model-parallel-checkpointing-and-finetuning.md#distributed-model-parallel-checkpoint).

# Model Parallel Troubleshooting
<a name="distributed-troubleshooting-model-parallel"></a>

If you run into an error, you can use the following list to try to troubleshoot your training job. If the problem persists, contact [AWS Support](https://aws.amazon.com/premiumsupport). 

**Topics**
+ [

## Considerations for Using SageMaker Debugger with the SageMaker Model Parallelism Library
](#distributed-ts-model-parallel-debugger)
+ [

## Saving Checkpoints
](#distributed-ts-model-parallel-checkpoints)
+ [

## Convergence Using Model Parallel and TensorFlow
](#distributed-ts-model-parallel-tf-convergence)
+ [

## Stalling or Crashing Distributed Training Jobs
](#distributed-ts-model-parallel-training-issues)
+ [

## Receiving NCCL Error for a PyTorch Training Job
](#distributed-ts-model-parallel-nccl-error)
+ [

## Receiving `RecursionError` for a PyTorch Training Job
](#distributed-ts-model-parallel-super-forward-not-supported)

## Considerations for Using SageMaker Debugger with the SageMaker Model Parallelism Library
<a name="distributed-ts-model-parallel-debugger"></a>

SageMaker Debugger is not available for the SageMaker model parallelism library. Debugger is enabled by default for all SageMaker TensorFlow and PyTorch training jobs, and you might see an error that looks like the following: 

```
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading
```

To fix this issue, disable Debugger by passing `debugger_hook_config=False` when creating a framework `estimator` as shown in the following example.

```
bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
checkpoint_in_bucket="checkpoints"

# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)

estimator = TensorFlow(
    ...

    distribution={"smdistributed": {"modelparallel": { "enabled": True }}},
    checkpoint_s3_uri=checkpoint_s3_bucket,
    checkpoint_local_path="/opt/ml/checkpoints",
    debugger_hook_config=False
)
```

## Saving Checkpoints
<a name="distributed-ts-model-parallel-checkpoints"></a>

You might run into the following error when saving checkpoints of a large model on SageMaker AI: 

```
InternalServerError: We encountered an internal error. Please try again
```

This could be caused by a SageMaker AI limitation while uploading the local checkpoint to Amazon S3 during training. To disable checkpointing in SageMaker AI, use the following example to explicitly upload the checkpoints.

If you run into the preceding error, do not use `checkpoint_s3_uri` with the SageMaker `estimator` call. While saving checkpoints for larger models, we recommend saving checkpoints to a custom directory and passing the same to the helper function (as a `local_path` argument).

```
import os

def aws_s3_sync(source, destination):
    """aws s3 sync in quiet mode and time profile"""
    import time, subprocess
    cmd = ["aws", "s3", "sync", "--quiet", source, destination]
    print(f"Syncing files from {source} to {destination}")
    start_time = time.time()
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p.wait()
    end_time = time.time()
    print("Time Taken to Sync: ", (end_time-start_time))
    return

def sync_local_checkpoints_to_s3(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'):
    """ sample function to sync checkpoints from local path to s3 """

    import boto3
    #check if local path exists
    if not os.path.exists(local_path):
        raise RuntimeError("Provided local path {local_path} does not exist. Please check")

    #check if s3 bucket exists
    s3 = boto3.resource('s3')
    if not s3_uri.startswith("s3://"):
        raise ValueError(f"Provided s3 uri {s3_uri} is not valid.")

    s3_bucket = s3_uri.replace('s3://','').split('/')[0]
    print(f"S3 Bucket: {s3_bucket}")
    try:
        s3.meta.client.head_bucket(Bucket=s3_bucket)
    except Exception as e:
        raise e
    aws_s3_sync(local_path, s3_uri)
    return

def sync_s3_checkpoints_to_local(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'):
    """ sample function to sync checkpoints from s3 to local path """

    import boto3
    #try to create local path if it does not exist
    if not os.path.exists(local_path):
        print(f"Provided local path {local_path} does not exist. Creating...")
        try:
            os.makedirs(local_path)
        except Exception as e:
            raise RuntimeError(f"Failed to create {local_path}")

    #check if s3 bucket exists
    s3 = boto3.resource('s3')
    if not s3_uri.startswith("s3://"):
        raise ValueError(f"Provided s3 uri {s3_uri} is not valid.")

    s3_bucket = s3_uri.replace('s3://','').split('/')[0]
    print(f"S3 Bucket: {s3_bucket}")
    try:
        s3.meta.client.head_bucket(Bucket=s3_bucket)
    except Exception as e:
        raise e
    aws_s3_sync(s3_uri, local_path)
    return
```

Usage of helper functions:

```
#base_s3_uri - user input s3 uri or save to model directory (default)
#curr_host - to save checkpoints of current host
#iteration - current step/epoch during which checkpoint is saved

# save checkpoints on every node using local_rank
if smp.local_rank() == 0:
    base_s3_uri = os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))
    curr_host = os.environ['SM_CURRENT_HOST']
    full_s3_uri = f'{base_s3_uri}/checkpoints/{curr_host}/{iteration}'
    sync_local_checkpoints_to_s3(local_path=checkpoint_dir, s3_uri=full_s3_uri)
```

## Convergence Using Model Parallel and TensorFlow
<a name="distributed-ts-model-parallel-tf-convergence"></a>

When you use SageMaker AI multi-node training with TensorFlow and the model parallelism library, the loss may not converge as expected because the order of training input files may be different on each node. This may cause different ranks in the same model parallel group to work on different input files, causing inconsistencies. To prevent this, ensure the input files are ordered the same way in all the ranks before they get converted to TensorFlow datasets. One way to achieve this is to sort the input file names in the training script.

## Stalling or Crashing Distributed Training Jobs
<a name="distributed-ts-model-parallel-training-issues"></a>

If your training job has stalling, crashing, or not responding issues, read the following troubleshooting items to identify what's the cause of the issue. If you need any further support, reach out to the SageMaker distributed training team through [AWS Support](https://aws.amazon.com/premiumsupport).
+  If you see **a distributed training job stalling at the NCCL initialization step**, consider the following: 
  + If you are using one of the EFA-enabled instances ( `ml.p4d` or `ml.p3dn` instances) with a custom VPC and its subnet, ensure that the security group used has inbound and outbound connections for all ports to and from the same SG. You also generally need outbound connections to any IP as a separate rule (for internet access). To find instructions on how to add inbound and outbound rules for EFA communication, refer to [SageMaker AI distributed training job stalling during initialization](distributed-troubleshooting-data-parallel.md#distributed-ts-data-parallel-efa-sg).
+ If you see a **distributed training job stalling when checkpointing** the full model, this might be because the `state_dict()` call on the model or optimizer was not made on all ranks with `rdp_rank()==0` (when using tensor parallelism) or `dp_rank()==0` (when using only pipeline parallelism). These ranks need to communicate to construct the checkpoint to be saved. Similar stalling issues can also happen when checkpointing partial optimizer if `shard_optimizer_state` is enabled. 

  For more information about checkpointing a model with model parallelism, see [General Instruction for Saving and Loading](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#general-instruction-for-saving-and-loading) and [Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between v1.6.0 and v1.9.0)](distributed-model-parallel-checkpointing-and-finetuning.md#model-parallel-extended-features-pytorch-saving-loading-checkpoints).
+ If the training job crashes with a **CUDA Out of Memory error**, this means that the distributed training configuration needs to be adjusted to fit the model on the GPU cluster. For more information and best practices, see [Setting Up the Right Configuration for a Given Model](model-parallel-best-practices.md#model-parallel-best-practices-configuration).
+ If the training job crashes with an **uncorrectable [ECC error](https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html)**, this means that one of the GPUs in the cluster has gone bad. If you need technical support, share the job ARN with the AWS team and restart your training job from a checkpoint if possible.
+ In rare cases, a job configuration that worked previously but is close to the limits of GPU memory might fail later with a different cluster due to a **CUDA Out of Memory error**. This could be because some GPU has lower available memory than usual due to ECC errors.
+ **Network timeout crash** might happen when running a multinode job which doesn’t use all GPUs in the node. To get around this, use all GPUs on the node by ensuring that the `processes_per_host` parameter is set to the number of GPUs in each instance. For example, this is `processes_per_host=8` for `ml.p3.16xlarge`, `ml.p3dn.24xlarge`, and `ml.p4d.24xlarge` instances.
+ If you find that your training job takes a long time during the data downloading stage, make sure the Amazon S3 path you provided to `checkpoint_s3_uri` for the SageMaker `Estimator` class is unique for the current training job. If this path is reused across multiple training jobs running simultaneously, all those checkpoints are uploaded and downloaded to the same Amazon S3 path and might significantly increase checkpoint loading time.
+ Use FSx for Lustre when you deal with large data and models.
  + If your dataset is large and fetching it takes a long time, we recommend keeping your dataset in [FSx for Lustre](https://aws.amazon.com/fsx/lustre/).
  + When training models are beyond 10 billion parameters, we recommend using FSx for Lustre for checkpointing.
  + After you create a file system, make sure to wait for the status to become **available** before starting a training job using it. 

## Receiving NCCL Error for a PyTorch Training Job
<a name="distributed-ts-model-parallel-nccl-error"></a>

If you encountered the following error, it might be due to a process running out of GPU memory.

```
NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

You can resolve this by reducing the batch size or `active_microbatches`. If auto partitioning is not resulting in a well-balanced partitioning, you might have to consider manual partitioning. For more information, see [Pipeline parallelism across nodes](model-parallel-best-practices.md#model-parallel-best-practices-configuration-pipeline-across-nodes).

## Receiving `RecursionError` for a PyTorch Training Job
<a name="distributed-ts-model-parallel-super-forward-not-supported"></a>

The library does not support calling `super.forward()` inside a module's forward call. If you use `super.forward()`, you might receive the following error message. 

```
RecursionError: maximum recursion depth exceeded
```

To fix the error, instead of calling `super.forward()`, you should call `super()._orig_forward()`. 

# Distributed computing with SageMaker AI best practices
<a name="distributed-training-options"></a>

This best practices page presents various flavors of distributed computing for machine learning (ML) jobs in general. The term *distributed computing* in this page encompasses distributed training for machine learning tasks and parallel computing for data processing, data generation, feature engineering, and reinforcement learning. In this page, we discuss about common challenges in distributed computing, and available options in SageMaker Training and SageMaker Processing. For additional reading materials about distributed computing, see [What Is Distributed Computing?](https://aws.amazon.com/what-is/distributed-computing/).

You can configure ML tasks to run in a distributed manner across multiple nodes (instances), accelerators (NVIDIA GPUs, AWS Trainium chips), and vCPU cores. By running distributed computation, you can achieve a variety of goals such as computing operations faster, handling large datasets, or training large ML models.

The following list covers common challenges that you might face when you run an ML training job at scale.
+ You need to make decisions on how to distribute computation depending on ML tasks, software libraries you want to use, and compute resources.
+ Not all ML tasks are straightforward to distribute. Also, not all ML libraries support distributed computation.
+ Distributed computation might not always result in a linear increase in compute efficiency. In particular, you need to identify if data I/O and inter-GPU communication have bottlenecks or cause overhead. 
+ Distributed computation might disturb numerical processes and change model accuracy. Specifically to data-parallel neural network training, when you change the global batch size while scaling up to a larger compute cluster, you also need to adjust the learning rate accordingly.

SageMaker AI provides distributed training solutions to ease such challenges for various use cases. Choose one of the following options that best fits your use case.

**Topics**
+ [

## Option 1: Use a SageMaker AI built-in algorithm that supports distributed training
](#distributed-training-options-1)
+ [

## Option 2: Run a custom ML code in the SageMaker AI managed training or processing environment
](#distributed-training-options-2)
+ [

## Option 3: Write your own custom distributed training code
](#distributed-training-options-3)
+ [

## Option 4: Launch multiple jobs in parallel or sequentially
](#distributed-training-options-4)

## Option 1: Use a SageMaker AI built-in algorithm that supports distributed training
<a name="distributed-training-options-1"></a>

SageMaker AI provides [built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) that you can use out of the box through the SageMaker AI console or the SageMaker Python SDK. Using the built-in algorithms, you don’t need to spend time for code customization, understanding science behind the models, or running Docker on provisioned Amazon EC2 instances. 

A subset of the SageMaker AI built-in algorithms support distributed training. To check if the algorithm of your choice supports distributed training, see the **Parallelizable** column in the [Common Information About Built-in Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/common-info-all-im-models.html) table. Some of the algorithms support multi-instance distributed training, while the rest of the parallelizable algorithms support parallelization across multiple GPUs in a single instance, as indicated in the **Parallelizable** column.

## Option 2: Run a custom ML code in the SageMaker AI managed training or processing environment
<a name="distributed-training-options-2"></a>

SageMaker AI jobs can instantiate distributed training environment for specific use cases and frameworks. This environment acts as a ready-to-use whiteboard, where you can bring and run your own ML code. 

### If your ML code uses a deep learning framework
<a name="distributed-training-options-2-1"></a>

You can launch distributed training jobs using the [Deep Learning Containers (DLC)](https://github.com/aws/deep-learning-containers) for SageMaker Training, which you can orchestrate either through the dedicated Python modules in the [SageMaker AI Python SDK](http://sagemaker.readthedocs.io/), or through the SageMaker APIs with [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/index.html), [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html). SageMaker AI provides training containers for machine learning frameworks, including [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/index.html), [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/index.html), [Hugging Face Transformers](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html), and [Apache MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/index.html). You have two options to write deep learning code for distributed training.
+ **The SageMaker AI distributed training libraries**

  The SageMaker AI distributed training libraries propose AWS-managed code for neural network data parallelism and model parallelism. SageMaker AI distributed training also comes with launcher clients built into the SageMaker Python SDK, and you don’t need to author parallel launch code. To learn more, see [SageMaker AI's data parallelism library](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) and [SageMaker AI's model parallelism library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).
+ **Open-source distributed training libraries** 

  Open source frameworks have their own distribution mechanisms such as [DistributedDataParallelism (DDP) in PyTorch](https://pytorch.org/docs/stable/notes/ddp.html) or `tf.distribute` modules in TensorFlow. You can choose to run these distributed training frameworks in the SageMaker AI-managed framework containers. For example, the sample code for [training MaskRCNN in SageMaker AI](https://github.com/aws-samples/amazon-sagemaker-cv) shows how to use both PyTorch DDP in the SageMaker AI PyTorch framework container and [Horovod](https://horovod.readthedocs.io/en/stable/) in the SageMaker TensorFlow framework container.

SageMaker AI ML containers also come with [MPI](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/mpi_on_sagemaker/intro/mpi_demo.ipynb) preinstalled, so you can parallelize your entry point script using [mpi4py](https://mpi4py.readthedocs.io/en/stable/). Using the MPI integrated training containers is a great option when you launch a third-party distributed training launcher or write ad-hoc parallel code in the SageMaker AI managed training environment.

*Notes for data-parallel neural network training on GPUs*
+ **Scale to multi-GPU and multi-machine parallelism when appropriate**

  We often run neural network training jobs on multiple-CPU or multiple-GPU instances. Each GPU-based instance usually contains multiple GPU devices. Consequently, distributed GPU computing can happen either within a single GPU instance with multiple GPUs (single-node multi-GPU training), or across multiple GPU instances with multiple GPU cores in each (multi-node multi-GPU training). Single-instance training is easier to write code and debug, and the intra-node GPU-to-GPU throughput is usually faster than the inter-node GPU-to-GPU throughput. Therefore, it is a good idea to scale data parallelism vertically first (use one GPU instance with multiple GPUs) and expand to multiple GPU instances if needed. This might not apply to cases where the CPU budget is high (for example, a massive workload for data pre-processing) and when the CPU-to-GPU ratio of a multi-GPU instance is too low. In all cases, you need to experiment with different combinations of instance types based on your own ML training needs and workload. 
+ **Monitor the quality of convergence**

  When training a neural network with data parallelism, increasing the number of GPUs while keeping the mini-batch size per GPU constant leads to increasing the size of global mini-batch for the mini-batch stochastic gradient descent (MSGD) process. The size of mini-batches for MSGD is known to impact the descent noise and convergence. For properly scaling while preserving accuracy, you need to adjust other hyperparameters such as the learning rate [[Goyal et al.](https://arxiv.org/abs/1706.02677) (2017)].
+ **Monitor I/O bottlenecks**

  As you increase the number of GPUs, the throughput for reading and writing storage should also increase. Make sure that your data source and pipeline don’t become bottlenecks.
+ **Modify your training script as needed**

  Training scripts written for single-GPU training must be modified for multi-node multi-GPU training. In most data parallelism libraries, script modification is required to do the following.
  + Assign batches of training data to each GPU.
  + Use an optimizer that can deal with gradient computation and parameter updates across multiple GPUs.
  + Assign responsibility of checkpointing to a specific host and GPU. 

   

### If your ML code involves tabular data processing
<a name="distributed-training-options-2-2"></a>

PySpark is a Python frontend of Apache Spark, which is an open-source distributed computing framework. PySpark has been widely adopted for distributed tabular data processing for large-scale production workloads. If you want to run tabular data processing code, consider using the [SageMaker Processing PySpark containers](https://docs.aws.amazon.com/sagemaker/latest/dg/use-spark-processing-container.html) and running parallel jobs. You can also run data processing jobs in parallel using SageMaker Training and SageMaker Processing APIs in Amazon SageMaker Studio Classic, which is integrated with [Amazon EMR](https://aws.amazon.com/blogs/machine-learning/part-1-create-and-manage-amazon-emr-clusters-from-sagemaker-studio-to-run-interactive-spark-and-ml-workloads/) and [AWS Glue](https://aws.amazon.com/about-aws/whats-new/2022/09/sagemaker-studio-supports-glue-interactive-sessions/?nc1=h_ls).

## Option 3: Write your own custom distributed training code
<a name="distributed-training-options-3"></a>

When you submit a training or processing job to SageMaker AI, SageMaker Training and SageMaker AI Processing APIs launch Amazon EC2 compute instances. You can customize training and processing environment in the instances by running your own Docker container or installing additional libraries in the AWS managed containers. For more information about Docker with SageMaker Training, see [Adapting your own Docker container to work with SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own.html) and [Create a container with your own algorithms and models](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-create.html). For more information about Docker with SageMaker AI Processing, see [Use Your Own Processing Code](https://docs.aws.amazon.com/sagemaker/latest/dg/use-your-own-processing-code.html).

Every SageMaker training job environment contains a configuration file at `/opt/ml/input/config/resourceconfig.json`, and every SageMaker Processing job environment contains a similar configuration file at `/opt/ml/config/resourceconfig.json`. Your code can read this file to find `hostnames` and establish inter-node communications. To learn more, including the schema of the JSON file, see [Distributed Training Configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-dist-training) and [How Amazon SageMaker Processing Configures Your Processing Container](https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html#byoc-config). You can also install and use third-party distributed computing libraries such as [Ray](https://github.com/aws-samples/aws-samples-for-ray/tree/main/sagemaker) or DeepSpeed in SageMaker AI.

You can also use SageMaker Training and SageMaker Processing to run custom distributed computations that do not require inter-worker communication. In the computing literature, those tasks are often described as *embarrassingly parallel* or *share-nothing*. Examples include parallel processing of data files, training models in parallel on different configurations, or running batch inference on a collection of records. You can trivially parallelize such share-nothing use cases with Amazon SageMaker AI. When you launch a SageMaker Training or SageMaker Processing job on a cluster with multiple nodes, SageMaker AI by default replicates and launches your training code (in Python or Docker) on all the nodes. Tasks requiring random spread of input data across such multiple nodes can be facilitated by setting `S3DataDistributionType=ShardedByS3Key` in the data input configuration of the SageMaker AI `TrainingInput` API. 

## Option 4: Launch multiple jobs in parallel or sequentially
<a name="distributed-training-options-4"></a>

You can also distribute an ML compute workflow into smaller parallel or sequential compute tasks, each represented by its own SageMaker Training or SageMaker Processing job. Splitting a task into multiple jobs can be beneficial for the following situations or tasks:
+ When you have specific [data channels](https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html) and metadata entries (such as hyperparameters, model configuration, or instance types) for each sub-tasks.
+ When you implement retry steps at a sub-task level.
+ When you vary the configuration of the sub-tasks over the course of the workload, such as when training on increasing batch sizes.
+ When you need to run an ML task that takes longer than the maximum training time allowed for a single training job (28 days maximum).
+ When different steps of a compute workflow require different instance types.

For the specific case of hyperparameter search, use [SageMaker AI Automated Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html). SageMaker AI Automated Model Tuning is a serverless parameter search orchestrator that launches multiple training jobs on your behalf, according to a search logic that can be random, Bayesian, or HyperBand. 

Additionally, to orchestrate multiple training jobs, you can also consider workflow orchestration tools, such as [SageMaker Pipelines](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-pipelines/index.html), [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html), and Apache Airflow supported by [Amazon Managed Workflows for Apache Airflow (MWAA)](https://aws.amazon.com/managed-workflows-for-apache-airflow/) and [SageMaker AI Workflows](https://sagemaker.readthedocs.io/en/stable/workflows/airflow/using_workflow.html). 