# Core Features of the SageMaker Model Parallelism Library
<a name="model-parallel-core-features"></a>

Amazon SageMaker AI's model parallelism library offers distribution strategies and memory-saving techniques, such as sharded data parallelism, tensor parallelism, model partitioning by layers for pipeline scheduling, and checkpointing. The model parallelism strategies and techniques help distribute large models across multiple devices while optimizing training speed and memory consumption. The library also provides Python helper functions, context managers, and wrapper functions to adapt your training script for automated or manual partitioning of your model.

When you implement model parallelism to your training job, you keep the same two-step workflow shown in the [Run a SageMaker Distributed Training Job with Model Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html) section. For adapting your training script, you'll add zero or few additional code lines to your training script. For launching a training job of the adapted training script, you'll need to set the distribution configuration parameters to activate the memory-saving features or to pass values for the degree of parallelism.

To get started with examples, see the following Jupyter notebooks that demonstrate how to use the SageMaker model parallelism library.
+ [PyTorch example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/model_parallel)
+ [TensorFlow example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/tensorflow/model_parallel/mnist)

To dive deep into the core features of the library, see the following topics.

**Note**  
The SageMaker distributed training libraries are available through the AWS deep learning containers for PyTorch, Hugging Face, and TensorFlow within the SageMaker Training platform. To utilize the features of the distributed training libraries, we recommend that you use the SageMaker Python SDK. You can also manually configure in JSON request syntax if you use SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout the documentation, instructions and examples focus on how to use the distributed training libraries with the SageMaker Python SDK.

**Important**  
The SageMaker model parallelism library supports all the core features for PyTorch, and supports pipeline parallelism for TensorFlow.

**Topics**
+ [Sharded Data Parallelism](model-parallel-extended-features-pytorch-sharded-data-parallelism.md)
+ [Pipelining a Model](model-parallel-core-features-pipieline-parallelism.md)
+ [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md)
+ [Optimizer State Sharding](model-parallel-extended-features-pytorch-optimizer-state-sharding.md)
+ [Activation Checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md)
+ [Activation Offloading](model-parallel-extended-features-pytorch-activation-offloading.md)
+ [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md)
+ [Support for FlashAttention](model-parallel-attention-head-size-for-flash-attention.md)

# Sharded Data Parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group. 

**Note**  
Sharded data parallelism is available for PyTorch in the SageMaker model parallelism library v1.11.0 and later.

When scaling up your training job to a large GPU cluster, you can reduce the per-GPU memory footprint of the model by sharding the training state of the model over multiple GPUs. This returns two benefits: you can fit larger models, which would otherwise run out of memory with standard data parallelism, or you can increase the batch size using the freed-up GPU memory.

The standard data parallelism technique replicates the training states across the GPUs in the data parallel group, and performs gradient aggregation based on the `AllReduce` operation. Sharded data parallelism modifies the standard data-parallel distributed training procedure to account for the sharded nature of the optimizer states. A group of ranks over which the model and optimizer states are sharded is called a *sharding group*. The sharded data parallelism technique shards the trainable parameters of a model and corresponding gradients and optimizer states across the GPUs in the *sharding group*.

SageMaker AI achieves sharded data parallelism through the implementation of MiCS, which is discussed in the AWS blog post [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws). In this implementation, you can set the sharding degree as a configurable parameter, which must be less than the data parallelism degree. During each forward and backward pass, MiCS temporarily recombines the model parameters in all GPUs through the `AllGather` operation. After the forward or backward pass of each layer, MiCS shards the parameters again to save GPU memory. During the backward pass, MiCS reduces gradients and simultaneously shards them across GPUs through the `ReduceScatter` operation. Finally, MiCS applies the local reduced and sharded gradients to their corresponding local parameter shards, using the local shards of optimizer states. To bring down communication overhead, the SageMaker model parallelism library prefetches the upcoming layers in the forward or backward pass, and overlaps the network communication with the computation.

The training state of the model is replicated across the sharding groups. This means that before gradients are applied to the parameters, the `AllReduce` operation must take place across the sharding groups, in addition to the `ReduceScatter` operation that takes place within the sharding group.

In effect, sharded data parallelism introduces a tradeoff between the communication overhead and GPU memory efficiency. Using sharded data parallelism increases the communication cost, but the memory footprint per GPU (excluding the memory usage due to activations) is divided by the sharded data parallelism degree, thus larger models can be fit in the GPU cluster.

**Selecting the degree of sharded data parallelism**

When you select a value for the degree of sharded data parallelism, the value must evenly divide the degree of data parallelism. For example, for an 8-way data parallelism job, choose 2, 4, or 8 for the sharded data parallelism degree. While choosing the sharded data parallelism degree, we recommend that you start with a small number, and gradually increase it until the model fits in the memory together with the desired batch size.

**Selecting the batch size**

After setting up sharded data parallelism, make sure you find the most optimal training configuration that can successfully run on the GPU cluster. For training large language models (LLM), start from the batch size 1, and gradually increase it until you reach the point to receive the out-of-memory (OOM) error. If you encounter the OOM error even with the smallest batch size, apply a higher degree of sharded data parallelism or a combination of sharded data parallelism and tensor parallelism.

**Topics**
+ [How to apply sharded data parallelism to your training job](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use)
+ [Reference configurations](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-config-sample)
+ [Sharded data parallelism with SMDDP Collectives](#model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives)
+ [Mixed precision training with sharded data parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-16bits-training)
+ [Sharded data parallelism with tensor parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism)
+ [Tips and considerations for using sharded data parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-considerations)

## How to apply sharded data parallelism to your training job
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use"></a>

To get started with sharded data parallelism, apply required modifications to your training script, and set up the SageMaker PyTorch estimator with the sharded-data-parallelism-specific parameters. Also consider to take reference values and example notebooks as a starting point.

### Adapt your PyTorch training script
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-modify-script"></a>

Follow the instructions at [Step 1: Modify a PyTorch Training Script](model-parallel-customize-training-script-pt.md) to wrap the model and optimizer objects with the `smdistributed.modelparallel.torch` wrappers of the `torch.nn.parallel` and `torch.distributed` modules.

**(Optional) Additional modification to register external model parameters**

If your model is built with `torch.nn.Module` and uses parameters that is not defined within the module class, you should register them to the module manually for SMP to gather the full parameters while . To register parameters to a module, use `smp.register_parameter(module, parameter)`.

```
class Module(torch.nn.Module):
    def __init__(self, *args):
        super().__init__(self, *args)
        self.layer1 = Layer1()
        self.layer2 = Layer2()
        smp.register_parameter(self, self.layer1.weight)

    def forward(self, input):
        x = self.layer1(input)
        # self.layer1.weight is required by self.layer2.forward
        y = self.layer2(x, self.layer1.weight)
        return y
```

### Set up the SageMaker PyTorch estimator
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-set-estimator"></a>

When configuring a SageMaker PyTorch estimator in [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md), add the parameters for sharded data parallelism. 

To turn on sharded data parallelism, add the `sharded_data_parallel_degree` parameter to the SageMaker PyTorch Estimator. This parameter specifies the number of GPUs over which the training state is sharded. The value for `sharded_data_parallel_degree` must be an integer between one and the data parallelism degree and must evenly divide the data parallelism degree. Note that the library automatically detects the number of GPUs so thus the data parallel degree. The following additional parameters are available for configuring sharded data parallelism.
+ `"sdp_reduce_bucket_size"` *(int, default: 5e8)* – Specifies the size of [PyTorch DDP gradient buckets](https://pytorch.org/docs/stable/notes/ddp.html#internal-design) in number of elements of the default dtype.
+ `"sdp_param_persistence_threshold"` *(int, default: 1e6)* – Specifies the size of a parameter tensor in number of elements that can persist at each GPU. Sharded data parallelism splits each parameter tensor across GPUs of a data parallel group. If the number of elements in the parameter tensor is smaller than this threshold, the parameter tensor is not split; this helps reduce communication overhead because the parameter tensor is replicated across data-parallel GPUs.
+ `"sdp_max_live_parameters"` *(int, default: 1e9)* – Specifies the maximum number of parameters that can simultaneously be in a recombined training state during the forward and backward pass. Parameter fetching with the `AllGather` operation pauses when the number of active parameters reaches the given threshold. Note that increasing this parameter increases the memory footprint.
+ `"sdp_hierarchical_allgather"` *(bool, default: True)* – If set to `True`, the `AllGather` operation runs hierarchically: it runs within each node first, and then runs across nodes. For multi-node distributed training jobs, the hierarchical `AllGather` operation is automatically activated.
+ `"sdp_gradient_clipping"` *(float, default: 1.0)* – Specifies a threshold for gradient clipping the L2 norm of the gradients before propagating them backward through the model parameters. When sharded data parallelism is activated, gradient clipping is also activated. The default threshold is `1.0`. Adjust this parameter if you have the exploding gradients problem.

The following code shows an example of how to configure sharded data parallelism.

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled": True,
    "parameters": {
        # "pipeline_parallel_degree": 1,    # Optional, default is 1
        # "tensor_parallel_degree": 1,      # Optional, default is 1
        "ddp": True,
        # parameters for sharded data parallelism
        "sharded_data_parallel_degree": 2,              # Add this to activate sharded data parallelism
        "sdp_reduce_bucket_size": int(5e8),             # Optional
        "sdp_param_persistence_threshold": int(1e6),    # Optional
        "sdp_max_live_parameters": int(1e9),            # Optional
        "sdp_hierarchical_allgather": True,             # Optional
        "sdp_gradient_clipping": 1.0                    # Optional
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

smp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-job"
)

smp_estimator.fit('s3://my_bucket/my_training_data/')
```

## Reference configurations
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-config-sample"></a>

The SageMaker distributed training team provides the following reference configurations that you can use as a starting point. You can extrapolate from the following configurations to experiment and estimate the GPU memory usage for your model configuration. 

Sharded data parallelism with SMDDP Collectives


| Model/the number of parameters | Num instances | Instance type | Sequence length | Global batch size | Mini batch size | Sharded data parallel degree | 
| --- | --- | --- | --- | --- | --- | --- | 
| GPT-NEOX-20B | 2 | ml.p4d.24xlarge | 2048 | 64 | 4 | 16 | 
| GPT-NEOX-20B | 8 | ml.p4d.24xlarge | 2048 | 768 | 12 | 32 | 

For example, if you increase the sequence length for a 20-billion-parameter model or increase the size of the model to 65 billion parameters, you need to try reducing the batch size first. If the model still doesn’t fit with the smallest batch size (the batch size of 1), try increasing the degree of model parallelism.

Sharded data parallelism with tensor parallelism and NCCL Collectives


| Model/the number of parameters | Num instances | Instance type | Sequence length | Global batch size | Mini batch size | Sharded data parallel degree | Tensor parallel degree | Activation offloading | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| GPT-NEOX-65B | 64 | ml.p4d.24xlarge | 2048 | 512 | 8 | 16 | 8 | Y | 
| GPT-NEOX-65B | 64 | ml.p4d.24xlarge | 4096 | 512 | 2 | 64 | 2 | Y | 

The combined usage of sharded data parallelism and tensor parallelism is useful when you want to fit a large language model (LLM) into a large-scale cluster while using text data with a longer sequence length, which leads to use a smaller batch size, and consequently handling the GPU memory usage to train LLMs against longer text sequences. To learn more, see [Sharded data parallelism with tensor parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism).

For case studies, benchmarks, and more configuration examples, see the blog post [New performance improvements in Amazon SageMaker AI model parallel library](https://aws.amazon.com/blogs/machine-learning/new-performance-improvements-in-amazon-sagemaker-model-parallel-library/).

## Sharded data parallelism with SMDDP Collectives
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives"></a>

The SageMaker data parallelism library offers collective communication primitives (SMDDP collectives) optimized for the AWS infrastructure. It achieves optimization by adopting an all-to-all-type communication pattern by making use of [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/), resulting in high-throughput and less latency-sensitive collectives, offloading the communication-related processing to the CPU, and freeing up GPU cycles for computation. On large clusters, SMDDP Collectives can offer improvements in distributed training performance by up to 40% compared to NCCL. For case studies and benchmark results, see the blog [New performance improvements in the Amazon SageMaker AI model parallelism library](https://aws.amazon.com/blogs/machine-learning/new-performance-improvements-in-amazon-sagemaker-model-parallel-library/).

**Note**  
Sharded data parallelism with SMDDP Collectives is available in the SageMaker model parallelism library v1.13.0 and later, and the SageMaker data parallelism library v1.6.0 and later. See also [Supported configurations](#sharded-data-parallelism-smddp-collectives-supported-config) to use sharded data parallelism with SMDDP Collectives.

In sharded data parallelism, which is a commonly used technique in large-scale distributed training, the `AllGather` collective is used to reconstitute the sharded layer parameters for forward and backward pass computations, in parallel with GPU computation. For large models, performing the `AllGather` operation efficiently is critical to avoid GPU bottleneck problems and slowing down training speed. When sharded data parallelism is activated, SMDDP Collectives drops into these performance-critical `AllGather` collectives, improving training throughput.

**Train with SMDDP Collectives**

When your training job has sharded data parallelism activated and meets the [Supported configurations](#sharded-data-parallelism-smddp-collectives-supported-config), SMDDP Collectives are automatically activated. Internally, SMDDP Collectives optimize the `AllGather` collective to be performant on the AWS infrastructure and falls back to NCCL for all other collectives. Furthermore, under unsupported configurations, all collectives, including `AllGather`, automatically use the NCCL backend.

Since the SageMaker model parallelism library version 1.13.0, the `"ddp_dist_backend"` parameter is added to the `modelparallel` options. The default value for this configuration parameter is `"auto"`, which uses SMDDP Collectives whenever possible, and falls back to NCCL otherwise. To force the library to always use NCCL, specify `"nccl"` to the `"ddp_dist_backend"` configuration parameter. 

The following code example shows how to set up a PyTorch estimator using the sharded data parallelism with the `"ddp_dist_backend"` parameter, which is set to `"auto"` by default and, therefore, optional to add. 

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled":True,
    "parameters": {                        
        "partitions": 1,
        "ddp": True,
        "sharded_data_parallel_degree": 64
        "bf16": True,
        "ddp_dist_backend": "auto"  # Specify "nccl" to force to use NCCL.
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=8,
    instance_type='ml.p4d.24xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

**Supported configurations**

The `AllGather` operation with SMDDP Collectives are activated in training jobs when all the following configuration requirements are met.
+ The sharded data parallelism degree greater than 1
+ `Instance_count` greater than 1 
+ `Instance_type` equal to `ml.p4d.24xlarge` 
+ SageMaker training container for PyTorch v1.12.1 or later
+ The SageMaker data parallelism library v1.6.0 or later
+ The SageMaker model parallelism library v1.13.0 or later

**Performance and memory tuning**

SMDDP Collectives utilize additional GPU memory. There are two environment variables to configure the GPU memory usage depending on different model training use cases.
+ `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` – During the SMDDP `AllGather` operation, the `AllGather` input buffer is copied into a temporary buffer for inter-node communication. The `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` variable controls the size (in bytes) of this temporary buffer. If the size of the temporary buffer is smaller than the `AllGather` input buffer size, the `AllGather` collective falls back to use NCCL.
  + Default value: 16 \$1 1024 \$1 1024 (16 MB)
  + Acceptable values: any multiple of 8192
+  `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` – The `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` variable is to size the temporary buffer (in bytes) to hold data gathered from inter-node communication. If the size of this temporary buffer is smaller than `1/8 * sharded_data_parallel_degree * AllGather input size`, the `AllGather` collective falls back to use NCCL.
  + Default value: 128 \$1 1024 \$1 1024 (128 MB)
  + Acceptable values: any multiple of 8192

**Tuning guidance on the buffer size variables**

The default values for the environment variables should work well for most use cases. We recommend tuning these variables only if training runs into the out-of-memory (OOM) error. 

The following list discusses some tuning tips to reduce the GPU memory footprint of SMDDP Collectives while retaining the performance gain from them.
+ Tuning `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES`
  + The `AllGather` input buffer size is smaller for smaller models. Hence, the required size for `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` can be smaller for models with fewer parameters.
  + The `AllGather` input buffer size decreases as `sharded_data_parallel_degree` increases, because the model gets sharded across more GPUs. Hence, the required size for `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` can be smaller for training jobs with large values for `sharded_data_parallel_degree`.
+ Tuning `SMDDP_AG_SORT_BUFFER_SIZE_BYTES`
  + The amount of data gathered from inter-node communication is less for models with fewer parameters. Hence, the required size for `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` can be smaller for such models with fewer number of parameters.

Some collectives might fall back to use NCCL; hence, you might not get the performance gain from the optimized SMDDP collectives. If additional GPU memory is available for use, you can consider increasing the values of `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` and `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` to benefit from the performance gain.

The following code shows how you can configure the environment variables by appending them to `mpi_options` in the distribution parameter for the PyTorch estimator.

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    .... # All modelparallel configuration options go here
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

# Use the following two lines to tune values of the environment variables for buffer
mpioptions += " -x SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES=8192" 
mpioptions += " -x SMDDP_AG_SORT_BUFFER_SIZE_BYTES=8192"

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=8,
    instance_type='ml.p4d.24xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-demo-with-tuning",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

## Mixed precision training with sharded data parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-16bits-training"></a>

To further save GPU memory with half-precision floating point numbers and sharded data parallelism, you can activate 16-bit floating point format (FP16) or [Brain floating point format](https://en.wikichip.org/wiki/brain_floating-point_format) (BF16) by adding one additional parameter to the distributed training configuration.

**Note**  
Mixed precision training with sharded data parallelism is available in the SageMaker model parallelism library v1.11.0 and later.

**For FP16 Training with Sharded Data Parallelism**

To run FP16 training with sharded data parallelism, add `"fp16": True"` to the `smp_options` configuration dictionary. In your training script, you can choose between the static and dynamic loss scaling options through the `smp.DistributedOptimizer` module. For more information, see [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md).

```
smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "sharded_data_parallel_degree": 2,
        "fp16": True
    }
}
```

**For BF16 Training with Sharded Data Parallelism**

The sharded data parallelism feature of SageMaker AI supports training in BF16 data type. The BF16 data type uses 8 bits to represent the exponent of a floating point number, while the FP16 data type uses 5 bits. Preserving the 8 bits for the exponent allows to keep the same representation of the exponent of a 32-bit single precision floating point (FP32) number. This makes the conversion between FP32 and BF16 simpler and significantly less prone to cause overflow and underflow issues that arise often in FP16 training, especially when training larger models. While both data types use 16 bits in total, this increased representation range for the exponent in the BF16 format comes at the expense of reduced precision. For training large models, this reduced precision is often considered an acceptable trade-off for the range and training stability.

**Note**  
Currently, BF16 training works only when sharded data parallelism is activated.

To run BF16 training with sharded data parallelism, add `"bf16": True` to the `smp_options` configuration dictionary.

```
smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "sharded_data_parallel_degree": 2,
        "bf16": True
    }
}
```

## Sharded data parallelism with tensor parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism"></a>

If you use sharded data parallelism and also need to reduce the global batch size, consider using [tensor parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html) with sharded data parallelism. When training a large model with sharded data parallelism on a very large compute cluster (typically 128 nodes or beyond), even a small batch size per GPU results in a very large global batch size. It might lead to convergence issues or low computational performance issues. Reducing the batch size per GPU sometimes is not possible with sharded data parallelism alone when a single batch is already large and cannot be reduced further. In such cases, using sharded data parallelism in combination with tensor parallelism helps reduce the global batch size.

Choosing the optimal sharded data parallel and tensor parallel degrees depends on the scale of the model, the instance type, and the global batch size that is reasonable for the model to converge. We recommend that you start from a low tensor parallel degree to fit the global batch size into the compute cluster to resolve CUDA out-of-memory errors and achieve the best performance. See the following two example cases to learn how the combination of tensor parallelism and sharded data parallelism helps you adjust the global batch size by grouping GPUs for model parallelism, resulting in a lower number of model replicas and a smaller global batch size.

**Note**  
This feature is available from the SageMaker model parallelism library v1.15, and supports PyTorch v1.13.1.

**Note**  
This feature is available for the supported models by the tensor parallelism functionality of the library. To find the list of the supported models, see [Support for Hugging Face Transformer Models](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-hugging-face.html). Also note that you need to pass `tensor_parallelism=True` to the `smp.model_creation` argument while modifying your training script. To learn more, see the training script [https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L793](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L793) in the *SageMaker AI Examples GitHub repository*.

### Example 1
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex1"></a>

Assume that we want to train a model over a cluster of 1536 GPUs (192 nodes with 8 GPUs in each), setting the degree of sharded data parallelism to 32 (`sharded_data_parallel_degree=32`) and the batch size per GPU to 1, where each batch has a sequence length of 4096 tokens. In this case, there are 1536 model replicas, the global batch size becomes 1536, and each global batch contains about 6 million tokens. 

```
(1536 GPUs) * (1 batch per GPU) = (1536 global batches)
(1536 batches) * (4096 tokens per batch) = (6,291,456 tokens)
```

Adding tensor parallelism to it can lower the global batch size. One configuration example can be setting the tensor parallel degree to 8 and the batch size per GPU to 4. This forms 192 tensor parallel groups or 192 model replicas, where each model replica is distributed across 8 GPUs. The batch size of 4 is the amount of training data per iteration and per tensor parallel group; that is, each model replica consumes 4 batches per iteration. In this case, the global batch size becomes 768, and each global batch contains about 3 million tokens. Hence, the global batch size is reduced by half compared to the previous case with sharded data parallelism only.

```
(1536 GPUs) / (8 tensor parallel degree) = (192 tensor parallelism groups)
(192 tensor parallelism groups) * (4 batches per tensor parallelism group) = (768 global batches)
(768 batches) * (4096 tokens per batch) = (3,145,728 tokens)
```

### Example 2
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex2"></a>

When both sharded data parallelism and tensor parallelism are activated, the library first applies tensor parallelism and shards the model across this dimension. For each tensor parallel rank, the data parallelism is applied as per `sharded_data_parallel_degree`.

For example, assume that we want to set 32 GPUs with a tensor parallel degree of 4 (forming groups of 4 GPUs), a sharded data parallel degree of 4, ending up with a replication degree of 2. The assignment creates eight GPU groups based on the tensor parallel degree as follows: `(0,1,2,3)`, `(4,5,6,7)`, `(8,9,10,11)`, `(12,13,14,15)`, `(16,17,18,19)`, `(20,21,22,23)`, `(24,25,26,27)`, `(28,29,30,31)`. That is, four GPUs form one tensor parallel group. In this case, the reduced data parallel group for the 0th rank GPUs of the tensor parallel groups would be `(0,4,8,12,16,20,24,28)`. The reduced data parallel group is sharded based on the sharded data parallel degree of 4, resulting in two replication groups for data parallelism. GPUs `(0,4,8,12)` form one sharding group, which collectively hold a complete copy of all parameters for the 0th tensor parallel rank, and GPUs `(16,20,24,28)` form another such group. Other tensor parallel ranks also have similar sharding and replication groups.

![\[Figure 1: Tensor parallelism groups.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/sdp_tp_group_tp.jpg)


Figure 1: Tensor parallelism groups for (nodes, sharded data parallel degree, tensor parallel degree) = (4, 4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form tensor parallelism groups from TPG0 to TPG7. Replication groups are (\$1TPG0, TPG4\$1, \$1TPG1, TPG5\$1, \$1TPG2, TPG6\$1 and \$1TPG3, TPG7\$1); each replication group pair shares the same color but filled differently.

![\[Figure 2: Sharded data parallelism groups.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/sdp_tp_group_sdp.jpg)


Figure 2: Sharded data parallelism groups for (nodes, sharded data parallel degree, tensor parallel degree) = (4, 4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form sharded data parallelism groups from SDPG0 to SDPG7. Replication groups are (\$1SDPG0, SDPG4\$1, \$1SDPG1, SDPG5\$1, \$1SDPG2, SDPG6\$1 and \$1SDPG3, SDPG7\$1); each replication group pair shares the same color but filled differently.

### How to activate sharded data parallelism with tensor parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-activate"></a>

To use sharded data parallelism with tensor parallelism, you need to set both `sharded_data_parallel_degree` and `tensor_parallel_degree` in the configuration for `distribution` while creating an object of the SageMaker PyTorch estimator class. 

You also need to activate `prescaled_batch`. This means that, instead of each GPU reading its own batch of data, each tensor parallel group collectively reads a combined batch of the chosen batch size. Effectively, instead of dividing the dataset into parts equal to the number of GPUs (or data parallel size, `smp.dp_size()`), it divides into parts equal to the number of GPUs divided by `tensor_parallel_degree` (also called reduced data parallel size, `smp.rdp_size()`). For more details on prescaled batch, see [Prescaled Batch](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#prescaled-batch) in the *SageMaker Python SDK documentation*. See also the example training script [https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L164](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L164) for GPT-2 in the *SageMaker AI Examples GitHub repository*.

The following code snippet shows an example of creating a PyTorch estimator object based on the aforementioned scenario in [Example 2](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex2).

```
mpi_options = "-verbose --mca orte_base_help_aggregate 0 "
smp_parameters = {
    "ddp": True,
    "fp16": True,
    "prescaled_batch": True,
    "sharded_data_parallel_degree": 4,
    "tensor_parallel_degree": 4
}

pytorch_estimator = PyTorch(
    entry_point="your_training_script.py",
    role=role,
    instance_type="ml.p4d.24xlarge",
    volume_size=200,
    instance_count=4,
    sagemaker_session=sagemaker_session,
    py_version="py3",
    framework_version="1.13.1",
    distribution={
        "smdistributed": {
            "modelparallel": {
                "enabled": True, 
                "parameters": smp_parameters,
            }
        },
        "mpi": {
            "enabled": True,
            "processes_per_host": 8,
            "custom_mpi_options": mpi_options,
        },
    },
    source_dir="source_directory_of_your_code",
    output_path=s3_output_location
)
```

## Tips and considerations for using sharded data parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-considerations"></a>

Consider the following when using the SageMaker model parallelism library's sharded data parallelism.
+ Sharded data parallelism is compatible with FP16 training. To run FP16 training, see the [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md) section.
+ Sharded data parallelism is compatible with tensor parallelism. The following items are what you might need to consider for using sharded data parallelism with tensor parallelism.
  + When using sharded data parallelism with tensor parallelism, the embedding layers are also automatically distributed across the tensor parallel group. In other words, the `distribute_embedding` parameter is automatically set to `True`. For more information about tensor parallelism, see [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md).
  + Note that sharded data parallelism with tensor parallelism currently uses the NCCL collectives as the backend of the distributed training strategy.

  To learn more, see the [Sharded data parallelism with tensor parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism) section.
+ Sharded data parallelism currently is not compatible with [pipeline parallelism](model-parallel-intro.md#model-parallel-intro-pp) or [optimizer state sharding](model-parallel-extended-features-pytorch-optimizer-state-sharding.md). To activate sharded data parallelism, turn off optimizer state sharding and set the pipeline parallel degree to 1.
+ The [activation checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md) and [activation offloading](model-parallel-extended-features-pytorch-activation-offloading.md) features are compatible with sharded data parallelism.
+ To use sharded data parallelism with gradient accumulation, set the `backward_passes_per_step` argument to the number of accumulation steps while wrapping your model with the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel) module. This ensures that the gradient `AllReduce` operation across the model replication groups (sharding groups) takes place at the boundary of gradient accumulation.
+ You can checkpoint your models trained with sharded data parallelism using the library's checkpointing APIs, `smp.save_checkpoint` and `smp.resume_from_checkpoint`. For more information, see [Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and later)](distributed-model-parallel-checkpointing-and-finetuning.md#model-parallel-extended-features-pytorch-checkpoint).
+ The behavior of the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.delay_param_initialization](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.delay_param_initialization) configuration parameter changes under sharded data parallelism. When these two features are simultaneously turned on, parameters are immediately initialized upon model creation in a sharded manner instead of delaying the parameter initialization, so that each rank initializes and stores its own shard of parameters.
+ When sharded data parallelism is activated, the library performs gradient clipping internally when the `optimizer.step()` call runs. You don't need to use utility APIs for gradient clipping, such as [https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html). To adjust the threshold value for gradient clipping, you can set it through the `sdp_gradient_clipping` parameter for the distribution parameter configuration when you construct the SageMaker PyTorch estimator, as shown in the [How to apply sharded data parallelism to your training job](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use) section.

# Pipelining a Model
<a name="model-parallel-core-features-pipieline-parallelism"></a>

One of the core features of SageMaker's model parallelism library is *pipeline parallelism*, which determines the order in which computations are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the GPUs compute simultaneously on different data samples, and to overcome the performance loss due to sequential computation. When you use pipeline parallelism, training job is executed in a pipelined fashion over microbatches to maximize GPU usage.

**Note**  
Pipeline parallelism, also called model partitioning, is available for both PyTorch and TensorFlow. For supported versions of the frameworks, see [Supported Frameworks and AWS Regions](distributed-model-parallel-support.md).

## Pipeline Execution Schedule
<a name="model-parallel-pipeline-execution"></a>

Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline one-by-one and follow an execution schedule defined by the library runtime. A *microbatch* is a smaller subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by which device for every time slot. 

For example, depending on the pipeline schedule and the model partition, GPU `i` might perform (forward or backward) computation on microbatch `b` while GPU `i+1` performs computation on microbatch `b+1`, thereby keeping both GPUs active at the same time. During a single forward or backward pass, execution flow for a single microbatch might visit the same device multiple times, depending on the partitioning decision. For instance, an operation that is at the beginning of the model might be placed on the same device as an operation at the end of the model, while the operations in between are on different devices, which means this device is visited twice.

The library offers two different pipeline schedules, *simple* and *interleaved*, which can be configured using the `pipeline` parameter in the SageMaker Python SDK. In most cases, interleaved pipeline can achieve better performance by utilizing the GPUs more efficiently.

### Interleaved Pipeline
<a name="model-parallel-pipeline-execution-interleaved"></a>

In an interleaved pipeline, backward execution of the microbatches is prioritized whenever possible. This allows quicker release of the memory used for activations, using memory more efficiently. It also allows for scaling the number of microbatches higher, reducing the idle time of the GPUs. At steady-state, each device alternates between running forward and backward passes. This means that the backward pass of one microbatch may run before the forward pass of another microbatch finishes.

![\[Example execution schedule for the interleaved pipeline over 2 GPUs.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-execution.png)


The preceding figure illustrates an example execution schedule for the interleaved pipeline over 2 GPUs. In the figure, F0 represents the forward pass for microbatch 0, and B1 represents the backward pass for microbatch 1. **Update** represents the optimizer update of the parameters. GPU0 always prioritizes backward passes whenever possible (for instance, executes B0 before F2), which allows for clearing of the memory used for activations earlier.

### Simple Pipeline
<a name="model-parallel-pipeline-execution-simple"></a>

A simple pipeline, by contrast, finishes running the forward pass for each microbatch before starting the backward pass. This means that it only pipelines the forward pass and backward pass stages within themselves. The following figure illustrates an example of how this works, over 2 GPUs.

![\[Example on a pipeline running the forward pass for each microbatch before starting the backward pass.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/simple-pipeline-execution.png)


### Pipelining Execution in Specific Frameworks
<a name="model-parallel-pipeline-frameworks"></a>

Use the following sections to learn about the framework-specific pipeline scheduling decisions SageMaker's model parallelism library makes for TensorFlow and PyTorch. 

#### Pipeline Execution with TensorFlow
<a name="model-parallel-pipeline-execution-interleaved-tf"></a>

The following image is an example of a TensorFlow graph partitioned by the model parallelism library, using automated model splitting. When a graph is split, each resulting subgraph is replicated B times (except for the variables), where B is the number of microbatches. In this figure, each subgraph is replicated 2 times (B=2). An `SMPInput` operation is inserted at each input of a subgraph, and an `SMPOutput` operation is inserted at each output. These operations communicate with the library backend to transfer tensors to and from each other.

![\[Example of a TensorFlow graph partitioned by the model parallelism library, using automated model splitting.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-tf.png)


The following image is an example of 2 subgraphs split with B=2 with gradient operations added. The gradient of a `SMPInput` op is a `SMPOutput` op, and vice versa. This enables the gradients to flow backwards during back-propagation.

![\[Example of 2 subgraphs split with B=2 with gradient operations added.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-tf.gif)


This GIF demonstrates an example interleaved pipeline execution schedule with B=2 microbatches and 2 subgraphs. Each device sequentially executes one of the subgraph replicas to improve GPU utilization. As B grows larger, the fraction of idle time slots goes to zero. Whenever it is time to do (forward or backward) computation on a specific subgraph replica, the pipeline layer signals to the corresponding blue `SMPInput` operations to start executing.

Once the gradients from all microbatches in a single mini-batch are computed, the library combines the gradients across microbatches, which can then be applied to the parameters. 

#### Pipeline Execution with PyTorch
<a name="model-parallel-pipeline-execution-interleaved-pt"></a>

Conceptually, pipelining follows a similar idea in PyTorch. However, since PyTorch does not involve static graphs and so the model parallelism library's PyTorch feature uses a more dynamic pipelining paradigm. 

As in TensorFlow, each batch is split into a number of microbatches, which are executed one at a time on each device. However, the execution schedule is handled via execution servers launched on each device. Whenever the output of a submodule that is placed on another device is needed on the current device, an execution request is sent to the execution server of the remote device along with the input tensors to the submodule. The server then executes this module with the given inputs and returns the response to the current device.

Since the current device is idle during the remote submodule execution, the local execution for the current microbatch pauses, and the library runtime switches execution to another microbatch which the current device can actively work on. The prioritization of microbatches is determined by the chosen pipeline schedule. For an interleaved pipeline schedule, microbatches that are in the backward stage of the computation are prioritized whenever possible.

# Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-tensor-parallelism"></a>

*Tensor parallelism* is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices. In contrast to pipeline parallelism, which keeps individual weights intact but partitions the *set* of weights, tensor parallelism splits individual weights. This typically involves distributed computation of specific operations, modules, or layers of the model.

Tensor parallelism is required in cases in which a single parameter consumes most of the GPU memory (such as large embedding tables with a large vocabulary size or a large softmax layer with a large number of classes). In this case, treating this large tensor or operation as an atomic unit is inefficient and impedes balance of the memory load. 

Tensor parallelism is also useful for extremely large models in which a pure pipelining is simply not enough. For example, with GPT-3-scale models that require partitioning over tens of instances, a pure microbatch pipelining is inefficient because the pipeline depth becomes too high and the overhead becomes prohibitively large.

**Note**  
Tensor parallelism is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

**Topics**
+ [How Tensor Parallelism Works](model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.md)
+ [Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism-examples.md)
+ [Support for Hugging Face Transformer Models](model-parallel-extended-features-pytorch-hugging-face.md)
+ [Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism](model-parallel-extended-features-pytorch-ranking-mechanism.md)

# How Tensor Parallelism Works
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works"></a>

Tensor parallelism takes place at the level of `nn.Modules`; it partitions specific modules in the model across tensor parallel ranks. This is in addition to the existing partition of the *set of modules* used in pipeline parallelism.

When a module is partitioned through tensor parallelism, its forward and backward propagation are distributed. The library handles the necessary communication across devices to implement the distributed execution of these modules. The modules are partitioned across multiple data parallel ranks. Contrary to the traditional distribution of workloads, each data parallel rank does **not** have the complete model replica when the library’s tensor parallelism is used. Instead, each data parallel rank may have only a partition of the distributed modules, in addition to the entirety of the modules that are not distributed.

**Example:** Consider tensor parallelism across data parallel ranks, where the degree of data parallelism is 4 and the degree of tensor parallelism is 2. Assume that you have a data parallel group that holds the following module tree, after partitioning the set of modules.

```
A
├── B
|   ├── E
|   ├── F
├── C
└── D
    ├── G
    └── H
```

Assume that tensor parallelism is supported for the modules B, G, and H. One possible outcome of tensor parallel partition of this model could be:

```
dp_rank 0 (tensor parallel rank 0): A, B:0, C, D, G:0, H
dp_rank 1 (tensor parallel rank 1): A, B:1, C, D, G:1, H
dp_rank 2 (tensor parallel rank 0): A, B:0, C, D, G:0, H
dp_rank 3 (tensor parallel rank 1): A, B:1, C, D, G:1, H
```

Each line represents the set of modules stored in that `dp_rank`, and the notation `X:y` represents the `y`th fraction of the module `X`. Note the following:

1. Partitioning takes place across subsets of data parallel ranks, which we call `TP_GROUP`, not the entire `DP_GROUP`, so that the exact model partition is replicated across `dp_rank` 0 and `dp_rank` 2, and similarly across `dp_rank` 1 and `dp_rank` 3.

1. The modules `E` and `F` are no longer part of the model, since their parent module `B` is partitioned, and any execution that is normally a part of `E` and `F` takes place within the (partitioned) `B` module.

1. Even though `H` is supported for tensor parallelism, in this example it is not partitioned, which highlights that whether to partition a module depends on user input. The fact that a module is supported for tensor parallelism does not necessarily mean it is partitioned.

## How the library adapts tensor parallelism to PyTorch `nn.Linear` module
<a name="model-parallel-extended-for-pytorch-adapt-to-module"></a>

When tensor parallelism is performed over data parallel ranks, a subset of the parameters, gradients, and optimizer states are partitioned across the tensor parallel devices *for the modules that are partitioned*. For the rest of the modules, the tensor parallel devices operate in a regular data parallel manner. To execute the partitioned module, a device first collects the necessary parts of *all data samples* across peer devices in the same tensor parallelism group. The device then runs the local fraction of the module on all these data samples, followed by another round of synchronization which both combines the parts of the output for each data sample and returns the combined data samples to the GPUs from which the data sample first originated. The following figure shows an example of this process over a partitioned `nn.Linear` module. 

![\[Two figures showing two tensor parallel concepts.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/tensor-parallel-concept.png)


The first figure shows a small model with a large `nn.Linear` module with data parallelism over the two tensor parallelism ranks. The `nn.Linear` module is replicated into the two parallel ranks. 

The second figure shows tensor parallelism applied on a larger model while splitting the `nn.Linear` module. Each `tp_rank` holds half the linear module, and the entirety of the rest of the operations. While the linear module runs, each `tp_rank` collects the relevant half of all data samples and passes it through their half of the `nn.Linear` module. The result needs to be reduce-scattered (with summation as the reduction operation) so that each rank has the final linear output for their own data samples. The rest of the model runs in the typical data parallel manner.

# Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-examples"></a>

In this section, you learn:
+ How to configure a SageMaker PyTorch estimator and the SageMaker model parallelism option to use tensor parallelism.
+ How to adapt your training script using the extended `smdistributed.modelparallel` modules for tensor parallelism.

To learn more about the `smdistributed.modelparallel` modules, see the [SageMaker model parallel APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html) in the *SageMaker Python SDK documentation*.

**Topics**
+ [Tensor parallelism alone](#model-parallel-extended-features-pytorch-tensor-parallelism-alone)
+ [Tensor parallelism combined with pipeline parallelism](#model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism)

## Tensor parallelism alone
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-alone"></a>

The following is an example of a distributed training option to activate tensor parallelism alone, without pipeline parallelism. Configure the `mpi_options` and `smp_options` dictionaries to specify distributed training options to the SageMaker `PyTorch` estimator.

**Note**  
Extended memory-saving features are available through Deep Learning Containers for PyTorch, which implements the SageMaker model parallelism library v1.6.0 or later.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
               
smp_options = {
    "enabled":True,
    "parameters": {
        "pipeline_parallel_degree": 1,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 4,      # tp over 4 devices
        "ddp": True
    }
}
              
smp_estimator = PyTorch(
    entry_point='your_training_script.py', # Specify
    role=role,
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    framework_version='1.13.1',
    py_version='py36',
    instance_count=1,
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smp_estimator.fit('s3://my_bucket/my_training_data/')
```

**Tip**  
To find a complete list of parameters for `distribution`, see [Configuration Parameters for Model Parallelism](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html) in the SageMaker Python SDK documentation.

**Adapt your PyTorch training script**

The following example training script shows how to adapt the SageMaker model parallelism library to a training script. In this example, it is assumed that the script is named `your_training_script.py`. 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)

def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by
        # the current process, based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target, reduction="mean")
        loss.backward()
        optimizer.step()

# smdistributed: Initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
if smp.local_rank() == 0:
    dataset = datasets.MNIST("../data", train=True, download=False)
smp.barrier()

# smdistributed: Shard the dataset based on data parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

train_loader = torch.utils.data.DataLoader(dataset, batch_size=64)

# smdistributed: Enable tensor parallelism for all supported modules in the model
# i.e., nn.Linear in this case. Alternatively, we can use
# smp.set_tensor_parallelism(model.fc1, True)
# to enable it only for model.fc1
with smp.tensor_parallelism():
    model = Net()

# smdistributed: Use the DistributedModel wrapper to distribute the
# modules for which tensor parallelism is enabled
model = smp.DistributedModel(model)

optimizer = optim.AdaDelta(model.parameters(), lr=4.0)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

## Tensor parallelism combined with pipeline parallelism
<a name="model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism"></a>

The following is an example of a distributed training option that enables tensor parallelism combined with pipeline parallelism. Set up the `mpi_options` and `smp_options` parameters to specify model parallel options with tensor parallelism when you configure a SageMaker `PyTorch` estimator.

**Note**  
Extended memory-saving features are available through Deep Learning Containers for PyTorch, which implements the SageMaker model parallelism library v1.6.0 or later.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
               
smp_options = {
    "enabled":True,
    "parameters": {
    "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True
    }
}
              
smp_estimator = PyTorch(
    entry_point='your_training_script.py', # Specify
    role=role,
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    framework_version='1.13.1',
    py_version='py36',
    instance_count=1,
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smp_estimator.fit('s3://my_bucket/my_training_data/')  
```

<a name="model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism-script"></a>**Adapt your PyTorch training script**

The following example training script shows how to adapt the SageMaker model parallelism library to a training script. Note that the training script now includes the `smp.step` decorator: 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)


# smdistributed: Define smp.step. Return any tensors needed outside.
@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(loss)
    return output, loss

def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by
        # the current process, based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        # Return value, loss_mb is a StepOutput object
        _, loss_mb = train_step(model, data, target)

        # smdistributed: Average the loss across microbatches.
        loss = loss_mb.reduce_mean()

        optimizer.step()

# smdistributed: Initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
if smp.local_rank() == 0:
    dataset = datasets.MNIST("../data", train=True, download=False)
smp.barrier()

# smdistributed: Shard the dataset based on data parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible
# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = Net()

# smdistributed: enable tensor parallelism only for model.fc1
smp.set_tensor_parallelism(model.fc1, True)

# smdistributed: Use the DistributedModel container to provide the model
# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)

optimizer = optim.AdaDelta(model.parameters(), lr=4.0)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

# Support for Hugging Face Transformer Models
<a name="model-parallel-extended-features-pytorch-hugging-face"></a>

The SageMaker model parallelism library's tensor parallelism offers out-of-the-box support for the following Hugging Face Transformer models:
+ GPT-2, BERT, and RoBERTa (Available in the SageMaker model parallelism library v1.7.0 and later)
+ GPT-J (Available in the SageMaker model parallelism library v1.8.0 and later)
+ GPT-Neo (Available in the SageMaker model parallelism library v1.10.0 and later)

**Note**  
For any other Transformers models, you need to use the [smdistributed.modelparallel.torch.tp\$1register\$1with\$1module()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.tp_register_with_module) API to apply tensor parallelism.

**Note**  
To use tensor parallelism for training Hugging Face Transformer models, make sure you use Hugging Face Deep Learning Containers for PyTorch that has the SageMaker model parallelism library v1.7.0 and later. For more information, see the [SageMaker model parallelism library release notes](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html).

## Supported Models Out of the Box
<a name="model-parallel-extended-features-pytorch-hugging-face-out-of-the-box"></a>

For the Hugging Face transformer models supported by the library out of the box, you don't need to manually implement hooks to translate Transformer APIs to `smdistributed` transformer layers. You can activate tensor parallelism by using the context manager [smdistributed.modelparallel.torch.tensor\$1parallelism()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.tensor_parallelism) and wrapping the model by [smdistributed.modelparallel.torch.DistributedModel()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel). You don't need to manually register hooks for tensor parallelism using the `smp.tp_register` API.

The `state_dict` translation functions between Hugging Face Transformers and `smdistributed.modelparallel` can be accessed as follows.
+  `smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_state_dict_to_hf_gpt2(state_dict, max_seq_len=None)`
+  `smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_hf_state_dict_to_smdistributed_gpt2(state_dict)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.bert.translate_state_dict_to_hf_bert(state_dict, max_seq_len=None)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.bert.translate_hf_state_dict_to_smdistributed_bert(state_dict)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_state_dict_to_hf_roberta(state_dict, max_seq_len=None)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_hf_state_dict_to_smdistributed_roberta(state_dict)` 
+ `smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_state_dict_to_hf_gptj(state_dict, max_seq_len=None)` (Available in the SageMaker model parallelism library v1.8.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_hf_gptj_state_dict_to_smdistributed_gptj` (Available in the SageMaker model parallelism library v1.8.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_state_dict_to_hf_gptneo(state_dict, max_seq_len=None)` (Available in the SageMaker model parallelism library v1.10.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_hf_state_dict_to_smdistributed_gptneo(state_dict)` (Available in the SageMaker model parallelism library v1.10.0 and later)

**Example usage of the GPT-2 translation function**

Start with wrapping the model as shown in the following code.

```
from transformers import AutoModelForCausalLM

with smp.tensor_parallelism():
    model = AutoModelForCausalLM.from_config(hf_gpt2_config)

model = smp.DistributedModel(model)
```

Given a `state_dict` from the `DistributedModel` object, you can load the weights into the original Hugging Face GPT-2 model using the `translate_state_dict_to_hf_gpt2` function as shown in the following code.

```
from smdistributed.modelparallel.torch.nn.huggingface.gpt2 \
                                      import translate_state_dict_to_hf_gpt2
max_seq_len = 1024

# [... code block for training ...]

if smp.rdp_rank() == 0:
    state_dict = dist_model.state_dict()
    hf_state_dict = translate_state_dict_to_hf_gpt2(state_dict, max_seq_len)

    # can now call model.load_state_dict(hf_state_dict) to the original HF model
```

**Example usage of the RoBERTa translation function**

Similarly, given a supported HuggingFace model `state_dict`, you can use the `translate_hf_state_dict_to_smdistributed` function to convert it to a format readable by `smp.DistributedModel`. This can be useful in transfer learning use cases, where a pre-trained model is loaded into a `smp.DistributedModel` for model-parallel fine-tuning:

```
from smdistributed.modelparallel.torch.nn.huggingface.roberta \
                                      import translate_state_dict_to_smdistributed

model = AutoModelForMaskedLM.from_config(roberta_config)
model = smp.DistributedModel(model)

pretrained_model = AutoModelForMaskedLM.from_pretrained("roberta-large")
translated_state_dict =
        translate_state_dict_to_smdistributed(pretrained_model.state_dict())

# load the translated pretrained weights into the smp.DistributedModel
model.load_state_dict(translated_state_dict)

# start fine-tuning...
```

# Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-ranking-mechanism"></a>

This section explains how the ranking mechanism of model parallelism works with tensor parallelism. This is extended from the [Ranking Basics](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#ranking-basics) for [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md). With tensor parallelism, the library introduces three types of ranking and process group APIs: `smp.tp_rank()` for tensor parallel rank, `smp.pp_rank()` for pipeline parallel rank, and `smp.rdp_rank()` for reduced-data parallel rank. The corresponding communication process groups are tensor parallel group (`TP_GROUP`), pipeline parallel group (`PP_GROUP`), and reduced-data parallel group (`RDP_GROUP`). These groups are defined as follows:
+ A *tensor parallel group* (`TP_GROUP`) is an evenly divisible subset of the data parallel group, over which tensor parallel distribution of modules takes place. When the degree of pipeline parallelism is 1, `TP_GROUP` is the same as *model parallel group* (`MP_GROUP`). 
+ A *pipeline parallel group* (`PP_GROUP`) is the group of processes over which pipeline parallelism takes place. When the tensor parallelism degree is 1, `PP_GROUP` is the same as `MP_GROUP`. 
+ A *reduced-data parallel group* (`RDP_GROUP`) is a set of processes that hold both the same pipeline parallelism partitions and the same tensor parallel partitions, and perform data parallelism among themselves. This is called the reduced data parallel group because it is a subset of the entire data parallelism group, `DP_GROUP`. For the model parameters that are distributed within the `TP_GROUP` , the gradient `allreduce` operation is performed only for reduced-data parallel group, while for the parameters that are not distributed, the gradient `allreduce` takes place over the entire `DP_GROUP`. 
+ A model parallel group (`MP_GROUP`) refers to a group of processes that collectively store the entire model. It consists of the union of the `PP_GROUP`s of all the ranks that are in the `TP_GROUP` of the current process. When the degree of tensor parallelism is 1, `MP_GROUP` is equivalent to `PP_GROUP`. It is also consistent with the existing definition of `MP_GROUP` from previous `smdistributed` releases. Note that the current `TP_GROUP` is a subset of both the current `DP_GROUP` and the current `MP_GROUP`. 

To learn more about the communication process APIs in the SageMaker model parallelism library, see the [Common API](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_common_api.html#) and the [PyTorch-specific APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html) in the *SageMaker Python SDK documentation*.

![\[Ranking mechanism, parameter distribution, and associated AllReduce operations of tensor parallelism.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/tensor-parallel-ranking-mechanism.png)


For example, consider process groups for a single node with 8 GPUs, where the degree of tensor parallelism is 2, the degree of pipeline parallelism is 2, and the degree of data parallelism is 4. The upper center part of the preceding figure shows an example of a model with 4 layers. The lower left and lower right parts of figure illustrate the 4-layer model distributed across 4 GPUs using both pipeline parallelism and tensor parallelism, where tensor parallelism is used for the middle two layers. These two lower figures are simple copies to illustrate different group boundary lines. The partitioned model is replicated for data parallelism across GPUs 0-3 and 4-7. The lower left figure shows the definitions of `MP_GROUP`, `PP_GROUP`, and `TP_GROUP`. The lower right figure shows `RDP_GROUP`, `DP_GROUP`, and `WORLD` over the same set of GPUs. The gradients for the layers and layer slices that have the same color are `allreduce`d together for data parallelism. For example, the first layer (light blue) gets the `allreduce` operations across `DP_GROUP`, whereas the dark orange slice in the second layer only gets the `allreduce` operations within the `RDP_GROUP` of its process. The bold dark red arrows represent tensors with the batch of its entire `TP_GROUP`.

```
GPU0: pp_rank 0, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 0
GPU1: pp_rank 1, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 1
GPU2: pp_rank 0, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 2
GPU3: pp_rank 1, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 3
GPU4: pp_rank 0, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 0
GPU5: pp_rank 1, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 1
GPU6: pp_rank 0, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 2
GPU7: pp_rank 1, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 3
```

In this example, pipeline parallelism occurs across the GPU pairs (0,1); (2,3); (4,5) and (6,7). In addition, data parallelism (`allreduce`) takes place across GPUs 0, 2, 4, 6, and independently over GPUs 1, 3, 5, 7. Tensor parallelism happens over subsets of `DP_GROUP`s, across the GPU pairs (0,2); (1,3); (4,6) and (5,7).

  For this kind of hybrid pipeline and tensor parallelism, the math for `data_parallel_degree` remains as `data_parallel_degree = number_of_GPUs / pipeline_parallel_degree`. The library further calculates the reduced data parallel degree from the following relation `reduced_data_parallel_degree * tensor_parallel_degree = data_parallel_degree`.  

# Optimizer State Sharding
<a name="model-parallel-extended-features-pytorch-optimizer-state-sharding"></a>

*Optimizer state sharding* is a useful memory-saving technique that shards the optimizer state (the set of weights that describes the state of optimizer) across data parallel device groups. You can use optimizer state sharding whenever you use a stateful optimizer (such as Adam) or an FP16 optimizer (which stores both FP16 and FP32 copies of the parameters).

**Note**  
Optimizer state sharding is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Optimizer State Sharding
<a name="model-parallel-extended-features-pytorch-optimizer-state-sharding-how-to-use"></a>

You can turn on *optimizer state sharding* by setting `"shard_optimizer_state": True` in the `modelparallel` configuration. 

When this feature is turned on, the library partitions the set of model parameters based on the data parallelism degree. The gradients corresponding to the `i`th partition get reduced only at the `i`th data parallel rank. At the end of the first call to an `smp.step` decorator function, the optimizer wrapped by `smp.DistributedOptimizer` redefines its parameters to be only limited to those parameters corresponding to the partition of the current data parallel rank. The redefined parameters are called *virtual parameters* and share underlying storage with the original parameters. During the first call to `optimizer.step`, the optimizer states are created based on these redefined parameters, which are sharded because of the original partition. After the optimizer update, the AllGather operation (as part of the `optimizer.step` call) runs across the data parallel ranks to achieve consistent parameter states.

**Tip**  
Optimizer state sharding can be useful when the degree of data parallelism is greater than 1 and the model has more than a billion parameters.   
The degree of data parallelism is calculated by `(processes_per_host * instance_count / pipeline_parallel_degree)`, and the `smp.dp_size()` function handles the sizing in the background.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True,
        "shard_optimizer_state": True
    }
}
```

**Adapt your PyTorch training script**

See [Adapt your PyTorch training script](model-parallel-extended-features-pytorch-tensor-parallelism-examples.md#model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism-script) in the *Tensor parallelism combined with pipeline parallelism* section. There’s no additional modification required for the script.

# Activation Checkpointing
<a name="model-parallel-extended-features-pytorch-activation-checkpointing"></a>

*Activation checkpointing* (or *gradient checkpointing*) is a technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra computation time for reduced memory usage. If a module is checkpointed, at the end of a forward pass, the inputs to and outputs from the module stay in memory. Any intermediate tensors that would have been part of the computation inside that module are freed up during the forward pass. During the backward pass of checkpointed modules, these tensors are recomputed. At this point, the layers beyond this checkpointed module have finished their backward pass, so the peak memory usage with checkpointing can be lower.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Activation Checkpointing
<a name="model-parallel-extended-for-pytorch-activation-checkpointing-how-to-use"></a>

With `smdistributed.modelparallel`, you can use activation checkpointing at the granularity of a module. For all `torch.nn` modules except `torch.nn.Sequential`, you can only checkpoint a module tree if it lies within one partition from the perspective of pipeline parallelism. In case of the `torch.nn.Sequential` module, each module tree inside the sequential module must lie completely within one partition for activation checkpointing to work. When you use manual partitioning, be aware of these restrictions.

When you use [automated model partitioning](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html#model-parallel-automated-model-splitting), you can find the partitioning assignment logs starting with `Partition assignments:` in the training job logs. If a module is partitioned across multiple ranks (for example, with one descendant on one rank and another descendant on a different rank), the library ignores the attempt to checkpoint the module and raises a warning message that the module won't be checkpointed.

**Note**  
The SageMaker model parallelism library supports both overlapping and non-overlapping `allreduce` operation in combination with checkpointing. 

**Note**  
PyTorch’s native checkpointing API is not compatible with `smdistributed.modelparallel`.

**Example 1:** The following sample code shows how to use activation checkpointing when you have a model definition in your script.

```
import torch.nn as nn
import torch.nn.functional as F

from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        # This call of fc1 will be checkpointed
        x = checkpoint(self.fc1, x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)
```

**Example 2:** The following sample code shows how to use activation checkpointing when you have a sequential model in your script.

```
import torch.nn as nn
from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint_sequential

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.seq = nn.Sequential(
            nn.Conv2d(1,20,5),
            nn.ReLU(),
            nn.Conv2d(20,64,5),
            nn.ReLU()
        )

    def forward(self, x):
        # This call of self.seq will be checkpointed
        x = checkpoint_sequential(self.seq, x)
        return F.log_softmax(x, 1)
```

**Example 3:** The following sample code shows how to use activation checkpointing when you import a prebuilt model from a library, such as PyTorch and Hugging Face Transformers. Whether you checkpoint sequential modules or not, do the following: 

1. Wrap the model by `smp.DistributedModel()`.

1. Define an object for sequential layers.

1. Wrap the sequential layer object by `smp.set_activation_checkpointig()`.

```
import smdistributed.modelparallel.torch as smp
from transformers import AutoModelForCausalLM

smp.init()
model = AutoModelForCausalLM(*args, **kwargs)
model = smp.DistributedModel(model)

# Call set_activation_checkpointing API
transformer_layers = model.module.module.module.transformer.seq_layers
smp.set_activation_checkpointing(
    transformer_layers, pack_args_as_tuple=True, strategy='each')
```

# Activation Offloading
<a name="model-parallel-extended-features-pytorch-activation-offloading"></a>

When activation checkpointing and pipeline parallelism are turned on and the number of microbatches is greater than one, *activation offloading* is an additional feature that can further reduce memory usage. Activation offloading asynchronously moves the checkpointed activations corresponding to their microbatches that are not currently running in the CPU. Right before the GPU needs the activations for the microbatch’s backward pass, this functionality prefetches the offloaded activations back from the CPU.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Activation Offloading
<a name="model-parallel-extended-for-pytorch-activation-offloading"></a>

Use activation offloading to reduce memory usage when **the number of microbatches is greater than 1, and activation checkpointing is turned on** (see [Activation Checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md)). When the activation checkpointing is not used, activation offloading has no effect. When it is used with only one microbatch, it does not save memory.

To use activation offloading, set `"offload_activations": True` in the `modelparallel` configuration.

Activation offloading moves the checkpointed activations in `nn.Sequential` modules to CPU asynchronously. The data transfer over the PCIe link overlaps with GPU computation. The offloading happens immediately, as soon as the forward pass for a particular checkpointed layer is computed. The activations are loaded back to the GPU shortly before they are needed for the backward pass of a particular microbatch. The CPU-GPU transfer similarly overlaps with computation. 

To adjust how early the activations are loaded back into the GPU, you can use the configuration parameter `"activation_loading_horizon"` (default is set to 4, must be `int` larger than 0). A larger activation loading horizon would cause the activations to be loaded back to the GPU earlier. If the horizon is too large, the memory-saving impact of activation offloading might be diminished. If the horizon is too small, the activations may not be loaded back in time, reducing the amount of overlap and degrading performance.

**Tip**  
Activation offloading can be useful for large models with over a hundred billion parameters.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True,
        "offload_activations": True,
        "activation_loading_horizon": 4   # optional. default is 4.
    }
}
```

# FP16 Training with Model Parallelism
<a name="model-parallel-extended-features-pytorch-fp16"></a>

For FP16 training, apply the following modifications to your training script and estimator.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.10.0 and later.

**Adapt your PyTorch training script**

1. Wrap your model using the [smdistributed.modelparallel.torch.model\$1creation()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.model_creation) context manager.

   ```
   # fp16_training_script.py
   
   import torch
   import smdistributed.modelparallel.torch as smp
   
   with smp.model_creation(
       dtype=torch.float16 if args.fp16 else torch.get_default_dtype()
   ):
       model = ...
   ```
**Tip**  
If you are using tensor parallelism, add `tensor_parallelism=smp.tp_size() > 1` to the `smp.model_creation` context manager. Adding this line also helps automatically detect whether tensor parallelism is activated or not.  

   ```
   with smp.model_creation(
       ... ,
       tensor_parallelism=smp.tp_size() > 1
   ):
       model = ...
   ```

1. When you wrap the optimizer with `smdistributed.modelparallel.torch.DistributedOptimizer`, set either the `static_loss_scaling` or `dynamic_loss_scaling` argument. By default, `static_loss_scaling` is set to `1.0`, and `dynamic_loss_scaling` is set to `False`. If you set `dynamic_loss_scale=True`, you can feed dynamic loss scaling options as a dictionary through the `dynamic_loss_args` argument. In most cases, we recommend you use dynamic loss scaling with the default options. For more information, options, and examples of the optimizer wrapper function, see the [smdistributed.modelparallel.torch.DistributedOptimizer](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedoptimizer) API.

   The following code is an example of wrapping an `Adadelta` optimizer object with dynamic loss scaling for FP16 training.

   ```
   optimizer = torch.optim.Adadelta(...)
   optimizer = smp.DistributedOptimizer(
       optimizer,
       static_loss_scale=None,
       dynamic_loss_scale=True,
       dynamic_loss_args={
           "scale_window": 1000,
           "min_scale": 1,
           "delayed_shift": 2
       }
   )
   ```

**Configure a SageMaker PyTorch estimator**

Add the FP16 parameter (`"fp16"`) to the distribution configuration for model parallelism when creating a SageMaker PyTorch estimator object. For a complete list of the configuration parameters for model parallelism, see [Parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#parameters-for-smdistributed).

```
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled": True,
    "parameters":  {
        "microbatches":  4,
        "pipeline_parallel_degree":  2,
        "tensor_parallel_degree":  2,
        ...,

        "fp16": True
    }
}

fp16_estimator = PyTorch(
    entry_point="fp16_training_script.py", # Specify your train script
    ...,

    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": {...}
    }
)

fp16_estimator.fit(...)
```

When FP16 training starts, the model and the optimizer are wrapped by `FP16_Module` and `FP16_Optimizer` respectively, which are modified `smdistributed` versions of the [Apex utils](https://nvidia.github.io/apex/fp16_utils.html#apex-fp16-utils). `FP16_Module` converts the model to FP16 dtype and deals with the forward pass in FP16.

**Tip**  
You can apply gradient clipping by calling `clip_master_grads` before `optimizer.step`.  

```
optimizer.clip_master_grads(max_norm)     # max_norm(float or int): max norm of the gradients
```

**Tip**  
When using `torch.optim.lr_scheduler` and FP16 training, you need to pass `optimizer.optimizer` to the LR scheduler rather than the optimizer. See the following example code.  

```
from torch.optim.lr_scheduler import StepLR

scheduler = StepLR(
    optimizer.optimizer if smp.state.cfg.fp16 else optimizer,
    step_size=1,
    gamma=args.gamma
)
```

# Support for FlashAttention
<a name="model-parallel-attention-head-size-for-flash-attention"></a>

Support for FlashAttention is a feature of the library only applicable for the *distributed transformer* model, which is a Transformer model wrapped by [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedmodel](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedmodel) for model-parallel training. This feature is also compatible with [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md). 

The [FlashAttention](https://github.com/HazyResearch/flash-attention) library only supports models when `attention_head_size` is set to a value that's a multiple of 8 and less than 128. Therefore, when you train a distributed transformer and make sure that FlashAttention works properly, you should adjust parameters to make the attention head size comply the requirements. For more information, see also [Installation and features](https://github.com/HazyResearch/flash-attention#installation-and-features) in the *FlashAttention GitHub repository*.

For example, assume that you configure a Transformer model with `hidden_width=864` and `num_heads=48`. The head size of FlashAttention is calculated as `attention_head_size = hidden_width / num_heads = 864 / 48 = 18`. To enable FlashAttention, you need to adjust the `num_heads` parameter to `54`, so that `attention_head_size = hidden_width / num_heads = 864 / 54 = 16`, which is a multiple of 8.