

# Amazon SageMaker Training Compiler
<a name="training-compiler"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

Use Amazon SageMaker Training Compiler to train deep learning (DL) models faster on scalable GPU instances managed by SageMaker AI.

## What Is SageMaker Training Compiler?
<a name="training-compiler-what-is"></a>

State-of-the-art deep learning (DL) models consist of complex multi-layered neural networks with billions of parameters that can take thousands of GPU hours to train. Optimizing such models on training infrastructure requires extensive knowledge of DL and systems engineering; this is challenging even for narrow use cases. Although there are open-source implementations of compilers that optimize the DL training process, they can lack the flexibility to integrate DL frameworks with some hardware such as GPU instances.

SageMaker Training Compiler is a capability of SageMaker AI that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker AI machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker AI and can help reduce total billable time as it accelerates training.

![\[A conceptual diagram of how SageMaker Training Compiler works with SageMaker AI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-compiler-marketing-diagram.png)


SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler–enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker AI and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker AI ML instances for accelerated computing.

## How It Works
<a name="training-compiler-how-it-works"></a>

SageMaker Training Compiler converts DL models from their high-level language representation to hardware-optimized instructions. Specifically, SageMaker Training Compiler applies graph-level optimizations, dataflow-level optimizations, and backend optimizations to produce an optimized model that efficiently uses hardware resources. As a result, you can train your models faster than when you train them without compilation.

It is a two-step process to activate SageMaker Training Compiler for your training job:

1. Bring your own DL script and, if needed, adapt to compile and train with SageMaker Training Compiler. To learn more, see [Bring Your Own Deep Learning Model](training-compiler-modify-scripts.md).

1. Create a SageMaker AI estimator object with the compiler configuration parameter using the SageMaker Python SDK.

   1. Turn on SageMaker Training Compiler by adding `compiler_config=TrainingCompilerConfig()` to the SageMaker AI estimator class.

   1. Adjust hyperparameters (`batch_size` and `learning_rate`) to maximize the benefit that SageMaker Training Compiler provides.

      Compilation through SageMaker Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. In some cases, the compiler intelligently promotes caching which leads to a decrease in the largest batch size that can fit on the GPU. Note that if you want to change the batch size, you must adjust the learning rate appropriately.

      For a reference for `batch_size` tested for popular models, see [Tested Models](training-compiler-support.md#training-compiler-tested-models).

      When you adjust the batch size, you also have to adjust the `learning_rate` appropriately. For best practices for adjusting the learning rate along with the change in batch size, see [SageMaker Training Compiler Best Practices and Considerations](training-compiler-tips-pitfalls.md).

   1. By running the `estimator.fit()` class method, SageMaker AI compiles your model and starts the training job.

   For instructions on how to launch a training job, see [Enable SageMaker Training Compiler](training-compiler-enable.md).

SageMaker Training Compiler does not alter the final trained model, while allowing you to accelerate the training job by more efficiently using the GPU memory and fitting a larger batch size per iteration. The final trained model from the compiler-accelerated training job is identical to the one from the ordinary training job.

**Tip**  
SageMaker Training Compiler only compiles DL models for training on [supported GPU instances](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html#training-compiler-supported-instance-types) managed by SageMaker AI. To compile your model for inference and deploy it to run anywhere in the cloud and at the edge, use [SageMaker Neo compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html).

**Topics**
+ [What Is SageMaker Training Compiler?](#training-compiler-what-is)
+ [How It Works](#training-compiler-how-it-works)
+ [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md)
+ [Bring Your Own Deep Learning Model](training-compiler-modify-scripts.md)
+ [Enable SageMaker Training Compiler](training-compiler-enable.md)
+ [SageMaker Training Compiler Example Notebooks and Blogs](training-compiler-examples-and-blogs.md)
+ [SageMaker Training Compiler Best Practices and Considerations](training-compiler-tips-pitfalls.md)
+ [SageMaker Training Compiler FAQ](training-compiler-faq.md)
+ [SageMaker Training Compiler Troubleshooting](training-compiler-troubleshooting.md)
+ [Amazon SageMaker Training Compiler Release Notes](training-compiler-release-notes.md)

# Supported Frameworks, AWS Regions, Instance Types, and Tested Models
<a name="training-compiler-support"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

Before using SageMaker Training Compiler, check if your framework of choice is supported, the instance types are available in your AWS account, and your AWS account is in one of the supported AWS Regions.

**Note**  
SageMaker Training Compiler is available in the SageMaker Python SDK v2.70.0 or later.

## Supported Frameworks
<a name="training-compiler-supported-frameworks"></a>

SageMaker Training Compiler supports the following deep learning frameworks and is available through AWS Deep Learning Containers.

**Topics**
+ [PyTorch](#training-compiler-supported-frameworks-pytorch)
+ [TensorFlow](#training-compiler-supported-frameworks-tensorflow)

### PyTorch
<a name="training-compiler-supported-frameworks-pytorch"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

### TensorFlow
<a name="training-compiler-supported-frameworks-tensorflow"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

For more information, see [Available Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) in the *AWS Deep Learning Containers GitHub repository*.

## AWS Regions
<a name="training-compiler-availablity-zone"></a>

The [SageMaker Training Compiler Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-training-compiler-containers) are available in the AWS Regions where [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) are in service except the China regions.

## Supported Instance Types
<a name="training-compiler-supported-instance-types"></a>

SageMaker Training Compiler is tested on and supports the following ML instance types.
+ P4 instances
+ P3 instances
+ G4dn instances
+ G5 instances

For specs of the instance types, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Request a service quota increase for SageMaker AI resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure).

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.
```

## Tested Models
<a name="training-compiler-tested-models"></a>

The following table includes a list of the models that have been tested with SageMaker Training Compiler. For reference, the largest batch size that is able to fit into memory is also included alongside other training parameters. SageMaker Training Compiler can change the memory footprint of the model training process; as a result, a larger batch size can often be used during the training process, further decreasing total training time. In some cases, SageMaker Training Compiler intelligently promotes caching which leads to a decrease in the largest batch size that can fit on the GPU. You must retune your model hyperparameters and find an optimal batch size for your case. To save time, use the following reference tables to look up a batch size that can be a good starting point for your use case.

**Note**  
The batch sizes are local batch size that fit into each individual GPU in the respective instance type. You should also adjust the learning rate when changing the batch size.

### PyTorch 1.13.1
<a name="training-compiler-tested-models-pt1131"></a>

**Natural language processing (NLP) models**

The following models are tested for training jobs for all combinations of single-node and multi-node with single or multi GPU cores and Automatic Mixed Precision (AMP) as indicated.


| Single-node/multi-node single-GPU/multi-GPU | Model | Dataset | Instance type | Precision | Sequence Length | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
| albert-base-v2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 80 | 192 | 
| albert-base-v2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 332 | 
| albert-base-v2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 80 | 224 | 
| bert-base-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 160 | 288 | 
| camembert-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 160 | 280 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 240 | 472 | 
| distilgpt2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 77 | 128 | 
| distilgpt2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 138 | 390 | 
| distilgpt2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 96 | 256 | 
| distilroberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 96 | 192 | 
| distilroberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 171 | 380 | 
| distilroberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 112 | 256 | 
| gpt2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 52 | 152 | 
| gpt2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 84 | 240 | 
| gpt2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 58 | 164 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 48 | 128 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 84 | 207 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 53 | 133 | 
| roberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 125 | 224 | 
| xlm-roberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 16 | 31 | 
| xlm-roberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 18 | 50 | 
| xlnet-base-cased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 240 | 
| bert-base-uncased | wikitext-103-v1 | g5.48xlarge | float16 | 512 | 29 | 50 | 
| distilbert-base-uncased | wikitext-103-v1 | g5.48xlarge | float16 | 512 | 45 | 64 | 
| gpt2 | wikitext-103-v1 | g5.48xlarge | float16 | 512 | 18 | 45 | 
| roberta-base | wikitext-103-v1 | g5.48xlarge | float16 | 512 | 23 | 44 | 
| gpt2 | wikitext-103-v1 | p4d.24xlarge | float16 | 512 | 36 | 64 | 

**Computer Vision (CV) models**

Tested using [TensorFlow Model Garden](https://github.com/tensorflow/models) with Automatic Mixed Precision (AMP) as indicated.


| Single/multi-node single/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| ResNet152 | food101 | g4dn.16xlarge | float16 | 128 | 144 | 
| ResNet152 | food101 | g5.4xlarge | float16 | 128 | 192 | 
| ResNet152 | food101 | p3.2xlarge | float16 | 152 | 156 | 
| ViT | food101 | g4dn.16xlarge | float16 | 512 | 512 | 
| ViT | food101 | g5.4xlarge | float16 | 992 | 768 | 
| ViT | food101 | p3.2xlarge | float16 | 848 | 768 | 

### PyTorch 1.12.0
<a name="training-compiler-tested-models-pt1120"></a>

**Natural language processing (NLP) models**

The following models are tested for training jobs for all combinations of single-node and multi-node with single or multi GPU cores and Automatic Mixed Precision (AMP) as indicated.


| Single-node/multi-node single-GPU/multi-GPU | Model | Dataset | Instance type | Precision | Sequence Length | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 128 | 248 | 
| bert-base-uncased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 160 | 288 | 
| camembert-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 160 | 279 | 
| camembert-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 105 | 164 | 
| distilgpt2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 136 | 256 | 
| distilgpt2 | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 80 | 118 | 
| gpt2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 84 | 240 | 
| gpt2 | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 80 | 119 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 93 | 197 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 113 | 130 | 
| roberta-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 125 | 224 | 
| roberta-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 78 | 112 | 
| xlnet-base-cased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 138 | 240 | 
| bert-base-uncased | wikitext-103-v1 | ml.p4d.24xlarge | float16 | 512 |  | 52 | 
| distilbert-base-uncased | wikitext-103-v1 | ml.p4d.24xlarge | float16 | 512 |  | 160 | 
| gpt2 | wikitext-103-v1 | ml.p4d.24xlarge | float16 | 512 |  | 25 | 
| roberta-base | wikitext-103-v1 | ml.p4d.24xlarge | float16 | 512 |  | 64 | 

### TensorFlow 2.11.0
<a name="training-compiler-tested-models-tf2110"></a>

**Computer Vision (CV) models**

Tested using [TensorFlow Model Garden](https://github.com/tensorflow/models) with Automatic Mixed Precision (AMP) as indicated.


| Single/multi-node single/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.g5.2xlarge | float16 | 6 | 8 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.p3.2xlarge | float16 | 4 | 6 | 
| ResNet50 | ImageNet | ml.g5.2xlarge | float16 | 192 | 256 | 
| ResNet50 | ImageNet | ml.p3.2xlarge | float16 | 256 | 256 | 
| ResNet101 | ImageNet | ml.g5.2xlarge | float16 | 128 | 256 | 
| ResNet101 | ImageNet | ml.p3.2xlarge | float16 | 128 | 128 | 
| ResNet152 | ImageNet | ml.g5.2xlarge | float16 | 128 | 224 | 
| ResNet152 | ImageNet | ml.p3.2xlarge | float16 | 128 | 128 | 
| VisionTransformer | ImageNet | ml.g5.2xlarge | float16 | 112 | 144 | 
| VisionTransformer | ImageNet | ml.p3.2xlarge | float16 | 96 | 128 | 

**Natural Language Processing (NLP) models**

Tested using [Transformer models](https://github.com/huggingface/transformers) with `Sequence_Len=128` and Automatic Mixed Precision (AMP) as indicated.


| Single/multi-node single/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 160 | 197 | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 95 | 127 | 
| bert-base-uncased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 160 | 128 | 
| bert-base-uncased | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 104 | 111 | 
| bert-large-uncased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 65 | 48 | 
| bert-large-uncased | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 40 | 35 | 
| camembert-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 162 | 
| camembert-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 105 | 111 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 256 | 264 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 169 | 
| gpt2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 120 | 
| gpt2 | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 80 | 83 | 
| jplu/tf-xlm-roberta-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 32 | 32 | 
| jplu/tf-xlm-roberta-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 32 | 36 | 
| microsoft/mpnet-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 144 | 160 | 
| microsoft/mpnet-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 106 | 110 | 
| roberta-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 128 | 
| roberta-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 72 | 98 | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 128 | 192 | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 95 | 96 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 256 | 256 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 140 | 184 | 
| google/electra-small-discriminator | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 256 | 384 | 
| google/electra-small-discriminator | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 256 | 268 | 
| gpt2 | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 116 | 116 | 
| gpt2 | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 85 | 83 | 
| gpt2 | wikitext-2-raw-v1 | ml.p4d.24xlarge | float16 | 94 | 110 | 
| microsoft/mpnet-base | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 187 | 164 | 
| microsoft/mpnet-base | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 106 | 111 | 

### TensorFlow 2.10.0
<a name="training-compiler-tested-models-tf2100"></a>

**Computer Vision (CV) models**

Tested using [TensorFlow Model Garden](https://github.com/tensorflow/models) with Automatic Mixed Precision (AMP) as indicated.


| Single-node single-GPU/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| DetectionTransformer-ResNet50 | COCO-2017 | ml.g4dn.2xlarge | float32 | 2 | 4 | 
| DetectionTransformer-ResNet50 | COCO-2017 | ml.g5.2xlarge | float32 | 3 | 6 | 
| DetectionTransformer-ResNet50 | COCO-2017 | ml.p3.2xlarge | float32 | 2 | 4 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.g4dn.2xlarge | float16 | 4 | 6 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.g5.2xlarge | float16 | 6 | 8 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.g5.48xlarge | float16 | 48 | 64 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.p3.2xlarge | float16 | 4 | 6 | 
| ResNet50 | ImageNet | ml.g4dn.2xlarge | float16 | 224 | 256 | 
| ResNet50 | ImageNet | ml.g5.2xlarge | float16 | 192 | 160 | 
| ResNet50 | ImageNet | ml.g5.48xlarge | float16 | 2048 | 2048 | 
| ResNet50 | ImageNet | ml.p3.2xlarge | float16 | 224 | 160 | 
| ResNet101 | ImageNet | ml.g4dn.2xlarge | float16 | 160 | 128 | 
| ResNet101 | ImageNet | ml.g5.2xlarge | float16 | 192 | 256 | 
| ResNet101 | ImageNet | ml.g5.48xlarge | float16 | 2048 | 2048 | 
| ResNet101 | ImageNet | ml.p3.2xlarge | float16 | 160 | 224 | 
| ResNet152 | ImageNet | ml.g4dn.2xlarge | float16 | 128 | 128 | 
| ResNet152 | ImageNet | ml.g5.2xlarge | float16 | 192 | 224 | 
| ResNet152 | ImageNet | ml.g5.48xlarge | float16 | 1536 | 1792 | 
| ResNet152 | ImageNet | ml.p3.2xlarge | float16 | 128 | 160 | 
| VisionTransformer | ImageNet | ml.g4dn.2xlarge | float16 | 80 | 128 | 
| VisionTransformer | ImageNet | ml.g5.2xlarge | float16 | 112 | 144 | 
| VisionTransformer | ImageNet | ml.g5.48xlarge | float16 | 896 | 1152 | 
| VisionTransformer | ImageNet | ml.p3.2xlarge | float16 | 80 | 128 | 

**Natural Language Processing (NLP) models**

Tested using [Transformer models](https://github.com/huggingface/transformers) with `Sequence_Len=128` and Automatic Mixed Precision (AMP) as indicated.


| Single-node single-GPU/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| albert-base-v2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 112 | 
| albert-base-v2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 128 | 
| albert-base-v2 | wikitext-2-raw-v1 | p3.8xlarge | float16 | 128 | 135 | 
| albert-base-v2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 191 | 
| bert-base-uncased | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 64 | 94 | 
| bert-base-uncased | wikitext-2-raw-v1 | p3.2xlarge | float16 | 96 | 101 | 
| bert-base-uncased | wikitext-2-raw-v1 | p3.8xlarge | float16 | 96 | 96 | 
| bert-base-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 
| bert-large-uncased | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 35 | 21 | 
| bert-large-uncased | wikitext-2-raw-v1 | p3.2xlarge | float16 | 39 | 26 | 
| bert-large-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 60 | 50 | 
| camembert-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 96 | 90 | 
| camembert-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 96 | 98 | 
| camembert-base | wikitext-2-raw-v1 | p3.8xlarge | float16 | 96 | 96 | 
| camembert-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 256 | 160 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 176 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | p3.8xlarge | float16 | 128 | 160 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 256 | 258 | 
| google\$1electra-small-discriminator | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 256 | 216 | 
| google\$1electra-small-discriminator | wikitext-2-raw-v1 | p3.2xlarge | float16 | 256 | 230 | 
| google\$1electra-small-discriminator | wikitext-2-raw-v1 | p3.8xlarge | float16 | 256 | 224 | 
| google\$1electra-small-discriminator | wikitext-2-raw-v1 | g5.4xlarge | float16 | 256 | 320 | 
| gpt2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 80 | 64 | 
| gpt2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 80 | 77 | 
| gpt2 | wikitext-2-raw-v1 | p3.8xlarge | float16 | 80 | 72 | 
| gpt2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 120 | 
| jplu\$1tf-xlm-roberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 28 | 24 | 
| jplu\$1tf-xlm-roberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 32 | 24 | 
| jplu\$1tf-xlm-roberta-base | wikitext-2-raw-v1 | p3.8xlarge | float16 | 32 | 26 | 
| jplu\$1tf-xlm-roberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 66 | 52 | 
| microsoft\$1mpnet-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 96 | 92 | 
| microsoft\$1mpnet-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 96 | 101 | 
| microsoft\$1mpnet-base | wikitext-2-raw-v1 | p3.8xlarge | float16 | 96 | 101 | 
| microsoft\$1mpnet-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 152 | 
| roberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 64 | 72 | 
| roberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 64 | 84 | 
| roberta-base | wikitext-2-raw-v1 | p3.8xlarge | float16 | 64 | 86 | 
| roberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 

### TensorFlow 2.9.1
<a name="training-compiler-tested-models-tf291"></a>

Tested using [TensorFlow Model Garden](https://github.com/tensorflow/models) with Automatic Mixed Precision (AMP).

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

\$1 The batch sizes marked with the asterisk symbol (\$1) indicate the largest batch size tested by the SageMaker Training Compiler developer team. For the marked cells, the instance may be able to fit a larger batch size than what is indicated.

### Transformers 4.21.1 with PyTorch 1.11.0
<a name="training-compiler-tested-models-hf421-pt111"></a>

Tested with `Sequence_Len=512` and Automatic Mixed Precision (AMP).

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

### Transformers 4.17.0 with PyTorch 1.10.2
<a name="training-compiler-tested-models-hf417-pt110"></a>

Tested with `Sequence_Len=512` and Automatic Mixed Precision (AMP).

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

### Transformers 4.11.0 with PyTorch 1.9.0
<a name="training-compiler-tested-models-hf411-pt190"></a>

Tested with `Sequence_Len=512` and Automatic Mixed Precision (AMP).


| Single-node single-GPU | Model  | Instance type | Batch size for native | Batch size for Training Compiler | 
| --- | --- | --- | --- | --- | 
| albert-base-v2  | ml.p3.2xlarge | 12 | 32 | 
| bert-base-cased  | ml.p3.2xlarge | 14 | 24 | 
| bert-base-chinese | ml.p3.2xlarge | 16 | 24 | 
| bert-base-multilingual-cased  | ml.p3.2xlarge | 4 | 16 | 
| bert-base-multilingual-uncased  | ml.p3.2xlarge | 8 | 16 | 
| bert-base-uncased  | ml.p3.2xlarge | 12 | 24 | 
| cl-tohoku/bert-base-japanese-whole-word-masking | ml.p3.2xlarge | 12 | 24 | 
| cl-tohoku/bert-base-japanese  | ml.p3.2xlarge | 12 | 24 | 
| distilbert-base-uncased  | ml.p3.2xlarge | 28 | 32 | 
| distilbert-base-uncased-finetuned-sst-2-english | ml.p3.2xlarge | 28 | 32 | 
| distilgpt2  | ml.p3.2xlarge | 16 | 32 | 
| facebook/bart-base  | ml.p3.2xlarge | 4 | 8 | 
| gpt2 | ml.p3.2xlarge | 6 | 20 | 
| nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large  | ml.p3.2xlarge | 20 | 32 | 
| roberta-base  | ml.p3.2xlarge | 12 | 20 | 


| Single-node multi-GPU | Model  | Instance type | Batch size for native | Batch size for Training Compiler | 
| --- | --- | --- | --- | --- | 
| bert-base-chinese  | ml.p3.8xlarge | 16 | 26 | 
| bert-base-multilingual-cased  | ml.p3.8xlarge | 6 | 16 | 
| bert-base-multilingual-uncased | ml.p3.8xlarge | 6 | 16 | 
| bert-base-uncased  | ml.p3.8xlarge | 14 | 24 | 
| distilbert-base-uncased  | ml.p3.8xlarge | 14 | 32 | 
| distilgpt2 | ml.p3.8xlarge | 6 | 32 | 
| facebook/bart-base | ml.p3.8xlarge | 8 | 16 | 
| gpt2  | ml.p3.8xlarge | 8 | 20 | 
| roberta-base  | ml.p3.8xlarge | 12 | 20 | 

### Transformers 4.17.0 with TensorFlow 2.6.3
<a name="training-compiler-tested-models-hf417-tf263"></a>

Tested with `Sequence_Len=128` and Automatic Mixed Precision (AMP).


| Model  | Instance type | Batch size for native frameworks | Batch size for Training Compiler | 
| --- | --- | --- | --- | 
| albert-base-v2 | ml.g4dn.16xlarge | 136 | 208 | 
| albert-base-v2 | ml.g5.4xlarge | 219 | 312 | 
| albert-base-v2 | ml.p3.2xlarge | 152 | 208 | 
| albert-base-v2 | ml.p3.8xlarge | 152 | 192 | 
| bert-base-uncased | ml.g4dn.16xlarge | 120 | 101 | 
| bert-base-uncased | ml.g5.4xlarge | 184 | 160 | 
| bert-base-uncased | ml.p3.2xlarge | 128 | 108 | 
| bert-large-uncased | ml.g4dn.16xlarge | 37 | 28 | 
| bert-large-uncased | ml.g5.4xlarge | 64 | 55 | 
| bert-large-uncased | ml.p3.2xlarge | 40 | 32 | 
| camembert-base | ml.g4dn.16xlarge | 96 | 100 | 
| camembert-base | ml.g5.4xlarge | 190 | 160 | 
| camembert-base | ml.p3.2xlarge | 129 | 108 | 
| camembert-base | ml.p3.8xlarge | 128 | 104 | 
| distilbert-base-uncased | ml.g4dn.16xlarge | 210 | 160 | 
| distilbert-base-uncased | ml.g5.4xlarge | 327 | 288 | 
| distilbert-base-uncased | ml.p3.2xlarge | 224 | 196 | 
| distilbert-base-uncased | ml.p3.8xlarge | 192 | 182 | 
| google\$1electra-small-discriminator | ml.g4dn.16xlarge | 336 | 288 | 
| google\$1electra-small-discriminator | ml.g5.4xlarge | 504 | 384 | 
| google\$1electra-small-discriminator | ml.p3.2xlarge | 352 | 323 | 
| gpt2 | ml.g4dn.16xlarge | 89 | 64 | 
| gpt2 | ml.g5.4xlarge | 140 | 146 | 
| gpt2 | ml.p3.2xlarge | 94 | 96 | 
| gpt2 | ml.p3.8xlarge | 96 | 88 | 
| jplu\$1tf-xlm-roberta-base | ml.g4dn.16xlarge | 52 | 16 | 
| jplu\$1tf-xlm-roberta-base | ml.g5.4xlarge | 64 | 44 | 
| microsoft\$1mpnet-base | ml.g4dn.16xlarge | 120 | 100 | 
| microsoft\$1mpnet-base | ml.g5.4xlarge | 192 | 160 | 
| microsoft\$1mpnet-base | ml.p3.2xlarge | 128 | 104 | 
| microsoft\$1mpnet-base | ml.p3.8xlarge | 130 | 92 | 
| roberta-base | ml.g4dn.16xlarge | 108 | 64 | 
| roberta-base | ml.g5.4xlarge | 176 | 142 | 
| roberta-base | ml.p3.2xlarge | 118 | 100 | 
| roberta-base | ml.p3.8xlarge | 112 | 88 | 

### Transformers 4.11.0 with TensorFlow 2.5.1
<a name="training-compiler-tested-models-hf411-tf251"></a>

Tested with `Sequence_Len=128` and Automatic Mixed Precision (AMP).


| Single-node single-GPU | Model  | Instance type | Batch size for native | Batch size for Training Compiler | 
| --- | --- | --- | --- | --- | 
| albert-base-v2  | ml.p3.2xlarge | 128 | 128 | 
| bart-base  | ml.p3.2xlarge | 12 | 64 | 
| bart-large  | ml.p3.2xlarge | 4 | 28 | 
| bert-base-cased  | ml.p3.2xlarge | 16 | 128 | 
| bert-base-chinese | ml.p3.2xlarge | 16 | 128 | 
| bert-base-multilingual-cased  | ml.p3.2xlarge | 12 | 64 | 
| bert-base-multilingual-uncased  | ml.p3.2xlarge | 16 | 96 | 
| bert-base-uncased | ml.p3.2xlarge | 16 | 96 | 
| bert-large-uncased  | ml.p3.2xlarge | 4 | 24 | 
| cl-tohoku/bert-base-japanese  | ml.p3.2xlarge | 16 | 128 | 
| cl-tohoku/bert-base-japanese-whole-word-masking  | ml.p3.2xlarge | 16 | 128 | 
| distilbert-base-sst2  | ml.p3.2xlarge | 32 | 128 | 
| distilbert-base-uncased  | ml.p3.2xlarge | 32 | 128 | 
| distilgpt2 | ml.p3.2xlarge | 32 | 128 | 
| gpt2  | ml.p3.2xlarge | 12 | 64 | 
| gpt2-large  | ml.p3.2xlarge | 2 | 24 | 
| jplu/tf-xlm-roberta-base  | ml.p3.2xlarge | 12 | 32 | 
| roberta-base  | ml.p3.2xlarge | 4 | 64 | 
| roberta-large  | ml.p3.2xlarge | 4 | 64 | 
| t5-base  | ml.p3.2xlarge | 64 | 64 | 
| t5-small  | ml.p3.2xlarge | 128 | 128 | 

# Bring Your Own Deep Learning Model
<a name="training-compiler-modify-scripts"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

This guide walks you through how to adapt your training script for a compiler-accelerated training job. The preparation of your training script depends on the following:
+ Training settings such as single-core or distributed training.
+ Frameworks and libraries that you use to create the training script.

Choose one of the following topics depending on the framework you use.

**Topics**
+ [PyTorch](training-compiler-pytorch-models.md)
+ [TensorFlow](training-compiler-tensorflow.md)

**Note**  
After you finish preparing your training script, you can run a SageMaker training job using the SageMaker AI framework estimator classes. For more information, see the previous topic at [Enable SageMaker Training Compiler](training-compiler-enable.md).

# PyTorch
<a name="training-compiler-pytorch-models"></a>

Bring your own PyTorch model to SageMaker AI, and run the training job with SageMaker Training Compiler.

**Topics**
+ [PyTorch Models with Hugging Face Transformers](#training-compiler-pytorch-models-transformers)

## PyTorch Models with Hugging Face Transformers
<a name="training-compiler-pytorch-models-transformers"></a>

PyTorch models with [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) are based on PyTorch's [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) API. Hugging Face Transformers also provides [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) and pretrained model classes for PyTorch to help reduce the effort for configuring natural language processing (NLP) models. After preparing your training script, you can launch a training job using the SageMaker AI `PyTorch` or `HuggingFace` estimator with the SageMaker Training Compiler configuration when you'll proceed to the next topic at [Enable SageMaker Training Compiler](training-compiler-enable.md).

**Tip**  
When you create a tokenizer for an NLP model using Transformers in your training script, make sure that you use a static input tensor shape by specifying `padding='max_length'`. Do not use `padding='longest'` because padding to the longest sequence in the batch can change the tensor shape for each training batch. The dynamic input shape can trigger recompilation of the model and might increase total training time. For more information about padding options of the Transformers tokenizers, see [Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation) in the *Hugging Face Transformers documentation*.

**Topics**
+ [Large Language Models Using the Hugging Face Transformers `Trainer` Class](#training-compiler-pytorch-models-transformers-trainer)
+ [Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)](#training-compiler-pytorch-models-non-trainer)

### Large Language Models Using the Hugging Face Transformers `Trainer` Class
<a name="training-compiler-pytorch-models-transformers-trainer"></a>

If you use the transformers library’s Trainer class, you don’t need to make any additional changes to your training script. SageMaker Training Compiler automatically compiles your Trainer model if you enable it through the estimator class. The following code shows the basic form of a PyTorch training script with Hugging Face Trainer API.

```
from transformers import Trainer, TrainingArguments

training_args=TrainingArguments(**kwargs)
trainer=Trainer(args=training_args, **kwargs)
```

**Topics**
+ [For single GPU training](#training-compiler-pytorch-models-transformers-trainer-single-gpu)
+ [For distributed training](#training-compiler-pytorch-models-transformers-trainer-distributed)
+ [Best Practices to Use SageMaker Training Compiler with `Trainer`](#training-compiler-pytorch-models-transformers-trainer-best-practices)

#### For single GPU training
<a name="training-compiler-pytorch-models-transformers-trainer-single-gpu"></a>

You don't need to change your code when you use the [https://huggingface.co/docs/transformers/main_classes/trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class. 

#### For distributed training
<a name="training-compiler-pytorch-models-transformers-trainer-distributed"></a>

**PyTorch v1.11.0 and later**

To run distributed training with SageMaker Training Compiler, you must add the following `_mp_fn()` function in your training script and wrap the `main()` function. It redirects the `_mp_fn(index)` function calls from the SageMaker AI distributed runtime for PyTorch (`pytorchxla`) to the `main()` function of your training script. 

```
def _mp_fn(index):
    main()
```

This function accepts the `index` argument to indicate the rank of the current GPU in the cluster for distributed training. To find more example scripts, see the [Hugging Face Transformers language modeling example scripts](https://github.com/huggingface/transformers/blob/v4.21.1/examples/pytorch/language-modeling).

**For Transformers v4.17 and before with PyTorch v1.10.2 and before**

SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job, and you don't need to make any modification in your training script. Instead, SageMaker Training Compiler requires you to pass a SageMaker AI distributed training launcher script to the `entry_point` argument and pass your training script to the `hyperparameters` argument in the SageMaker AI Hugging Face estimator.

#### Best Practices to Use SageMaker Training Compiler with `Trainer`
<a name="training-compiler-pytorch-models-transformers-trainer-best-practices"></a>
+ Make sure that you use SyncFree optimizers by setting the `optim` argument to `adamw_torch_xla` while setting up [transformers.TrainingArgument](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). See also [Optimizer](https://huggingface.co/docs/transformers/v4.23.1/en/perf_train_gpu_one#optimizer) in the *Hugging Face Transformers documentation*.
+ Ensure that the throughput of the data processing pipeline is higher than the training throughput. You can tweak the `dataloader_num_workers` and `preprocessing_num_workers` arguments of the [transformers.TrainingArgument](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) class to achieve this. Typically, these need to be greater than or equal to the number of GPUs but less than the number of CPUs.

After you have completed adapting your training script, proceed to [Run PyTorch Training Jobs with SageMaker Training Compiler](training-compiler-enable-pytorch.md).

### Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)
<a name="training-compiler-pytorch-models-non-trainer"></a>

If you have a training script that uses PyTorch directly, you need to make additional changes to your PyTorch training script to implement PyTorch/XLA. Follow the instructions to modify your script to properly set up the PyTorch/XLA primatives.

**Topics**
+ [For single GPU training](#training-compiler-pytorch-models-non-trainer-single-gpu)
+ [For distributed training](#training-compiler-pytorch-models-non-trainer-distributed)
+ [Best Practices to Use SageMaker Training Compiler with PyTorch/XLA](#training-compiler-pytorch-models-best-practices)

#### For single GPU training
<a name="training-compiler-pytorch-models-non-trainer-single-gpu"></a>

1. Import the optimization libraries.

   ```
   import torch_xla
   import torch_xla.core.xla_model as xm
   ```

1. Change the target device to be XLA instead of `torch.device("cuda")`

   ```
   device=xm.xla_device()
   ```

1. If you're using PyTorch's [Automatic Mixed Precision](https://pytorch.org/docs/stable/amp.html) (AMP), do the following:

   1. Replace `torch.cuda.amp` with the following:

      ```
      import torch_xla.amp
      ```

   1. Replace `torch.optim.SGD` and `torch.optim.Adam` with the following:

      ```
      import torch_xla.amp.syncfree.Adam as adam
      import torch_xla.amp.syncfree.SGD as SGD
      ```

   1. Replace `torch.cuda.amp.GradScaler` with the following:

      ```
      import torch_xla.amp.GradScaler as grad_scaler
      ```

1. If you're not using AMP, replace `optimizer.step()` with the following:

   ```
   xm.optimizer_step(optimizer)
   ```

1. If you're using a distributed dataloader, wrap your dataloader in the PyTorch/XLA's `ParallelLoader` class:

   ```
   import torch_xla.distributed.parallel_loader as pl
   parallel_loader=pl.ParallelLoader(dataloader, [device]).per_device_loader(device)
   ```

1. Add `mark_step` at the end of the training loop when you're not using `parallel_loader`:

   ```
   xm.mark_step()
   ```

1. To checkpoint your training, use the PyTorch/XLA's model checkpoint method:

   ```
   xm.save(model.state_dict(), path_to_save)
   ```

After you have completed adapting your training script, proceed to [Run PyTorch Training Jobs with SageMaker Training Compiler](training-compiler-enable-pytorch.md).

#### For distributed training
<a name="training-compiler-pytorch-models-non-trainer-distributed"></a>

In addition to the changes listed in the previous [For single GPU training](#training-compiler-pytorch-models-non-trainer-single-gpu) section, add the following changes to properly distribute workload across GPUs.

1. If you're using AMP, add `all_reduce` after `scaler.scale(loss).backward()`:

   ```
   gradients=xm._fetch_gradients(optimizer)
   xm.all_reduce('sum', gradients, scale=1.0/xm.xrt_world_size())
   ```

1. If you need to set variables for `local_ranks` and `world_size`, use similar code to the following:

   ```
   local_rank=xm.get_local_ordinal()
   world_size=xm.xrt_world_size()
   ```

1. For any `world_size` (`num_gpus_per_node*num_nodes`) greater than `1`, you must define a train sampler which should look similar to the following:

   ```
   import torch_xla.core.xla_model as xm
   
   if xm.xrt_world_size() > 1:
       train_sampler=torch.utils.data.distributed.DistributedSampler(
           train_dataset,
           num_replicas=xm.xrt_world_size(),
           rank=xm.get_ordinal(),
           shuffle=True
       )
   
   train_loader=torch.utils.data.DataLoader(
       train_dataset, 
       batch_size=args.batch_size,
       sampler=train_sampler,
       drop_last=args.drop_last,
       shuffle=False if train_sampler else True,
       num_workers=args.num_workers
   )
   ```

1. Make the following changes to make sure you use the `parallel_loader` provided by the `torch_xla distributed` module. 

   ```
   import torch_xla.distributed.parallel_loader as pl
   train_device_loader=pl.MpDeviceLoader(train_loader, device)
   ```

   The `train_device_loader` functions like a regular PyTorch loader as follows: 

   ```
   for step, (data, target) in enumerate(train_device_loader):
       optimizer.zero_grad()
       output=model(data)
       loss=torch.nn.NLLLoss(output, target)
       loss.backward()
   ```

   With all of these changes, you should be able to launch distributed training with any PyTorch model without the Transformer Trainer API. Note that these instructions can be used for both single-node multi-GPU and multi-node multi-GPU.

1. **For PyTorch v1.11.0 and later**

   To run distributed training with SageMaker Training Compiler, you must add the following `_mp_fn()` function in your training script and wrap the `main()` function. It redirects the `_mp_fn(index)` function calls from the SageMaker AI distributed runtime for PyTorch (`pytorchxla`) to the `main()` function of your training script. 

   ```
   def _mp_fn(index):
       main()
   ```

   This function accepts the `index` argument to indicate the rank of the current GPU in the cluster for distributed training. To find more example scripts, see the [Hugging Face Transformers language modeling example scripts](https://github.com/huggingface/transformers/blob/v4.21.1/examples/pytorch/language-modeling).

   **For Transformers v4.17 and before with PyTorch v1.10.2 and before**

   SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job and requires you to pass a SageMaker AI distributed training launcher script to the `entry_point` argument and pass your training script to the `hyperparameters` argument in the SageMaker AI Hugging Face estimator.

After you have completed adapting your training script, proceed to [Run PyTorch Training Jobs with SageMaker Training Compiler](training-compiler-enable-pytorch.md).

#### Best Practices to Use SageMaker Training Compiler with PyTorch/XLA
<a name="training-compiler-pytorch-models-best-practices"></a>

If you want to leverage the SageMaker Training Compiler on your native PyTorch training script, you may want to first get familiar with [PyTorch on XLA devices](https://pytorch.org/xla/release/1.9/index.html). The following sections list some best practices to enable XLA for PyTorch.

**Note**  
This section for best practices assumes that you use the following PyTorch/XLA modules:  

```
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
```

##### Understand the lazy mode in PyTorch/XLA
<a name="training-compiler-pytorch-models-best-practices-lazy-mode"></a>

One significant difference between PyTorch/XLA and native PyTorch is that the PyTorch/XLA system runs in lazy mode while the native PyTorch runs in eager mode. Tensors in lazy mode are placeholders for building the computational graph until they are materialized after the compilation and evaluation are complete. The PyTorch/XLA system builds the computational graph on the fly when you call PyTorch APIs to build the computation using tensors and operators. The computational graph gets compiled and executed when `xm.mark_step()` is called explicitly or implicitly by `pl.MpDeviceLoader/pl.ParallelLoader`, or when you explicitly request the value of a tensor such as by calling `loss.item()` or `print(loss)`. 

##### Minimize the number of *compilation-and-executions* using `pl.MpDeviceLoader/pl.ParallelLoader` and `xm.step_closure`
<a name="training-compiler-pytorch-models-best-practices-minimize-comp-exec"></a>

For best performance, you should keep in mind the possible ways to initiate *compilation-and-executions* as described in [Understand the lazy mode in PyTorch/XLA](#training-compiler-pytorch-models-best-practices-lazy-mode) and should try to minimize the number of compilation-and-executions. Ideally, only one compilation-and-execution is necessary per training iteration and is initiated automatically by `pl.MpDeviceLoader/pl.ParallelLoader`. The `MpDeviceLoader` is optimized for XLA and should always be used if possible for best performance. During training, you might want to examine some intermediate results such as loss values. In such case, the printing of lazy tensors should be wrapped using `xm.add_step_closure()` to avoid unnecessary compilation-and-executions.

##### Use AMP and `syncfree` optimizers
<a name="training-compiler-pytorch-models-best-practices-amp-optimizers"></a>

Training in Automatic Mixed Precision (AMP) mode significantly accelerates your training speed by leveraging the Tensor cores of NVIDIA GPUs. SageMaker Training Compiler provides `syncfree` optimizers that are optimized for XLA to improve AMP performance. Currently, the following three `syncfree` optimizers are available and should be used if possible for best performance.

```
torch_xla.amp.syncfree.SGD
torch_xla.amp.syncfree.Adam
torch_xla.amp.syncfree.AdamW
```

These `syncfree` optimizers should be paired with `torch_xla.amp.GradScaler` for gradient scaling/unscaling.

**Tip**  
Starting PyTorch 1.13.1, SageMaker Training Compiler improves performance by letting PyTorch/XLA to automatically override the optimizers (such as SGD, Adam, AdamW) in `torch.optim` or `transformers.optimization` with the syncfree versions of them in `torch_xla.amp.syncfree` (such as `torch_xla.amp.syncfree.SGD`, `torch_xla.amp.syncfree.Adam`, `torch_xla.amp.syncfree.AdamW`). You don't need to change those code lines where you define optimizers in your training script.

# TensorFlow
<a name="training-compiler-tensorflow"></a>

Bring your own TensorFlow model to SageMaker AI, and run the training job with SageMaker Training Compiler.

## TensorFlow Models
<a name="training-compiler-tensorflow-models"></a>

SageMaker Training Compiler automatically optimizes model training workloads that are built on top of the native TensorFlow API or the high-level Keras API.

**Tip**  
For preprocessing your input dataset, ensure that you use a static input shape. Dynamic input shape can initiate recompilation of the model and might increase total training time. 

### Using Keras (Recommended)
<a name="training-compiler-tensorflow-models-keras"></a>

For the best compiler acceleration, we recommend using models that are subclasses of TensorFlow Keras ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)).

#### For single GPU training
<a name="training-compiler-tensorflow-models-keras-single-gpu"></a>

There's no additional change you need to make in the training script.

### Without Keras
<a name="training-compiler-tensorflow-models-no-keras"></a>

SageMaker Training Compiler does not support eager execution in TensorFlow. Accordingly, you should wrap your model and training loops with the TensorFlow function decorator (`@tf.function`) to leverage compiler acceleration.

SageMaker Training Compiler performs a graph-level optimization, and uses the decorator to make sure your TensorFlow functions are set to run in [graph mode](https://www.tensorflow.org/guide/intro_to_graphs).

#### For single GPU training
<a name="training-compiler-tensorflow-models-no-keras-single-gpu"></a>

TensorFlow 2.0 or later has the eager execution on by default, so you should add the `@tf.function` decorator in front of every function that you use for constructing a TensorFlow model.

## TensorFlow Models with Hugging Face Transformers
<a name="training-compiler-tensorflow-models-transformers"></a>

TensorFlow models with [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) are based on TensorFlow's [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) API. Hugging Face Transformers also provides pretrained model classes for TensorFlow to help reduce the effort for configuring natural language processing (NLP) models. After creating your own training script using the Transformers library, you can run the training script using the SageMaker AI `HuggingFace` estimator with the SageMaker Training Compiler configuration class as shown in the previous topic at [Run TensorFlow Training Jobs with SageMaker Training Compiler](training-compiler-enable-tensorflow.md).

SageMaker Training Compiler automatically optimizes model training workloads that are built on top of the native TensorFlow API or the high-level Keras API, such as the TensorFlow transformer models.

**Tip**  
When you create a tokenizer for an NLP model using Transformers in your training script, make sure that you use a static input tensor shape by specifying `padding='max_length'`. Do not use `padding='longest'` because padding to the longest sequence in the batch can change the tensor shape for each training batch. The dynamic input shape can initiate recompilation of the model and might increase total training time. For more information about padding options of the Transformers tokenizers, see [Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation) in the *Hugging Face Transformers documentation*.

**Topics**
+ [Using Keras](#training-compiler-tensorflow-models-transformers-keras)
+ [Without Keras](#training-compiler-tensorflow-models-transformers-no-keras)

### Using Keras
<a name="training-compiler-tensorflow-models-transformers-keras"></a>

For the best compiler acceleration, we recommend using models that are subclasses of TensorFlow Keras ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)). As noted in the [Quick tour](https://huggingface.co/docs/transformers/quicktour) page in the *Hugging Face Transformers documentation*, you can use the models as regular TensorFlow Keras models.

#### For single GPU training
<a name="training-compiler-tensorflow-models-transformers-keras-single-gpu"></a>

There's no additional change you need to make in the training script.

#### For distributed training
<a name="training-compiler-tensorflow-models-transformers-keras-distributed"></a>

SageMaker Training Compiler acceleration works transparently for multi-GPU workloads when the model is constructed and trained using Keras APIs within the scope of [https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy) call.

1. Choose the right distributed training strategy.

   1. For single-node multi-GPU, use `tf.distribute.MirroredStrategy` to set the strategy.

      ```
      strategy = tf.distribute.MirroredStrategy()
      ```

   1. For multi-node multi-GPU, add the following code to properly set the TensorFlow distributed training configuration before creating the strategy.

      ```
      def set_sm_dist_config():
          DEFAULT_PORT = '8890'
          DEFAULT_CONFIG_FILE = '/opt/ml/input/config/resourceconfig.json'
          with open(DEFAULT_CONFIG_FILE) as f:
              config = json.loads(f.read())
              current_host = config['current_host']
          tf_config = {
              'cluster': {
                  'worker': []
              },
              'task': {'type': 'worker', 'index': -1}
          }
          for i, host in enumerate(config['hosts']):
              tf_config['cluster']['worker'].append("%s:%s" % (host, DEFAULT_PORT))
              if current_host == host:
                  tf_config['task']['index'] = i
          os.environ['TF_CONFIG'] = json.dumps(tf_config)
      
      set_sm_dist_config()
      ```

       Use `tf.distribute.MultiWorkerMirroredStrategy` to set the strategy.

      ```
      strategy = tf.distribute.MultiWorkerMirroredStrategy()
      ```

1. Using the strategy of your choice, wrap the model.

   ```
   with strategy.scope():
       # create a model and do fit
   ```

### Without Keras
<a name="training-compiler-tensorflow-models-transformers-no-keras"></a>

If you want to bring custom models with custom training loops using TensorFlow without Keras, you should wrap the model and the training loop with the TensorFlow function decorator (`@tf.function`) to leverage compiler acceleration.

SageMaker Training Compiler performs a graph-level optimization, and uses the decorator to make sure your TensorFlow functions are set to run in graph mode. 

#### For single GPU training
<a name="training-compiler-tensorflow-models-transformers-no-keras-single-gpu"></a>

TensorFlow 2.0 or later has the eager execution on by default, so you should add the `@tf.function` decorator in front of every function that you use for constructing a TensorFlow model.

#### For distributed training
<a name="training-compiler-tensorflow-models-transformers-no-keras-distributed"></a>

In addition to the changes needed for [Using Keras for distributed training](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-tensorflow-models.html#training-compiler-tensorflow-models-transformers-keras), you need to ensure that functions to be run on each GPU are annotated with `@tf.function`, while cross-GPU communication functions are not annotated. An example training code should look like the following:

```
@tf.function()
def compiled_step(inputs, outputs):
    with tf.GradientTape() as tape:
        pred=model(inputs, training=True)
        total_loss=loss_object(outputs, pred)/args.batch_size
    gradients=tape.gradient(total_loss, model.trainable_variables)
    return total_loss, pred, gradients

def train_step(inputs, outputs):
    total_loss, pred, gradients=compiled_step(inputs, outputs)
    if args.weight_decay > 0.:
        gradients=[g+v*args.weight_decay for g,v in zip(gradients, model.trainable_variables)]

    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss.update_state(total_loss)
    train_accuracy.update_state(outputs, pred)

@tf.function()
def train_step_dist(inputs, outputs):
    strategy.run(train_step, args= (inputs, outputs))
```

Note that this instruction can be used for both single-node multi-GPU and multi-node multi-GPU.

# Enable SageMaker Training Compiler
<a name="training-compiler-enable"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

SageMaker Training Compiler is built into the SageMaker Python SDK and AWS Deep Learning Containers so that you don’t need to change your workflows to enable Training Compiler. Choose one of the following topics that matches with your use case.

**Topics**
+ [Run PyTorch Training Jobs with SageMaker Training Compiler](training-compiler-enable-pytorch.md)
+ [Run TensorFlow Training Jobs with SageMaker Training Compiler](training-compiler-enable-tensorflow.md)

# Run PyTorch Training Jobs with SageMaker Training Compiler
<a name="training-compiler-enable-pytorch"></a>

You can use any of the SageMaker AI interfaces to run a training job with SageMaker Training Compiler: Amazon SageMaker Studio Classic, Amazon SageMaker notebook instances, AWS SDK for Python (Boto3), and AWS Command Line Interface.

**Topics**
+ [Using the SageMaker Python SDK](#training-compiler-enable-pytorch-pysdk)
+ [Using the SageMaker AI `CreateTrainingJob` API Operation](#training-compiler-enable-pytorch-api)

## Using the SageMaker Python SDK
<a name="training-compiler-enable-pytorch-pysdk"></a>

SageMaker Training Compiler for PyTorch is available through the SageMaker AI [https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) and [https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator) framework estimator classes. To turn on SageMaker Training Compiler, add the `compiler_config` parameter to the SageMaker AI estimators. Import the `TrainingCompilerConfig` class and pass an instance of it to the `compiler_config` parameter. The following code examples show the structure of SageMaker AI estimator classes with SageMaker Training Compiler turned on.

**Tip**  
To get started with prebuilt models provided by PyTorch or Transformers, try using the batch sizes provided in the reference table at [Tested Models](training-compiler-support.md#training-compiler-tested-models).

**Note**  
The native PyTorch support is available in the SageMaker Python SDK v2.121.0 and later. Make sure that you update the SageMaker Python SDK accordingly.

**Note**  
Starting PyTorch v1.12.0, SageMaker Training Compiler containers for PyTorch are available. Note that the SageMaker Training Compiler containers for PyTorch are not prepackaged with Hugging Face Transformers. If you need to install the library in the container, make sure that you add the `requirements.txt` file under the source directory when submitting a training job.  
For PyTorch v1.11.0 and before, use the previous versions of the SageMaker Training Compiler containers for Hugging Face and PyTorch.  
For a complete list of framework versions and corresponding container information, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).

For information that fits your use case, see one of the following options.

### For single GPU training
<a name="training-compiler-estimator-pytorch-single"></a>

------
#### [ PyTorch v1.12.0 and later ]

To compile and train a PyTorch model, configure a SageMaker AI PyTorch estimator with SageMaker Training Compiler as shown in the following code example.

**Note**  
This native PyTorch support is available in the SageMaker AI Python SDK v2.120.0 and later. Make sure that you update the SageMaker AI Python SDK.

```
from sagemaker.pytorch import PyTorch, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

pytorch_estimator=PyTorch(
    entry_point='train.py',
    source_dir='path-to-requirements-file', # Optional. Add this if need to install additional packages.
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='1.13.1',
    py_version='py3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_estimator.fit()
```

------
#### [ Hugging Face Transformers with PyTorch v1.11.0 and before ]

To compile and train a transformer model with PyTorch, configure a SageMaker AI Hugging Face estimator with SageMaker Training Compiler as shown in the following code example.

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

pytorch_huggingface_estimator=HuggingFace(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    transformers_version='4.21.1',
    pytorch_version='1.11.0',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_huggingface_estimator.fit()
```

To prepare your training script, see the following pages.
+ [For single GPU training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-transformers-trainer-single-gpu) of a PyTorch model using Hugging Face Transformers' [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer)
+ [For single GPU training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer-single-gpu) of a PyTorch model without Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)

To find end-to-end examples, see the following notebooks:
+ [Compile and Train a Hugging Face Transformers Trainer Model for Question and Answering with the SQuAD dataset ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_single_gpu_single_node/albert-base-v2/albert-base-v2.html) 
+ [Compile and Train a Hugging Face Transformer `BERT` Model with the SST Dataset using SageMaker Training Compiler](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_single_gpu_single_node/bert-base-cased/bert-base-cased-single-node-single-gpu.html) 
+ [Compile and Train a Binary Classification Trainer Model with the SST2 Dataset for Single-Node Single-GPU Training ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_single_gpu_single_node/roberta-base/roberta-base.html)

------

### For distributed training
<a name="training-compiler-estimator-pytorch-distributed"></a>

------
#### [ PyTorch v1.12 ]

For PyTorch v1.12, you can run distributed training with SageMaker Training Compiler by adding the `pytorch_xla` option specified to the `distribution` parameter of the SageMaker AI PyTorch estimator class.

**Note**  
This native PyTorch support is available in the SageMaker AI Python SDK v2.121.0 and later. Make sure that you update the SageMaker AI Python SDK.

```
from sagemaker.pytorch import PyTorch, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
    "n_gpus": num_gpus,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

pytorch_estimator=PyTorch(
    entry_point='your_training_script.py',
    source_dir='path-to-requirements-file', # Optional. Add this if need to install additional packages.
    instance_count=instance_count,
    instance_type=instance_type,
    framework_version='1.13.1',
    py_version='py3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    distribution ={'pytorchxla' : { 'enabled': True }},
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_estimator.fit()
```

**Tip**  
To prepare your training script, see [PyTorch](training-compiler-pytorch-models.md)

------
#### [ Transformers v4.21 with PyTorch v1.11 ]

For PyTorch v1.11 and later, SageMaker Training Compiler is available for distributed training with the `pytorch_xla` option specified to the `distribution` parameter.

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
    "n_gpus": num_gpus,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

pytorch_huggingface_estimator=HuggingFace(
    entry_point='your_training_script.py',
    instance_count=instance_count,
    instance_type=instance_type,
    transformers_version='4.21.1',
    pytorch_version='1.11.0',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    distribution ={'pytorchxla' : { 'enabled': True }},
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_huggingface_estimator.fit()
```

**Tip**  
To prepare your training script, see the following pages.  
[For distributed training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-transformers-trainer-distributed) of a PyTorch model using Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)
[For distributed training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer-distributed) of a PyTorch model without Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)

------
#### [ Transformers v4.17 with PyTorch v1.10.2 and before ]

For the supported version of PyTorch v1.10.2 and before, SageMaker Training Compiler requires an alternate mechanism for launching a distributed training job. To run distributed training, SageMaker Training Compiler requires you to pass a SageMaker AI distributed training launcher script to the `entry_point` argument, and pass your training script to the `hyperparameters` argument. The following code example shows how to configure a SageMaker AI Hugging Face estimator applying the required changes.

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

training_script="your_training_script.py"

hyperparameters={
    "n_gpus": num_gpus,
    "batch_size": batch_size,
    "learning_rate": learning_rate,
    "training_script": training_script     # Specify the file name of your training script.
}

pytorch_huggingface_estimator=HuggingFace(
    entry_point='distributed_training_launcher.py',    # Specify the distributed training launcher script.
    instance_count=instance_count,
    instance_type=instance_type,
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_huggingface_estimator.fit()
```

The launcher script should look like the following. It wraps your training script and configures the distributed training environment depending on the size of the training instance of your choice. 

```
# distributed_training_launcher.py

#!/bin/python

import subprocess
import sys

if __name__ == "__main__":
    arguments_command = " ".join([arg for arg in sys.argv[1:]])
    """
    The following line takes care of setting up an inter-node communication
    as well as managing intra-node workers for each GPU.
    """
    subprocess.check_call("python -m torch_xla.distributed.sm_dist " + arguments_command, shell=True)
```

**Tip**  
To prepare your training script, see the following pages.  
[For distributed training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-transformers-trainer-distributed) of a PyTorch model using Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)
[For distributed training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer-distributed) of a PyTorch model without Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)

**Tip**  
To find end-to-end examples, see the following notebooks:  
[Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Single-Node Multi-GPU Training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_single_node/language-modeling-multi-gpu-single-node.html)
[Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Multi-Node Multi-GPU Training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/language-modeling-multi-gpu-multi-node.html)

------

The following list is the minimal set of parameters required to run a SageMaker training job with the compiler.

**Note**  
When using the SageMaker AI Hugging Face estimator, you must specify the `transformers_version`, `pytorch_version`, `hyperparameters`, and `compiler_config` parameters to enable SageMaker Training Compiler. You cannot use `image_uri` to manually specify the Training Compiler integrated Deep Learning Containers that are listed at [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
+ `entry_point` (str) – Required. Specify the file name of your training script.
**Note**  
To run a distributed training with SageMaker Training Compiler and PyTorch v1.10.2 and before, specify the file name of a launcher script to this parameter. The launcher script should be prepared to wrap your training script and configure the distributed training environment. For more information, see the following example notebooks:  
[Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Single-Node Multi-GPU Training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_single_node/language-modeling-multi-gpu-single-node.html)
[Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Multi-Node Multi-GPU Training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/language-modeling-multi-gpu-multi-node.html)
+ `source_dir` (str) – Optional. Add this if need to install additional packages. To install packages, you need to prapare a `requirements.txt` file under this directory.
+ `instance_count` (int) – Required. Specify the number of instances.
+ `instance_type` (str) – Required. Specify the instance type.
+ `transformers_version` (str) – Required only when using the SageMaker AI Hugging Face estimator. Specify the Hugging Face Transformers library version supported by SageMaker Training Compiler. To find available versions, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
+ `framework_version` or `pytorch_version` (str) – Required. Specify the PyTorch version supported by SageMaker Training Compiler. To find available versions, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
**Note**  
When using the SageMaker AI Hugging Face estimator, you must specify both `transformers_version` and `pytorch_version`.
+ `hyperparameters` (dict) – Optional. Specify hyperparameters for the training job, such as `n_gpus`, `batch_size`, and `learning_rate`. When you enable SageMaker Training Compiler, try larger batch sizes and adjust the learning rate accordingly. To find case studies of using the compiler and adjusted batch sizes to improve training speed, see [Tested Models](training-compiler-support.md#training-compiler-tested-models) and [SageMaker Training Compiler Example Notebooks and Blogs](training-compiler-examples-and-blogs.md).
**Note**  
To run a distributed training with SageMaker Training Compiler and PyTorch v1.10.2 and before, you need to add an additional parameter, `"training_script"`, to specify your training script, as shown in the preceding code example.
+ `compiler_config` (TrainingCompilerConfig object) – Required to activate SageMaker Training Compiler. Include this parameter to turn on SageMaker Training Compiler. The following are parameters for the `TrainingCompilerConfig` class.
  + `enabled` (bool) – Optional. Specify `True` or `False` to turn on or turn off SageMaker Training Compiler. The default value is `True`.
  + `debug` (bool) – Optional. To receive more detailed training logs from your compiler-accelerated training jobs, change it to `True`. However, the additional logging might add overhead and slow down the compiled training job. The default value is `False`.
+ `distribution` (dict) – Optional. To run a distributed training job with SageMaker Training Compiler, add `distribution = { 'pytorchxla' : { 'enabled': True }}`.

**Warning**  
If you turn on SageMaker Debugger, it might impact the performance of SageMaker Training Compiler. We recommend that you turn off Debugger when running SageMaker Training Compiler to make sure there's no impact on performance. For more information, see [Considerations](training-compiler-tips-pitfalls.md#training-compiler-tips-pitfalls-considerations). To turn the Debugger functionalities off, add the following two arguments to the estimator:  

```
disable_profiler=True,
debugger_hook_config=False
```

If the training job with the compiler is launched successfully, you receive the following logs during the job initialization phase: 
+ With `TrainingCompilerConfig(debug=False)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  ```
+ With `TrainingCompilerConfig(debug=True)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  Training Compiler set to debug mode
  ```

## Using the SageMaker AI `CreateTrainingJob` API Operation
<a name="training-compiler-enable-pytorch-api"></a>

SageMaker Training Compiler configuration options must be specified through the `AlgorithmSpecification` and `HyperParameters` field in the request syntax for the [`CreateTrainingJob` API operation](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

```
"AlgorithmSpecification": {
    "TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},

"HyperParameters": {
    "sagemaker_training_compiler_enabled": "true",
    "sagemaker_training_compiler_debug_mode": "false",
    "sagemaker_pytorch_xla_multi_worker_enabled": "false"    // set to "true" for distributed training
}
```

To find a complete list of deep learning container image URIs that have SageMaker Training Compiler implemented, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).

# Run TensorFlow Training Jobs with SageMaker Training Compiler
<a name="training-compiler-enable-tensorflow"></a>

You can use any of the SageMaker AI interfaces to run a training job with SageMaker Training Compiler: Amazon SageMaker Studio Classic, Amazon SageMaker notebook instances, AWS SDK for Python (Boto3), and AWS Command Line Interface.

**Topics**
+ [Using the SageMaker Python SDK](#training-compiler-enable-tensorflow-pysdk)
+ [Using the SageMaker AI Python SDK and Extending SageMaker AI Framework Deep Learning Containers](#training-compiler-enable-tensorflow-sdk-extend-container)
+ [Enable SageMaker Training Compiler Using the SageMaker AI `CreateTrainingJob` API Operation](#training-compiler-enable-tensorflow-api)

## Using the SageMaker Python SDK
<a name="training-compiler-enable-tensorflow-pysdk"></a>

To turn on SageMaker Training Compiler, add the `compiler_config` parameter to the SageMaker AI TensorFlow or Hugging Face estimator. Import the `TrainingCompilerConfig` class and pass an instance of it to the `compiler_config` parameter. The following code examples show the structure of the SageMaker AI estimator classes with SageMaker Training Compiler turned on.

**Tip**  
To get started with prebuilt models provided by the TensorFlow and Transformers libraries, try using the batch sizes provided in the reference table at [Tested Models](training-compiler-support.md#training-compiler-tested-models).

**Note**  
SageMaker Training Compiler for TensorFlow is available through the SageMaker AI [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) and [Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator) framework estimators.

For information that fits your use case, see one of the following options.

### For single GPU training
<a name="training-compiler-estimator-tensorflow-single"></a>

------
#### [ TensorFlow ]

```
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64    

# update the global learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_estimator=TensorFlow(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='2.9.1',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_estimator.fit()
```

To prepare your training script, see the following pages.
+ [For single GPU training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-keras-single-gpu) of a model constructed using TensorFlow Keras (`tf.keras.*`).
+ [For single GPU training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-no-keras-single-gpu) of a model constructed using TensorFlow modules (`tf.*` excluding the TensorFlow Keras modules).

------
#### [ Hugging Face Estimator with TensorFlow ]

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update the global learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_huggingface_estimator=HuggingFace(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    transformers_version='4.21.1',
    tensorflow_version='2.6.3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_huggingface_estimator.fit()
```

To prepare your training script, see the following pages.
+ [For single GPU training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-transformers-keras-single-gpu) of a TensorFlow Keras model with Hugging Face Transformers
+ [For single GPU training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-transformers-no-keras-single-gpu) of a TensorFlow model with Hugging Face Transformers

------

### For distributed training
<a name="training-compiler-estimator-tensorflow-distributed"></a>

------
#### [ Hugging Face Estimator with TensorFlow ]

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
    "n_gpus": num_gpus,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_huggingface_estimator=HuggingFace(
    entry_point='train.py',
    instance_count=instance_count,
    instance_type=instance_type,
    transformers_version='4.21.1',
    tensorflow_version='2.6.3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_huggingface_estimator.fit()
```

**Tip**  
To prepare your training script, see the following pages.  
[For distributed training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-transformers-keras-distributed) of a TensorFlow Keras model with Hugging Face Transformers
[For distributed training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-transformers-no-keras-distributed) of a TensorFlow model with Hugging Face Transformers

------

The following list is the minimal set of parameters required to run a SageMaker training job with the compiler.

**Note**  
When using the SageMaker AI Hugging Face estimator, you must specify the `transformers_version`, `tensorflow_version`, `hyperparameters`, and `compiler_config` parameters to enable SageMaker Training Compiler. You cannot use `image_uri` to manually specify the Training Compiler integrated Deep Learning Containers that are listed at [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
+ `entry_point` (str) – Required. Specify the file name of your training script.
+ `instance_count` (int) – Required. Specify the number of instances.
+ `instance_type` (str) – Required. Specify the instance type.
+ `transformers_version` (str) – Required only when using the SageMaker AI Hugging Face estimator. Specify the Hugging Face Transformers library version supported by SageMaker Training Compiler. To find available versions, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
+ `framework_version` or `tensorflow_version` (str) – Required. Specify the TensorFlow version supported by SageMaker Training Compiler. To find available versions, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
**Note**  
When using the SageMaker AI TensorFlow estimator, you must specify `framework_version`.  
When using the SageMaker AI Hugging Face estimator, you must specify both `transformers_version` and `tensorflow_version`.
+ `hyperparameters` (dict) – Optional. Specify hyperparameters for the training job, such as `n_gpus`, `batch_size`, and `learning_rate`. When you enable SageMaker Training Compiler, try larger batch sizes and adjust the learning rate accordingly. To find case studies of using the compiler and adjusted batch sizes to improve training speed, see [Tested Models](training-compiler-support.md#training-compiler-tested-models) and [SageMaker Training Compiler Example Notebooks and Blogs](training-compiler-examples-and-blogs.md).
+ `compiler_config` (TrainingCompilerConfig object) – Required. Include this parameter to turn on SageMaker Training Compiler. The following are parameters for the `TrainingCompilerConfig` class.
  + `enabled` (bool) – Optional. Specify `True` or `False` to turn on or turn off SageMaker Training Compiler. The default value is `True`.
  + `debug` (bool) – Optional. To receive more detailed training logs from your compiler-accelerated training jobs, change it to `True`. However, the additional logging might add overhead and slow down the compiled training job. The default value is `False`.

**Warning**  
If you turn on SageMaker Debugger, it might impact the performance of SageMaker Training Compiler. We recommend that you turn off Debugger when running SageMaker Training Compiler to make sure there's no impact on performance. For more information, see [Considerations](training-compiler-tips-pitfalls.md#training-compiler-tips-pitfalls-considerations). To turn the Debugger functionalities off, add the following two arguments to the estimator:  

```
disable_profiler=True,
debugger_hook_config=False
```

If the training job with the compiler is launched successfully, you receive the following logs during the job initialization phase: 
+ With `TrainingCompilerConfig(debug=False)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  ```
+ With `TrainingCompilerConfig(debug=True)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  Training Compiler set to debug mode
  ```

## Using the SageMaker AI Python SDK and Extending SageMaker AI Framework Deep Learning Containers
<a name="training-compiler-enable-tensorflow-sdk-extend-container"></a>

AWS Deep Learning Containers (DLC) for TensorFlow use adapted versions of TensorFlow that include changes on top of the open-source TensorFlow framework. The [SageMaker AI Framework Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) are optimized for the underlying AWS infrastructure and Amazon SageMaker AI. With the advantage of using the DLCs, SageMaker Training Compiler integration adds more performance improvements over the native TensorFlow. Furthermore, you can create a custom training container by extending the DLC image.

**Note**  
This Docker customization feature is currently available only for TensorFlow.

To extend and customize the SageMaker AI TensorFlow DLCs for your use-case, use the following instructions.

### Create a Dockerfile
<a name="training-compiler-enable-tensorflow-sdk-extend-container-create-dockerfile"></a>

Use the following Dockerfile template to extend the SageMaker AI TensorFlow DLC. You must use the SageMaker AI TensorFlow DLC image as the base image of your Docker container. To find the SageMaker AI TensorFlow DLC image URIs, see [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html#training-compiler-supported-frameworks).

```
# SageMaker AI TensorFlow Deep Learning Container image
FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/tensorflow-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# This environment variable is used by the SageMaker AI container 
# to determine user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# Add more code lines to customize for your use-case
...
```

For more information, see [Step 2: Create and upload the Dockerfile and Python training scripts](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step2).

Consider the following pitfalls when extending SageMaker AI Framework DLCs:
+ Do not explicitly uninstall or change the version of TensorFlow packages in SageMaker AI containers. Doing so causes the AWS optimized TensorFlow packages to be overwritten by open-source TensorFlow packages, which might result in performance degradation.
+ Watch out for packages that have a particular TensorFlow version or flavor as a dependency. These packages might implicitly uninstall the AWS optimized TensorFlow and install open-source TensorFlow packages.

For example, there’s a known issue that the [tensorflow/models](https://github.com/tensorflow/models) and [tensorflow/text](https://github.com/tensorflow/text) libraries always attempt to [reinstall open source TensorFlow](https://github.com/tensorflow/models/issues/9267). If you need to install these libraries to choose a specific version for your use case, we recommend that you look into the SageMaker AI TensorFlow DLC Dockerfiles for v2.9 or later. The paths to the Dockerfiles are typically in the following format: `tensorflow/training/docker/<tensorflow-version>/py3/<cuda-version>/Dockerfile.gpu`. In the Dockerfiles, you should find the code lines to reinstall AWS managed TensorFlow binary (specified to the `TF_URL` environment variable) and other dependencies in order. The reinstallation section should look like the following example:

```
# tf-models does not respect existing installations of TensorFlow 
# and always installs open source TensorFlow

RUN pip3 install --no-cache-dir -U \
    tf-models-official==x.y.z

RUN pip3 uninstall -y tensorflow tensorflow-gpu \
  ; pip3 install --no-cache-dir -U \
    ${TF_URL} \
    tensorflow-io==x.y.z \
    tensorflow-datasets==x.y.z
```

### Build and push to ECR
<a name="training-compiler-enable-tensorflow-sdk-extend-container-build-and-push"></a>

To build and push your Docker container to Amazon ECR, follow the instructions in the following links:
+ [Step 3: Build the container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step3)
+ [Step 4: Test the container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step4)
+ [Step 5: Push the container to Amazon ECR](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step5)

### Run using the SageMaker Python SDK Estimator
<a name="training-compiler-enable-tensorflow-sdk-extend-container-run-job"></a>

Use the SageMaker AI TensorFlow framework estimator as usual. You must specify `image_uri` to use the new container you hosted in Amazon ECR.

```
import sagemaker, boto3
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'tf-custom-container-test'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'

byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(
    account_id, region, uri_suffix, ecr_repository + tag
)

byoc_image_uri
# This should return something like
# 111122223333.dkr.ecr.us-east-2.amazonaws.com/tf-custom-container-test:latest

estimator = TensorFlow(
    image_uri=image_uri,
    role=get_execution_role(),
    base_job_name='tf-custom-container-test-job',
    instance_count=1,
    instance_type='ml.p3.8xlarge'
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

# Start training
estimator.fit()
```

## Enable SageMaker Training Compiler Using the SageMaker AI `CreateTrainingJob` API Operation
<a name="training-compiler-enable-tensorflow-api"></a>

SageMaker Training Compiler configuration options must be specified through the `AlgorithmSpecification` and `HyperParameters` field in the request syntax for the [`CreateTrainingJob` API operation](https://amazonaws.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

```
"AlgorithmSpecification": {
    "TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},

"HyperParameters": {
    "sagemaker_training_compiler_enabled": "true",
    "sagemaker_training_compiler_debug_mode": "false"
}
```

To find a complete list of deep learning container image URIs that have SageMaker Training Compiler implemented, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).

# SageMaker Training Compiler Example Notebooks and Blogs
<a name="training-compiler-examples-and-blogs"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

The following blogs, case studies, and notebooks provide examples of how to implement SageMaker Training Compiler.

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-training-compiler), and you can also browse them on the [SageMaker AI examples website](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/index.html).

## Blogs and Case Studies
<a name="training-compiler-blogs"></a>

The following blogs discuss case studies about using SageMaker Training Compiler.
+ [New – Introducing SageMaker Training Compiler](https://aws.amazon.com/blogs/aws/new-introducing-sagemaker-training-compiler/)
+ [Hugging Face Transformers BERT fine-tuning using Amazon SageMaker Training Compiler](https://www.philschmid.de/huggingface-amazon-sagemaker-training-compiler)
+ [Speed up Hugging Face Training Jobs on AWS by Up to 50% with SageMaker Training Compiler](https://towardsdatascience.com/speed-up-hugging-face-training-jobs-on-aws-by-up-to-50-with-sagemaker-training-compiler-9ad2ac5b0eb)

## Examples Notebooks
<a name="training-compiler-example-notebooks"></a>

To find examples of using SageMaker Training Compiler, see the [Training Compiler page](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/index.html) in the *Amazon SageMaker AI Example Read the Docs website*.

# SageMaker Training Compiler Best Practices and Considerations
<a name="training-compiler-tips-pitfalls"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

Review the following best practices and considerations when using SageMaker Training Compiler.

## Best Practices
<a name="training-compiler-tips-pitfalls-best-practices"></a>

Use the following guidelines to achieve the best results when you run training jobs with SageMaker Training Compiler.

**General Best Practices**
+ Make sure that you use one of the [Supported Instance Types](training-compiler-support.md#training-compiler-supported-instance-types) and [Tested Models](training-compiler-support.md#training-compiler-tested-models). 
+ When you create a tokenizer for an NLP model using the Hugging Face Transformers library in your training script, make sure that you use a static input tensor shape by specifying `padding='max_length'`. Do not use `padding='longest'` because padding to the longest sequence in the batch can change the tensor shape for each training batch. The dynamic input shape can initiate recompilation of the model and might increase total training time. For more information about padding options of the Transformers tokenizers, see [Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation) in the *Hugging Face Transformers documentation*.
+ Measure GPU memory utilization to make sure that you use the maximum batch size that can fit into the GPU memory. Amazon SageMaker Training Compiler reduces the memory footprint of your model during training, which typically allows you to fit a larger `batch_size` in the GPU memory. Using a larger `batch_size` results in a better GPU utilization and reduces the total training time. 

  When you adjust the batch size, you also have to adjust the `learning_rate` appropriately. For example, if you increased the batch size by a factor of `k`, you need to adjust `learning_rate` linearly (simple multiplication by `k`) or multiply by the square root of `k`. This is to achieve the same or similar convergence behavior in the reduced training time. For reference of `batch_size` tested for popular models, see [Tested Models](training-compiler-support.md#training-compiler-tested-models).
+ To debug the compiler-accelerated training job, enable the `debug` flag in the `compiler_config` parameter. This enables SageMaker AI to put the debugging logs into SageMaker training job logs.

  ```
  huggingface_estimator=HuggingFace(
      ...
      compiler_config=TrainingCompilerConfig(debug=True)
  )
  ```

  Note that if you enable full debugging of the training job with the compiler, this might add some overhead.

**Best Practices for PyTorch**
+ If you bring a PyTorch model and want to checkpoint it, make sure you use PyTorch/XLA's model save function to properly checkpoint your model. For more information about the function, see [https://pytorch.org/xla/release/1.9/index.html#torch_xla.core.xla_model.save](https://pytorch.org/xla/release/1.9/index.html#torch_xla.core.xla_model.save) in the *PyTorch on XLA Devices documentation*. 

  To learn how to add the modifications to your PyTorch script, see [Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer).

  For more information about the actual application of using the model save function, see [Checkpoint Writing and Loading](https://huggingface.co/blog/pytorch-xla#checkpoint-writing-and-loading) in the *Hugging Face on PyTorch/XLA TPUs: Faster and cheaper training blog*.
+ To achieve the most optimal training time for distributed training, consider the following.
  + Use instances with multiple GPUs instead of using single-gpu instances. For example, a single `ml.p3dn.24xlarge` instance has faster training time compared to 8 x `ml.p3.2xlarge` instances.
  + Use instances with EFA support such as `ml.p3dn.24xlarge` and `ml.p4d.24xlarge`. These instance types have accelerated networking speed and reduce training time.
  + Tune the `preprocessing_num_workers` parameter for datasets, so that model training is not delayed by slow preprocessing.

## Considerations
<a name="training-compiler-tips-pitfalls-considerations"></a>

Consider the following when using SageMaker Training Compiler.

### Performance degradation due to logging, checkpointing, and profiling
<a name="training-compiler-considerations-performance-degradation"></a>
+ Avoid logging, checkpointing, and profiling model tensors that lead to explicit evaluations. To understand what an explicit evaluation is, consider the following code compiling example.

  ```
  a = b+c
  e = a+d
  ```

  A compiler interprets the code as follows and reduces the memory footprint for the variable `a`:

  ```
  e = b+c+d
  ```

  Now consider the following case in which the code is changed to add a print function for the variable `a`.

  ```
  a = b+c
  e = a+d
  print(a)
  ```

  The compiler makes an explicit evaluation of the variable `a` as follows.

  ```
  e = b+c+d
  a = b+c    # Explicit evaluation
  print(a)
  ```

  In PyTorch, for example, avoid using [torch.tensor.items()](https://pytorch.org/docs/stable/generated/torch.Tensor.item.html), which might introduce explicit evaluations. In deep learning, such explicit evaluations can cause overhead because they break fused operations in a compilation graph of a model and lead to recomputation of the tensors. 

  If you still want to periodically evaluate the model during training while using SageMaker Training Compiler, we recommend logging and checkpointing at a lower frequency to reduce overhead due to explicit evaluations. For example, log every 10 epochs instead of every epoch.
+ Graph compilation runs during the first few steps of training. As a result, the first few steps are expected to be exceptionally slow. However, this is a one-time compilation cost and can be amortized by training for a longer duration because compilation makes future steps much faster. The initial compilation overhead depends on the size of the model, the size of the input tensors, and the distribution of input tensor shapes.

### Incorrect use of the PyTorch/XLA APIs when using PyTorch directly
<a name="training-compiler-considerations-incorrect-api-use"></a>

PyTorch/XLA defines a set of APIs to replace some of the existing PyTorch training APIs. Failing to use them properly leads PyTorch training to fail.
+ One of the most typical errors when compiling a PyTorch model is due to a wrong device type for operators and tensors. To properly compile a PyTorch model, make sure you use XLA devices ([https://pytorch.org/xla/release/1.9/index.html](https://pytorch.org/xla/release/1.9/index.html)) instead of using CUDA or mixing CUDA devices and XLA devices.
+ `mark_step()` is a barrier just for XLA. Failing to set it correctly causes a training job to stall.
+ PyTorch/XLA provides additional distributed training APIs. Failing to program the APIs properly causes gradients to be collected incorrectly, which causes a training convergence failure.

To properly set up your PyTorch script and avoid the aforementioned incorrect API uses, see [Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer).

# SageMaker Training Compiler FAQ
<a name="training-compiler-faq"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

Use the following FAQ items to find answers to commonly asked questions about SageMaker Training Compiler.

**Q. How do I know SageMaker Training Compiler is working?**

If you successfully launched your training job with SageMaker Training Compiler, you receive the following log messages:
+ With `TrainingCompilerConfig(debug=False)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  ```
+ With `TrainingCompilerConfig(debug=True)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  Training Compiler set to debug mode
  ```

**Q. Which models does SageMaker Training Compiler accelerate?**

SageMaker Training Compiler supports the most popular deep learning models from the Hugging Face transformers library. With most of the operators that the compiler supports, these models can be trained faster with SageMaker Training Compiler. Compilable models include but are not limited to the following: `bert-base-cased`, `bert-base-chinese`, `bert-base-uncased`, `distilbert-base-uncased`, `distilbert-base-uncased-finetuned-sst-2-english`, `gpt2`, `roberta-base`, `roberta-large`, `t5-base`, and `xlm-roberta-base`. The compiler works with most DL operators and data structures and can accelerate many other DL models beyond those that have been tested.

**Q. What happens if I enable SageMaker Training Compiler with a model that isn't tested?**

For an untested model, you might need to first modify the training script to be compatible with SageMaker Training Compiler. For more information, see [Bring Your Own Deep Learning Model](training-compiler-modify-scripts.md) and follow the instructions on how to prepare your training script.

Once you have updated your training script, you can start the training job. The compiler proceeds to compile the model. However, training speed may not increase and might even decrease relative to the baseline with an untested model. You might need to retune training parameters such as `batch_size` and `learning_rate` to achieve any speedup benefits.

If compilation of the untested model fails, the compiler returns an error. See [SageMaker Training Compiler Troubleshooting](training-compiler-troubleshooting.md) for detailed information about the failure types and error messages.

**Q. Will I always get a faster training job with SageMaker Training Compiler? **

No, not necessarily. First, SageMaker Training Compiler adds some compilation overhead before the ongoing training process can be accelerated. The optimized training job must run sufficiently long to amortize and make up for this incremental compilation overhead at the beginning of the training job.

Additionally, as with any model training process, training with suboptimal parameters can increase training time. SageMaker Training Compiler can change the characteristics of the training job by, for example, changing the memory footprint of the job. Because of these differences, you might need to retune your training job parameters to speed up training. A reference table specifying the best performing parameters for training jobs with different instance types and models can be found at [Tested Models](training-compiler-support.md#training-compiler-tested-models).

Finally, some code in a training script might add additional overhead or disrupt the compiled computation graph and slow training. If working with a customized or untested model, see the instructions at [Best Practices to Use SageMaker Training Compiler with PyTorch/XLA](training-compiler-pytorch-models.md#training-compiler-pytorch-models-best-practices).

**Q. Can I always use a larger batch size with SageMaker Training Compiler? **

Batch size increases in most, but not all, cases. The optimizations made by SageMaker Training Compiler can change the characteristics of your training job, such as the memory footprint. Typically, a Training Compiler job occupies less memory than an uncompiled training job with the native framework, which allows for a larger batch size during training. A larger batch size, and a corresponding adjustment to the learning rate, increases training throughput and can decrease total training time.

However, there could be cases where SageMaker Training Compiler might actually increase memory footprint based on its optimization scheme. The compiler uses an analytical cost model to predict the execution schedule with the lowest cost of execution for any compute-intensive operator. This model could find an optimal schedule that increases memory use. In this case, you won’t be able to increase batch sizes, but your sample throughput is still higher.

**Q. Does SageMaker Training Compiler work with other SageMaker training features, such as the SageMaker AI distributed training libraries and SageMaker Debugger?**

SageMaker Training Compiler is currently not compatible with SageMaker AI’s distributed training libraries.

SageMaker Training Compiler is compatible with SageMaker Debugger, but Debugger might degrade computational performance by adding overhead.

**Q. Does SageMaker Training Compiler support custom containers (bring your own container)?**

SageMaker Training Compiler is provided through AWS Deep Learning Containers, and you can extend a subset of the containers to customize for your use-case. Containers that are extended from AWS DLCs are supported by SageMaker Training Compiler. For more information, see [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html#training-compiler-supported-frameworks) and [Using the SageMaker AI Python SDK and Extending SageMaker AI Framework Deep Learning Containers](training-compiler-enable-tensorflow.md#training-compiler-enable-tensorflow-sdk-extend-container). If you need further support, reach out to the SageMaker AI team through [AWS Support](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

# SageMaker Training Compiler Troubleshooting
<a name="training-compiler-troubleshooting"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

If you run into an error, you can use the following list to try to troubleshoot your training job. If you need further support, reach out to the SageMaker AI team through [AWS Support](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

## Training job is not converging as expected when compared to the native framework training job
<a name="training-compiler-troubleshooting-convergence-issue"></a>

Convergence issues range from “the model is not learning when SageMaker Training Compiler is turned on” to “the model is learning but slower than the native framework”. In this troubleshooting guide, we assume your convergence is fine without SageMaker Training Compiler (in the native framework) and consider this the baseline.

When faced with such convergence issues, the first step is to identify if the issue is limited to distributed training or stems from single-GPU training. Distributed training with SageMaker Training Compiler is an extension of single-GPU training with additional steps.

1. Set up a cluster with multiple instances or GPUs.

1. Distribute input data to all workers.

1. Synchronize the model updates from all workers.

Therefore, any convergence issue in single-GPU training propagates to distributed training with multiple workers.

![\[A flow chart to troubleshoot convergence issues in training jobs when using SageMaker Training Compiler.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-compiler-troubleshooting-convergence-flow.jpg)


### Convergence issues occurring in single-GPU training
<a name="training-compiler-troubleshooting-convergence-issue-single-gpu"></a>

If your convergence issue stems from single-GPU training, this is likely due to improper settings for hyperparameters or the `torch_xla` APIs.

**Check the hyperparameters**

Training with SageMaker Training Compiler leads to change in the memory footprint of a model. The compiler intelligently arbitrates between re-use and re-compute leading to a corresponding increase or decrease in memory consumption. To leverage this, it is essential to re-tune the batch size and associated hyperparameters when migrating a training job to SageMaker Training Compiler. However, incorrect hyperparameter settings often cause oscillation in training loss and possibly a slower convergence as a result. In rare cases, aggressive hyperparameters might result in the model not learning (the training loss metric doesn’t decrease or returns `NaN`). To identify if the convergence issue is due to the hyperparameters, do a side-by-side test of two training jobs with and without SageMaker Training Compiler while keeping all the hyperparameters the same.

**Check if the `torch_xla` APIs are properly set up for single-GPU training**

If the convergence issue persists with the baseline hyperparameters, you need to check if there’s any improper usage of the `torch_xla` APIs, specifically the ones for updating the model. Fundamentally, `torch_xla` continues to accumulate instructions (deferring execution) in the form of graph until it is explicitly instructed to run the accumulated graph. The `torch_xla.core.xla_model.mark_step()` function facilitates the execution of the accumulated graph. The graph execution should be synchronized using this function ***after each model update*** and ***before printing and logging any variables***. If it lacks the synchronization step, the model might use stale values from memory during prints, logs, and the subsequent forward passes, instead of using the most recent values that have to be synchronized after every iteration and model update.

It can be more complicated when using SageMaker Training Compiler with gradient scaling (possibly from the use of AMP) or gradient clipping techniques. The appropriate order of gradient computation with AMP is as follows.

1. Gradient computation with scaling

1. Gradient un-scaling, gradient clipping, and then scaling

1. Model update

1. Synchronizing the graph execution with `mark_step()`

To find the right APIs for the operations mentioned in the list, see the guide for [migrating your training script to SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-pytorch-models.html).

**Consider using Automatic Model Tuning**

If the convergence issue arises when re-tuning the batch size and associated hyperparameters such as the learning rate while using SageMaker Training Compiler, consider using [Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) to tune your hyperparameters. You can refer to the [example notebook on tuning hyperparameters with SageMaker Training Compiler](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-training-compiler/tensorflow/single_gpu_single_node/hyper-parameter-tuning.ipynb). 

### Convergence issues occurring in distributed training
<a name="training-compiler-troubleshooting-convergence-issue-distributed-training"></a>

If your convergence issue persists in distributed training, this is likely due to improper settings for weight initialization or the `torch_xla` APIs. 

**Check weight initialization across the workers**

If the convergence issue arises when running a distributed training job with multiple workers, ensure there is a uniform deterministic behavior across all workers by setting a constant seed where applicable. Beware of techniques such as weight initialization, which involves randomization. Each worker might end up training a different model in the absence of a constant seed.

**Check if the `torch_xla` APIs are properly set up for distributed training**

If the issue still persists, this is likely due to improper use of the `torch_xla` APIs for distributed training. Make sure that you add the following in your estimator to set up a cluster for distributed training with SageMaker Training Compiler.

```
distribution={'torchxla': {'enabled': True}}
```

This should be accompanied by a function `_mp_fn(index)` in your training script, which is invoked once per worker. Without the `mp_fn(index)` function, you might end up letting each of the workers train the model independently without sharing model updates. 

Next, make sure that you use the `torch_xla.distributed.parallel_loader.MpDeviceLoader` API along with the distributed data sampler, as guided in the documentation about [migrating your training script to SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-pytorch-models.html), as in the following example.

```
torch.utils.data.distributed.DistributedSampler()
```

 This ensures that the input data is properly distributed across all workers. 

Finally, to synchronize model updates from all workers, use `torch_xla.core.xla_model._fetch_gradients` to gather gradients from all workers and `torch_xla.core.xla_model.all_reduce` to combine all the gathered gradients into a single update. 

It can be more complicated when using SageMaker Training Compiler with gradient scaling (possibly from use of AMP) or gradient clipping techniques. The appropriate order of gradient computation with AMP is as follows.

1. Gradient computation with scaling

1. Gradient synchronization across all workers

1. Gradient un-scaling, gradient clipping, and then gradient scaling

1. Model update

1. Synchronizing the graph execution with `mark_step()`

Note that this checklist has an additional item for synchronizing all workers, compared to the checklist for single-GPU training.

## Training job fails due to missing PyTorch/XLA configuration
<a name="training-compiler-troubleshooting-missing-xla-config"></a>

If a training job fails with the `Missing XLA configuration` error message, it might be due to a misconfiguration in the number of GPUs per instance that you use.

XLA requires additional environment variables to compile the training job. The most common missing environment variable is `GPU_NUM_DEVICES`. For the compiler to work properly, you must set this environment variable equal to the number of GPUs per instance.

There are three approaches to set the `GPU_NUM_DEVICES` environment variable:
+ **Approach 1** – Use the `environment` argument of the SageMaker AI estimator class. For example, if you use an `ml.p3.8xlarge` instance that has four GPUs, do the following:

  ```
  # Using the SageMaker Python SDK's HuggingFace estimator
  
  hf_estimator=HuggingFace(
      ...
      instance_type="ml.p3.8xlarge",
      hyperparameters={...},
      environment={
          ...
          "GPU_NUM_DEVICES": "4" # corresponds to number of GPUs on the specified instance
      },
  )
  ```
+ **Approach 2** – Use the `hyperparameters` argument of the SageMaker AI estimator class and parse it in your training script.

  1. To specify the number of GPUs, add a key-value pair to the `hyperparameters` argument.

     For example, if you use an `ml.p3.8xlarge` instance that has four GPUs, do the following:

     ```
     # Using the SageMaker Python SDK's HuggingFace estimator
     
     hf_estimator=HuggingFace(
         ...
         entry_point = "train.py"
         instance_type= "ml.p3.8xlarge",
         hyperparameters = {
             ...
             "n_gpus": 4 # corresponds to number of GPUs on specified instance
         }
     )
     hf_estimator.fit()
     ```

  1. In your training script, parse the `n_gpus` hyperparameter and specify it as an input for the `GPU_NUM_DEVICES` environment variable.

     ```
     # train.py
     import os, argparse
     
     if __name__ == "__main__":
         parser = argparse.ArgumentParser()
         ...
         # Data, model, and output directories
         parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
         parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
         parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
         parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])
         parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
     
         args, _ = parser.parse_known_args()
     
         os.environ["GPU_NUM_DEVICES"] = args.n_gpus
     ```
+ **Approach 3** – Hard-code the `GPU_NUM_DEVICES` environment variable in your training script. For example, add the following to your script if you use an instance that has four GPUs.

  ```
  # train.py
  
  import os
  os.environ["GPU_NUM_DEVICES"] = 4
  ```

**Tip**  
To find the number of GPU devices on machine learning instances that you want to use, see [Accelerated Computing](https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing) in the *Amazon EC2 Instance Types page*. 

## SageMaker Training Compiler doesn't reduce the total training time
<a name="training-compiler-troubleshooting-no-improved-training-time"></a>

If the total training time does not decrease with SageMaker Training Compiler, we highly recommend you to go over the [SageMaker Training Compiler Best Practices and Considerations](training-compiler-tips-pitfalls.md) page to check your training configuration, padding strategy for the input tensor shape, and hyperparameters. 

# Amazon SageMaker Training Compiler Release Notes
<a name="training-compiler-release-notes"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

See the following release notes to track the latest updates for Amazon SageMaker Training Compiler.

## SageMaker Training Compiler Release Notes: February 13, 2023
<a name="training-compiler-release-notes-20230213"></a>

**Currency Updates**
+ Added support for PyTorch v1.13.1

**Bug Fixes**
+ Fixed a race condition issue on GPU which was causing NAN loss in some models like vision transformer (ViT) models.

**Other Changes**
+ SageMaker Training Compiler improves performance by letting PyTorch/XLA to automatically override the optimizers (such as SGD, Adam, AdamW) in `torch.optim` or `transformers.optimization` with the syncfree versions of them in `torch_xla.amp.syncfree` (such as `torch_xla.amp.syncfree.SGD`, `torch_xla.amp.syncfree.Adam`, `torch_xla.amp.syncfree.AdamW`). You don't need to change those code lines where you define optimizers in your training script.

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ PyTorch v1.13.1

  ```
  763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-trcomp-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: January 9, 2023
<a name="training-compiler-release-notes-20230109"></a>

**Breaking Changes**
+ `tf.keras.optimizers.Optimizer` points to a new optimizer in TensorFlow 2.11.0 and later. The old optimizers are moved to `tf.keras.optimizers.legacy`. You might encounter job failure due to the breaking change when you do the following. 
  + Load checkpoints from an old optimizer. We recommend you to switch to use the legacy optimizers.
  + Use TensorFlow v1. We recommend you to migrate to TensorFlow v2, or switch to the legacy optimizers if you need to continue using TensorFlow v1.

  For more detailed list of breaking changes from the optimizer changes, see the [official TensorFlow v2.11.0 release notes](https://github.com/tensorflow/tensorflow/releases/tag/v2.11.0) in the TensorFlow GitHub repository.

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ TensorFlow v2.11.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-ubuntu20.04-sagemaker
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: December 8, 2022
<a name="training-compiler-release-notes-20221208"></a>

**Bug Fixes**
+ Fixed the seed for PyTorch training jobs starting PyTorch v1.12 to ensure that there is no discrepancy in model initialization across different processes. See also [PyTorch Reproducibility](https://pytorch.org/docs/stable/notes/randomness.html).
+ Fixed the issue causing PyTorch distributed training jobs on G4dn and G5 instances to not default to communication through [PCIe](https://en.wikipedia.org/wiki/PCI_Express).

**Known Issues**
+ Improper use of PyTorch/XLA APIs in Hugging Face’s vision transformers might lead to convergence issues.

**Other Changes**
+ When using the Hugging Face Transformers `Trainer` class, make sure that you use SyncFree optimizers by setting the `optim` argument to `adamw_torch_xla`. For more information, see [Large Language Models Using the Hugging Face Transformers `Trainer` Class](training-compiler-pytorch-models.md#training-compiler-pytorch-models-transformers-trainer). See also [Optimizer](https://huggingface.co/docs/transformers/v4.23.1/en/perf_train_gpu_one#optimizer) in the *Hugging Face Transformers documentation*.

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ PyTorch v1.12.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-trcomp-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: October 4, 2022
<a name="training-compiler-release-notes-20221004"></a>

**Currency Updates**
+ Added support for TensorFlow v2.10.0.

**Other Changes**
+ Added Hugging Face NLP models using the Transformers library to TensorFlow framework tests. To find the tested Transformer models, see [Tested Models](training-compiler-support.md#training-compiler-tested-models).

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ TensorFlow v2.10.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.10.0-gpu-py39-cu112-ubuntu20.04-sagemaker
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: September 1, 2022
<a name="training-compiler-release-notes-20220825"></a>

**Currency Updates**
+ Added support for Hugging Face Transformers v4.21.1 with PyTorch v1.11.0.

**Improvements**
+ Implemented a new distributed training launcher mechanism to activate SageMaker Training Compiler for Hugging Face Transformer models with PyTorch. To learn more, see [Run PyTorch Training Jobs with SageMaker Training Compiler for Distributed Training](training-compiler-enable-pytorch.md#training-compiler-estimator-pytorch-distributed).
+ Integrated with EFA to improve the collective communication in distributed training.
+ Added support for G5 instances for PyTorch training jobs. For more information, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ [HuggingFace v4.21.1 with PyTorch v1.11.0](https://github.com/aws/deep-learning-containers/releases/tag/v1.0-trcomp-hf-4.21.1-pt-1.11.0-tr-gpu-py38)

  ```
  763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-trcomp-training:1.11.0-transformers4.21.1-gpu-py38-cu113-ubuntu20.04
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: June 14, 2022
<a name="training-compiler-release-notes-20220614"></a>

**New Features**
+ Added support for TensorFlow v2.9.1. SageMaker Training Compiler fully supports compiling TensorFlow modules (`tf.*`) and TensorFlow Keras modules (`tf.keras.*`).
+ Added support for custom containers created by extending AWS Deep Learning Containers for TensorFlow. For more information, see [Enable SageMaker Training Compiler Using the SageMaker Python SDK and Extending SageMaker AI Framework Deep Learning Containers](training-compiler-enable-tensorflow.md#training-compiler-enable-tensorflow-sdk-extend-container).
+ Added support for G5 instances for TensorFlow training jobs.

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ TensorFlow 2.9.1

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.9.1-gpu-py39-cu112-ubuntu20.04-sagemaker
  ```

  To find a complete list of the pre-built containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: April 26, 2022
<a name="training-compiler-release-notes-20220426"></a>

**Improvements**
+ Added support for all of the AWS Regions where [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) are in service except the China regions.

## SageMaker Training Compiler Release Notes: April 12, 2022
<a name="training-compiler-release-notes-20220412"></a>

**Currency Updates**
+ Added support for Hugging Face Transformers v4.17.0 with TensorFlow v2.6.3 and PyTorch v1.10.2.

## SageMaker Training Compiler Release Notes: February 21, 2022
<a name="training-compiler-release-notes-20220221"></a>

**Improvements**
+ Completed benchmark test and confirmed training speed-ups on the `ml.g4dn` instance types. To find a complete list of tested `ml` instances, see [Supported Instance Types](training-compiler-support.md#training-compiler-supported-instance-types).

## SageMaker Training Compiler Release Notes: December 01, 2021
<a name="training-compiler-release-notes-20211201"></a>

**New Features**
+ Launched Amazon SageMaker Training Compiler at AWS re:Invent 2021.

**Migration to AWS Deep Learning Containers**
+ Amazon SageMaker Training Compiler passed benchmark testing and is migrated to AWS Deep Learning Containers. To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).