# Model training
<a name="train-model"></a>

The training stage of the full machine learning (ML) lifecycle spans from accessing your training dataset to generating a final model and selecting the best performing model for deployment. The following sections provide an overview of available SageMaker training features and resources with in-depth technical information for each.

## The basic architecture of SageMaker Training
<a name="train-model-simple-case"></a>

If you’re using SageMaker AI for the first time and want to find a quick ML solution to train a model on your dataset, consider using a no-code or low-code solution such as [ SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas.html), [JumpStart within SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html), or [SageMaker Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html).

For intermediate coding experiences, consider using a [SageMaker Studio Classic notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html) or [SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html). To get started, follow the instructions at [Train a Model](ex1-train-model.md) of the SageMaker AI *Getting Started* guide. We recommend this for use cases in which you create your own model and training script using an ML framework. 

The core of SageMaker AI jobs is the containerization of ML workloads and the capability of managing compute resources. The SageMaker Training platform takes care of the heavy lifting associated with setting up and managing infrastructure for ML training workloads. With SageMaker Training, you can focus on developing, training, and fine-tuning your model.

The following architecture diagram shows how SageMaker AI manages ML training jobs and provisions Amazon EC2 instances on behalf of SageMaker AI users. You as a SageMaker AI user can bring your own training dataset, saving it to Amazon S3. You can choose an ML model training from available SageMaker AI built-in algorithms, or bring your own training script with a model built with popular machine learning frameworks.

![\[How users provide data and choose algorithms and SageMaker AI provisions compute infrastructure.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-training.png)


## Full view of the SageMaker Training workflow and features
<a name="train-model-full-view"></a>

The full journey of ML training involves tasks beyond data ingestion to ML models, training models on compute instances, and obtaining model artifacts and outputs. You need to evaluate every phase of before, during, and after training to make sure your model is trained well to meet the target accuracy for your objectives.

The following flow chart shows a high-level overview of your actions (in blue boxes) and available SageMaker Training features (in light blue boxes) throughout the training phase of the ML lifecycle.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-main.png)


The following sections walk you through each phase of training depicted in the previous flow chart and useful features offered by SageMaker AI throughout the three sub-stages of the ML training.

**Topics**
+ [

### Before training
](#train-model-full-view-before-training)
+ [

### During training
](#train-model-full-view-during-training)
+ [

### After training
](#train-model-full-view-after-training)

### Before training
<a name="train-model-full-view-before-training"></a>

There are a number of scenarios of setting up data resources and access you need to consider before training. Refer to the following diagram and details of each before-training stage to get a sense of what decisions you need to make.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-before.png)

+ **Prepare data:** Before training, you must have finished data cleaning and feature engineering during the data preparation stage. SageMaker AI has several labeling and feature engineering tools to help you. See [Label Data](https://docs.aws.amazon.com/sagemaker/latest/dg/data-label.html), [Prepare and Analyze Datasets](https://docs.aws.amazon.com/sagemaker/latest/dg/data-prep.html), [Process Data](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html), and [Create, Store, and Share Features](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html) for more information. 
+ **Choose an algorithm or framework:** Depending on how much customization you need, there are different options for algorithms and frameworks.
  + If you prefer a low-code implementation of a pre-built algorithm, use one of the built-in algorithms offered by SageMaker AI. For more information, see [Choose an Algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html).
  + If you need more flexibility to customize your model, run your training script using your preferred frameworks and toolkits within SageMaker AI. For more information, see [ML Frameworks and Toolkits](https://docs.aws.amazon.com/sagemaker/latest/dg/frameworks.html).
  + To extend pre-built SageMaker AI Docker images as the base image of your own container, see [Use Pre-built SageMaker AI Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-prebuilt.html).
  + To bring your custom Docker container to SageMaker AI, see [Adapting your own Docker container to work with SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own.html). You need to install the [sagemaker-training-toolkit](https://github.com/aws/sagemaker-training-toolkit) to your container.
+ **Manage data storage:** Understand mapping between the data storage (such as Amazon S3, Amazon EFS, or Amazon FSx) and the training container that runs in the Amazon EC2 compute instance. SageMaker AI helps map the storage paths and local paths in the training container. You can also manually specify them. After mapping is done, consider using one of the data transmission modes: File, Pipe, and FastFile mode. To learn how SageMaker AI maps storage paths, see [Training Storage Folders](https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html).
+ **Set up access to training data:** Use Amazon SageMaker AI domain, a domain user profile, IAM, Amazon VPC, and AWS KMS to meet the requirements of the most security-sensitive organizations.
  + For account administration, see [Amazon SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/sm-domain.html).
  + For a complete reference about IAM policies and security, see [Security in Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/security.html).
+ **Stream your input data:** SageMaker AI provides three data input modes, *File*, *Pipe*, and *FastFile*. The default input mode is File mode, which loads the entire dataset during initializing the training job. To learn about general best practices for streaming data from your data storage to the training container, see [Access Training Data](https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html). 

  In case of [Pipe mode](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html), you can also consider using an augmented manifest file to stream your data directly from Amazon Simple Storage Service (Amazon S3) and train your model. Using pipe mode reduces disk space because Amazon Elastic Block Store only needs to store your final model artifacts, rather than storing your full training dataset. For more information, see [Provide Dataset Metadata to Training Jobs with an Augmented Manifest File](https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html).
+ **Analyze your data for bias:** Before training, you can analyze your dataset and model for bias against a disfavored group so that you can check that your model learns an unbiased dataset using [SageMaker Clarify](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-data-bias.html).
+ **Choose which SageMaker SDK to use:** There are two ways to launch a training job in SageMaker AI: using the high-level SageMaker AI Python SDK, or using the low-level SageMaker APIs for the SDK for Python (Boto3) or the AWS CLI. The SageMaker Python SDK abstracts the low-level SageMaker API to provide convenient tools. As aforementioned in [The basic architecture of SageMaker Training](#train-model-simple-case), you can also pursue no-code or minimal-code options using [SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas.html), [JumpStart within SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html), or [SageMaker AI Autopilot](https://docs.aws.amazon.com/sagemaker/latest/dg/autopilot-automate-model-development.html).

### During training
<a name="train-model-full-view-during-training"></a>

During training, you need to continuously improve training stability, training speed, training efficiency while scaling compute resources, cost optimization, and, most importantly, model performance. Read on for more information about during-training stages and relevant SageMaker Training features.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-during.png)

+ **Set up infrastructure:** Choose the right instance type and infrastructure management tools for your use case. You can start from a small instance and scale up depending on your workload. For training a model on a tabular dataset, start with the smallest CPU instance of the C4 or C5 instance families. For training a large model for computer vision or natural language processing, start with the smallest GPU instance of the P2, P3, G4dn or G5 instance families. You can also mix different instance types in a cluster, or keep instances in warm pools using the following instance management tools offered by SageMaker AI. You can also use persistent cache to reduce latency and billable time on iterative training jobs over the latency reduction from warm pools alone. To learn more, see the following topics.
  + [Running training jobs on a heterogeneous cluster](train-heterogeneous-cluster.md) 
  + [SageMaker AI Managed Warm Pools](train-warm-pools.md)
  + [Using persistent cache](train-warm-pools.md#train-warm-pools-persistent-cache)

  You must have sufficient quota to run a training job. If you run your training job on an instance where you have insufficient quota, you will receive a `ResourceLimitExceeded` error. To check the currently available quotas in your account, use your [Service Quotas console](https://console.aws.amazon.com/servicequotas/home/services/sagemaker/quotas). To learn how to request a quota increase, see [Supported Regions and Quotas](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html). Also, to find pricing information and available instance types depending on the AWS Regions, look up the tables in the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.
+ **Run a training job from a local code:** You can annotate your local code with a remote decorator to run your code as a SageMaker training job from inside Amazon SageMaker Studio Classic, an Amazon SageMaker notebook or from your local integrated development environment. For more information, see [Run your local code as a SageMaker training job](train-remote-decorator.md).
+ **Track training jobs:** Monitor and track your training jobs using SageMaker Experiments, SageMaker Debugger, or Amazon CloudWatch. You can watch the model performance in terms of accuracy and convergence, and run comparative analysis of metrics between multiple training jobs by using SageMaker AI Experiments. You can watch the compute resource utilization rate by using SageMaker Debugger’s profiling tools or Amazon CloudWatch. To learn more, see the following topics.
  + [Manage Machine Learning with Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html)
  + [Profile Training Jobs Using Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profile-training-jobs.html)
  + [Monitor and Analyze Using CloudWatch Metrics ](https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html)

  Additionally, for deep learning tasks, use the [Amazon SageMaker Debugger model debugging tools](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-debug-training-jobs.html) and [built-in rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) to identify more complex issues in model convergence and weight update processes.
+ **Distributed training:** If your training job is going into a stable stage without breaking due to misconfiguration of the training infrastructure or out-of-memory issues, you might want to find more options to scale your job and run over an extended period of time for days and even months. When you’re ready to scale up, consider distributed training. SageMaker AI provides various options for distributed computation from light ML workloads to heavy deep learning workloads. 

  For deep learning tasks that involve training very large models on very large datasets, consider using one of the [SageMaker AI distributed training strategies](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) to scale up and achieve data parallelism, model parallelism, or a combination of the two. You can also use [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) for compiling and optimizing model graphs on GPU instances. These SageMaker AI features support deep learning frameworks such as PyTorch, TensorFlow, and Hugging Face Transformers.
+ **Model hyperparameter tuning:** Tune your model hyperparameters using [Automatic Model Tuning with SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html). SageMaker AI provides hyperparameter tuning methods such as grid search and Bayesian search, launching parallel hyperparameter tuning jobs with early-stopping functionality for non-improving hyperparameter tuning jobs.
+ **Checkpointing and cost saving with Spot instances:** If training time is not a big concern, you might consider optimizing model training costs with managed Spot instances. Note that you must activate checkpointing for Spot training to keep restoring from intermittent job pauses due to Spot instance replacements. You can also use the checkpointing functionality to back up your models in case of unexpected training job termination. To learn more, see the following topics.
  + [Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) 
  + [Use Checkpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html) 

### After training
<a name="train-model-full-view-after-training"></a>

After training, you obtain a final model artifact to use for model deployment and inference. There are additional actions involved in the after-training phase as shown in the following diagram.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-after.png)

+ **Obtain baseline model:** After you have the model artifact, you can set it as a baseline model. Consider the following post-training actions and using SageMaker AI features before moving on to model deployment to production.
+ **Examine model performance and check for bias:** Use Amazon CloudWatch Metrics and [SageMaker Clarify for post-training bias](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-post-training-bias.html) to detect any bias in incoming data and model over time against the baseline. You need to evaluate your new data and model predictions against the new data regularly or in real time. Using these features, you can receive alerts about any acute changes or anomalies, as well as gradual changes or drifts in data and model.
+ You can also use the [Incremental Training](https://docs.aws.amazon.com/sagemaker/latest/dg/incremental-training.html) functionality of SageMaker AI to load and update your model (or fine-tune) with an expanded dataset.
+ You can register model training as a step in your [SageMaker AI Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html) or as part of other [Workflow](https://docs.aws.amazon.com/sagemaker/latest/dg/workflows.html) features offered by SageMaker AI in order to orchestrate the full ML lifecycle.

# Train a Model with Amazon SageMaker
<a name="how-it-works-training"></a>

Amazon SageMaker Training is a fully managed machine learning (ML) service offered by SageMaker that helps you efficiently train a wide range of ML models at scale. The core of SageMaker AI jobs is the containerization of ML workloads and the capability of managing AWS compute resources. The SageMaker Training platform takes care of the heavy lifting associated with setting up and managing infrastructure for ML training workloads. With SageMaker Training, you can focus on developing, training, and fine-tuning your model. This page introduces three recommended ways to get started with training a model on SageMaker, followed by additional options you can consider.

**Tip**  
For information about training foundation models for Generative AI, see [Use SageMaker JumpStart foundation models in Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-use-studio-updated.html).

## Choosing a feature within Amazon SageMaker Training
<a name="choose-a-feature-of-sagemaker-training"></a>

There are three main use cases for training ML models within SageMaker AI. This section describes those use cases, as well as the SageMaker AI features we recommend for each use case. 

Whether you are training complex deep learning models or implementing smaller machine learning algorithms, SageMaker Training provides streamlined and cost-effective solutions that meet the requirements of your use cases.

### Use cases
<a name="choose-use-cases-sagemaker-training"></a>

The following are the main uses cases for training ML models within SageMaker AI.
+ **Use case 1**: Develop a machine learning model in a low-code or no-code environment.
+ **Use case 2**: Use code to develop machine learning models with more flexibility and control.
+ **Use case 3**: Develop machine learning models at scale with maximum flexibility and control.

### Recommended features
<a name="choose-recommended-features-of-sagemaker-training"></a>

The following table describes three common scenarios of training ML models and corresponding options to get started with SageMaker Training.


| Descriptor | Use case 1 | Use case 2 | Use case 3 | 
| --- | --- | --- | --- | 
| SageMaker AI feature | [Build a model using Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-build-model.html). | Train a model using one of the [SageMaker AI built-in ML algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) such as [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html#xgboost-modes) or [Task-Specific Models by SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-models.html) with the SageMaker Python SDK. | Train a model at scale with maximum flexibility leveraging [script mode](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-script-mode/sagemaker-script-mode.html) or [custom containers](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own.html) in SageMaker AI. | 
| Description | Bring your data. SageMaker AI helps manage building ML models and setting up the training infrastructure and resources. |  Bring your data and choose one of the built-in ML algorithms provided by SageMaker AI. Set up the model hyperparameters, output metrics, and basic infrastructure settings using the SageMaker Python SDK. The SageMaker Training platform helps provision the training infrastructure and resources.  |  Develop your own ML code and bring it as a script or a set of scripts to SageMaker AI. To learn more, see [Distributed computing with SageMaker best practices](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training-options.html#distributed-training-options-2). Additionally, you can [bring your own Docker container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step2). The SageMaker Training platform helps provision the training infrastructure and resources at scale based on your custom settings.  | 
| Optimized for |  Low/no-code and UI-driven model development with quick experimentation with a training dataset. When you [build a custom model](canvas-build-model.md) an algorithm automatically selected based on your data. For advanced customization options like algorithm selection, see [advanced model building configurations](canvas-advanced-settings.md).  |  Training ML models with high-level customization for hyperparameters, infrastructure settings, and the ability to directly use ML frameworks and entrypoint scripts for more flexibility. Use built-in algorithms, pre-trained models, and JumpStart models through the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to develop ML models. For more information, see [Low-code deployment with the JumpStart class](https://sagemaker.readthedocs.io/en/stable/overview.html#low-code-deployment-with-the-jumpstartmodel-class).  |  ML training workloads at scale, requiring multiple instances and maximum flexibility. See [distributed computing with SageMaker best practices](distributed-training-options.md). SageMaker AI uses Docker images to host the training and serving of all models. You can use any SageMaker AI or external algorithms and [use Docker containers to build models](docker-containers.md).  | 
| Considerations |  Minimal flexibility to customize the model provided by Amazon SageMaker Canvas.  |  The SageMaker Python SDK provides a simplified interface and fewer configuration options compared to the low-level SageMaker Training API.   |  Requires knowledge of AWS infrastructure and distributed training options. See also [Create your own training container](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html) using the [SageMaker Training toolkit](https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-toolkits.html).  | 
| Recommended environment | Use [Amazon SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html#canvas-prerequisites). To learn how to set it up, see [Getting started with using SageMaker Canvas](https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html). | Use [SageMaker AI JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) within [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html). To learn how to set it up, see [Launch Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html). | Use [SageMaker JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) within [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html). To learn how to set it up, see [Launch Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html). | 

## Additional options
<a name="choose-additional-options-for-sagemaker-training"></a>

SageMaker AI offers the following additional options for training ML models.

**SageMaker AI features offering training capabilities**
+ **[SageMaker JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html)**: SageMaker JumpStart provides access to the SageMaker AI public model hub that contains the latest publicly available and proprietary foundation models (FMs). You can fine-tune, evaluate, and deploy these models within Amazon SageMaker Studio. SageMaker JumpStart streamlines the process of leveraging foundation models for your generative AI use-cases and allows you to create private model hubs to use foundation models while enforcing governance guardrails and ensuring that your organization can only access approved models. To get started with SageMaker JumpStart, see [SageMaker JumpStart Foundation Models](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models.html).
+ **[SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod.html)**: SageMaker HyperPod is a persistent cluster service for use cases that need resilient clusters for massive machine learning (ML) workloads and developing state-of-the-art foundation models (FMs). It accelerates development of such models by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium or NVIDIA A100 and H100 Graphical Processing Units (GPUs). You can use workload manager software such as Slurm on HyperPod.

**More features of SageMaker Training**
+ **[Hyperparameter Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html)**: This SageMaker AI feature helps define a set of hyperparameters for a model and launch many training jobs on a dataset. Depending on the hyperparameter values, the model training performance might vary. This feature provides the best performing set of hyperparameters within the given range of hyperparameters you set to search through.
+ **[Distributed training](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html)**: Pre-train or fine-tune FMs built with PyTorch, NVIDIA CUDA, and other PyTorch-based frameworks. To efficiently utilize GPU instances, use the SageMaker AI distributed training libraries that offer collective communication operations and various model parallelism techniques such as expert parallelism and shared data parallelism that are optimized for AWS infrastructure.
+ **Observability features**: Use the profiling and debugging functionalities of SageMaker Training to gain insights into model training workloads, model performance, and resource utilization. To learn more, see [Debug and improve model performance](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debug-and-improve-model-performance.html) and [Profile and optimize computational performance](https://docs.aws.amazon.com/sagemaker/latest/dg/train-profile-computational-performance.html).
+ **Cost-saving and efficient instance options**: To optimize compute cost and efficiency for training instance provisioning, use [Heterogeneous Cluster](https://docs.aws.amazon.com/sagemaker/latest/dg/train-heterogeneous-cluster.html), [Managed Spot instances](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), or [Managed Warm Pools](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html).

# Types of Algorithms
<a name="algorithms-choose"></a>

Machine learning can help you accomplish empirical tasks that require some sort of inductive inference. This task involves induction as it uses data to train algorithms to make generalizable inferences. This means that the algorithms can make statistically reliable predictions or decisions, or complete other tasks when applied to new data that was not used to train them. 

To help you select the best algorithm for your task, we classify these tasks on various levels of abstraction. At the highest level of abstraction, machine learning attempts to find patterns or relationships between features or less structured items, such as text in a data set. Pattern recognition techniques can be classified into distinct machine learning paradigms, each of which address specific problem types. There are currently three basic paradigms for machine learning used to address various problem types: 
+ [Supervised learning](#algorithms-choose-supervised-learning)
+ [Unsupervised learning](#algorithms-choose-unsupervised-learning)
+ [Reinforcement learning](#algorithms-choose-reinforcement-learning)

The types of problems that each learning paradigm can address are identified by considering the inferences (or predictions, decisions, or other tasks) you want to make from the type of data that you have or could collect. Machine learning paradigms use algorithmic methods to address their various problem types. The algorithms provide recipes for solving these problems. 

However, many algorithms, such as neural networks, can be deployed with different learning paradigms and on different types of problems. Multiple algorithms can also address a specific problem type. Some algorithms are more generally applicable and others are quite specific for certain kinds of objectives and data. So the mapping between machine learning algorithms and problem types is many-to-many. Also, there are various implementation options available for algorithms. 

The following sections provide guidance concerning implementation options, machine learning paradigms, and algorithms appropriate for different problem types.

**Topics**
+ [

## Choose an algorithm implementation
](#algorithms-choose-implementation)
+ [

## Problem types for the basic machine learning paradigms
](#basic-machine-learning-paradigms)
+ [

# Built-in algorithms and pretrained models in Amazon SageMaker
](algos.md)
+ [

# Use Reinforcement Learning with Amazon SageMaker AI
](reinforcement-learning.md)

## Choose an algorithm implementation
<a name="algorithms-choose-implementation"></a>

After choosing an algorithm, you must decide which implementation of it you want to use. Amazon SageMaker AI supports three implementation options that require increasing levels of effort. 
+ **Pre-trained models** require the least effort and are models ready to deploy or to fine-tune and deploy using SageMaker JumpStart.
+ **Built-in algorithms** require more effort and scale if the data set is large and significant resources are needed to train and deploy the model.
+ If there is no built-in solution that works, try to develop one that uses **pre-made images for machine and deep learning frameworks** for supported frameworks such as Scikit-Learn, TensorFlow, PyTorch, MXNet, or Chainer.
+ If you need to run custom packages or use any code which isn’t a part of a supported framework or available via PyPi, then you need to build **your own custom Docker image** that is configured to install the necessary packages or software. The custom image must also be pushed to an online repository like the Amazon Elastic Container Registry.

**Topics**
+ [

### Use a built-in algorithm
](#built-in-algorithms-benefits)
+ [

### Use script mode in a supported framework
](#supported-frameworks-benefits)
+ [

### Use a custom Docker image
](#custom-image-use-case)

Algorithm implementation guidance


| Implementation | Requires code | Pre-coded algorithms | Support for third party packages | Support for custom code | Level of effort | 
| --- | --- | --- | --- | --- | --- | 
| Built-in | No | Yes | No | No | Low | 
| Scikit-learn | Yes | Yes | PyPi only | Yes | Medium | 
| Spark ML | Yes | Yes | PyPi only | Yes | Medium | 
| XGBoost (open source) | Yes | Yes | PyPi only | Yes | Medium | 
| TensorFlow | Yes | No | PyPi only | Yes | Medium-high | 
| PyTorch | Yes | No | PyPi only | Yes | Medium-high | 
| MXNet | Yes | No | PyPi only | Yes | Medium-high | 
| Chainer | Yes | No | PyPi only | Yes | Medium-high | 
| Custom image | Yes | No | Yes, from any source | Yes | High | 

### Use a built-in algorithm
<a name="built-in-algorithms-benefits"></a>

When choosing an algorithm for your type of problem and data, the easiest option is to use one of Amazon SageMaker AI's built-in algorithms. These built-in algorithms come with two major benefits.
+ The built-in algorithms require no coding to start running experiments. The only inputs you need to provide are the data, hyperparameters, and compute resources. This allows you to run experiments more quickly, with less overhead for tracking results and code changes.
+ The built-in algorithms come with parallelization across multiple compute instances and GPU support right out of the box for all applicable algorithms (some algorithms may not be included due to inherent limitations). If you have a lot of data with which to train your model, most built-in algorithms can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be easier to use its corollary in SageMaker AI and input the hyper-parameters you already know than to port it over, using script mode on a supported framework.

For more information on the built-in algorithms provided by SageMaker AI, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md).

For important information about docker registry paths, data formats, recommended EC2 instance types, and CloudWatch logs common to all of the built-in algorithms provided by SageMaker AI, see [Parameters for Built-in Algorithms](common-info-all-im-models.md).

### Use script mode in a supported framework
<a name="supported-frameworks-benefits"></a>

If the algorithm you want to use for your model is not supported by a built-in choice and you are comfortable coding your own solution, then you should consider using an Amazon SageMaker AI supported framework. This is referred to as "script mode" because you write your custom code (script) in a text file with a `.py` extension. As the table above indicates, SageMaker AI supports most of the popular machine learning frameworks. These frameworks come preloaded with the corresponding framework and some additional Python packages, such as Pandas and NumPy, so you can write your own code for training an algorithm. These frameworks also allow you to install any Python package hosted on PyPi by including a requirements.txt file with your training code or to include your own code directories. R is also supported natively in SageMaker notebook kernels. Some frameworks, like scikit-learn and Spark ML, have pre-coded algorithms you can use easily, while other frameworks like TensorFlow and PyTorch may require you to implement the algorithm yourself. The only limitation when using a supported framework image is that you cannot import any software packages that are not hosted on PyPi or that are not already included with the framework’s image.

For more information on the frameworks supported by SageMaker AI, see [Machine Learning Frameworks and Languages](frameworks.md).

### Use a custom Docker image
<a name="custom-image-use-case"></a>

Amazon SageMaker AI's built-in algorithms and supported frameworks should cover most use cases, but there are times when you may need to use an algorithm from a package not included in any of the supported frameworks. You might also have a pre-trained model picked or persisted somewhere which you need to deploy. SageMaker AI uses Docker images to host the training and serving of all models, so you can supply your own custom Docker image if the package or software you need is not included in a supported framework. This may be your own Python package or an algorithm coded in a language like Stan or Julia. For these images you must also configure the training of the algorithm and serving of the model properly in your Dockerfile. This requires intermediate knowledge of Docker and is not recommended unless you are comfortable writing your own machine learning algorithm. Your Docker image must be uploaded to an online repository, such as the Amazon Elastic Container Registry (ECR) before you can train and serve your model properly.

 For more information on custom Docker images in SageMaker AI, see [Docker containers for training and deploying models](docker-containers.md).

## Problem types for the basic machine learning paradigms
<a name="basic-machine-learning-paradigms"></a>

The following three sections describe the main problem types addressed by the three basic paradigms for machine learning. For a list of the built-in algorithms that SageMaker AI provides to address these problem types, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md).

**Topics**
+ [

### Supervised learning
](#algorithms-choose-supervised-learning)
+ [

### Unsupervised learning
](#algorithms-choose-unsupervised-learning)
+ [

### Reinforcement learning
](#algorithms-choose-reinforcement-learning)

### Supervised learning
<a name="algorithms-choose-supervised-learning"></a>

If your data set consists of features or attributes (inputs) that contain target values (outputs), then you have a supervised learning problem. If your target values are categorical (mathematically discrete), then you have a **classification problem**. It is a standard practice to distinguish binary from multiclass classification. 
+ **Binary classification** is a type of supervised learning that assigns an individual to one of two predefined and mutually exclusive classes based on the individual's attributes. It is supervised because the models are trained using examples in which the attributes are provided with correctly labeled objects. A medical diagnosis for whether an individual has a disease or not based on the results of diagnostic tests is an example of binary classification.
+ **Multiclass classification** is a type of supervised learning that assigns an individual to one of several classes based on the individual's attributes. It is supervised because the models are trained using examples in which the attributes are provided with correctly labeled objects. An example is the prediction of the topic most relevant to a text document. A document may be classified as being about religion, politics, or finance, or as about one of several other predefined topic classes.

If the target values you are trying to predict are mathematically continuous, then you have a **regression** problem. Regression estimates the values of a dependent target variable based on one or more other variables or attributes that are correlated with it. An example is the prediction of house prices using features like the number of bathrooms and bedrooms and the square footage of the house and garden. Regression analysis can create a model that takes one or more of these features as an input and predicts the price of a house.

For more information on the built-in supervised learning algorithms provided by SageMaker AI, see [Supervised learning](algos.md#algorithms-built-in-supervised-learning).

### Unsupervised learning
<a name="algorithms-choose-unsupervised-learning"></a>

If your data set consists of features or attributes (inputs) that do not contain labels or target values (outputs), then you have an unsupervised learning problem. In this type of problem, the output must be predicted based on the pattern discovered in the input data. The goal in unsupervised learning problems is to discover patterns such as groupings within the data. There are a large variety of tasks or problem types to which unsupervised learning can be applied. Principal component and cluster analyses are two of the main methods commonly deployed for preprocessing data. Here is a short list of problem types that can be addressed by unsupervised learning:
+ **Dimension reduction** is typically part of a data exploration step used to determine the most relevant features to use for model construction. The idea is to transform data from a high-dimensional, sparsely populated space into a low-dimensional space that retains most significant properties of the original data. This provides relief for the curse of dimensionality that can arise with sparsely populated, high-dimensional data on which statistical analysis becomes problematic. It can also be used to help understand data, reducing high-dimensional data to a lower dimensionality that can be visualized.
+ **Cluster analysis** is a class of techniques that are used to classify objects or cases into groups called clusters. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the features or attributes that you want the algorithm to use to determine similarity, select a distance function to measure similarity, and specify the number of clusters to use in the analysis.
+ **Anomaly detection** is the identification of rare items, events, or observations in a data set which raise suspicions because they differ significantly from the rest of the data. The identification of anomalous items can be used, for example, to detect bank fraud or medical errors. Anomalies are also referred to as outliers, novelties, noise, deviations, and exceptions.
+ **Density estimation** is the construction of estimates of unobservable underlying probability density functions based on observed data. A natural use of density estimates is for data exploration. Density estimates can discover features such as skewness and multimodality in the data. The most basic form of density estimation is a rescaled histogram.

SageMaker AI provides several built-in machine learning algorithms that you can use for these unsupervised learning tasks. For more information on the built-in unsupervised algorithms provided by SageMaker AI, see [Unsupervised learning](algos.md#algorithms-built-in-unsupervised-learning).

### Reinforcement learning
<a name="algorithms-choose-reinforcement-learning"></a>

Reinforcement learning is a type of learning that is based on interaction with the environment. This type of learning is used by an agent that must learn behavior through trial-and-error interactions with a dynamic environment in which the goal is to maximize the long-term rewards that the agent receives as a result of its actions. Rewards are maximized by trading off exploring actions that have uncertain rewards with exploiting actions that have known rewards. 

For more information on SageMaker AI's frameworks, toolkits, and environments for reinforcement learning, see [Use Reinforcement Learning with Amazon SageMaker AI](reinforcement-learning.md).

# Built-in algorithms and pretrained models in Amazon SageMaker
<a name="algos"></a>

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning practitioners get started on training and deploying machine learning models quickly. For someone who is new to SageMaker, choosing the right algorithm for your particular use case can be a challenging task. The following table provides a quick cheat sheet that shows how you can start with an example problem or use case and find an appropriate built-in algorithm offered by SageMaker that is valid for that problem type. Additional guidance organized by learning paradigms (supervised and unsupervised) and important data domains (text and images) is provided in the sections following the table.

Table: Mapping use cases to built-in algorithms

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/algos.html)

For important information about the following items common to all of the built-in algorithms provided by SageMaker AI, see [Parameters for Built-in Algorithms](common-info-all-im-models.md).
+ Docker registry paths
+ data formats
+ recommended Amazon EC2 instance types
+ CloudWatch logs

The following sections provide additional guidance for the Amazon SageMaker AI built-in algorithms grouped by the supervised and unsupervised learning paradigms to which they belong. For descriptions of these learning paradigms and their associated problem types, see [Types of Algorithms](algorithms-choose.md). Sections are also provided for the SageMaker AI built-in algorithms available to address two important machine learning domains: textual analysis and image processing.
+ [Pre-trained models and solution templates](#algorithms-built-in-jumpstart)
+ [Supervised learning](#algorithms-built-in-supervised-learning)
+ [Unsupervised learning](#algorithms-built-in-unsupervised-learning)
+ [Textual analysis](#algorithms-built-in-text-analysis)
+ [Image processing](#algorithms-built-in-image-processing)

## Pre-trained models and solution templates
<a name="algorithms-built-in-jumpstart"></a>

Amazon SageMaker JumpStart provides a wide range of pre-trained models, pre-built solution templates, and examples for popular problem types. These use the SageMaker SDK as well as Studio Classic. For more information about these models, solutions, and the example notebooks provided by Amazon SageMaker JumpStart, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

## Supervised learning
<a name="algorithms-built-in-supervised-learning"></a>

Amazon SageMaker AI provides several built-in general purpose algorithms that can be used for either classification or regression problems.
+ [AutoGluon-Tabular](autogluon-tabular.md)—an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers. 
+ [CatBoost](catboost.md)—an implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.
+ [Factorization Machines Algorithm](fact-machines.md)—an extension of a linear model that is designed to economically capture interactions between features within high-dimensional sparse datasets.
+ [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md)—a non-parametric method that uses the k nearest labeled points to assign a value. For classification, it is a label to a new data point. For regression, it is a predicted target value from the average of the k nearest points.
+ [LightGBM](lightgbm.md)—an implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability. These two novel techniques are Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
+ [Linear Learner Algorithm](linear-learner.md)—learns a linear function for regression or a linear threshold function for classification.
+ [TabTransformer](tabtransformer.md)—a novel deep tabular data modeling architecture built on self-attention-based Transformers. 
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)—an implementation of the gradient-boosted trees algorithm that combines an ensemble of estimates from a set of simpler and weaker models.

Amazon SageMaker AI also provides several built-in supervised learning algorithms used for more specialized tasks during feature engineering and forecasting from time series data.
+ [Object2Vec Algorithm](object2vec.md)—a new highly customizable multi-purpose algorithm used for feature engineering. It can learn low-dimensional dense embeddings of high-dimensional objects to produce features that improve training efficiencies for downstream models. While this is a supervised algorithm, there are many scenarios in which the relationship labels can be obtained purely from natural clusterings in data. Even though it requires labeled data for training, this can occur without any explicit human annotation.
+ [Use the SageMaker AI DeepAR forecasting algorithm](deepar.md)—a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).

## Unsupervised learning
<a name="algorithms-built-in-unsupervised-learning"></a>

Amazon SageMaker AI provides several built-in algorithms that can be used for a variety of unsupervised learning tasks. These tasks includes things like clustering, dimension reduction, pattern recognition, and anomaly detection.
+ [Principal Component Analysis (PCA) Algorithm](pca.md)—reduces the dimensionality (number of features) within a dataset by projecting data points onto the first few principal components. The objective is to retain as much information or variation as possible. For mathematicians, principal components are eigenvectors of the data's covariance matrix.
+ [K-Means Algorithm](k-means.md)—finds discrete groupings within data. This occurs where members of a group are as similar as possible to one another and as different as possible from members of other groups.
+ [IP Insights](ip-insights.md)—learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers.
+ [Random Cut Forest (RCF) Algorithm](randomcutforest.md)—detects anomalous data points within a data set that diverge from otherwise well-structured or patterned data.

## Textual analysis
<a name="algorithms-built-in-text-analysis"></a>

SageMaker AI provides algorithms that are tailored to the analysis of textual documents. This includes text used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.
+ [BlazingText algorithm](blazingtext.md)—a highly optimized implementation of the Word2vec and text classification algorithms that scale to large datasets easily. It is useful for many downstream natural language processing (NLP) tasks.
+ [Sequence-to-Sequence Algorithm](seq-2-seq.md)—a supervised algorithm commonly used for neural machine translation. 
+ [Latent Dirichlet Allocation (LDA) Algorithm](lda.md)—an algorithm suitable for determining topics in a set of documents. It is an *unsupervised algorithm*, which means that it doesn't use example data with answers during training.
+ [Neural Topic Model (NTM) Algorithm](ntm.md)—another unsupervised technique for determining topics in a set of documents, using a neural network approach.
+ [Text Classification - TensorFlow](text-classification-tensorflow.md)—a supervised algorithm that supports transfer learning with available pretrained models for text classification.

## Image processing
<a name="algorithms-built-in-image-processing"></a>

SageMaker AI also provides image processing algorithms that are used for image classification, object detection, and computer vision.
+ [Image Classification - MXNet](image-classification.md)—uses example data with answers (referred to as a *supervised algorithm*). Use this algorithm to classify images.
+ [Image Classification - TensorFlow](image-classification-tensorflow.md)—uses pretrained TensorFlow Hub models to fine-tune for specific tasks (referred to as a *supervised algorithm*). Use this algorithm to classify images.
+ [Semantic Segmentation Algorithm](semantic-segmentation.md)—provides a fine-grained, pixel-level approach to developing computer vision applications.
+ [Object Detection - MXNet](object-detection.md)—detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.
+ [Object Detection - TensorFlow](object-detection-tensorflow.md)—detects bounding boxes and object labels in an image. It is a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow models.

**Topics**
+ [

## Pre-trained models and solution templates
](#algorithms-built-in-jumpstart)
+ [

## Supervised learning
](#algorithms-built-in-supervised-learning)
+ [

## Unsupervised learning
](#algorithms-built-in-unsupervised-learning)
+ [

## Textual analysis
](#algorithms-built-in-text-analysis)
+ [

## Image processing
](#algorithms-built-in-image-processing)
+ [

# Parameters for Built-in Algorithms
](common-info-all-im-models.md)
+ [

# Built-in SageMaker AI Algorithms for Tabular Data
](algorithms-tabular.md)
+ [

# Built-in SageMaker AI Algorithms for Text Data
](algorithms-text.md)
+ [

# Built-in SageMaker AI Algorithms for Time-Series Data
](algorithms-time-series.md)
+ [

# Unsupervised Built-in SageMaker AI Algorithms
](algorithms-unsupervised.md)
+ [

# Built-in SageMaker AI Algorithms for Computer Vision
](algorithms-vision.md)

# Parameters for Built-in Algorithms
<a name="common-info-all-im-models"></a>

The following table lists parameters for each of the algorithms provided by Amazon SageMaker AI.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| AutoGluon-Tabular | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| BlazingText | train | File or Pipe | Text file (one sentence per line with space-separated tokens)  | CPU or GPU (single instance only)  | No | 
| CatBoost | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| DeepAR Forecasting | train and (optionally) test | File | JSON Lines or Parquet | CPU or GPU | Yes | 
| Factorization Machines | train and (optionally) test | File or Pipe | recordIO-protobuf | CPU (GPU for dense data) | Yes | 
| Image Classification - MXNet | train and validation, (optionally) train\$1lst, validation\$1lst, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Image Classification - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 
| IP Insights | train and (optionally) validation | File | CSV | CPU or GPU | Yes | 
| K-Means | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPUCommon (single GPU device on one or more instances) | No | 
| K-Nearest-Neighbors (k-NN) | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU (single GPU device on one or more instances) | Yes | 
| LDA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU (single instance only) | No | 
| LightGBM | train/training and (optionally) validation | File | CSV | CPU | Yes | 
| Linear Learner | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Neural Topic Model | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Object2Vec | train and (optionally) validation, test, or both | File | JSON Lines  | CPU or GPU (single instance only) | No | 
| Object Detection - MXNet | train and validation, (optionally) train\$1annotation, validation\$1annotation, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Object Detection - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | GPU | Yes (only across multiple GPUs on a single instance) | 
| PCA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| Random Cut Forest | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU | Yes | 
| Semantic Segmentation | train and validation, train\$1annotation, validation\$1annotation, and (optionally) label\$1map and model | File or Pipe | Image files | GPU (single instance only) | No | 
| Seq2Seq Modeling | train, validation, and vocab | File | recordIO-protobuf | GPU (single instance only) | No | 
| TabTransformer | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| Text Classification - TensorFlow | training and validation | File | CSV | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 
| XGBoost (0.90-1, 0.90-2, 1.0-1, 1.2-1, 1.2-21) | train and (optionally) validation | File or Pipe | CSV, LibSVM, or Parquet | CPU (or GPU for 1.2-1) | Yes | 

Algorithms that are *parallelizable* can be deployed on multiple compute instances for distributed training.

The following topics provide information about data formats, recommended Amazon EC2 instance types, and CloudWatch logs common to all of the built-in algorithms provided by Amazon SageMaker AI.

**Note**  
To look up the Docker image URIs of the built-in algorithms managed by SageMaker AI, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths).

**Topics**
+ [

# Common Data Formats for Training
](cdf-training.md)
+ [

# Common data formats for inference
](cdf-inference.md)
+ [

# Instance Types for Built-in Algorithms
](cmn-info-instance-types.md)
+ [

# Logs for Built-in Algorithms
](common-info-all-sagemaker-models-logs.md)

# Common Data Formats for Training
<a name="cdf-training"></a>

To prepare for training, you can preprocess your data using a variety of AWS services, including AWS Glue, Amazon EMR, Amazon Redshift, Amazon Relational Database Service, and Amazon Athena. After preprocessing, publish the data to an Amazon S3 bucket. For training, the data must go through a series of conversions and transformations, including: 
+ Training data serialization (handled by you) 
+ Training data deserialization (handled by the algorithm) 
+ Training model serialization (handled by the algorithm) 
+ Trained model deserialization (optional, handled by you) 

When using Amazon SageMaker AI in the training portion of the algorithm, make sure to upload all data at once. If more data is added to that location, a new training call would need to be made to construct a brand new model.

**Topics**
+ [

## Content Types Supported by Built-In Algorithms
](#cdf-common-content-types)
+ [

## Using Pipe Mode
](#cdf-pipe-mode)
+ [

## Using CSV Format
](#cdf-csv-format)
+ [

## Using RecordIO Format
](#cdf-recordio-format)
+ [

## Trained Model Deserialization
](#td-deserialization)

## Content Types Supported by Built-In Algorithms
<a name="cdf-common-content-types"></a>

The following table lists some of the commonly supported [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType) values and the algorithms that use them:

ContentTypes for Built-in Algorithms


| ContentType | Algorithm | 
| --- | --- | 
| application/x-image | Object Detection Algorithm, Semantic Segmentation | 
| application/x-recordio |  Object Detection Algorithm  | 
| application/x-recordio-protobuf |  Factorization Machines, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, Sequence-to-Sequence  | 
| application/jsonlines |  BlazingText, DeepAR  | 
| image/jpeg |  Object Detection Algorithm, Semantic Segmentation  | 
| image/png |  Object Detection Algorithm, Semantic Segmentation  | 
| text/csv |  IP Insights, K-Means, k-NN, Latent Dirichlet Allocation, Linear Learner, NTM, PCA, RCF, XGBoost  | 
| text/libsvm |  XGBoost  | 

For a summary of the parameters used by each algorithm, see the documentation for the individual algorithms or this [table](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

## Using Pipe Mode
<a name="cdf-pipe-mode"></a>

In *Pipe mode*, your training job streams data directly from Amazon Simple Storage Service (Amazon S3). Streaming can provide faster start times for training jobs and better throughput. This is in contrast to *File mode*, in which your data from Amazon S3 is stored on the training instance volumes. File mode uses disk space to store both your final model artifacts and your full training dataset. By streaming in your data directly from Amazon S3 in Pipe mode, you reduce the size of Amazon Elastic Block Store volumes of your training instances. Pipe mode needs only enough disk space to store your final model artifacts. See the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html) for additional details on the training input mode.

## Using CSV Format
<a name="cdf-csv-format"></a>

Many Amazon SageMaker AI algorithms support training with data in CSV format. To use data in CSV format for training, in the input data channel specification, specify **text/csv** as the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-ContentType). Amazon SageMaker AI requires that a CSV file does not have a header record and that the target variable is in the first column. To run unsupervised learning algorithms that don't have a target, specify the number of label columns in the content type. For example, in this case **'content\$1type=text/csv;label\$1size=0'**. For more information, see [Now use Pipe mode with CSV datasets for faster training on Amazon SageMaker AI built-in algorithms](https://aws.amazon.com/blogs/machine-learning/now-use-pipe-mode-with-csv-datasets-for-faster-training-on-amazon-sagemaker-built-in-algorithms/).

## Using RecordIO Format
<a name="cdf-recordio-format"></a>

In the protobuf recordIO format, SageMaker AI converts each observation in the dataset into a binary representation as a set of 4-byte floats, then loads it in the protobuf values field. If you are using Python for your data preparation, we strongly recommend that you use these existing transformations. However, if you are using another language, the protobuf definition file below provides the schema that you use to convert your data into SageMaker AI protobuf format.

**Note**  
For an example that shows how to convert the commonly used numPy array into the protobuf recordIO format, see *[An Introduction to Factorization Machines with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/factorization_machines_mnist/factorization_machines_mnist.html)* .

```
syntax = "proto2";

 package aialgs.data;

 option java_package = "com.amazonaws.aialgorithms.proto";
 option java_outer_classname = "RecordProtos";

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float32Tensor   {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated float values = 1 [packed = true];

     // If key is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 20) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as doubles (float64).
 message Float64Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated double values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse, with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For example, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // A sparse or dense rank-R tensor that stores data as 32-bit ints (int32).
 message Int32Tensor {
     // Each value in the vector. If keys is empty, this is treated as a
     // dense vector.
     repeated int32 values = 1 [packed = true];

     // If this is not empty, the vector is treated as sparse with
     // each key specifying the location of the value in the sparse vector.
     repeated uint64 keys = 2 [packed = true];

     // An optional shape that allows the vector to represent a matrix.
     // For Exmple, if shape = [ 10, 20 ], floor(keys[i] / 10) gives the row,
     // and keys[i] % 20 gives the column.
     // This also supports n-dimensonal tensors.
     // Note: If the tensor is sparse, you must specify this value.
     repeated uint64 shape = 3 [packed = true];
 }

 // Support for storing binary data for parsing in other ways (such as JPEG/etc).
 // This is an example of another type of value and may not immediately be supported.
 message Bytes {
     repeated bytes value = 1;

     // If the content type of the data is known, stores it.
     // This allows for the possibility of using decoders for common formats
     // in the future.
     optional string content_type = 2;
 }

 message Value {
     oneof value {
         // The numbering assumes the possible use of:
         // - float16, float128
         // - int8, int16, int32
         Float32Tensor float32_tensor = 2;
         Float64Tensor float64_tensor = 3;
         Int32Tensor int32_tensor = 7;
         Bytes bytes = 9;
     }
 }

 message Record {
     // Map from the name of the feature to the value.
     //
     // For vectors and libsvm-like datasets,
     // a single feature with the name `values`
     // should be specified.
     map<string, Value> features = 1;

     // An optional set of labels for this record.
     // Similar to the features field above, the key used for
     // generic scalar / vector labels should be 'values'.
     map<string, Value> label = 2;

     // A unique identifier for this record in the dataset.
     //
     // Whilst not necessary, this allows better
     // debugging where there are data issues.
     //
     // This is not used by the algorithm directly.
     optional string uid = 3;

     // Textual metadata describing the record.
     //
     // This may include JSON-serialized information
     // about the source of the record.
     //
     // This is not used by the algorithm directly.
     optional string metadata = 4;

     // An optional serialized JSON object that allows per-record
     // hyper-parameters/configuration/other information to be set.
     //
     // The meaning/interpretation of this field is defined by
     // the algorithm author and may not be supported.
     //
     // This is used to pass additional inference configuration
     // when batch inference is used (e.g. types of scores to return).
     optional string configuration = 5;
 }
```

After creating the protocol buffer, store it in an Amazon S3 location that Amazon SageMaker AI can access and that can be passed as part of `InputDataConfig` in `create_training_job`. 

**Note**  
For all Amazon SageMaker AI algorithms, the `ChannelName` in `InputDataConfig` must be set to `train`. Some algorithms also support a validation or test `input channels`. These are typically used to evaluate the model's performance by using a hold-out dataset. Hold-out datasets are not used in the initial training but can be used to further tune the model.

## Trained Model Deserialization
<a name="td-deserialization"></a>

Amazon SageMaker AI models are stored as model.tar.gz in the S3 bucket specified in `OutputDataConfig` `S3OutputPath` parameter of the `create_training_job` call. The S3 bucket must be in the same AWS Region as the notebook instance. You can specify most of these model artifacts when creating a hosting model. You can also open and review them in your notebook instance. When `model.tar.gz` is untarred, it contains `model_algo-1`, which is a serialized Apache MXNet object. For example, you use the following to load the k-means model into memory and view it: 

```
import mxnet as mx
print(mx.ndarray.load('model_algo-1'))
```

# Common data formats for inference
<a name="cdf-inference"></a>

Amazon SageMaker AI algorithms accept and produce several different MIME types for the HTTP payloads used in retrieving online and mini-batch predictions. You can use multiple AWS services to transform or preprocess records before running inference. At a minimum, you need to convert the data for the following:
+ Inference request serialization (handled by you) 
+ Inference request deserialization (handled by the algorithm) 
+ Inference response serialization (handled by the algorithm) 
+ Inference response deserialization (handled by you) 

**Topics**
+ [

## Convert data for inference request serialization
](#ir-serialization)
+ [

## Convert data for inference response deserialization
](#ir-deserialization)
+ [

## Common request formats for all algorithms
](#common-in-formats)
+ [

## Use batch transform with built-in algorithms
](#cm-batch)

## Convert data for inference request serialization
<a name="ir-serialization"></a>

Content type options for Amazon SageMaker AI algorithm inference requests include: `text/csv`, `application/json`, and `application/x-recordio-protobuf`. Algorithms that don't support all of these types can support other types. XGBoost, for example, only supports `text/csv` from this list, but also supports `text/libsvm`.

For `text/csv`, the value for the Body argument to `invoke_endpoint` should be a string with commas separating the values for each feature. For example, a record for a model with four features might look like `1.5,16.0,14,23.0`. Any transformations performed on the training data should also be performed on the data before obtaining inference. The order of the features matters and must remain unchanged. 

`application/json` is more flexible and provides multiple possible formats for developers to use in their applications. At a high level, in JavaScript, the payload might look like the following: 

```
let request = {
  // Instances might contain multiple rows that predictions are sought for.
  "instances": [
    {
      // Request and algorithm specific inference parameters.
      "configuration": {},
      // Data in the specific format required by the algorithm.
      "data": {
         "<field name>": dataElement
       }
    }
  ]
}
```

You have the following options for specifying the `dataElement`: 

**Protocol buffers equivalent**

```
// Has the same format as the protocol buffers implementation described for training.
let dataElement = {
  "keys": [],
  "values": [],
  "shape": []
}
```

**Simple numeric vector **

```
// An array containing numeric values is treated as an instance containing a
// single dense vector.
let dataElement = [1.5, 16.0, 14.0, 23.0]

// It will be converted to the following representation by the SDK.
let converted = {
  "features": {
    "values": dataElement
  }
}
```

**For multiple records**

```
let request = {
  "instances": [
    // First instance.
    {
      "features": [ 1.5, 16.0, 14.0, 23.0 ]
    },
    // Second instance.
    {
      "features": [ -2.0, 100.2, 15.2, 9.2 ]
    }
  ]
}
```

## Convert data for inference response deserialization
<a name="ir-deserialization"></a>

Amazon SageMaker AI algorithms return JSON in several layouts. At a high level, the structure is:

```
let response = {
  "predictions": [{
    // Fields in the response object are defined on a per algorithm-basis.
  }]
}
```

The fields that are included in predictions differ across algorithms. The following are examples of output for the k-means algorithm.

**Single-record inference** 

```
let response = {
  "predictions": [{
    "closest_cluster": 5,
    "distance_to_cluster": 36.5
  }]
}
```

**Multi-record inference**

```
let response = {
  "predictions": [
    // First instance prediction.
    {
      "closest_cluster": 5,
      "distance_to_cluster": 36.5
    },
    // Second instance prediction.
    {
      "closest_cluster": 2,
      "distance_to_cluster": 90.3
    }
  ]
}
```

**Multi-record inference with protobuf input **

```
{
  "features": [],
  "label": {
    "closest_cluster": {
      "values": [ 5.0 ] // e.g. the closest centroid/cluster was 1.0
    },
    "distance_to_cluster": {
      "values": [ 36.5 ]
    }
  },
  "uid": "abc123",
  "metadata": "{ "created_at": '2017-06-03' }"
}
```

SageMaker AI algorithms also support the JSONLINES format, where the per-record response content is same as that in JSON format. The multi-record structure is a collection of per-record response objects separated by newline characters. The response content for the built-in KMeans algorithm for 2 input data points is:

```
{"distance_to_cluster": 23.40593910217285, "closest_cluster": 0.0}
{"distance_to_cluster": 27.250282287597656, "closest_cluster": 0.0}
```

While running batch transform, we recommended using the `jsonlines` response type by setting the `Accept` field in the `CreateTransformJobRequest` to `application/jsonlines`.

## Common request formats for all algorithms
<a name="common-in-formats"></a>

Most algorithms use many of the following inference request formats.

### JSON request format
<a name="cm-json"></a>

**Content type:** application/JSON

**Dense format**

```
let request =   {
    "instances":    [
        {
            "features": [1.5, 16.0, 14.0, 23.0]
        }
    ]
}


let request =   {
    "instances":    [
        {
            "data": {
                "features": {
                    "values": [ 1.5, 16.0, 14.0, 23.0]
                }
            }
        }
    ]
}
```

**Sparse format**

```
{
	"instances": [
		{"data": {"features": {
					"keys": [26, 182, 232, 243, 431],
					"shape": [2000],
					"values": [1, 1, 1, 4, 1]
				}
			}
		},
		{"data": {"features": {
					"keys": [0, 182, 232, 243, 431],
					"shape": [2000],
					"values": [13, 1, 1, 4, 1]
				}
			}
		},
	]
}
```

### JSONLINES request format
<a name="cm-jsonlines"></a>

**Content type:** application/JSONLINES

**Dense format**

A single record in dense format can be represented as either:

```
{ "features": [1.5, 16.0, 14.0, 23.0] }
```

or:

```
{ "data": { "features": { "values": [ 1.5, 16.0, 14.0, 23.0] } }
```

**Sparse Format**

A single record in sparse format is represented as:

```
{"data": {"features": { "keys": [26, 182, 232, 243, 431], "shape": [2000], "values": [1, 1, 1, 4, 1] } } }
```

Multiple records are represented as a collection of single-record representations, separated by newline characters:

```
{"data": {"features": { "keys": [0, 1, 3], "shape": [4], "values": [1, 4, 1] } } }
{ "data": { "features": { "values": [ 1.5, 16.0, 14.0, 23.0] } }
{ "features": [1.5, 16.0, 14.0, 23.0] }
```

### CSV request format
<a name="cm-csv"></a>

**Content type:** text/CSV; label\$1size=0

**Note**  
CSV support is not available for factorization machines.

### RECORDIO request format
<a name="cm-recordio"></a>

Content type: application/x-recordio-protobuf

## Use batch transform with built-in algorithms
<a name="cm-batch"></a>

While running batch transform, we recommended using the JSONLINES response type instead of JSON, if supported by the algorithm. To do this, set the `Accept` field in the `CreateTransformJobRequest` to `application/jsonlines`.

When you create a transform job, the `SplitType` must be set based on the `ContentType` of the input data. Similarly, depending on the `Accept` field in the `CreateTransformJobRequest`, `AssembleWith` must be set accordingly. Use the following table to set these fields:


| ContentType | Recommended SplitType | 
| --- | --- | 
| application/x-recordio-protobuf | RecordIO | 
| text/csv | Line | 
| application/jsonlines | Line | 
| application/json | None | 
| application/x-image | None | 
| image/\$1 | None | 


| Accept | Recommended AssembleWith | 
| --- | --- | 
| application/x-recordio-protobuf | None | 
| application/json | None | 
| application/jsonlines | Line | 

For more information on response formats for specific algorithms, see the following:
+ [DeepAR Inference Formats](deepar-in-formats.md)
+ [Factorization Machines Response Formats](fm-in-formats.md)
+ [IP Insights Inference Data Formats](ip-insights-inference-data-formats.md)
+ [K-Means Response Formats](km-in-formats.md)
+ [k-NN Request and Response Formats](kNN-inference-formats.md)
+ [Linear learner response formats](LL-in-formats.md)
+ [NTM Response Formats](ntm-in-formats.md)
+ [Data Formats for Object2Vec Inference](object2vec-inference-formats.md)
+ [Encoder Embeddings for Object2Vec](object2vec-encoder-embeddings.md)
+ [PCA Response Formats](PCA-in-formats.md)
+ [RCF Response Formats](rcf-in-formats.md)

# Instance Types for Built-in Algorithms
<a name="cmn-info-instance-types"></a>

Most Amazon SageMaker AI algorithms have been engineered to take advantage of GPU computing for training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. Exceptions are noted in this guide.

To learn about the supported EC2 instances, see [Instance details](https://aws.amazon.com/sagemaker-ai/pricing/#Instance_details).

The size and type of data can have a great effect on which hardware configuration is most effective. When the same model is trained on a recurring basis, initial testing across a spectrum of instance types can discover configurations that are more cost-effective in the long run. Additionally, algorithms that train most efficiently on GPUs might not require GPUs for efficient inference. Experiment to determine the most cost effectiveness solution. To get an automatic instance recommendation or conduct custom load tests, use [Amazon SageMaker Inference Recommender](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender.html).

For more information on SageMaker AI hardware specifications, see [Amazon SageMaker AI pricing](https://aws.amazon.com/sagemaker/ai/pricing/).

**UltraServers**

UltraServers connect multiple Amazon EC2 instances using a low-latency, high-bandwidth accelerator interconnect. They are built to handle large-scale AI/ML workloads that require significant processing power. For more information, see [Amazon EC2 UltraServers](https://aws.amazon.com/ec2/ultraservers/). To get started with UltraServers, see [Reserve training plans for your training jobs or HyperPod clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/reserve-capacity-with-training-plans.html).

To get started with UltraServers on Amazon SageMaker AI, [ create a training plan](https://docs.aws.amazon.com/sagemaker/latest/dg/reserve-capacity-with-training-plans.html). Once your UltraServer is available in the training plan, create a training job with the AWS Management Console, Amazon SageMaker AI API, or AWS CLI. Remember to specify the UltraServer instance type that you purchased in the training plan.

An UltraServer can run one or multiple jobs at a time. UltraServers groups instances together, which gives you some flexibility in terms of how to allocate your UltraServer capacity in your organization. As you configure your jobs, also remember your organization's data security guidelines, as instances in one UltraServer can access data for another job in another instance on the same UltraServer.

If you run into hardware failures in the UltraServer, SageMaker AI automatically tries to resolve the issue. As SageMaker AI investigates and resolves the issue, you might receive notifications and actions through AWS Health Events or AWS Support.

Once your training job finishes, SageMaker AI stops the instances, but they remain available in your training plan if the plan is still active. To keep an instance in an UltraServer running after a job finishes, you can use [managed warm pools](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html).

If your training plan has enough capacity, you can even run training jobs across multiple UltraServers. By default, each UltraServer comes with 18 instances, comprising of 17 instances and 1 spare instance. If you need more instances, you must buy more UltraServers. When creating a training job, you can configure how jobs are placed across UltraServers using the `InstancePlacementConfig` parameter.

If you don't configure job placement, SageMaker AI automatically allocates jobs to instances within your UltraServer. This default strategy is based on best effort that prioritizes filling all of the instances in a single UltraServer before using a different UltraServer. For example, if you request 14 instances and have 2 UltraServers in your training plan, SageMaker AI uses all of the instances in the first UltraServer. If you requested 20 instances and have 2 UltraServers in your training plan, SageMaker AI will will use all 17 instances in the first UltraServer and then use 3 from the second UltraServer. Instances within an UltraServer use NVLink to communicate, but individual UltraServers use Elastic Fabric Adapter (EFA), which might affect model training performance.

# Logs for Built-in Algorithms
<a name="common-info-all-sagemaker-models-logs"></a>

Amazon SageMaker AI algorithms produce Amazon CloudWatch logs, which provide detailed information on the training process. To see the logs, in the AWS management console, choose **CloudWatch**, choose **Logs**, and then choose the /aws/sagemaker/TrainingJobs **log group**. Each training job has one log stream per node on which it was trained. The log stream’s name begins with the value specified in the `TrainingJobName` parameter when the job was created.

**Note**  
If a job fails and logs do not appear in CloudWatch, it's likely that an error occurred before the start of training. Reasons include specifying the wrong training image or S3 location.

The contents of logs vary by algorithms. However, you can typically find the following information:
+ Confirmation of arguments provided at the beginning of the log
+ Errors that occurred during training
+ Measurement of an algorithm's accuracy or numerical performance
+ Timings for the algorithm and any major stages within the algorithm

## Common Errors
<a name="example-errors"></a>

If a training job fails, some details about the failure are provided by the `FailureReason` return value in the training job description, as follows:

```
sage = boto3.client('sagemaker')
sage.describe_training_job(TrainingJobName=job_name)['FailureReason']
```

Others are reported only in the CloudWatch logs. Common errors include the following:

1. Misspecifying a hyperparameter or specifying a hyperparameter that is invalid for the algorithm.

   **From the CloudWatch Log**

   ```
   [10/16/2017 23:45:17 ERROR 139623806805824 train.py:48]
   Additional properties are not allowed (u'mini_batch_siz' was
   unexpected)
   ```

1. Specifying an invalid value for a hyperparameter.

   **FailureReason**

   ```
   AlgorithmError: u'abc' is not valid under any of the given
   schemas\n\nFailed validating u'oneOf' in
   schema[u'properties'][u'feature_dim']:\n    {u'oneOf':
   [{u'pattern': u'^([1-9][0-9]*)$', u'type': u'string'},\n
   {u'minimum': 1, u'type': u'integer'}]}\
   ```

   **FailureReason**

   ```
   [10/16/2017 23:57:17 ERROR 140373086025536 train.py:48] u'abc'
   is not valid under any of the given schemas
   ```

1. Inaccurate protobuf file format.

   **From the CloudWatch log**

   ```
   [10/17/2017 18:01:04 ERROR 140234860816192 train.py:48] cannot
                      copy sequence with size 785 to array axis with dimension 784
   ```

# Built-in SageMaker AI Algorithms for Tabular Data
<a name="algorithms-tabular"></a>

Amazon SageMaker AI provides built-in algorithms that are tailored to the analysis of tabular data. Tabular data refers to any datasets that are organized in tables consisting of rows (observations) and columns (features). The built-in SageMaker AI algorithms for tabular data can be used for either classification or regression problems.
+ [AutoGluon-Tabular](autogluon-tabular.md)—an open-source AutoML framework that succeeds by ensembling models and stacking them in multiple layers. 
+ [CatBoost](catboost.md)—an implementation of the gradient-boosted trees algorithm that introduces ordered boosting and an innovative algorithm for processing categorical features.
+ [Factorization Machines Algorithm](fact-machines.md)—an extension of a linear model that is designed to economically capture interactions between features within high-dimensional sparse datasets.
+ [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md)—a non-parametric method that uses the k nearest labeled points to assign a label to a new data point for classification or a predicted target value from the average of the k nearest points for regression.
+ [LightGBM](lightgbm.md)—an implementation of the gradient-boosted trees algorithm that adds two novel techniques for improved efficiency and scalability: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB).
+ [Linear Learner Algorithm](linear-learner.md)—learns a linear function for regression or a linear threshold function for classification.
+ [TabTransformer](tabtransformer.md)—a novel deep tabular data modeling architecture built on self-attention-based Transformers. 
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)—an implementation of the gradient-boosted trees algorithm that combines an ensemble of estimates from a set of simpler and weaker models.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| AutoGluon-Tabular | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| CatBoost | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| Factorization Machines | train and (optionally) test | File or Pipe | recordIO-protobuf | CPU (GPU for dense data) | Yes | 
| K-Nearest-Neighbors (k-NN) | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPU (single GPU device on one or more instances) | Yes | 
| LightGBM | training and (optionally) validation | File | CSV | CPU (single instance only) | No | 
| Linear Learner | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | CPU or GPU | Yes | 
| TabTransformer | training and (optionally) validation | File | CSV | CPU or GPU (single instance only) | No | 
| XGBoost (0.90-1, 0.90-2, 1.0-1, 1.2-1, 1.2-21) | train and (optionally) validation | File or Pipe | CSV, LibSVM, or Parquet | CPU (or GPU for 1.2-1) | Yes | 

# AutoGluon-Tabular
<a name="autogluon-tabular"></a>

[AutoGluon-Tabular](https://auto.gluon.ai/stable/index.html) is a popular open-source AutoML framework that trains highly accurate machine learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. This page includes information about Amazon EC2 instance recommendations and sample notebooks for AutoGluon-Tabular.

# How to use SageMaker AI AutoGluon-Tabular
<a name="autogluon-tabular-modes"></a>

You can use AutoGluon-Tabular as an Amazon SageMaker AI built-in algorithm. The following section describes how to use AutoGluon-Tabular with the SageMaker Python SDK. For information on how to use AutoGluon-Tabular from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use AutoGluon-Tabular as a built-in algorithm**

  Use the AutoGluon-Tabular built-in algorithm to build an AutoGluon-Tabular training container as shown in the following code example. You can automatically spot the AutoGluon-Tabular built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the AutoGluon-Tabular image URI, you can use the AutoGluon-Tabular container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The AutoGluon-Tabular built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own AutoGluon-Tabular training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "autogluon-classification-ensemble", "*", "training"
  training_instance_type = "ml.p3.2xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_binary/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "auto_stack"
  ] = "True"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the AutoGluon-Tabular as a built-in algorithm, see the following notebook examples. Any S3 bucket used in these examples must be in the same AWS Region as the notebook instance used to run them.
  + [Tabular classification with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb)
  + [Tabular regression with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Regression_AutoGluon.ipynb)

# Input and Output interface for the AutoGluon-Tabular algorithm
<a name="InputOutput-AutoGluon-Tabular"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of AutoGluon-Tabular supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the AutoGluon-Tabular model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains a CSV file. You can optionally include another subdirectory in the same location called `validation/` that also has a CSV file. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI AutoGluon-Tabular uses the `autogluon.tabular.TabularPredictor` module to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI AutoGluon-Tabular with the AutoGluon framework**
+ Use the following Python code:

  ```
  import tarfile
  from autogluon.tabular import TabularPredictor
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = TabularPredictor.load(model_file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the AutoGluon-Tabular algorithm
<a name="Instance-AutoGluon-Tabular"></a>

SageMaker AI AutoGluon-Tabular supports single-instance CPU and single-instance GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker AI AutoGluon-Tabular currently does not support multi-GPU training.

## AutoGluon-Tabular sample notebooks
<a name="autogluon-tabular-sample-notebooks"></a>

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI AutoGluon-Tabular algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Classification_AutoGluon.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI AutoGluon-Tabular algorithm to train and host a tabular classification model.  | 
|  [Tabular regression with Amazon SageMaker AI AutoGluon-Tabular algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/autogluon_tabular/Amazon_Tabular_Regression_AutoGluon.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI AutoGluon-Tabular algorithm to train and host a tabular regression model.  | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How AutoGluon-Tabular works
<a name="autogluon-tabular-HowItWorks"></a>

AutoGluon-Tabular performs advanced data processing, deep learning, and multi-layer model ensemble methods. It automatically recognizes the data type in each column for robust data preprocessing, including special handling of text fields. 

AutoGluon fits various models ranging from off-the-shelf boosted trees to customized neural networks. These models are ensembled in a novel way: models are stacked in multiple layers and trained in a layer-wise manner that guarantees raw data can be translated into high-quality predictions within a given time constraint. This process mitigates overfitting by splitting the data in various ways with careful tracking of out-of-fold examples.

The AutoGluon-Tabular algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, and distributions. You can use AutoGluon-Tabular for regression, classification (binary and multiclass), and ranking problems.

Refer to the following diagram illustrating how the multi-layer stacking strategy works.

![\[AutoGluon's multi-layer stacking strategy shown with two stacking layers.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/autogluon_tabular_illustration.png)


For more information, see *[AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data](https://arxiv.org/pdf/2003.06505.pdf)*.

# AutoGluon-Tabular hyperparameters
<a name="autogluon-tabular-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI AutoGluon-Tabular algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI AutoGluon-Tabular algorithm is an implementation of the open-source [AutoGluon-Tabular](https://github.com/awslabs/autogluon) package.

**Note**  
The default hyperparameters are based on example datasets in the [AutoGluon-Tabular sample notebooks](autogluon-tabular.md#autogluon-tabular-sample-notebooks).

By default, the SageMaker AI AutoGluon-Tabular algorithm automatically chooses an evaluation metric based on the type of classification problem. The algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is root mean squared error. For binary classification problems, the evaluation metric is area under the receiver operating characteristic curve (AUC). For multiclass classification problems, the evaluation metric is accuracy. You can use the `eval_metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on AutoGluon-Tabular hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| eval\$1metric |  The evaluation metric for validation data. If `eval_metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) Valid values: string, refer to the [AutoGluon documentation](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html) for valid values. Default value: `"auto"`.  | 
| presets |  List of preset configurations for various arguments in `fit()`.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/autogluon-tabular-hyperparameters.html) For more details, see [AutoGluon Predictors](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html). Valid values: string, any of the following: (`"best_quality"`, `"high_quality"`, `good_quality"`, `"medium_quality"`, `"optimize_for_deployment"`,` or "interpretable"`). Default value: `"medium_quality"`.  | 
| auto\$1stack |  Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy. Set `auto_stack` to `"True"` if you are willing to tolerate longer training times in order to maximize predictive accuracy. This automatically sets the `num_bag_folds` and `num_stack_levels` arguments based on dataset properties.  Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| num\$1bag\$1folds |  Number of folds used for bagging of models. When `num_bag_folds` is equal to `k`, training time is roughly increased by a factor of `k`. Set `num_bag_folds` to 0 to deactivate bagging. This is disabled by default, but we recommend using values between 5 and 10 to maximize predictive performance. Increasing `num_bag_folds` results in models with lower bias, but that are more prone to overfitting. One is an invalid value for this parameter, and will raise a `ValueError`. Values greater than 10 may produce diminishing returns and can even harm overall results due to overfitting. To further improve predictions, avoid increasing `num_bag_folds` and instead increase `num_bag_sets`. Valid values: string, any integer between (and including) `"0"` and `"10"`. Default value: `"0"`.  | 
| num\$1bag\$1sets |  Number of repeats of kfold bagging to perform (values must be greater than or equal to 1). The total number of models trained during bagging is equal to `num_bag_folds` \$1 `num_bag_sets`. This parameter defaults to one if `time_limit` is not specified. This parameters is disabled if `num_bag_folds` is not specified. Values greater than one result in superior predictive performance, especially on smaller problems and with stacking enabled.  Valid values: integer, range: [`1`, `20`]. Default value: `1`.  | 
| num\$1stack\$1levels |  Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of `num_stack_levels` \$1 1. Set this parameter to 0 to deactivate stack ensembling. This parameter is deactivated by default, but we recommend using values between 1 and 3 to maximize predictive performance. To prevent overfitting and a `ValueError`, `num_bag_folds` must be greater than or equal to 2. Valid values: float, range: [`0`, `3`]. Default value: `0`.  | 
| refit\$1full |  Whether or not to retrain all models on all of the data (training and validation) after the normal training procedure. For more details, see [AutoGluon Predictors](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html). Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| set\$1best\$1to\$1refit\$1full |  Whether or not to change the default model that the predictor uses for prediction. If `set_best_to_refit_full` is set to `"True"`, the default model changes to the model that exhibited the highest validation score as a result of refitting (activated by `refit_full`). Only valid if `refit_full` is set. Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| save\$1space |  Whether or note to reduce the memory and disk size of predictor by deleting auxiliary model files that aren’t needed for prediction on new data. This has no impact on inference accuracy. We recommend setting `save_space` to `"True"` if the only goal is to use the trained model for prediction. Certain advanced functionality may no longer be available if `save_space` is set to `"True"`. Refer to the `[predictor.save\$1space()](https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.save_space.html)` documentation for more details. Valid values: string, `"True"` or `"False"`. Default value: `"False"`.  | 
| verbosity |  The verbosity of print messages. `verbosity` levels range from `0` to `4`, with higher levels corresponding to more detailed print statements. A `verbosity` of `0` suppresses warnings.  Valid values: integer, any of the following: (`0`, `1`, `2`, `3`, or `4`). Default value: `2`.  | 

# Tuning an AutoGluon-Tabular model
<a name="autogluon-tabular-tuning"></a>

Although AutoGluon-Tabular can be used with model tuning, its design can deliver good performance using stacking and ensemble methods, meaning hyperparameter optimization is not necessary. Rather than focusing on model tuning, AutoGluon-Tabular succeeds by stacking models in multiple layers and training in a layer-wise manner. 

For more information about AutoGluon-Tabular hyperparameters, see [AutoGluon-Tabular hyperparameters](autogluon-tabular-hyperparameters.md).

# CatBoost
<a name="catboost"></a>

[CatBoost](https://catboost.ai/) is a popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.

CatBoost introduces two critical algorithmic advances to GBDT:

1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm

1. An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms. This page includes information about Amazon EC2 instance recommendations and sample notebooks for CatBoost.

# How to use SageMaker AI CatBoost
<a name="catboost-modes"></a>

You can use CatBoost as an Amazon SageMaker AI built-in algorithm. The following section describes how to use CatBoost with the SageMaker Python SDK. For information on how to use CatBoost from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use CatBoost as a built-in algorithm**

  Use the CatBoost built-in algorithm to build a CatBoost training container as shown in the following code example. You can automatically spot the CatBoost built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the CatBoost image URI, you can use the CatBoost container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The CatBoost built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own CatBoost training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "catboost-classification-model", "*", "training"
  training_instance_type = "ml.m5.xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_multiclass/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "iterations"
  ] = "500"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up CatBoost as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)
  + [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)

# Input and Output interface for the CatBoost algorithm
<a name="InputOutput-CatBoost"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of CatBoost supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the CatBoost model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `training` or `validation` channels, the CatBoost algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI CatBoost uses the `catboost.CatBoostClassifier` and `catboost.CatBoostRegressor` modules to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI CatBoost with `catboost`**
+ Use the following Python code:

  ```
  import tarfile
  from catboost import CatBoostClassifier
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  file_path = os.path.join(model_file_path, "model")
  model = CatBoostClassifier()
  model.load_model(file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the CatBoost algorithm
<a name="Instance-CatBoost"></a>

SageMaker AI CatBoost currently only trains using CPUs. CatBoost is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5). Further, we recommend that you have enough total memory in selected instances to hold the training data. 

## CatBoost sample notebooks
<a name="catboost-sample-notebooks"></a>

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI CatBoost algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI CatBoost algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI CatBoost algorithm to train and host a tabular regression model.   | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How CatBoost Works
<a name="catboost-HowItWorks"></a>

CatBoost implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition of two critical algorithmic advances:

1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm

1. An innovative algorithm for processing categorical features

Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms.

The CatBoost algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use CatBoost for regression, classification (binary and multiclass), and ranking problems.

For more information on gradient boosting, see [How the SageMaker AI XGBoost algorithm works](xgboost-HowItWorks.md). For in-depth details about the additional GOSS and EFB techniques used in the CatBoost method, see *[CatBoost: unbiased boosting with categorical features](https://arxiv.org/pdf/1706.09516.pdf)*.

# CatBoost hyperparameters
<a name="catboost-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI CatBoost algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI CatBoost algorithm is an implementation of the open-source [CatBoost](https://github.com/catboost/catboost) package.

**Note**  
The default hyperparameters are based on example datasets in the [CatBoost sample notebooks](catboost.md#catboost-sample-notebooks).

By default, the SageMaker AI CatBoost algorithm automatically chooses an evaluation metric and loss function based on the type of classification problem. The CatBoost algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric and loss functions are both root mean squared error. For binary classification problems, the evaluation metric is Area Under the Curve (AUC) and the loss function is log loss. For multiclass classification problems, the evaluation metric and loss functions are multiclass cross entropy. You can use the `eval_metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on LightGBM hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| iterations |  The maximum number of trees that can be built. Valid values: integer, range: Positive integer. Default value: `500`.  | 
| early\$1stopping\$1rounds |  The training will stop if one metric of one validation data point does not improve in the last `early_stopping_rounds` round. If `early_stopping_rounds` is less than or equal to zero, this hyperparameter is ignored. Valid values: integer. Default value: `5`.  | 
| eval\$1metric |  The evaluation metric for validation data. If `eval_metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/catboost-hyperparameters.html) Valid values: string, refer to the [CatBoost documentation](https://catboost.ai/en/docs/references/eval-metric__supported-metrics) for valid values. Default value: `"auto"`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.009`.  | 
| depth |  Depth of the tree. Valid values: integer, range: (`1`, `16`). Default value: `6`.  | 
| l2\$1leaf\$1reg |  Coefficient for the L2 regularization term of the cost function. Valid values: integer, range: Positive integer. Default value: `3`.  | 
| random\$1strength |  The amount of randomness to use for scoring splits when the tree structure is selected. Use this parameter to avoid overfitting the model. Valid values: float, range: Positive floating point number. Default value: `1.0`.  | 
| max\$1leaves |  The maximum number of leaves in the resulting tree. Can only be used with the `"Lossguide"` growing policy. Valid values: integer, range: [`2`, `64`]. Default value: `31`.  | 
| rsm |  Random subspace method. The percentage of features to use at each split selection, when features are selected over again at random. Valid values: float, range: (`0.0`, `1.0`]. Default value: `1.0`.  | 
| sampling\$1frequency |  Frequency to sample weights and objects when building trees. Valid values: string, either: (`"PerTreeLevel"` or `"PerTree"`). Default value: `"PerTreeLevel"`.  | 
| min\$1data\$1in\$1leaf |  The minimum number of training samples in a leaf. CatBoost does not search for new splits in leaves with a sample count less than the specified value. Can only be used with the `"Lossguide"` and `"Depthwise"` growing policies. Valid values: integer, range: (`1` or `∞`). Default value: `1`.  | 
| bagging\$1temperature |  Defines the settings of the Bayesian bootstrap. Use the Bayesian bootstrap to assign random weights to objects. If `bagging_temperature` is set to `1.0`, then the weights are sampled from an exponential distribution. If `bagging_temperature` is set to `0.0`, then all weights are 1.0. Valid values: float, range: Non-negative float. Default value: `1.0`.  | 
| boosting\$1type |  The boosting scheme. "Auto" means that the `boosting_type` is selected based on processing unit type, the number of objects in the training dataset, and the selected learning mode. Valid values: string, any of the following: (`"Auto"`, `"Ordered"`, `"Plain"`). Default value: `"Auto"`.  | 
| scale\$1pos\$1weight |  The weight for positive class in binary classification. The value is used as a multiplier for the weights of objects from positive class. Valid values: float, range: Positive float. Default value: `1.0`.  | 
| max\$1bin |  The number of splits for numerical features. `"Auto"` means that `max_bin` is selected based on the processing unit type and other parameters. For details, see the CatBoost documentation. Valid values: string, either: (`"Auto"` or string of integer from `"1"` to `"65535"` inclusively). Default value: `"Auto"`.  | 
| grow\$1policy |  The tree growing policy. Defines how to perform greedy tree construction. Valid values: string, any of the following: (`"SymmetricTree"`, `"Depthwise"`, or `"Lossguide"`). Default value: `"SymmetricTree"`.  | 
| random\$1seed |  The random seed used for training. Valid values: integer, range: Non-negative integer. Default value: `1.0`. | 
| thread\$1count |  The number of threads to use during the training. If `thread_count` is `-1`, then the number of threads is equal to the number of processor cores. `thread_count` cannot be `0`. Valid values: integer, either: (`-1` or positive integer). Default value: `-1`.  | 
| verbose |  The verbosity of print messages, with higher levels corresponding to more detailed print statements. Valid values: integer, range: Positive integer. Default value: `1`.  | 

# Tune a CatBoost model
<a name="catboost-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters:

**Note**  
The learning loss function is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [CatBoost hyperparameters](catboost-hyperparameters.md).
+ A learning loss function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your chosen hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for CatBoost is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the CatBoost algorithm
<a name="catboost-metrics"></a>

The SageMaker AI CatBoost algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| RMSE | root mean square error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MAE | mean absolute error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MedianAbsoluteError | median absolute error | minimize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| R2 | r2 score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Logloss | binary cross entropy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Precision | precision | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Recall | recall | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| F1 | f1 score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| AUC | auc score | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| MultiClass | multiclass cross entropy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| Accuracy | accuracy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 
| BalancedAccuracy | balanced accuracy | maximize | "bestTest = ([0-9\$1\$1.]\$1)" | 

## Tunable CatBoost hyperparameters
<a name="catboost-tunable-hyperparameters"></a>

Tune the CatBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the CatBoost evaluation metrics are: `learning_rate`, `depth`, `l2_leaf_reg`, and `random_strength`. For a list of all the CatBoost hyperparameters, see [CatBoost hyperparameters](catboost-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| depth | IntegerParameterRanges | MinValue: 4, MaxValue: 10 | 
| l2\$1leaf\$1reg | IntegerParameterRanges | MinValue: 2, MaxValue: 10 | 
| random\$1strength | ContinuousParameterRanges | MinValue: 0, MaxValue: 10 | 

# Factorization Machines Algorithm
<a name="fact-machines"></a>

The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.

**Note**  
The Amazon SageMaker AI implementation of the Factorization Machines algorithm considers only pair-wise (2nd order) interactions between features.

**Topics**
+ [

## Input/Output Interface for the Factorization Machines Algorithm
](#fm-inputoutput)
+ [

## EC2 Instance Recommendation for the Factorization Machines Algorithm
](#fm-instances)
+ [

## Factorization Machines Sample Notebooks
](#fm-sample-notebooks)
+ [

# How Factorization Machines Work
](fact-machines-howitworks.md)
+ [

# Factorization Machines Hyperparameters
](fact-machines-hyperparameters.md)
+ [

# Tune a Factorization Machines Model
](fm-tuning.md)
+ [

# Factorization Machines Response Formats
](fm-in-formats.md)

## Input/Output Interface for the Factorization Machines Algorithm
<a name="fm-inputoutput"></a>

The Factorization Machines algorithm can be run in either in binary classification mode or regression mode. In each mode, a dataset can be provided to the **test** channel along with the train channel dataset. The scoring depends on the mode used. In regression mode, the testing dataset is scored using Root Mean Square Error (RMSE). In binary classification mode, the test dataset is scored using Binary Cross Entropy (Log Loss), Accuracy (at threshold=0.5) and F1 Score (at threshold =0.5).

For **training**, the Factorization Machines algorithm currently supports only the `recordIO-protobuf` format with `Float32` tensors. Because their use case is predominantly on sparse data, `CSV` is not a good candidate. Both File and Pipe mode training are supported for recordIO-wrapped protobuf.

For **inference**, the Factorization Machines algorithm supports the `application/json` and `x-recordio-protobuf` formats. 
+ For the **binary classification** problem, the algorithm predicts a score and a label. The label is a number and can be either `0` or `1`. The score is a number that indicates how strongly the algorithm believes that the label should be `1`. The algorithm computes score first and then derives the label from the score value. If the score is greater than or equal to 0.5, the label is `1`.
+ For the **regression** problem, just a score is returned and it is the predicted value. For example, if Factorization Machines is used to predict a movie rating, score is the predicted rating value.

Please see [Factorization Machines Sample Notebooks](#fm-sample-notebooks) for more details on training and inference file formats.

## EC2 Instance Recommendation for the Factorization Machines Algorithm
<a name="fm-instances"></a>

The Amazon SageMaker AI Factorization Machines algorithm is highly scalable and can train across distributed instances. We recommend training and inference with CPU instances for both sparse and dense datasets. In some circumstances, training with one or more GPUs on dense data might provide some benefit. Training with GPUs is available only on dense data. Use CPU instances for sparse data. The Factorization Machines algorithm supports P2, P3, G4dn, and G5 instances for training and inference.

## Factorization Machines Sample Notebooks
<a name="fm-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI Factorization Machines algorithm to analyze the images of handwritten digits from zero to nine in the MNIST dataset, see [An Introduction to Factorization Machines with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/factorization_machines_mnist/factorization_machines_mnist.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. Example notebooks that use Factorization Machines algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Factorization Machines Work
<a name="fact-machines-howitworks"></a>

The prediction task for a Factorization Machines model is to estimate a function ŷ from a feature set xi to a target domain. This domain is real-valued for regression and binary for classification. The Factorization Machines model is supervised and so has a training dataset (xi,yj) available. The advantages this model presents lie in the way it uses a factorized parametrization to capture the pairwise feature interactions. It can be represented mathematically as follows: 

![\[An image containing the equation for the Factorization Machines model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM1.jpg)


 The three terms in this equation correspond respectively to the three components of the model: 
+ The w0 term represents the global bias.
+ The wi linear terms model the strength of the ith variable.
+ The <vi,vj> factorization terms model the pairwise interaction between the ith and jth variable.

The global bias and linear terms are the same as in a linear model. The pairwise feature interactions are modeled in the third term as the inner product of the corresponding factors learned for each feature. Learned factors can also be considered as embedding vectors for each feature. For example, in a classification task, if a pair of features tends to co-occur more often in positive labeled samples, then the inner product of their factors would be large. In other words, their embedding vectors would be close to each other in cosine similarity. For more information about the Factorization Machines model, see [Factorization Machines](https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf).

For regression tasks, the model is trained by minimizing the squared error between the model prediction ŷn and the target value yn. This is known as the square loss:

![\[An image containing the equation for square loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM2.jpg)


For a classification task, the model is trained by minimizing the cross entropy loss, also known as the log loss: 

![\[An image containing the equation for log loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM3.jpg)


where: 

![\[An image containing the logistic function of the predicted values.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/FM4.jpg)


For more information about loss functions for classification, see [Loss functions for classification](https://en.wikipedia.org/wiki/Loss_functions_for_classification).

# Factorization Machines Hyperparameters
<a name="fact-machines-hyperparameters"></a>

The following table contains the hyperparameters for the Factorization Machines algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order.


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim | The dimension of the input feature space. This could be very high with sparse input. **Required** Valid values: Positive integer. Suggested value range: [10000,10000000]  | 
| num\$1factors | The dimensionality of factorization. **Required** Valid values: Positive integer. Suggested value range: [2,1000], 64 typically generates good outcomes and is a good starting point.  | 
| predictor\$1type | The type of predictor. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Required** Valid values: String: `binary_classifier` or `regressor`  | 
| bias\$1init\$1method | The initialization method for the bias term: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant` Default value: `normal`  | 
| bias\$1init\$1scale | Range for initialization of the bias term. Takes effect if `bias_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| bias\$1init\$1sigma | The standard deviation for initialization of the bias term. Takes effect if `bias_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| bias\$1init\$1value | The initial value of the bias term. Takes effect if `bias_init_method` is set to `constant`.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| bias\$1lr | The learning rate for the bias term.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.1  | 
| bias\$1wd | The weight decay for the bias term.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| clip\$1gradient | Gradient clipping optimizer parameter. Clips the gradient by projecting onto the interval [-`clip_gradient`, \$1`clip_gradient`].  **Optional** Valid values: Float Default value: None  | 
| epochs | The number of training epochs to run.  **Optional** Valid values: Positive integer Default value: 1  | 
| eps | Epsilon parameter to avoid division by 0. **Optional** Valid values: Float. Suggested value: small. Default value: None  | 
| factors\$1init\$1method | The initialization method for factorization terms: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant`. Default value: `normal`  | 
| factors\$1init\$1scale  | The range for initialization of factorization terms. Takes effect if `factors_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| factors\$1init\$1sigma | The standard deviation for initialization of factorization terms. Takes effect if `factors_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| factors\$1init\$1value | The initial value of factorization terms. Takes effect if `factors_init_method` is set to `constant`.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| factors\$1lr | The learning rate for factorization terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.0001  | 
| factors\$1wd | The weight decay for factorization terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.00001  | 
| linear\$1lr | The learning rate for linear terms.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| linear\$1init\$1method | The initialization method for linear terms: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/fact-machines-hyperparameters.html) **Optional** Valid values: `uniform`, `normal`, or `constant`. Default value: `normal`  | 
| linear\$1init\$1scale | Range for initialization of linear terms. Takes effect if `linear_init_method` is set to `uniform`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: None  | 
| linear\$1init\$1sigma | The standard deviation for initialization of linear terms. Takes effect if `linear_init_method` is set to `normal`.  **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.01  | 
| linear\$1init\$1value | The initial value of linear terms. Takes effect if `linear_init_method` is set to *constant*.  **Optional** Valid values: Float. Suggested value range: [1e-8, 512]. Default value: None  | 
| linear\$1wd | The weight decay for linear terms. **Optional** Valid values: Non-negative float. Suggested value range: [1e-8, 512]. Default value: 0.001  | 
| mini\$1batch\$1size | The size of mini-batch used for training.  **Optional** Valid values: Positive integer Default value: 1000  | 
| rescale\$1grad |  Gradient rescaling optimizer parameter. If set, multiplies the gradient with `rescale_grad` before updating. Often choose to be 1.0/`batch_size`.  **Optional** Valid values: Float Default value: None  | 

# Tune a Factorization Machines Model
<a name="fm-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Factorization Machines Algorithm
<a name="fm-metrics"></a>

The Factorization Machines algorithm has both binary classification and regression predictor types. The predictor type determines which metric you can use for automatic model tuning. The algorithm reports a `test:rmse` regressor metric, which is computed during training. When tuning the model for regression tasks, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:rmse | Root Mean Square Error | Minimize | 

The Factorization Machines algorithm reports three binary classification metrics, which are computed during training. When tuning the model for binary classification tasks, choose one of these as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:binary\$1classification\$1accuracy | Accuracy | Maximize | 
| test:binary\$1classification\$1cross\$1entropy | Cross Entropy | Minimize | 
| test:binary\$1f\$1beta | Beta | Maximize | 

## Tunable Factorization Machines Hyperparameters
<a name="fm-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the Factorization Machines algorithm. The initialization parameters that contain the terms bias, linear, and factorization depend on their initialization method. There are three initialization methods: `uniform`, `normal`, and `constant`. These initialization methods are not themselves tunable. The parameters that are tunable are dependent on this choice of the initialization method. For example, if the initialization method is `uniform`, then only the `scale` parameters are tunable. Specifically, if `bias_init_method==uniform`, then `bias_init_scale`, `linear_init_scale`, and `factors_init_scale` are tunable. Similarly, if the initialization method is `normal`, then only `sigma` parameters are tunable. If the initialization method is `constant`, then only `value` parameters are tunable. These dependencies are listed in the following table. 


| Parameter Name | Parameter Type | Recommended Ranges | Dependency | 
| --- | --- | --- | --- | 
| bias\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| bias\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| bias\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| bias\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| bias\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| epoch | IntegerParameterRange | MinValue: 1, MaxValue: 1000 | None | 
| factors\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| factors\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| factors\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| factors\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| factors\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512] | None | 
| linear\$1init\$1scale | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==uniform | 
| linear\$1init\$1sigma | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==normal | 
| linear\$1init\$1value | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | bias\$1init\$1method==constant | 
| linear\$1lr | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| linear\$1wd | ContinuousParameterRange | MinValue: 1e-8, MaxValue: 512 | None | 
| mini\$1batch\$1size | IntegerParameterRange | MinValue: 100, MaxValue: 10000 | None | 

# Factorization Machines Response Formats
<a name="fm-in-formats"></a>

Amazon SageMaker AI provides several response formats for getting inference from the Factorization Machines model, such as JSON, JSONLINES, and RECORDIO, with specific structures for binary classification and regression tasks.

## JSON Response Format
<a name="fm-json"></a>

Binary classification

```
let response =   {
    "predictions":    [
        {
            "score": 0.4,
            "predicted_label": 0
        } 
    ]
}
```

Regression

```
let response =   {
    "predictions":    [
        {
            "score": 0.4
        } 
    ]
}
```

## JSONLINES Response Format
<a name="fm-jsonlines"></a>

Binary classification

```
{"score": 0.4, "predicted_label": 0}
```

Regression

```
{"score": 0.4}
```

## RECORDIO Response Format
<a name="fm-recordio"></a>

Binary classification

```
[
    Record = {
        features = {},
        label = {
            'score’: {
                keys: [],
                values: [0.4]  # float32
            },
            'predicted_label': {
                keys: [],
                values: [0.0]  # float32
            }
        }
    }
]
```

Regression

```
[
    Record = {
        features = {},
        label = {
            'score’: {
                keys: [],
                values: [0.4]  # float32
            }   
        }
    }
]
```

# K-Nearest Neighbors (k-NN) Algorithm
<a name="k-nearest-neighbors"></a>

Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression. For classification problems, the algorithm queries the *k* points that are closest to the sample point and returns the most frequently used label of their class as the predicted label. For regression problems, the algorithm queries the *k* closest points to the sample point and returns the average of their feature values as the predicted value. 

Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building. Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction, the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model in memory and inference latency. We provide two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is to construct the index. The index enables efficient lookups of distances between points whose values or class labels have not yet been determined and the k nearest points to use for inference.

**Topics**
+ [

## Input/Output Interface for the k-NN Algorithm
](#kNN-input_output)
+ [

## k-NN Sample Notebooks
](#kNN-sample-notebooks)
+ [

# How the k-NN Algorithm Works
](kNN_how-it-works.md)
+ [

## EC2 Instance Recommendation for the k-NN Algorithm
](#kNN-instances)
+ [

# k-NN Hyperparameters
](kNN_hyperparameters.md)
+ [

# Tune a k-NN Model
](kNN-tuning.md)
+ [

# Data Formats for k-NN Training Input
](kNN-in-formats.md)
+ [

# k-NN Request and Response Formats
](kNN-inference-formats.md)

## Input/Output Interface for the k-NN Algorithm
<a name="kNN-input_output"></a>

SageMaker AI k-NN supports train and test data channels.
+ Use a *train channel* for data that you want to sample and construct into the k-NN index.
+ Use a *test channel* to emit scores in log files. Scores are listed as one line per mini-batch: accuracy for `classifier`, mean-squared error (mse) for `regressor` for score.

For training inputs, k-NN supports `text/csv` and `application/x-recordio-protobuf` data formats. For input type `text/csv`, the first `label_size` columns are interpreted as the label vector for that row. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference inputs, k-NN supports the `application/json`, `application/x-recordio-protobuf`, and `text/csv` data formats. The `text/csv` format accepts a `label_size` and encoding parameter. It assumes a `label_size` of 0 and a UTF-8 encoding.

For inference outputs, k-NN supports the `application/json` and `application/x-recordio-protobuf` data formats. These two data formats also support a verbose output mode. In verbose output mode, the API provides the search results with the distances vector sorted from smallest to largest, and corresponding elements in the labels vector.

For batch transform, k-NN supports the `application/jsonlines` data format for both input and output. An example input is as follows:

```
content-type: application/jsonlines

{"features": [1.5, 16.0, 14.0, 23.0]}
{"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}
```

An example output is as follows:

```
accept: application/jsonlines

{"predicted_label": 0.0}
{"predicted_label": 2.0}
```

For more information on input and output file formats, see [Data Formats for k-NN Training Input](kNN-in-formats.md) for training, [k-NN Request and Response Formats](kNN-inference-formats.md) for inference, and the [k-NN Sample Notebooks](#kNN-sample-notebooks).

## k-NN Sample Notebooks
<a name="kNN-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI k-nearest neighbor algorithm to predict wilderness cover types from geological and forest service data, see the [K-Nearest Neighbor Covertype ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/k_nearest_neighbors_covtype/k_nearest_neighbors_covtype.html). 

Use a Jupyter notebook instance to run the example in SageMaker AI. To learn how to create and open a Jupyter notebook instance in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI example notebooks. Find K-Nearest Neighbor notebooks in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How the k-NN Algorithm Works
<a name="kNN_how-it-works"></a>

The Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm follows a multi-step training process which includes sampling the input data, performing dimension reduction, and building an index. The indexed data is then used during inference to efficiently find the k-nearest neighbors for a given data point and make predictions based on the neighboring labels or values.

## Step 1: Sample
<a name="step1-k-NN-sampling"></a>

To specify the total number of data points to be sampled from the training dataset, use the `sample_size`parameter. For example, if the initial dataset has 1,000 data points and the `sample_size` is set to 100, where the total number of instances is 2, each worker would sample 50 points. A total set of 100 data points would be collected. Sampling runs in linear time with respect to the number of data points. 

## Step 2: Perform Dimension Reduction
<a name="step2-kNN-dim-reduction"></a>

The current implementation of the k-NN algorithm has two methods of dimension reduction. You specify the method in the `dimension_reduction_type` hyperparameter. The `sign` method specifies a random projection, which uses a linear projection using a matrix of random signs, and the `fjlt` method specifies a fast Johnson-Lindenstrauss transform, a method based on the Fourier transform. Both methods preserve the L2 and inner product distances. The `fjlt` method should be used when the target dimension is large and has better performance with CPU inference. The methods differ in their computational complexity. The `sign` method requires O(ndk) time to reduce the dimension of a batch of n points of dimension d into a target dimension k. The `fjlt` method requires O(nd log(d)) time, but the constants involved are larger. Using dimension reduction introduces noise into the data and this noise can reduce prediction accuracy.

## Step 3: Build an Index
<a name="step3-kNN-build-index"></a>

During inference, the algorithm queries the index for the k-nearest-neighbors of a sample point. Based on the references to the points, the algorithm makes the classification or regression prediction. It makes the prediction based on the class labels or values provided. k-NN provides three different types of indexes: a flat index, an inverted index, and an inverted index with product quantization. You specify the type with the `index_type` parameter.

## Serialize the Model
<a name="kNN-model-serialization"></a>

When the k-NN algorithm finishes training, it serializes three files to prepare for inference. 
+ model\$1algo-1: Contains the serialized index for computing the nearest neighbors.
+ model\$1algo-1.labels: Contains serialized labels (np.float32 binary format) for computing the predicted label based on the query result from the index.
+ model\$1algo-1.json: Contains the JSON-formatted model metadata which stores the `k` and `predictor_type` hyper-parameters from training for inference along with other relevant state.

With the current implementation of k-NN, you can modify the metadata file to change the way predictions are computed. For example, you can change `k` to 10 or change `predictor_type` to *regressor*.

```
{
  "k": 5,
  "predictor_type": "classifier",
  "dimension_reduction": {"type": "sign", "seed": 3, "target_dim": 10, "input_dim": 20},
  "normalize": False,
  "version": "1.0"
}
```

## EC2 Instance Recommendation for the k-NN Algorithm
<a name="kNN-instances"></a>

We recommend training on a CPU instance (such as ml.m5.2xlarge) or on a GPU instance. The k-NN algorithm supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

Inference requests from CPUs generally have a lower average latency than requests from GPUs because there is a tax on CPU-to-GPU communication when you use GPU hardware. However, GPUs generally have higher throughput for larger batches.

# k-NN Hyperparameters
<a name="kNN_hyperparameters"></a>

The following table lists the hyperparameters that you can set for the Amazon SageMaker AI k-nearest neighbors (k-NN) algorithm.


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  The number of features in the input data. **Required** Valid values: positive integer.  | 
| k |  The number of nearest neighbors. **Required** Valid values: positive integer  | 
| predictor\$1type |  The type of inference to use on the data labels. **Required** Valid values: *classifier* for classification or *regressor* for regression.  | 
| sample\$1size |  The number of data points to be sampled from the training data set.  **Required** Valid values: positive integer  | 
| dimension\$1reduction\$1target |  The target dimension to reduce to. **Required** when you specify the `dimension_reduction_type` parameter. Valid values: positive integer greater than 0 and less than `feature_dim`.  | 
| dimension\$1reduction\$1type |  The type of dimension reduction method.  **Optional** Valid values: *sign* for random projection or *fjlt* for the fast Johnson-Lindenstrauss transform. Default value: No dimension reduction  | 
| faiss\$1index\$1ivf\$1nlists |  The number of centroids to construct in the index when `index_type` is *faiss.IVFFlat* or *faiss.IVFPQ*. **Optional** Valid values: positive integer Default value: *auto*, which resolves to `sqrt(sample_size)`.  | 
| faiss\$1index\$1pq\$1m |  The number of vector sub-components to construct in the index when `index_type` is set to *faiss.IVFPQ*.  The FaceBook AI Similarity Search (FAISS) library requires that the value of `faiss_index_pq_m` is a divisor of the data dimension. If `faiss_index_pq_m` is not a divisor of the data dimension, we increase the data dimension to smallest integer divisible by `faiss_index_pq_m`. If no dimension reduction is applied, the algorithm adds a padding of zeros. If dimension reduction is applied, the algorithm increase the value of the `dimension_reduction_target` hyper-parameter. **Optional** Valid values: One of the following positive integers: 1, 2, 3, 4, 8, 12, 16, 20, 24, 28, 32, 40, 48, 56, 64, 96  | 
| index\$1metric |  The metric to measure the distance between points when finding nearest neighbors. When training with `index_type` set to `faiss.IVFPQ`, the `INNER_PRODUCT` distance and `COSINE` similarity are not supported. **Optional**  Valid values: *L2* for Euclidean-distance, *INNER\$1PRODUCT* for inner-product distance, *COSINE* for cosine similarity. Default value: *L2*  | 
| index\$1type |  The type of index. **Optional** Valid values: *faiss.Flat*, *faiss.IVFFlat*, *faiss.IVFPQ*. Default values: *faiss.Flat*  | 
| mini\$1batch\$1size |  The number of observations per mini-batch for the data iterator.  **Optional** Valid values: positive integer Default value: 5000  | 

# Tune a k-NN Model
<a name="kNN-tuning"></a>

The Amazon SageMaker AI k-nearest neighbors algorithm is a supervised algorithm. The algorithm consumes a test data set and emits a metric about the accuracy for a classification task or about the mean squared error for a regression task. These accuracy metrics compare the model predictions for their respective task to the ground truth provided by the empirical test data. To find the best model that reports the highest accuracy or lowest error on the test dataset, run a hyperparameter tuning job for k-NN. 

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric appropriate for the prediction task of the algorithm. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric. The hyperparameters are used only to help estimate model parameters and are not used by the trained model to make predictions.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the k-NN Algorithm
<a name="km-metrics"></a>

The k-nearest neighbors algorithm computes one of two metrics in the following table during training depending on the type of task specified by the `predictor_type` hyper-parameter. 
+ *classifier* specifies a classification task and computes `test:accuracy` 
+ *regressor* specifies a regression task and computes `test:mse`.

Choose the `predictor_type` value appropriate for the type of task undertaken to calculate the relevant objective metric when tuning a model.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:accuracy |  When `predictor_type` is set to *classifier*, k-NN compares the predicted label, based on the average of the k-nearest neighbors' labels, to the ground truth label provided in the test channel data. The accuracy reported ranges from 0.0 (0%) to 1.0 (100%).  |  Maximize  | 
| test:mse |  When `predictor_type` is set to *regressor*, k-NN compares the predicted label, based on the average of the k-nearest neighbors' labels, to the ground truth label provided in the test channel data. The mean squared error is computed by comparing the two labels.  |  Minimize  | 


## Tunable k-NN Hyperparameters
<a name="km-tunable-hyperparameters"></a>

Tune the Amazon SageMaker AI k-nearest neighbor model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| k |  IntegerParameterRanges  |  MinValue: 1, MaxValue: 1024  | 
| sample\$1size |  IntegerParameterRanges  |  MinValue: 256, MaxValue: 20000000  | 

# Data Formats for k-NN Training Input
<a name="kNN-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input training formats described in [Common Data Formats - Training](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-training.html). This topic contains a list of the available input formats for the SageMaker AI k-nearest-neighbor algorithm.

## CSV Data Format
<a name="kNN-training-data-csv"></a>

content-type: text/csv; label\$1size=1

```
4,1.2,1.3,9.6,20.3
```

The first `label_size` columns are interpreted as the label vector for that row.

## RECORDIO Data Format
<a name="kNN-training-data-recordio"></a>

content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {
            'values': {
                values: [1.2, 1.3, 9.6, 20.3]  # float32
            }
        },
        label = {
            'values': {
                values: [4]  # float32
            }
        }
    }
] 

                
}
```

# k-NN Request and Response Formats
<a name="kNN-inference-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI k-nearest-neighbor algorithm.

## INPUT: CSV Request Format
<a name="kNN-input-csv"></a>

content-type: text/csv

```
1.2,1.3,9.6,20.3
```

This accepts a `label_size` or encoding parameter. It assumes a `label_size` of 0 and a utf-8 encoding.

## INPUT: JSON Request Format
<a name="kNN-input-json"></a>

content-type: application/json

```
{
  "instances": [
    {"data": {"features": {"values": [-3, -1, -4, 2]}}},
    {"features": [3.0, 0.1, 0.04, 0.002]}]
}
```

## INPUT: JSONLINES Request Format
<a name="kNN-input-jsonlines"></a>

content-type: application/jsonlines

```
{"features": [1.5, 16.0, 14.0, 23.0]}
{"data": {"features": {"values": [1.5, 16.0, 14.0, 23.0]}}
```

## INPUT: RECORDIO Request Format
<a name="kNN-input-recordio"></a>

content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {
            'values': {
                values: [-3, -1, -4, 2]  # float32
            }
        },
        label = {}
    },
    Record = {
        features = {
            'values': {
                values: [3.0, 0.1, 0.04, 0.002]  # float32
            }
        },
        label = {}
    },
]
```

## OUTPUT: JSON Response Format
<a name="kNN-output-json"></a>

accept: application/json

```
{
  "predictions": [
    {"predicted_label": 0.0},
    {"predicted_label": 2.0}
  ]
}
```

## OUTPUT: JSONLINES Response Format
<a name="kNN-output-jsonlines"></a>

accept: application/jsonlines

```
{"predicted_label": 0.0}
{"predicted_label": 2.0}
```

## OUTPUT: VERBOSE JSON Response Format
<a name="KNN-output-verbose-json"></a>

In verbose mode, the API provides the search results with the distances vector sorted from smallest to largest, with corresponding elements in the labels vector. In this example, k is set to 3.

accept: application/json; verbose=true

```
{
  "predictions": [
    {
        "predicted_label": 0.0,
        "distances": [3.11792408, 3.89746071, 6.32548437],
        "labels": [0.0, 1.0, 0.0]
    },
    {
        "predicted_label": 2.0,
        "distances": [1.08470316, 3.04917915, 5.25393973],
        "labels": [2.0, 2.0, 0.0]
    }
  ]
}
```

## OUTPUT: RECORDIO-PROTOBUF Response Format
<a name="kNN-output-recordio-protobuf"></a>

content-type: application/x-recordio-protobuf

```
[
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [2.0]  # float32
            }
        }
    }
]
```

## OUTPUT: VERBOSE RECORDIO-PROTOBUF Response Format
<a name="kNN-output-verbose-recordio"></a>

In verbose mode, the API provides the search results with the distances vector sorted from smallest to largest, with corresponding elements in the labels vector. In this example, k is set to 3.

accept: application/x-recordio-protobuf; verbose=true

```
[
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            },
            'distances': {
                values: [3.11792408, 3.89746071, 6.32548437]  # float32
            },
            'labels': {
                values: [0.0, 1.0, 0.0]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'predicted_label': {
                values: [0.0]  # float32
            },
            'distances': {
                values: [1.08470316, 3.04917915, 5.25393973]  # float32
            },
            'labels': {
                values: [2.0, 2.0, 0.0]  # float32
            }
        }
    }
]
```

## SAMPLE OUTPUT for the k-NN Algorithm
<a name="kNN-sample-output"></a>

For regressor tasks:

```
[06/08/2018 20:15:33 INFO 140026520049408] #test_score (algo-1) : ('mse', 0.013333333333333334)
```

For classifier tasks:

```
[06/08/2018 20:15:46 INFO 140285487171328] #test_score (algo-1) : ('accuracy', 0.98666666666666669)
```

# LightGBM
<a name="lightgbm"></a>

[LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT. This page includes information about Amazon EC2 instance recommendations and sample notebooks for LightGBM.

# How to use SageMaker AI LightGBM
<a name="lightgbm-modes"></a>

You can use LightGBM as an Amazon SageMaker AI built-in algorithm. The following section describes how to use LightGBM with the SageMaker Python SDK. For information on how to use LightGBM from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use LightGBM as a built-in algorithm**

  Use the LightGBM built-in algorithm to build a LightGBM training container as shown in the following code example. You can automatically spot the LightGBM built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the LightGBM image URI, you can use the LightGBM container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The LightGBM built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own LightGBM training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
  training_instance_type = "ml.m5.xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_multiclass/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train" 
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation" 
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "num_boost_round"
  ] = "500"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1, # for distributed training, specify an instance_count greater than 1
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "train": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the LightGBM as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.ipynb)
  + [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.ipynb)

# Input and Output interface for the LightGBM algorithm
<a name="InputOutput-LightGBM"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of LightGBM supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the LightGBM model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `train` and `validation` channels to provide your input data. Alternatively, you can use only the `train` channel.

**Note**  
Both `train` and `training` are valid channel names for LightGBM training.

**Use both the `train` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `train` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `train` or `validation` channels, the LightGBM algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `train` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `train` channel**:

You can alternatively provide your input data by way of a single S3 path for the `train` channel. This S3 path should point to a directory with a subdirectory named `train/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

SageMaker AI LightGBM uses the Python Joblib module to serialize or deserialize the model, which can be used for saving or loading the model.

**To use a model trained with SageMaker AI LightGBM with the JobLib module**
+ Use the following Python code:

  ```
  import joblib 
  import tarfile
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = joblib.load(model_file_path)
  
  # prediction with test data
  # dtest should be a pandas DataFrame with column names feature_0, feature_1, ..., feature_d
  pred = model.predict(dtest)
  ```

## Amazon EC2 instance recommendation for the LightGBM algorithm
<a name="Instance-LightGBM"></a>

SageMaker AI LightGBM currently supports single-instance and multi-instance CPU training. For multi-instance CPU training (distributed training), specify an `instance_count` greater than 1 when you define your Estimator. For more information on distributed training with LightGBM, see [Amazon SageMaker AI LightGBM Distributed training using Dask](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/sagemaker_lightgbm_distributed_training_dask/sagemaker-lightgbm-distributed-training-dask.html).

LightGBM is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C5). Further, we recommend that you have enough total memory in selected instances to hold the training data. 

## LightGBM sample notebooks
<a name="lightgbm-sample-notebooks"></a>

The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI LightGBM algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Classification_LightGBM_CatBoost.html)  |  This notebook demonstrates the use of the Amazon SageMaker AI LightGBM algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI LightGBM and CatBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lightgbm_catboost_tabular/Amazon_Tabular_Regression_LightGBM_CatBoost.html)  |  This notebook demonstrates the use of the Amazon SageMaker AI LightGBM algorithm to train and host a tabular regression model.   | 
|  [Amazon SageMaker AI LightGBM Distributed training using Dask](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/sagemaker_lightgbm_distributed_training_dask/sagemaker-lightgbm-distributed-training-dask.html)  |  This notebook demonstrates distributed training with the Amazon SageMaker AI LightGBM algorithm using the Dask framework.  | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How LightGBM works
<a name="lightgbm-HowItWorks"></a>

LightGBM implements a conventional Gradient Boosting Decision Tree (GBDT) algorithm with the addition of two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). These techniques are designed to significantly improve the efficiency and scalability of GBDT.

The LightGBM algorithm performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use LightGBM for regression, classification (binary and multiclass), and ranking problems.

For more information on gradient boosting, see [How the SageMaker AI XGBoost algorithm works](xgboost-HowItWorks.md). For in-depth details about the additional GOSS and EFB techniques used in the LightGBM method, see *[LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf)*.

# LightGBM hyperparameters
<a name="lightgbm-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI LightGBM algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI LightGBM algorithm is an implementation of the open-source [LightGBM](https://github.com/microsoft/LightGBM) package. 

**Note**  
The default hyperparameters are based on example datasets in the [LightGBM sample notebooks](lightgbm.md#lightgbm-sample-notebooks).

By default, the SageMaker AI LightGBM algorithm automatically chooses an evaluation metric and objective function based on the type of classification problem. The LightGBM algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is root mean squared error and the objective function is L2 loss. For binary classification problems, the evaluation metric and objective function are both binary cross entropy. For multiclass classification problems, the evaluation metric is multiclass cross entropy and the objective function is softmax. You can use the `metric` hyperparameter to change the default evaluation metric. Refer to the following table for more information on LightGBM hyperparameters, including descriptions, valid values, and default values.


| Parameter Name | Description | 
| --- | --- | 
| num\$1boost\$1round |  The maximum number of boosting iterations. **Note:** Internally, LightGBM constructs `num_class * num_boost_round` trees for multi-class classification problems. Valid values: integer, range: Positive integer. Default value: `100`.  | 
| early\$1stopping\$1rounds |  The training will stop if one metric of one validation data point does not improve in the last `early_stopping_rounds` round. If `early_stopping_rounds` is less than or equal to zero, this hyperparameter is ignored. Valid values: integer. Default value: `10`.  | 
| metric |  The evaluation metric for validation data. If `metric` is set to the default `"auto"` value, then the algorithm automatically chooses an evaluation metric based on the type of classification problem: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/lightgbm-hyperparameters.html) Valid values: string, any of the following: (`"auto"`, `"rmse"`, `"l1"`, `"l2"`, `"huber"`, `"fair"`, `"binary_logloss"`, `"binary_error"`, `"auc"`, `"average_precision"`, `"multi_logloss"`, `"multi_error"`, `"auc_mu"`, or `"cross_entropy"`). Default value: `"auto"`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.1`.  | 
| num\$1leaves |  The maximum number of leaves in one tree. Valid values: integer, range: (`1`, `131072`). Default value: `64`.  | 
| feature\$1fraction |  A subset of features to be selected on each iteration (tree). Must be less than 1.0. Valid values: float, range: (`0.0`, `1.0`). Default value: `0.9`.  | 
| bagging\$1fraction |  A subset of features similar to `feature_fraction`, but `bagging_fraction` randomly selects part of the data without resampling. Valid values: float, range: (`0.0`, `1.0`]. Default value: `0.9`.  | 
| bagging\$1freq |  The frequency to perform bagging. At every `bagging_freq` iteration, LightGBM randomly selects a percentage of the data to use for the next `bagging_freq` iteration. This percentage is determined by the `bagging_fraction` hyperparameter. If `bagging_freq` is zero, then bagging is deactivated. Valid values: integer, range: Non-negative integer. Default value: `1`.  | 
| max\$1depth |  The maximum depth for a tree model. This is used to deal with overfitting when the amount of data is small. If `max_depth` is less than or equal to zero, this means there is no limit for maximum depth. Valid values: integer. Default value: `6`.  | 
| min\$1data\$1in\$1leaf |  The minimum amount of data in one leaf. Can be used to deal with overfitting. Valid values: integer, range: Non-negative integer. Default value: `3`.  | 
| max\$1delta\$1step |  Used to limit the max output of tree leaves. If `max_delta_step` is less than or equal to 0, then there is no constraint. The final max output of leaves is `learning_rate * max_delta_step`. Valid values: float. Default value: `0.0`.  | 
| lambda\$1l1 |  L1 regularization. Valid values: float, range: Non-negative float. Default value: `0.0`.  | 
| lambda\$1l2 |  L2 regularization. Valid values: float, range: Non-negative float. Default value: `0.0`.  | 
| boosting |  Boosting type Valid values: string, any of the following: (`"gbdt"`, `"rf"`, `"dart"`, or `"goss"`). Default value: `"gbdt"`.  | 
| min\$1gain\$1to\$1split |  The minimum gain to perform a split. Can be used to speed up training. Valid values: integer, float: Non-negative float. Default value: `0.0`.  | 
| scale\$1pos\$1weight |  The weight of the labels with positive class. Used only for binary classification tasks. `scale_pos_weight` cannot be used if `is_unbalance` is set to `"True"`.  Valid values: float, range: Positive float. Default value: `1.0`.  | 
| tree\$1learner |  Tree learner type. Valid values: string, any of the following: (`"serial"`, `"feature"`, `"data"`, or `"voting"`). Default value: `"serial"`.  | 
| feature\$1fraction\$1bynode |  Selects a subset of random features on each tree node. For example, if `feature_fraction_bynode` is `0.8`, then 80% of features are selected. Can be used to deal with overfitting. Valid values: integer, range: (`0.0`, `1.0`]. Default value: `1.0`.  | 
| is\$1unbalance |  Set to `"True"` if training data is unbalanced. Used only for binary classification tasks. `is_unbalance` cannot be used with `scale_pos_weight`. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| max\$1bin |  The maximum number of bins used to bucket feature values. A small number of bins may reduce training accuracy, but may increase general performance. Can be used to deal with overfitting. Valid values: integer, range: (1, ∞). Default value: `255`.  | 
| tweedie\$1variance\$1power |  Controls the variance of the Tweedie distribution. Set this closer to `2.0` to shift toward a gamma distribution. Set this closer to `1.0` to shift toward a Poisson distribution. Used only for regression tasks. Valid values: float, range: [`1.0`, `2.0`). Default value: `1.5`.  | 
| num\$1threads |  Number of parallel threads used to run LightGBM. Value 0 means default number of threads in OpenMP. Valid values: integer, range: Non-negative integer. Default value: `0`.  | 
| verbosity |  The verbosity of print messages. If the `verbosity` is less than `0`, then print messages only show fatal errors. If `verbosity` is set to `0`, then print messages include errors and warnings. If `verbosity` is `1`, then print messages show more information. A `verbosity` greater than `1` shows the most information in print messages and can be used for debugging. Valid values: integer. Default value: `1`.  | 

# Tune a LightGBM model
<a name="lightgbm-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters: 

**Note**  
The learning objective function is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [LightGBM hyperparameters](lightgbm-hyperparameters.md).
+ A learning objective function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your specified hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for LightGBM is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the LightGBM algorithm
<a name="lightgbm-metrics"></a>

The SageMaker AI LightGBM algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| rmse | root mean square error | minimize | "rmse: ([0-9\$1\$1.]\$1)" | 
| l1 | mean absolute error | minimize | "l1: ([0-9\$1\$1.]\$1)" | 
| l2 | mean squared error | minimize | "l2: ([0-9\$1\$1.]\$1)" | 
| huber | huber loss | minimize | "huber: ([0-9\$1\$1.]\$1)" | 
| fair | fair loss | minimize | "fair: ([0-9\$1\$1.]\$1)" | 
| binary\$1logloss | binary cross entropy | maximize | "binary\$1logloss: ([0-9\$1\$1.]\$1)" | 
| binary\$1error | binary error | minimize | "binary\$1error: ([0-9\$1\$1.]\$1)" | 
| auc | AUC | maximize | "auc: ([0-9\$1\$1.]\$1)" | 
| average\$1precision | average precision score | maximize | "average\$1precision: ([0-9\$1\$1.]\$1)" | 
| multi\$1logloss | multiclass cross entropy | maximize | "multi\$1logloss: ([0-9\$1\$1.]\$1)" | 
| multi\$1error | multiclass error score | minimize | "multi\$1error: ([0-9\$1\$1.]\$1)" | 
| auc\$1mu | AUC-mu | maximize | "auc\$1mu: ([0-9\$1\$1.]\$1)" | 
| cross\$1entropy | cross entropy | minimize | "cross\$1entropy: ([0-9\$1\$1.]\$1)" | 

## Tunable LightGBM hyperparameters
<a name="lightgbm-tunable-hyperparameters"></a>

Tune the LightGBM model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the LightGBM evaluation metrics are: `learning_rate`, `num_leaves`, `feature_fraction`, `bagging_fraction`, `bagging_freq`, `max_depth` and `min_data_in_leaf`. For a list of all the LightGBM hyperparameters, see [LightGBM hyperparameters](lightgbm-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| num\$1leaves | IntegerParameterRanges | MinValue: 10, MaxValue: 100 | 
| feature\$1fraction | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 1.0 | 
| bagging\$1fraction | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 1.0 | 
| bagging\$1freq | IntegerParameterRanges | MinValue: 0, MaxValue: 10 | 
| max\$1depth | IntegerParameterRanges | MinValue: 15, MaxValue: 100 | 
| min\$1data\$1in\$1leaf | IntegerParameterRanges | MinValue: 10, MaxValue: 200 | 

# Linear Learner Algorithm
<a name="linear-learner"></a>

*Linear models* are supervised learning algorithms used for solving either classification or regression problems. For input, you give the model labeled examples (*x*, *y*). *x* is a high-dimensional vector and *y* is a numeric label. For binary classification problems, the label must be either 0 or 1. For multiclass classification problems, the labels must be from 0 to `num_classes` - 1. For regression problems, *y* is a real number. The algorithm learns a linear function, or, for classification problems, a linear threshold function, and maps a vector *x* to an approximation of the label *y*. 

The Amazon SageMaker AI linear learner algorithm provides a solution for both classification and regression problems. With the SageMaker AI algorithm, you can simultaneously explore different training objectives and choose the best solution from a validation set. You can also explore a large number of models and choose the best. The best model optimizes either of the following:
+ Continuous objectives, such as mean square error, cross entropy loss, absolute error.
+ Discrete objectives suited for classification, such as F1 measure, precision, recall, or accuracy. 

Compared with methods that provide a solution for only continuous objectives, the SageMaker AI linear learner algorithm provides a significant increase in speed over naive hyperparameter optimization techniques. It is also more convenient. 

The linear learner algorithm requires a data matrix, with rows representing the observations, and columns representing the dimensions of the features. It also requires an additional column that contains the labels that match the data points. At a minimum, Amazon SageMaker AI linear learner requires you to specify input and output data locations, and objective type (classification or regression) as arguments. The feature dimension is also required. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html). You can specify additional parameters in the `HyperParameters` string map of the request body. These parameters control the optimization procedure, or specifics of the objective function that you train on. For example, the number of epochs, regularization, and loss type. 

If you're using [Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html), the linear learner algorithm supports using [checkpoints to take a snapshot of the state of the model](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html).

**Topics**
+ [

## Input/Output interface for the linear learner algorithm
](#ll-input_output)
+ [

## EC2 instance recommendation for the linear learner algorithm
](#ll-instances)
+ [

## Linear learner sample notebooks
](#ll-sample-notebooks)
+ [

# How linear learner works
](ll_how-it-works.md)
+ [

# Linear learner hyperparameters
](ll_hyperparameters.md)
+ [

# Tune a linear learner model
](linear-learner-tuning.md)
+ [

# Linear learner response formats
](LL-in-formats.md)

## Input/Output interface for the linear learner algorithm
<a name="ll-input_output"></a>

The Amazon SageMaker AI linear learner algorithm supports three data channels: train, validation (optional), and test (optional). If you provide validation data, the `S3DataDistributionType` should be `FullyReplicated`. The algorithm logs validation loss at every epoch, and uses a sample of the validation data to calibrate and select the best model. If you don't provide validation data, the algorithm uses a sample of the training data to calibrate and select the model. If you provide test data, the algorithm logs include the test score for the final model.

**For training**, the linear learner algorithm supports both `recordIO-wrapped protobuf` and `CSV` formats. For the `application/x-recordio-protobuf` input type, only Float32 tensors are supported. For the `text/csv` input type, the first column is assumed to be the label, which is the target variable for prediction. You can use either File mode or Pipe mode to train linear learner models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

**For inference**, the linear learner algorithm supports the `application/json`, `application/x-recordio-protobuf`, and `text/csv` formats. When you make predictions on new data, the format of the response depends on the type of model. **For regression** (`predictor_type='regressor'`), the `score` is the prediction produced by the model. **For classification** (`predictor_type='binary_classifier'` or `predictor_type='multiclass_classifier'`), the model returns a `score` and also a `predicted_label`. The `predicted_label` is the class predicted by the model and the `score` measures the strength of that prediction. 
+ **For binary classification**, `predicted_label` is `0` or `1`, and `score` is a single floating point number that indicates how strongly the algorithm believes that the label should be 1.
+ **For multiclass classification**, the `predicted_class` will be an integer from `0` to `num_classes-1`, and `score` will be a list of one floating point number per class. 

To interpret the `score` in classification problems, you have to consider the loss function used. If the `loss` hyperparameter value is `logistic` for binary classification or `softmax_loss` for multiclass classification, then the `score` can be interpreted as the probability of the corresponding class. These are the loss values used by the linear learner when the `loss` value is `auto` default value. But if the loss is set to `hinge_loss`, then the score cannot be interpreted as a probability. This is because hinge loss corresponds to a Support Vector Classifier, which does not produce probability estimates.

For more information on input and output file formats, see [Linear learner response formats](LL-in-formats.md). For more information on inference formats, and the [Linear learner sample notebooks](#ll-sample-notebooks).

## EC2 instance recommendation for the linear learner algorithm
<a name="ll-instances"></a>

The linear learner algorithm supports both CPU and GPU instances for training and inference. For GPU, the linear learner algorithm supports P2, P3, G4dn, and G5 GPU families.

During testing, we have not found substantial evidence that multi-GPU instances are faster than single-GPU instances. Results can vary, depending on your specific use case.

## Linear learner sample notebooks
<a name="ll-sample-notebooks"></a>

 The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI linear learner algorithm.


| **Notebook Title** | **Description** | 
| --- | --- | 
|  [An Introduction with the MNIST dataset](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/linear_learner_mnist/linear_learner_mnist.html)  |   Using the MNIST dataset, we train a binary classifier to predict a single digit.  | 
|  [How to Build a Multiclass Classifier?](https://sagemaker-examples.readthedocs.io/en/latest/scientific_details_of_algorithms/linear_learner_multiclass_classification/linear_learner_multiclass_classification.html)  |   Using UCI's Covertype dataset, we demonstrate how to train a multiclass classifier.   | 
|  [How to Build a Machine Learning (ML) Pipeline for Inference? ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.html)  |   Using a Scikit-learn container, we demonstrate how to build an end-to-end ML pipeline.   | 

 For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. The topic modeling example notebooks using the linear learning algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab and choose **Create copy**. 

# How linear learner works
<a name="ll_how-it-works"></a>

There are three steps involved in the implementation of the linear learner algorithm: preprocess, train, and validate. 

## Step 1: Preprocess
<a name="step1-preprocessing"></a>

Normalization, or feature scaling, is an important preprocessing step for certain loss functions that ensures the model being trained on a dataset does not become dominated by the weight of a single feature. The Amazon SageMaker AI Linear Learner algorithm has a normalization option to assist with this preprocessing step. If normalization is turned on, the algorithm first goes over a small sample of the data to learn the mean value and standard deviation for each feature and for the label. Each of the features in the full dataset is then shifted to have mean of zero and scaled to have a unit standard deviation.

**Note**  
For best results, ensure your data is shuffled before training. Training with unshuffled data may cause training to fail. 

You can configure whether the linear learner algorithm normalizes the feature data and the labels using the `normalize_data` and `normalize_label` hyperparameters, respectively. Normalization is enabled by default for both features and labels for regression. Only the features can be normalized for binary classification and this is the default behavior. 

## Step 2: Train
<a name="step2-training"></a>

With the linear learner algorithm, you train with a distributed implementation of stochastic gradient descent (SGD). You can control the optimization process by choosing the optimization algorithm. For example, you can choose to use Adam, AdaGrad, stochastic gradient descent, or other optimization algorithms. You also specify their hyperparameters, such as momentum, learning rate, and the learning rate schedule. If you aren't sure which algorithm or hyperparameter value to use, choose a default that works for the majority of datasets. 

During training, you simultaneously optimize multiple models, each with slightly different objectives. For example, you vary L1 or L2 regularization and try out different optimizer settings. 

## Step 3: Validate and set the threshold
<a name="step3-validation"></a>

When training multiple models in parallel, the models are evaluated against a validation set to select the most optimal model once training is complete. For regression, the most optimal model is the one that achieves the best loss on the validation set. For classification, a sample of the validation set is used to calibrate the classification threshold. The most optimal model selected is the one that achieves the best binary classification selection criteria on the validation set. Examples of such criteria include the F1 measure, accuracy, and cross-entropy loss. 

**Note**  
If the algorithm is not provided a validation set, then evaluating and selecting the most optimal model is not possible. To take advantage of parallel training and model selection ensure you provide a validation set to the algorithm. 

# Linear learner hyperparameters
<a name="ll_hyperparameters"></a>

The following table contains the hyperparameters for the linear learner algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. When a hyperparameter is set to `auto`, Amazon SageMaker AI will automatically calculate and set the value of that hyperparameter. 


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes |  The number of classes for the response variable. The algorithm assumes that classes are labeled `0`, ..., `num_classes - 1`. **Required** when `predictor_type` is `multiclass_classifier`. Otherwise, the algorithm ignores it. Valid values: Integers from 3 to 1,000,000  | 
| predictor\$1type |  Specifies the type of target variable as a binary classification, multiclass classification, or regression. **Required** Valid values: `binary_classifier`, `multiclass_classifier`, or `regressor`  | 
| accuracy\$1top\$1k |  When computing the top-k accuracy metric for multiclass classification, the value of *k*. If the model assigns one of the top-k scores to the true label, an example is scored as correct. **Optional** Valid values: Positive integers Default value: 3   | 
| balance\$1multiclass\$1weights |  Specifies whether to use class weights, which give each class equal importance in the loss function. Used only when the `predictor_type` is `multiclass_classifier`. **Optional** Valid values: `true`, `false` Default value: `false`  | 
| beta\$11 |  The exponential decay rate for first-moment estimates. Applies only when the `optimizer` value is `adam`. **Optional** Valid values: `auto` or floating-point value between 0 and 1.0 Default value: `auto`  | 
| beta\$12 |  The exponential decay rate for second-moment estimates. Applies only when the `optimizer` value is `adam`. **Optional** Valid values: `auto` or floating-point integer between 0 and 1.0  Default value: `auto`  | 
| bias\$1lr\$1mult |  Allows a different learning rate for the bias term. The actual learning rate for the bias is `learning_rate` \$1 `bias_lr_mult`. **Optional** Valid values: `auto` or positive floating-point integer Default value: `auto`  | 
| bias\$1wd\$1mult |  Allows different regularization for the bias term. The actual L2 regularization weight for the bias is `wd` \$1 `bias_wd_mult`. By default, there is no regularization on the bias term. **Optional** Valid values: `auto` or non-negative floating-point integer Default value: `auto`  | 
| binary\$1classifier\$1model\$1selection\$1criteria |  When `predictor_type` is set to `binary_classifier`, the model evaluation criteria for the validation dataset (or for the training dataset if you don't provide a validation dataset). Criteria include: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) **Optional** Valid values: `accuracy`, `f_beta`, `precision_at_target_recall`, `recall_at_target_precision`, or `loss_function` Default value: `accuracy`  | 
| early\$1stopping\$1patience | If no improvement is made in the relevant metric, the number of epochs to wait before ending training. If you have provided a value for binary\$1classifier\$1model\$1selection\$1criteria. the metric is that value. Otherwise, the metric is the same as the value specified for the loss hyperparameter. The metric is evaluated on the validation data. If you haven't provided validation data, the metric is always the same as the value specified for the `loss` hyperparameter and is evaluated on the training data. To disable early stopping, set `early_stopping_patience` to a value greater than the value specified for `epochs`.**Optional**Valid values: Positive integerDefault value: 3 | 
| early\$1stopping\$1tolerance |  The relative tolerance to measure an improvement in loss. If the ratio of the improvement in loss divided by the previous best loss is smaller than this value, early stopping considers the improvement to be zero. **Optional** Valid values: Positive floating-point integer Default value: 0.001  | 
| epochs |  The maximum number of passes over the training data. **Optional** Valid values: Positive integer Default value: 15  | 
| f\$1beta |  The value of beta to use when calculating F score metrics for binary or multiclass classification. Also used if the value specified for `binary_classifier_model_selection_criteria` is `f_beta`. **Optional** Valid values: Positive floating-point integers Default value: 1.0   | 
| feature\$1dim |  The number of features in the input data.  **Optional** Valid values: `auto` or positive integer Default values: `auto`  | 
| huber\$1delta |  The parameter for Huber loss. During training and metric evaluation, compute L2 loss for errors smaller than delta and L1 loss for errors larger than delta. **Optional** Valid values: Positive floating-point integer Default value: 1.0   | 
| init\$1bias |  Initial weight for the bias term. **Optional** Valid values: Floating-point integer Default value: 0  | 
| init\$1method |  Sets the initial distribution function used for model weights. Functions include: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) **Optional** Valid values: `uniform` or `normal` Default value: `uniform`  | 
| init\$1scale |  Scales an initial uniform distribution for model weights. Applies only when the `init_method` hyperparameter is set to `uniform`. **Optional** Valid values: Positive floating-point integer Default value: 0.07  | 
| init\$1sigma |  The initial standard deviation for the normal distribution. Applies only when the `init_method` hyperparameter is set to `normal`. **Optional** Valid values: Positive floating-point integer Default value: 0.01  | 
| l1 |  The L1 regularization parameter. If you don't want to use L1 regularization, set the value to 0. **Optional** Valid values: `auto` or non-negative float Default value: `auto`  | 
| learning\$1rate |  The step size used by the optimizer for parameter updates. **Optional** Valid values: `auto` or positive floating-point integer Default value: `auto`, whose value depends on the optimizer chosen.  | 
| loss |  Specifies the loss function.  The available loss functions and their default values depend on the value of `predictor_type`: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) Valid values: `auto`, `logistic`, `squared_loss`, `absolute_loss`, `hinge_loss`, `eps_insensitive_squared_loss`, `eps_insensitive_absolute_loss`, `quantile_loss`, or `huber_loss`  **Optional** Default value: `auto`  | 
| loss\$1insensitivity |  The parameter for the epsilon-insensitive loss type. During training and metric evaluation, any error smaller than this value is considered to be zero. **Optional** Valid values: Positive floating-point integer Default value: 0.01   | 
| lr\$1scheduler\$1factor |  For every `lr_scheduler_step` hyperparameter, the learning rate decreases by this quantity. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive floating-point integer between 0 and 1 Default value: `auto`  | 
| lr\$1scheduler\$1minimum\$1lr |  The learning rate never decreases to a value lower than the value set for `lr_scheduler_minimum_lr`. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive floating-point integer Default values: `auto`  | 
| lr\$1scheduler\$1step |  The number of steps between decreases of the learning rate. Applies only when the `use_lr_scheduler` hyperparameter is set to `true`. **Optional** Valid values: `auto` or positive integer Default value: `auto`  | 
| margin |  The margin for the `hinge_loss` function. **Optional** Valid values: Positive floating-point integer Default value: 1.0  | 
| mini\$1batch\$1size |  The number of observations per mini-batch for the data iterator. **Optional** Valid values: Positive integer Default value: 1000  | 
| momentum |  The momentum of the `sgd` optimizer. **Optional** Valid values: `auto` or a floating-point integer between 0 and 1.0 Default value: `auto`  | 
| normalize\$1data |  Normalizes the feature data before training. Data normalization shifts the data for each feature to have a mean of zero and scales it to have unit standard deviation. **Optional** Valid values: `auto`, `true`, or `false` Default value: `true`  | 
| normalize\$1label |  Normalizes the label. Label normalization shifts the label to have a mean of zero and scales it to have unit standard deviation. The `auto` default value normalizes the label for regression problems but does not for classification problems. If you set the `normalize_label` hyperparameter to `true` for classification problems, the algorithm ignores it. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| num\$1calibration\$1samples |  The number of observations from the validation dataset to use for model calibration (when finding the best threshold). **Optional** Valid values: `auto` or positive integer Default value: `auto`  | 
| num\$1models |  The number of models to train in parallel. For the default, `auto`, the algorithm decides the number of parallel models to train. One model is trained according to the given training parameter (regularization, optimizer, loss), and the rest by close parameters. **Optional** Valid values: `auto` or positive integer Default values: `auto`  | 
| num\$1point\$1for\$1scaler |  The number of data points to use for calculating normalization or unbiasing of terms. **Optional** Valid values: Positive integer Default value: 10,000  | 
| optimizer |  The optimization algorithm to use. **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ll_hyperparameters.html) Default value: `auto`. The default setting for `auto` is `adam`.  | 
| positive\$1example\$1weight\$1mult |  The weight assigned to positive examples when training a binary classifier. The weight of negative examples is fixed at 1. If you want the algorithm to choose a weight so that errors in classifying negative *vs.* positive examples have equal impact on training loss, specify `balanced`. If you want the algorithm to choose the weight that optimizes performance, specify `auto`. **Optional** Valid values: `balanced`, `auto`, or a positive floating-point integer Default value: 1.0  | 
| quantile |  The quantile for quantile loss. For quantile q, the model attempts to produce predictions so that the value of `true_label` is greater than the prediction with probability q. **Optional** Valid values: Floating-point integer between 0 and 1 Default value: 0.5  | 
| target\$1precision |  The target precision. If `binary_classifier_model_selection_criteria` is `recall_at_target_precision`, then precision is held at this value while recall is maximized. **Optional** Valid values: Floating-point integer between 0 and 1.0 Default value: 0.8  | 
| target\$1recall |  The target recall. If `binary_classifier_model_selection_criteria` is `precision_at_target_recall`, then recall is held at this value while precision is maximized. **Optional** Valid values: Floating-point integer between 0 and 1.0 Default value: 0.8  | 
| unbias\$1data |  Unbiases the features before training so that the mean is 0. By default data is unbiased as the `use_bias` hyperparameter is set to `true`. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| unbias\$1label |  Unbiases labels before training so that the mean is 0. Applies to regression only if the `use_bias` hyperparameter is set to `true`. **Optional** Valid values: `auto`, `true`, or `false` Default value: `auto`  | 
| use\$1bias |  Specifies whether the model should include a bias term, which is the intercept term in the linear equation. **Optional** Valid values: `true` or `false` Default value: `true`  | 
| use\$1lr\$1scheduler |  Whether to use a scheduler for the learning rate. If you want to use a scheduler, specify `true`.  **Optional** Valid values: `true` or `false` Default value: `true`  | 
| wd |  The weight decay parameter, also known as the L2 regularization parameter. If you don't want to use L2 regularization, set the value to 0. **Optional** Valid values:`auto` or non-negative floating-point integer Default value: `auto`  | 

# Tune a linear learner model
<a name="linear-learner-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric. 

The linear learner algorithm also has an internal mechanism for tuning hyperparameters separate from the automatic model tuning feature described here. By default, the linear learner algorithm tunes hyperparameters by training multiple models in parallel. When you use automatic model tuning, the linear learner internal tuning mechanism is turned off automatically. This sets the number of parallel models, `num_models`, to 1. The algorithm ignores any value that you set for `num_models`.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the linear learner algorithm
<a name="linear-learner-metrics"></a>

The linear learner algorithm reports the metrics in the following table, which are computed during training. Choose one of them as the objective metric. To avoid overfitting, we recommend tuning the model against a validation metric instead of a training metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:absolute\$1loss |  The absolute loss of the final model on the test dataset. This objective metric is only valid for regression.  |  Minimize  | 
| test:binary\$1classification\$1accuracy |  The accuracy of the final model on the test dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:binary\$1f\$1beta |  The F-beta score of the final model on the test dataset. By default, it is the F1 score, which is the harmonic mean of precision and recall. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:dcg |  The discounted cumulative gain of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1f\$1beta |  The F-beta score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1precision |  The precision score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:macro\$1recall |  The recall score of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:mse |  The mean square error of the final model on the test dataset. This objective metric is only valid for regression.  |  Minimize  | 
| test:multiclass\$1accuracy |  The accuracy of the final model on the test dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:multiclass\$1top\$1k\$1accuracy |  The accuracy among the top k labels predicted on the test dataset. If you choose this metric as the objective, we recommend setting the value of k using the `accuracy_top_k` hyperparameter. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| test:objective\$1loss |  The mean value of the objective loss function on the test dataset after the model is trained. By default, the loss is logistic loss for binary classification and squared loss for regression. To set the loss to other types, use the `loss` hyperparameter.  |  Minimize  | 
| test:precision |  The precision of the final model on the test dataset. If you choose this metric as the objective, we recommend setting a target recall by setting the `binary_classifier_model_selection` hyperparameter to `precision_at_target_recall` and setting the value for the `target_recall` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:recall |  The recall of the final model on the test dataset. If you choose this metric as the objective, we recommend setting a target precision by setting the `binary_classifier_model_selection` hyperparameter to `recall_at_target_precision` and setting the value for the `target_precision` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| test:roc\$1auc\$1score |  The area under receiving operating characteristic curve (ROC curve) of the final model on the test dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:absolute\$1loss |  The absolute loss of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:binary\$1classification\$1accuracy |  The accuracy of the final model on the validation dataset. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:binary\$1f\$1beta |  The F-beta score of the final model on the validation dataset. By default, the F-beta score is the F1 score, which is the harmonic mean of the `validation:precision` and `validation:recall` metrics. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:dcg |  The discounted cumulative gain of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1f\$1beta |  The F-beta score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1precision |  The precision score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:macro\$1recall |  The recall score of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:mse |  The mean square error of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:multiclass\$1accuracy |  The accuracy of the final model on the validation dataset. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:multiclass\$1top\$1k\$1accuracy |  The accuracy among the top k labels predicted on the validation dataset. If you choose this metric as the objective, we recommend setting the value of k using the `accuracy_top_k` hyperparameter. This objective metric is only valid for multiclass classification.  |  Maximize  | 
| validation:objective\$1loss |  The mean value of the objective loss function on the validation dataset every epoch. By default, the loss is logistic loss for binary classification and squared loss for regression. To set loss to other types, use the `loss` hyperparameter.  |  Minimize  | 
| validation:precision |  The precision of the final model on the validation dataset. If you choose this metric as the objective, we recommend setting a target recall by setting the `binary_classifier_model_selection` hyperparameter to `precision_at_target_recall` and setting the value for the `target_recall` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:recall |  The recall of the final model on the validation dataset. If you choose this metric as the objective, we recommend setting a target precision by setting the `binary_classifier_model_selection` hyperparameter to `recall_at_target_precision` and setting the value for the `target_precision` hyperparameter. This objective metric is only valid for binary classification.  |  Maximize  | 
| validation:rmse |  The root mean square error of the final model on the validation dataset. This objective metric is only valid for regression.  |  Minimize  | 
| validation:roc\$1auc\$1score |  The area under receiving operating characteristic curve (ROC curve) of the final model on the validation dataset. This objective metric is only valid for binary classification.  |  Maximize  | 

## Tuning linear learner hyperparameters
<a name="linear-learner-tunable-hyperparameters"></a>

You can tune a linear learner model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| wd |  `ContinuousParameterRanges`  |  `MinValue: ``1e-7`, `MaxValue`: `1`  | 
| l1 |  `ContinuousParameterRanges`  |  `MinValue`: `1e-7`, `MaxValue`: `1`  | 
| learning\$1rate |  `ContinuousParameterRanges`  |  `MinValue`: `1e-5`, `MaxValue`: `1`  | 
| mini\$1batch\$1size |  `IntegerParameterRanges`  |  `MinValue`: `100`, `MaxValue`: `5000`  | 
| use\$1bias |  `CategoricalParameterRanges`  |  `[True, False]`  | 
| positive\$1example\$1weight\$1mult |  `ContinuousParameterRanges`  |  `MinValue`: 1e-5, `MaxValue`: `1e5`  | 

# Linear learner response formats
<a name="LL-in-formats"></a>

## JSON response formats
<a name="LL-json"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). The following are the available output formats for the SageMaker AI linear learner algorithm.

**Binary Classification**

```
let response =   {
    "predictions":    [
        {
            "score": 0.4,
            "predicted_label": 0
        } 
    ]
}
```

**Multiclass Classification**

```
let response =   {
    "predictions":    [
        {
            "score": [0.1, 0.2, 0.4, 0.3],
            "predicted_label": 2
        } 
    ]
}
```

**Regression**

```
let response =   {
    "predictions":    [
        {
            "score": 0.4
        } 
    ]
}
```

## JSONLINES response formats
<a name="LL-jsonlines"></a>

**Binary Classification**

```
{"score": 0.4, "predicted_label": 0}
```

**Multiclass Classification**

```
{"score": [0.1, 0.2, 0.4, 0.3], "predicted_label": 2}
```

**Regression**

```
{"score": 0.4}
```

## RECORDIO response formats
<a name="LL-recordio"></a>

**Binary Classification**

```
[
    Record = {
        features = {},
        label = {
            'score': {
                keys: [],
                values: [0.4]  # float32
            },
            'predicted_label': {
                keys: [],
                values: [0.0]  # float32
            }
        }
    }
]
```

**Multiclass Classification**

```
[
    Record = {
    "features": [],
    "label":    {
            "score":  {
                    "values":   [0.1, 0.2, 0.3, 0.4]   
            },
            "predicted_label":  {
                    "values":   [3]
            }
       },
    "uid":  "abc123",
    "metadata": "{created_at: '2017-06-03'}"
   }
]
```

**Regression**

```
[
    Record = {
        features = {},
        label = {
            'score': {
                keys: [],
                values: [0.4]  # float32
            }   
        }
    }
]
```

# TabTransformer
<a name="tabtransformer"></a>

[TabTransformer](https://arxiv.org/abs/2012.06678) is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. This page includes information about Amazon EC2 instance recommendations and sample notebooks for TabTransformer.

# How to use SageMaker AI TabTransformer
<a name="tabtransformer-modes"></a>

You can use TabTransformer as an Amazon SageMaker AI built-in algorithm. The following section describes how to use TabTransformer with the SageMaker Python SDK. For information on how to use TabTransformer from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).
+ **Use TabTransformer as a built-in algorithm**

  Use the TabTransformer built-in algorithm to build a TabTransformer training container as shown in the following code example. You can automatically spot the TabTransformer built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API (or the `get_image_uri` API if using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 2). 

  After specifying the TabTransformer image URI, you can use the TabTransformer container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. The TabTransformer built-in algorithm runs in script mode, but the training script is provided for you and there is no need to replace it. If you have extensive experience using script mode to create a SageMaker training job, then you can incorporate your own TabTransformer training scripts.

  ```
  from sagemaker import image_uris, model_uris, script_uris
  
  train_model_id, train_model_version, train_scope = "pytorch-tabtransformerclassification-model", "*", "training"
  training_instance_type = "ml.p3.2xlarge"
  
  # Retrieve the docker image
  train_image_uri = image_uris.retrieve(
      region=None,
      framework=None,
      model_id=train_model_id,
      model_version=train_model_version,
      image_scope=train_scope,
      instance_type=training_instance_type
  )
  
  # Retrieve the training script
  train_source_uri = script_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, script_scope=train_scope
  )
  
  train_model_uri = model_uris.retrieve(
      model_id=train_model_id, model_version=train_model_version, model_scope=train_scope
  )
  
  # Sample training data is available in this bucket
  training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
  training_data_prefix = "training-datasets/tabular_binary/"
  
  training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/train"
  validation_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}/validation"
  
  output_bucket = sess.default_bucket()
  output_prefix = "jumpstart-example-tabular-training"
  
  s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"
  
  from sagemaker import hyperparameters
  
  # Retrieve the default hyperparameters for training the model
  hyperparameters = hyperparameters.retrieve_default(
      model_id=train_model_id, model_version=train_model_version
  )
  
  # [Optional] Override default hyperparameters with custom values
  hyperparameters[
      "n_epochs"
  ] = "50"
  print(hyperparameters)
  
  from sagemaker.estimator import Estimator
  from sagemaker.utils import name_from_base
  
  training_job_name = name_from_base(f"built-in-algo-{train_model_id}-training")
  
  # Create SageMaker Estimator instance
  tabular_estimator = Estimator(
      role=aws_role,
      image_uri=train_image_uri,
      source_dir=train_source_uri,
      model_uri=train_model_uri,
      entry_point="transfer_learning.py",
      instance_count=1,
      instance_type=training_instance_type,
      max_run=360000,
      hyperparameters=hyperparameters,
      output_path=s3_output_location
  )
  
  # Launch a SageMaker Training job by passing the S3 path of the training data
  tabular_estimator.fit(
      {
          "training": training_dataset_s3_path,
          "validation": validation_dataset_s3_path,
      }, logs=True, job_name=training_job_name
  )
  ```

  For more information about how to set up the TabTransformer as a built-in algorithm, see the following notebook examples.
  + [Tabular classification with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Classification_TabTransformer.ipynb)
  + [Tabular regression with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Regression_TabTransformer.ipynb)

# Input and Output interface for the TabTransformer algorithm
<a name="InputOutput-TabTransformer"></a>

TabTransformer operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of TabTransformer supports CSV for training and inference:
+ For **Training ContentType**, valid inputs must be *text/csv*.
+ For **Inference ContentType**, valid inputs must be *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.   
For CSV inference, the algorithm assumes that CSV input does not have the label column. 

**Input format for training data, validation data, and categorical features**

Be mindful of how to format your training data for input to the TabTransformer model. You must provide the path to an Amazon S3 bucket that contains your training and validation data. You can also include a list of categorical features. Use both the `training` and `validation` channels to provide your input data. Alternatively, you can use only the `training` channel.

**Use both the `training` and `validation` channels**

You can provide your input data by way of two S3 paths, one for the `training` channel and one for the `validation` channel. Each S3 path can either be an S3 prefix that points to one or more CSV files or a full S3 path pointing to one specific CSV file. The target variables should be in the first column of your CSV file. The predictor variables (features) should be in the remaining columns. If multiple CSV files are provided for the `training` or `validation` channels, the TabTransformer algorithm concatenates the files. The validation data is used to compute a validation score at the end of each boosting iteration. Early stopping is applied when the validation score stops improving.

If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your training data file or files. If you provide a JSON file for categorical features, your `training` channel must point to an S3 prefix and not a specific CSV file. This file should contain a Python dictionary where the key is the string `"cat_index_list"` and the value is a list of unique integers. Each integer in the value list should indicate the column index of the corresponding categorical features in your training data CSV file. Each value should be a positive integer (greater than zero because zero represents the target value), less than the `Int32.MaxValue` (2147483647), and less than the total number of columns. There should only be one categorical index JSON file.

**Use only the `training` channel**:

You can alternatively provide your input data by way of a single S3 path for the `training` channel. This S3 path should point to a directory with a subdirectory named `training/` that contains one or more CSV files. You can optionally include another subdirectory in the same location called `validation/` that also has one or more CSV files. If the validation data is not provided, then 20% of your training data is randomly sampled to serve as the validation data. If your predictors include categorical features, you can provide a JSON file named `categorical_index.json` in the same location as your data subdirectories.

**Note**  
For CSV training input mode, the total memory available to the algorithm (instance count multiplied by the memory available in the `InstanceType`) must be able to hold the training dataset.

## Amazon EC2 instance recommendation for the TabTransformer algorithm
<a name="Instance-TabTransformer"></a>

SageMaker AI TabTransformer supports single-instance CPU and single-instance GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. To take advantage of GPU training, specify the instance type as one of the GPU instances (for example, P3). SageMaker AI TabTransformer currently does not support multi-GPU training.

## TabTransformer sample notebooks
<a name="tabtransformer-sample-notebooks"></a>

The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker AI TabTransformer algorithm.


****  

| **Notebook Title** | **Description** | 
| --- | --- | 
|  [Tabular classification with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Classification_TabTransformer.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI TabTransformer algorithm to train and host a tabular classification model.   | 
|  [Tabular regression with Amazon SageMaker AI TabTransformer algorithm](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/tabtransformer_tabular/Amazon_Tabular_Regression_TabTransformer.ipynb)  |  This notebook demonstrates the use of the Amazon SageMaker AI TabTransformer algorithm to train and host a tabular regression model.   | 

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How TabTransformer works
<a name="tabtransformer-HowItWorks"></a>

TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer is built upon self-attention based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability.

TabTransformer performs well in machine learning competitions because of its robust handling of a variety of data types, relationships, distributions, and the diversity of hyperparameters that you can fine-tune. You can use TabTransformer for regression, classification (binary and multiclass), and ranking problems.

The following diagram illustrates the TabTransformer architecture.

![\[The architecture of TabTransformer.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/tabtransformer_illustration.png)


For more information, see *[TabTransformer: Tabular Data Modeling Using Contextual Embeddings](https://arxiv.org/abs/2012.06678)*.

# TabTransformer hyperparameters
<a name="tabtransformer-hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI TabTransformer algorithm. Users set these parameters to facilitate the estimation of model parameters from data. The SageMaker AI TabTransformer algorithm is an implementation of the open-source [TabTransformer](https://github.com/jrzaurin/pytorch-widedeep) package.

**Note**  
The default hyperparameters are based on example datasets in the [TabTransformer sample notebooks](tabtransformer.md#tabtransformer-sample-notebooks).

The SageMaker AI TabTransformer algorithm automatically chooses an evaluation metric and objective function based on the type of classification problem. The TabTransformer algorithm detects the type of classification problem based on the number of labels in your data. For regression problems, the evaluation metric is r square and the objective function is mean square error. For binary classification problems, the evaluation metric and objective function are both binary cross entropy. For multiclass classification problems, the evaluation metric and objective function are both multiclass cross entropy.

**Note**  
The TabTransformer evaluation metric and objective functions are not currently available as hyperparameters. Instead, the SageMaker AI TabTransformer built-in algorithm automatically detects the type of classification task (regression, binary, or multiclass) based on the number of unique integers in the label column and assigns an evaluation metric and objective function.


| Parameter Name | Description | 
| --- | --- | 
| n\$1epochs |  Number of epochs to train the deep neural network. Valid values: integer, range: Positive integer. Default value: `5`.  | 
| patience |  The training will stop if one metric of one validation data point does not improve in the last `patience` round. Valid values: integer, range: (`2`, `60`). Default value: `10`.  | 
| learning\$1rate |  The rate at which the model weights are updated after working through each batch of training examples. Valid values: float, range: Positive floating point number. Default value: `0.001`.  | 
| batch\$1size |  The number of examples propagated through the network.  Valid values: integer, range: (`1`, `2048`). Default value: `256`.  | 
| input\$1dim |  The dimension of embeddings to encode the categorical and/or continuous columns. Valid values: string, any of the following: `"16"`, `"32"`, `"64"`, `"128"`, `"256"`, or `"512"`. Default value: `"32"`.  | 
| n\$1blocks |  The number of Transformer encoder blocks. Valid values: integer, range: (`1`, `12`). Default value: `4`.  | 
| attn\$1dropout |  Dropout rate applied to the Multi-Head Attention layers. Valid values: float, range: (`0`, `1`). Default value: `0.2`.  | 
| mlp\$1dropout |  Dropout rate applied to the FeedForward network within the encoder layers and the final MLP layers on top of Transformer encoders. Valid values: float, range: (`0`, `1`). Default value: `0.1`.  | 
| frac\$1shared\$1embed |  The fraction of embeddings shared by all the different categories for one particular column. Valid values: float, range: (`0`, `1`). Default value: `0.25`.  | 

# Tune a TabTransformer model
<a name="tabtransformer-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. Model tuning focuses on the following hyperparameters: 

**Note**  
The learning objective function and evaluation metric are both automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column. For more information, see [TabTransformer hyperparameters](tabtransformer-hyperparameters.md).
+ A learning objective function to optimize during model training
+ An evaluation metric that is used to evaluate model performance during validation
+ A set of hyperparameters and a range of values for each to use when tuning the model automatically

Automatic model tuning searches your chosen hyperparameters to find the combination of values that results in a model that optimizes the chosen evaluation metric.

**Note**  
Automatic model tuning for TabTransformer is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation metrics computed by the TabTransformer algorithm
<a name="tabtransformer-metrics"></a>

The SageMaker AI TabTransformer algorithm computes the following metrics to use for model validation. The evaluation metric is automatically assigned based on the type of classification task, which is determined by the number of unique integers in the label column.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| r2 | r square | maximize | "metrics=\$1'r2': (\$1\$1S\$1)\$1" | 
| f1\$1score | binary cross entropy | maximize | "metrics=\$1'f1': (\$1\$1S\$1)\$1" | 
| accuracy\$1score | multiclass cross entropy | maximize | "metrics=\$1'accuracy': (\$1\$1S\$1)\$1" | 

## Tunable TabTransformer hyperparameters
<a name="tabtransformer-tunable-hyperparameters"></a>

Tune the TabTransformer model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the TabTransformer evaluation metrics are: `learning_rate`, `input_dim`, `n_blocks`, `attn_dropout`, `mlp_dropout`, and `frac_shared_embed`. For a list of all the TabTransformer hyperparameters, see [TabTransformer hyperparameters](tabtransformer-hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 0.001, MaxValue: 0.01 | 
| input\$1dim | CategoricalParameterRanges | [16, 32, 64, 128, 256, 512] | 
| n\$1blocks | IntegerParameterRanges | MinValue: 1, MaxValue: 12 | 
| attn\$1dropout | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.8 | 
| mlp\$1dropout | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.8 | 
| frac\$1shared\$1embed | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.5 | 

# XGBoost algorithm with Amazon SageMaker AI
<a name="xgboost"></a>

The [XGBoost](https://github.com/dmlc/xgboost) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
+ Its robust handling of a variety of data types, relationships, distributions.
+ The variety of hyperparameters that you can fine-tune.

You can use XGBoost for regression, classification (binary and multiclass), and ranking problems. 

You can use the new release of the XGBoost algorithm as either:
+ A Amazon SageMaker AI built-in algorithm.
+ A framework to run training scripts in your local environments.

This implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an bigger set of metrics than the original versions. It provides an XGBoost `estimator` that runs a training script in a managed XGBoost environment. The current release of SageMaker AI XGBoost is based on the original XGBoost versions 1.0, 1.2, 1.3, 1.5, 1.7 and 3.0.

For more information about the Amazon SageMaker AI XGBoost algorithm, see the following blog posts:
+ [Introducing the open-source Amazon SageMaker AI XGBoost algorithm container](https://aws.amazon.com/blogs/machine-learning/introducing-the-open-source-amazon-sagemaker-xgboost-algorithm-container/)
+ [Amazon SageMaker AI XGBoost now offers fully distributed GPU training](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-xgboost-now-offers-fully-distributed-gpu-training/)

## Supported versions
<a name="xgboost-supported-versions"></a>

For more details, see our [support policy](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-support-policy.html#pre-built-containers-support-policy-ml-framework).
+ Framework (open source) mode: 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1, 3.0-5
+ Algorithm mode: 1.2-1, 1.2-2, 1.3-1, 1.5-1, 1.7-1, 3.0-5

**Warning**  
Due to required compute capacity, version 3.0-5 of SageMaker AI XGBoost is not compatible with GPU instances from the P3 instance family for training or inference.

**Warning**  
Due to package compatible, version 3.0-5 of SageMaker AI XGBoost does not support SageMaker debugger.

**Warning**  
Due to required compute capacity, version 1.7-1 of SageMaker AI XGBoost is not compatible with GPU instances from the P2 instance family for training or inference.

**Warning**  
Network Isolation Mode: Do not upgrade pip beyond version 25.2. Newer versions may attempt to fetch setuptools from PyPI during module installation.

**Important**  
When you retrieve the SageMaker AI XGBoost image URI, do not use `:latest` or `:1` for the image URI tag. You must specify one of the [Supported versions](#xgboost-supported-versions) to choose the SageMaker AI-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker AI XGBoost containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Then choose your AWS Region, and navigate to the **XGBoost (algorithm)** section.

**Warning**  
The XGBoost 0.90 versions are deprecated. Supports for security updates or bug fixes for XGBoost 0.90 is discontinued. We highly recommend that you upgrade the XGBoost version to one of the newer versions.

**Note**  
XGBoost v1.1 is not supported on SageMaker AI. XGBoost 1.1 has a broken capability to run prediction when the test input has fewer features than the training data in LIBSVM inputs. This capability has been restored in XGBoost v1.2. Consider using SageMaker AI XGBoost 1.2-2 or later.

**Note**  
You can use XGBoost v1.0-1, but it's not officially supported.

## EC2 instance recommendation for the XGBoost algorithm
<a name="Instance-XGBoost"></a>

SageMaker AI XGBoost supports CPU and GPU training and inference. Instance recommendations depend on training and inference needs, as well as the version of the XGBoost algorithm. Choose one of the following options for more information:
+ [CPU training](#Instance-XGBoost-training-cpu)
+ [GPU training](#Instance-XGBoost-training-gpu)
+ [Distributed CPU training](#Instance-XGBoost-distributed-training-cpu)
+ [Distributed GPU training](#Instance-XGBoost-distributed-training-gpu)
+ [Inference](#Instance-XGBoost-inference)

### Training
<a name="Instance-XGBoost-training"></a>

The SageMaker AI XGBoost algorithm supports CPU and GPU training.

#### CPU training
<a name="Instance-XGBoost-training-cpu"></a>

SageMaker AI XGBoost 1.0-1 or earlier only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M5) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. It supports the use of disk space to handle data that does not fit into main memory. This is a result of the out-of-core feature available with the libsvm input mode. Even so, writing cache files onto disk slows the algorithm processing time. 

#### GPU training
<a name="Instance-XGBoost-training-gpu"></a>

SageMaker AI XGBoost version 1.2-2 or later supports GPU training. Despite higher per-instance costs, GPUs train more quickly, making them more cost effective. 

SageMaker AI XGBoost version 1.2-2 or later supports P2, P3, G4dn, and G5 GPU instance families.

SageMaker AI XGBoost version 1.7-1 or later supports P3, G4dn, and G5 GPU instance families. Note that due to compute capacity requirements, version 1.7-1 or later does not support the P2 instance family.

SageMaker AI XGBoost version 3.0-5 or later supports G4dn and G5 GPU instance families. Note that due to compute capacity requirements, version 3.0-5 or later does not support the P3 instance family.

To take advantage of GPU training:
+ Specify the instance type as one of the GPU instances (for example, G4dn) 
+ Set the `tree_method` hyperparameter to `gpu_hist` in your existing XGBoost script

### Distributed training
<a name="Instance-XGBoost-distributed-training"></a>

SageMaker AI XGBoost supports CPU and GPU instances for distributed training.

#### Distributed CPU training
<a name="Instance-XGBoost-distributed-training-cpu"></a>

To run CPU training on multiple instances, set the `instance_count` parameter for the estimator to a value greater than one. The input data must be divided between the total number of instances. 

##### Divide input data across instances
<a name="Instance-XGBoost-distributed-training-divide-data"></a>

Divide the input data using the following steps:

1. Break the input data down into smaller files. The number of files should be at least equal to the number of instances used for distributed training. Using multiple smaller files as opposed to one large file also decreases the data download time for the training job.

1. When creating your [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html), set the distribution parameter to `ShardedByS3Key`. With this, each instance gets approximately *1/n* of the number of files in S3 if there are *n* instances specified in the training job.

#### Distributed GPU training
<a name="Instance-XGBoost-distributed-training-gpu"></a>

You can use distributed training with either single-GPU or multi-GPU instances.

**Distributed training with single-GPU instances **

SageMaker AI XGBoost versions 1.2-2 through 1.3-1 only support single-GPU instance training. This means that even if you select a multi-GPU instance, only one GPU is used per instance.

You must divide your input data between the total number of instances if: 
+ You use XGBoost versions 1.2-2 through 1.3-1.
+ You do not need to use multi-GPU instances.

 For more information, see [Divide input data across instances](#Instance-XGBoost-distributed-training-divide-data).

**Note**  
Versions 1.2-2 through 1.3-1 of SageMaker AI XGBoost only use one GPU per instance even if you choose a multi-GPU instance.

**Distributed training with multi-GPU instances**

Starting with version 1.5-1, SageMaker AI XGBoost offers distributed GPU training with [Dask](https://www.dask.org/). With Dask you can utilize all GPUs when using one or more multi-GPU instances. Dask also works when using single-GPU instances. 

Train with Dask using the following steps:

1. Either omit the `distribution` parameter in your [TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html) or set it to `FullyReplicated`.

1. When defining your hyperparameters, set `use_dask_gpu_training` to `"true"`.

**Important**  
Distributed training with Dask only supports CSV and Parquet input formats. If you use other data formats such as LIBSVM or PROTOBUF, the training job fails.   
For Parquet data, ensure that the column names are saved as strings. Columns that have names of other data types will fail to load.

**Important**  
Distributed training with Dask does not support pipe mode. If pipe mode is specified, the training job fails.

There are a few considerations to be aware of when training SageMaker AI XGBoost with Dask. Be sure to split your data into smaller files. Dask reads each Parquet file as a partition. There is a Dask worker for every GPU. As a result, the number of files should be greater than the total number of GPUs (instance count \$1 number of GPUs per instance). Having a very large number of files can also degrade performance. For more information, see [Dask Best Practices](https://docs.dask.org/en/stable/best-practices.html).

#### Variations in output
<a name="Instance-XGBoost-distributed-training-output"></a>

The specified `tree_method` hyperparameter determines the algorithm that is used for XGBoost training. The tree methods `approx`, `hist` and `gpu_hist` are all approximate methods and use sketching for quantile calculation. For more information, see [Tree Methods](https://xgboost.readthedocs.io/en/stable/treemethod.html) in the XGBoost documentation. Sketching is an approximate algorithm. Therefore, you can expect variations in the model depending on factors such as the number of workers chosen for distributed training. The significance of the variation is data-dependent.

### Inference
<a name="Instance-XGBoost-inference"></a>

SageMaker AI XGBoost supports CPU and GPU instances for inference. For information about the instance types for inference, see [Amazon SageMaker AI ML Instance Types](https://aws.amazon.com/sagemaker/pricing/).

# How to use SageMaker AI XGBoost
<a name="xgboost-how-to-use"></a>

With SageMaker AI, you can use XGBoost as a built-in algorithm or framework. When XGBoost as a framework, you have more flexibility and access to more advanced scenarios because you can customize your own training scripts. The following sections describe how to use XGBoost with the SageMaker Python SDK, and the input/output interface for the XGBoost algorithm. For information on how to use XGBoost from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

**Topics**
+ [

## Use XGBoost as a framework
](#xgboost-how-to-framework)
+ [

## Use XGBoost as a built-in algorithm
](#xgboost-how-to-built-in)
+ [

## Input/Output interface for the XGBoost algorithm
](#InputOutput-XGBoost)

## Use XGBoost as a framework
<a name="xgboost-how-to-framework"></a>

Use XGBoost as a framework to run your customized training scripts that can incorporate additional data processing into your training jobs. In the following code example, SageMaker Python SDK provides the XGBoost API as a framework. This functions similarly to how SageMaker AI provides other framework APIs, such as TensorFlow, MXNet, and PyTorch.

```
import boto3
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "verbosity":"1",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-framework'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-framework')

# construct a SageMaker AI XGBoost estimator
# specify the entry_point to your xgboost training script
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py", 
                    framework_version='1.7-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})
```

For an end-to-end example of using SageMaker AI XGBoost as a framework, see [Regression with Amazon SageMaker AI XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone_dist_script_mode.html).

## Use XGBoost as a built-in algorithm
<a name="xgboost-how-to-built-in"></a>

Use the XGBoost built-in algorithm to build an XGBoost training container as shown in the following code example. You can automatically spot the XGBoost built-in algorithm image URI using the SageMaker AI `image_uris.retrieve` API. If using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) version 1, use the `get_image_uri` API. To make sure that the `image_uris.retrieve` API finds the correct URI, see [Common parameters for built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html). Then look up `xgboost` from the full list of built-in algorithm image URIs and available regions.

After specifying the XGBoost image URI, use the XGBoost container to construct an estimator using the SageMaker AI Estimator API and initiate a training job. This XGBoost built-in algorithm mode does not incorporate your own XGBoost training script and runs directly on the input datasets.

**Important**  
When you retrieve the SageMaker AI XGBoost image URI, do not use `:latest` or `:1` for the image URI tag. You must specify one of the [Supported versions](xgboost.md#xgboost-supported-versions) to choose the SageMaker AI-managed XGBoost container with the native XGBoost package version that you want to use. To find the package version migrated into the SageMaker AI XGBoost containers, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html). Then choose your AWS Region, and navigate to the **XGBoost (algorithm)** section.

```
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.session import Session
from sagemaker.inputs import TrainingInput

# initialize hyperparameters
hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "objective":"reg:squarederror",
        "num_round":"50"}

# set an output path where the trained model will be saved
bucket = sagemaker.Session().default_bucket()
prefix = 'DEMO-xgboost-as-a-built-in-algo'
output_path = 's3://{}/{}/{}/output'.format(bucket, prefix, 'abalone-xgb-built-in-algo')

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.7-1")

# construct a SageMaker AI estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)

# define the data type and paths to the training and validation datasets
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)

# execute the XGBoost training job
estimator.fit({'train': train_input, 'validation': validation_input})
```

For more information about how to set up the XGBoost as a built-in algorithm, see the following notebook examples.
+ [Managed Spot Training for XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html)
+ [Regression with Amazon SageMaker AI XGBoost (Parquet input)](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html)

## Input/Output interface for the XGBoost algorithm
<a name="InputOutput-XGBoost"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of XGBoost supports the following data formats for training and inference:
+  *text/libsvm* (default) 
+  *text/csv*
+  *application/x-parquet*
+  *application/x-recordio-protobuf*

**Note**  
There are a few considerations to be aware of regarding training and inference input:  
For increased performance, we recommend using XGBoost with *File mode*, in which your data from Amazon S3 is stored on the training instance volumes.
For training with columnar input, the algorithm assumes that the target variable (label) is the first column. For inference, the algorithm assumes that the input has no label column.
For CSV data, the input should not have a header record.
For LIBSVM training, the algorithm assumes that subsequent columns after the label column contain the zero-based index value pairs for features. So each row has the format: : <label> <index0>:<value0> <index1>:<value1>.
For information on instance types and distributed training, see [EC2 instance recommendation for the XGBoost algorithm](xgboost.md#Instance-XGBoost).

For CSV training input mode, the total memory available to the algorithm must be able to hold the training dataset. The total memory available is calculated as `Instance Count * the memory available in the InstanceType`. For libsvm training input mode, it's not required, but we recommend it.

For v1.3-1 and later, SageMaker AI XGBoost saves the model in the XGBoost internal binary format, using `Booster.save_model`. Previous versions use the Python pickle module to serialize/deserialize the model.

**Note**  
Be mindful of versions when using an SageMaker AI XGBoost model in open source XGBoost. Versions 1.3-1 and later use the XGBoost internal binary format while previous versions use the Python pickle module.

**To use a model trained with SageMaker AI XGBoost v1.3-1 or later in open source XGBoost**
+ Use the following Python code:

  ```
  import xgboost as xgb
  
  xgb_model = xgb.Booster()
  xgb_model.load_model(model_file_path)
  xgb_model.predict(dtest)
  ```

**To use a model trained with previous versions of SageMaker AI XGBoost in open source XGBoost**
+ Use the following Python code:

  ```
  import pickle as pkl 
  import tarfile
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = pkl.load(open(model_file_path, 'rb'))
  
  # prediction with test data
  pred = model.predict(dtest)
  ```

**To differentiate the importance of labelled data points use Instance Weight Supports**
+ SageMaker AI XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For *text/libsvm* input, customers can assign weight values to data instances by attaching them after the labels. For example, `label:weight idx_0:val_0 idx_1:val_1...`. For *text/csv* input, customers need to turn on the `csv_weights` flag in the parameters and attach weight values in the column after labels. For example: `label,weight,val_0,val_1,...`).

# XGBoost sample notebooks
<a name="xgboost-sample-notebooks"></a>

The following list contains a variety of sample Jupyter notebooks that address different use cases of Amazon SageMaker AI XGBoost algorithm.
+ [How to Create a Custom XGBoost container](https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/sagemaker_studio_image_build/xgboost_bring_your_own/Batch_Transform_BYO_XGB.html) – This notebook shows you how to build a custom XGBoost Container with Amazon SageMaker AI Batch Transform.
+ [Regression with XGBoost using Parquet](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_parquet_input_training.html) – This notebook shows you how to use the Abalone dataset in Parquet to train a XGBoost model.
+ [How to Train and Host a Multiclass Classification Model](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_mnist/xgboost_mnist.html) – This notebook shows how to use the MNIST dataset to train and host a multiclass classification model.
+ [How to train a Model for Customer Churn Prediction](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.html) – This notebook shows you how to train a model to Predict Mobile Customer Departure in an effort to identify unhappy customers.
+ [An Introduction to Amazon SageMaker AI Managed Spot infrastructure for XGBoost Training](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html) – This notebook shows you how to use Spot Instances for training with a XGBoost Container.
+ [How to use Amazon SageMaker Debugger to debug XGBoost Training Jobs](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/xgboost_census_explanations/xgboost-census-debugger-rules.html) – This notebook shows you how to use Amazon SageMaker Debugger to monitor training jobs to detect inconsistencies using built-in debugging rules.

For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI samples. The topic modeling example notebooks using the linear learning algorithm are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How the SageMaker AI XGBoost algorithm works
<a name="xgboost-HowItWorks"></a>

[XGBoost](https://github.com/dmlc/xgboost) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm, which attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models.

When using [gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) for regression, the weak learners are regression trees, and each regression tree maps an input data point to one of its leaves that contains a continuous score. XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions). The training proceeds iteratively, adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction. It's called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.

 Below is a brief illustration on how gradient tree boosting works.

![\[A diagram illustrating gradient tree boosting.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/xgboost_illustration.png)


**For more detail on XGBoost, see:**
+ [XGBoost: A Scalable Tree Boosting System](https://arxiv.org/pdf/1603.02754.pdf)
+ [Gradient Tree Boosting ](https://www.sas.upenn.edu/~fdiebold/NoHesitations/BookAdvanced.pdf#page=380)
+ [Introduction to Boosted Trees](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)

# XGBoost hyperparameters
<a name="xgboost_hyperparameters"></a>

The following table contains the subset of hyperparameters that are required or most commonly used for the Amazon SageMaker AI XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker AI XGBoost algorithm is an implementation of the open-source DMLC XGBoost package. For details about full set of hyperparameter that can be configured for this version of XGBoost, see [ XGBoost Parameters](https://xgboost.readthedocs.io/en/release_1.2.0/).


| Parameter Name | Description | 
| --- | --- | 
| num\$1class |  The number of classes. **Required** if `objective` is set to *multi:softmax* or *multi:softprob*. Valid values: Integer.  | 
| num\$1round |  The number of rounds to run the training. **Required** Valid values: Integer.  | 
| alpha |  L1 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: Float. Default value: 0  | 
| base\$1score |  The initial prediction score of all instances, global bias. **Optional** Valid values: Float. Default value: 0.5  | 
| booster |  Which booster to use. The `gbtree` and `dart` values use a tree-based model, while `gblinear` uses a linear function. **Optional** Valid values: String. One of `"gbtree"`, `"gblinear"`, or `"dart"`. Default value: `"gbtree"`  | 
| colsample\$1bylevel |  Subsample ratio of columns for each split, in each level. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| colsample\$1bynode |  Subsample ratio of columns from each node. **Optional** Valid values: Float. Range: (0,1]. Default value: 1  | 
| colsample\$1bytree |  Subsample ratio of columns when constructing each tree. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| csv\$1weights |  When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. **Optional** Valid values: 0 or 1 Default value: 0  | 
| deterministic\$1histogram |  When this flag is enabled, XGBoost builds histogram on GPU deterministically. Used only if `tree_method` is set to `gpu_hist`. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: String. Range: `"true"` or `"false"`. Default value: `"true"`  | 
| early\$1stopping\$1rounds |  The model trains until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. SageMaker AI hosting uses the best model for inference. **Optional** Valid values: Integer. Default value: -  | 
| eta |  Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The `eta` parameter actually shrinks the feature weights to make the boosting process more conservative. **Optional** Valid values: Float. Range: [0,1]. Default value: 0.3  | 
| eval\$1metric |  Evaluation metrics for validation data. A default metric is assigned according to the objective: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html) For a list of valid inputs, see [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: String. Default value: Default according to objective.  | 
| gamma |  Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. **Optional** Valid values: Float. Range: [0,∞). Default value: 0  | 
| grow\$1policy |  Controls the way that new nodes are added to the tree. Currently supported only if `tree_method` is set to `hist`. **Optional** Valid values: String. Either `"depthwise"` or `"lossguide"`. Default value: `"depthwise"`  | 
| interaction\$1constraints |  Specify groups of variables that are allowed to interact. **Optional** Valid values: Nested list of integers. Each integer represents a feature, and each nested list contains features that are allowed to interact e.g., [[1,2], [3,4,5]]. Default value: None  | 
| lambda |  L2 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: Float. Default value: 1  | 
| lambda\$1bias |  L2 regularization term on bias. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0  | 
| max\$1bin |  Maximum number of discrete bins to bucket continuous features. Used only if `tree_method` is set to `hist`.  **Optional** Valid values: Integer. Default value: 256  | 
| max\$1delta\$1step |  Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.  **Optional** Valid values: Integer. Range: [0,∞). Default value: 0  | 
| max\$1depth |  Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when `grow_policy`=`depth-wise`. **Optional** Valid values: Integer. Range: [0,∞) Default value: 6  | 
| max\$1leaves |  Maximum number of nodes to be added. Relevant only if `grow_policy` is set to `lossguide`. **Optional** Valid values: Integer. Default value: 0  | 
| min\$1child\$1weight |  Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than `min_child_weight`, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. **Optional** Valid values: Float. Range: [0,∞). Default value: 1  | 
| monotone\$1constraints |  Specifies monotonicity constraints on any feature. **Optional** Valid values: Tuple of Integers. Valid integers: -1 (decreasing constraint), 0 (no constraint), 1 (increasing constraint).  E.g., (0, 1): No constraint on first predictor, and an increasing constraint on the second. (-1, 1): Decreasing constraint on first predictor, and an increasing constraint on the second. Default value: (0, 0)  | 
| normalize\$1type |  Type of normalization algorithm. **Optional** Valid values: Either *tree* or *forest*. Default value: *tree*  | 
| nthread |  Number of parallel threads used to run *xgboost*. **Optional** Valid values: Integer. Default value: Maximum number of threads.  | 
| objective |  Specifies the learning task and the corresponding learning objective. Examples: `reg:logistic`, `multi:softmax`, `reg:squarederror`. For a full list of valid inputs, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: String Default value: `"reg:squarederror"`  | 
| one\$1drop |  When this flag is enabled, at least one tree is always dropped during the dropout. **Optional** Valid values: 0 or 1 Default value: 0  | 
| process\$1type |  The type of boosting process to run. **Optional** Valid values: String. Either `"default"` or `"update"`. Default value: `"default"`  | 
| rate\$1drop |  The dropout rate that specifies the fraction of previous trees to drop during the dropout. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| refresh\$1leaf |  This is a parameter of the 'refresh' updater plug-in. When set to `true` (1), tree leaves and tree node stats are updated. When set to `false`(0), only tree node stats are updated. **Optional** Valid values: 0/1 Default value: 1  | 
| sample\$1type |  Type of sampling algorithm. **Optional** Valid values: Either `uniform` or `weighted`. Default value: `uniform`  | 
| scale\$1pos\$1weight |  Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: `sum(negative cases)` / `sum(positive cases)`. **Optional** Valid values: float Default value: 1  | 
| seed |  Random number seed. **Optional** Valid values: integer Default value: 0  | 
| single\$1precision\$1histogram |  When this flag is enabled, XGBoost uses single precision to build histograms instead of double precision. Used only if `tree_method` is set to `hist` or `gpu_hist`. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: String. Range: `"true"` or `"false"` Default value: `"false"`  | 
| sketch\$1eps |  Used only for approximate greedy algorithm. This translates into O(1 / `sketch_eps`) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. **Optional** Valid values: Float, Range: [0, 1]. Default value: 0.03  | 
| skip\$1drop |  Probability of skipping the dropout procedure during a boosting iteration. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| subsample |  Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| tree\$1method |  The tree construction algorithm used in XGBoost. **Optional** Valid values: One of `auto`, `exact`, `approx`, `hist`, or `gpu_hist`. Default value: `auto`  | 
| tweedie\$1variance\$1power |  Parameter that controls the variance of the Tweedie distribution. **Optional** Valid values: Float. Range: (1, 2). Default value: 1.5  | 
| updater |  A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: comma-separated string. Default value: `grow_colmaker`, prune  | 
| use\$1dask\$1gpu\$1training |  Set `use_dask_gpu_training` to `"true"` if you want to run distributed GPU training with Dask. Dask GPU training is only supported for versions 1.5-1 and later. Do not set this value to `"true"` for versions preceding 1.5-1. For more information, see [Distributed GPU training](xgboost.md#Instance-XGBoost-distributed-training-gpu). **Optional** Valid values: String. Range: `"true"` or `"false"` Default value: `"false"`  | 
| verbosity | Verbosity of printing messages. Valid values: 0 (silent), 1 (warning), 2 (info), 3 (debug). **Optional** Default value: 1  | 

# Tune an XGBoost Model
<a name="xgboost-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:
+ a learning `objective` function to optimize during model training
+ an `eval_metric` to use to evaluate model performance during validation
+ a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric. 

**Note**  
Automatic model tuning for XGBoost 0.90 is only available from the Amazon SageMaker SDKs, not from the SageMaker AI console.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Evaluation Metrics Computed by the XGBoost Algorithm
<a name="xgboost-metrics"></a>

The XGBoost algorithm computes the following metrics to use for model validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of valid `eval_metric` values, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters)


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy |  Classification rate, calculated as \$1(right)/\$1(all cases).  |  Maximize  | 
| validation:auc |  Area under the curve.  |  Maximize  | 
| validation:error |  Binary classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:f1 |  Indicator of classification accuracy, calculated as the harmonic mean of precision and recall.  |  Maximize  | 
| validation:logloss |  Negative log-likelihood.  |  Minimize  | 
| validation:mae |  Mean absolute error.  |  Minimize  | 
| validation:map |  Mean average precision.  |  Maximize  | 
| validation:merror |  Multiclass classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:mlogloss |  Negative log-likelihood for multiclass classification.  |  Minimize  | 
| validation:mse |  Mean squared error.  |  Minimize  | 
| validation:ndcg |  Normalized Discounted Cumulative Gain.  |  Maximize  | 
| validation:rmse |  Root mean square error.  |  Minimize  | 

## Tunable XGBoost Hyperparameters
<a name="xgboost-tunable-hyperparameters"></a>

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the XGBoost evaluation metrics are: `alpha`, `min_child_weight`, `subsample`, `eta`, and `num_round`. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| colsample\$1bylevel |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bynode |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bytree |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 
| eta |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 0.5  | 
| gamma |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 5  | 
| lambda |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| max\$1delta\$1step |  IntegerParameterRanges  |  [0, 10]  | 
| max\$1depth |  IntegerParameterRanges  |  [0, 10]  | 
| min\$1child\$1weight |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 120  | 
| num\$1round |  IntegerParameterRanges  |  [1, 4000]  | 
| subsample |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 

# Deprecated Versions of XGBoost and their Upgrades
<a name="xgboost-previous-versions"></a>

This topic contains documentation for previous versions of Amazon SageMaker AI XGBoost that are still available but deprecated. It also provides instructions on how to upgrade deprecated versions of XGBoost, when possible, to more current versions.

**Topics**
+ [

# Upgrade XGBoost Version 0.90 to Version 1.5
](xgboost-version-0.90.md)
+ [

# XGBoost Version 0.72
](xgboost-72.md)

# Upgrade XGBoost Version 0.90 to Version 1.5
<a name="xgboost-version-0.90"></a>

If you are using the SageMaker Python SDK, to upgrade existing XGBoost 0.90 jobs to version 1.5, you must have version 2.x of the SDK installed and change the XGBoost `version` and `framework_version` parameters to 1.5-1. If you are using Boto3, you need to update the Docker image, and a few hyperparameters and learning objectives.

**Topics**
+ [

## Upgrade SageMaker AI Python SDK Version 1.x to Version 2.x
](#upgrade-xgboost-version-0.90-sagemaker-python-sdk)
+ [

## Change the image tag to 1.5-1
](#upgrade-xgboost-version-0.90-change-image-tag)
+ [

## Change Docker Image for Boto3
](#upgrade-xgboost-version-0.90-boto3)
+ [

## Update Hyperparameters and Learning Objectives
](#upgrade-xgboost-version-0.90-hyperparameters)

## Upgrade SageMaker AI Python SDK Version 1.x to Version 2.x
<a name="upgrade-xgboost-version-0.90-sagemaker-python-sdk"></a>

If you are still using Version 1.x of the SageMaker Python SDK, you must to upgrade version 2.x of the SageMaker Python SDK. For information on the latest version of the SageMaker Python SDK, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). To install the latest version, run:

```
python -m pip install --upgrade sagemaker
```

## Change the image tag to 1.5-1
<a name="upgrade-xgboost-version-0.90-change-image-tag"></a>

If you are using the SageMaker Python SDK and using the XGBoost build-in algorithm, change the version parameter in `image_uris.retrive`.

```
from sagemaker import image_uris
image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          role=sagemaker.get_execution_role(),
                                          instance_count=1, 
                                          instance_type='ml.m5.2xlarge', 
                                          volume_size=5, # 5 GB 
                                          output_path=output_path)
```

If you are using the SageMaker Python SDK and using XGBoost as a framework to run your customized training scripts, change the `framework_version` parameter in the XGBoost API.

```
estimator = XGBoost(entry_point = "your_xgboost_abalone_script.py", 
                    framework_version='1.5-1',
                    hyperparameters=hyperparameters,
                    role=sagemaker.get_execution_role(),
                    instance_count=1,
                    instance_type='ml.m5.2xlarge',
                    output_path=output_path)
```

`sagemaker.session.s3_input` in SageMaker Python SDK version 1.x has been renamed to `sagemaker.inputs.TrainingInput`. You must use `sagemaker.inputs.TrainingInput` as in the following example.

```
content_type = "libsvm"
train_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'train'), content_type=content_type)
validation_input = TrainingInput("s3://{}/{}/{}/".format(bucket, prefix, 'validation'), content_type=content_type)
```

 For the full list of SageMaker Python SDK version 2.x changes, see [Use Version 2.x of the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html). 

## Change Docker Image for Boto3
<a name="upgrade-xgboost-version-0.90-boto3"></a>

If you are using Boto3 to train or deploy your model, change the docker image tag (1, 0.72, 0.90-1 or 0.90-2) to 1.5-1.

```
{
    "AlgorithmSpecification":: {
        "TrainingImage": "746614075791.dkr.ecr.us-west-1.amazonaws.com/sagemaker-xgboost:1.5-1"
    }
    ...
}
```

If you using the SageMaker Python SDK to retrieve registry path, change the `version` parameter in `image_uris.retrieve`.

```
from sagemaker import image_uris
image_uris.retrieve(framework="xgboost", region="us-west-2", version="1.5-1")
```

## Update Hyperparameters and Learning Objectives
<a name="upgrade-xgboost-version-0.90-hyperparameters"></a>

The silent parameter has been deprecated and is no longer available in XGBoost 1.5 and later versions. Use `verbosity` instead. If you were using the `reg:linear` learning objective, it has been deprecated as well in favor of` reg:squarederror`. Use `reg:squarederror` instead.

```
hyperparameters = {
    "verbosity": "2",
    "objective": "reg:squarederror",
    "num_round": "50",
    ...
}

estimator = sagemaker.estimator.Estimator(image_uri=xgboost_container, 
                                          hyperparameters=hyperparameters,
                                          ...)
```

# XGBoost Version 0.72
<a name="xgboost-72"></a>

**Important**  
The XGBoost 0.72 is deprecated by Amazon SageMaker AI. You can still use this old version of XGBoost (as a built-in algorithm) by pulling its image URI as shown in the following code sample. For XGBoost, the image URI ending with `:1` is for the old version.  

```
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri

xgb_image_uri = get_image_uri(boto3.Session().region_name, "xgboost", repo_version="1")
```

```
import boto3
from sagemaker import image_uris

xgb_image_uri = image_uris.retrieve("xgboost", boto3.Session().region_name, "1")
```
If you want to use newer versions, you have to explicitly specify the image URI tags (see [Supported versions](xgboost.md#xgboost-supported-versions)).

This previous release of the Amazon SageMaker AI XGBoost algorithm is based on the 0.72 release. [XGBoost](https://github.com/dmlc/xgboost) (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining the estimates of a set of simpler, weaker models. XGBoost has done remarkably well in machine learning competitions because it robustly handles a variety of data types, relationships, and distributions, and because of the large number of hyperparameters that can be tweaked and tuned for improved fits. This flexibility makes XGBoost a solid choice for problems in regression, classification (binary and multiclass), and ranking.

Customers should consider using the new release of [XGBoost algorithm with Amazon SageMaker AI](xgboost.md). They can use it as a SageMaker AI built-in algorithm or as a framework to run scripts in their local environments as they would typically, for example, do with a Tensorflow deep learning framework. The new implementation has a smaller memory footprint, better logging, improved hyperparameter validation, and an expanded set of metrics. The earlier implementation of XGBoost remains available to customers if they need to postpone migrating to the new version. But this previous implementation will remain tied to the 0.72 release of XGBoost.

## Input/Output Interface for the XGBoost Release 0.72
<a name="xgboost-72-InputOutput"></a>

Gradient boosting operates on tabular data, with the rows representing observations, one column representing the target variable or label, and the remaining columns representing features. 

The SageMaker AI implementation of XGBoost supports CSV and libsvm formats for training and inference:
+ For Training ContentType, valid inputs are *text/libsvm* (default) or *text/csv*.
+ For Inference ContentType, valid inputs are *text/libsvm* or (the default) *text/csv*.

**Note**  
For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input does not have the label column.   
For libsvm training, the algorithm assumes that the label is in the first column. Subsequent columns contain the zero-based index value pairs for features. So each row has the format: <label> <index0>:<value0> <index1>:<value1> ... Inference requests for libsvm may or may not have labels in the libsvm format.

This differs from other SageMaker AI algorithms, which use the protobuf training input format to maintain greater consistency with standard XGBoost data formats.

For CSV training input mode, the total memory available to the algorithm (Instance Count \$1 the memory available in the `InstanceType`) must be able to hold the training dataset. For libsvm training input mode, it's not required, but we recommend it.

SageMaker AI XGBoost uses the Python pickle module to serialize/deserialize the model, which can be used for saving/loading the model.

**To use a model trained with SageMaker AI XGBoost in open source XGBoost**
+ Use the following Python code:

  ```
  import pickle as pkl 
  import tarfile
  import xgboost
  
  t = tarfile.open('model.tar.gz', 'r:gz')
  t.extractall()
  
  model = pkl.load(open(model_file_path, 'rb'))
  
  # prediction with test data
  pred = model.predict(dtest)
  ```

**To differentiate the importance of labelled data points use Instance Weight Supports**
+ SageMaker AI XGBoost allows customers to differentiate the importance of labelled data points by assigning each instance a weight value. For *text/libsvm* input, customers can assign weight values to data instances by attaching them after the labels. For example, `label:weight idx_0:val_0 idx_1:val_1...`. For *text/csv* input, customers need to turn on the `csv_weights` flag in the parameters and attach weight values in the column after labels. For example: `label,weight,val_0,val_1,...`).

## EC2 Instance Recommendation for the XGBoost Release 0.72
<a name="xgboost-72-Instance"></a>

SageMaker AI XGBoost currently only trains using CPUs. It is a memory-bound (as opposed to compute-bound) algorithm. So, a general-purpose compute instance (for example, M4) is a better choice than a compute-optimized instance (for example, C4). Further, we recommend that you have enough total memory in selected instances to hold the training data. Although it supports the use of disk space to handle data that does not fit into main memory (the out-of-core feature available with the libsvm input mode), writing cache files onto disk slows the algorithm processing time.

## XGBoost Release 0.72 Sample Notebooks
<a name="xgboost-72-sample-notebooks"></a>

For a sample notebook that shows how to use the latest version of SageMaker AI XGBoost as a built-in algorithm to train and host a regression model, see [Regression with Amazon SageMaker AI XGBoost algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.html). To use the 0.72 version of XGBoost, you need to change the version in the sample code to 0.72. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the XGBoost algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

## XGBoost Release 0.72 Hyperparameters
<a name="xgboost-72-hyperparameters"></a>

The following table contains the hyperparameters for the XGBoost algorithm. These are parameters that are set by users to facilitate the estimation of model parameters from data. The required hyperparameters that must be set are listed first, in alphabetical order. The optional hyperparameters that can be set are listed next, also in alphabetical order. The SageMaker AI XGBoost algorithm is an implementation of the open-source XGBoost package. Currently SageMaker AI supports version 0.72. For more detail about hyperparameter configuration for this version of XGBoost, see [ XGBoost Parameters](https://xgboost.readthedocs.io/en/release_0.72/parameter.html).


| Parameter Name | Description | 
| --- | --- | 
| num\$1class | The number of classes. **Required** if `objective` is set to *multi:softmax* or *multi:softprob*. Valid values: integer  | 
| num\$1round | The number of rounds to run the training. **Required** Valid values: integer  | 
| alpha | L1 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: float Default value: 0  | 
| base\$1score | The initial prediction score of all instances, global bias. **Optional** Valid values: float Default value: 0.5  | 
| booster | Which booster to use. The `gbtree` and `dart` values use a tree-based model, while `gblinear` uses a linear function. **Optional** Valid values: String. One of `gbtree`, `gblinear`, or `dart`. Default value: `gbtree`  | 
| colsample\$1bylevel | Subsample ratio of columns for each split, in each level. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| colsample\$1bytree | Subsample ratio of columns when constructing each tree. **Optional** Valid values: Float. Range: [0,1]. Default value: 1 | 
| csv\$1weights | When this flag is enabled, XGBoost differentiates the importance of instances for csv input by taking the second column (the column after labels) in training data as the instance weights. **Optional** Valid values: 0 or 1 Default value: 0  | 
| early\$1stopping\$1rounds | The model trains until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training. SageMaker AI hosting uses the best model for inference. **Optional** Valid values: integer Default value: -  | 
| eta | Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The `eta` parameter actually shrinks the feature weights to make the boosting process more conservative. **Optional** Valid values: Float. Range: [0,1]. Default value: 0.3  | 
| eval\$1metric | Evaluation metrics for validation data. A default metric is assigned according to the objective:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/xgboost-72.html) For a list of valid inputs, see [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: string Default value: Default according to objective.  | 
| gamma | Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm is. **Optional** Valid values: Float. Range: [0,∞). Default value: 0  | 
| grow\$1policy | Controls the way that new nodes are added to the tree. Currently supported only if `tree_method` is set to `hist`. **Optional** Valid values: String. Either `depthwise` or `lossguide`. Default value: `depthwise`  | 
| lambda | L2 regularization term on weights. Increasing this value makes models more conservative. **Optional** Valid values: float Default value: 1  | 
| lambda\$1bias | L2 regularization term on bias. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0  | 
| max\$1bin | Maximum number of discrete bins to bucket continuous features. Used only if `tree_method` is set to `hist`.  **Optional** Valid values: integer Default value: 256  | 
| max\$1delta\$1step | Maximum delta step allowed for each tree's weight estimation. When a positive integer is used, it helps make the update more conservative. The preferred option is to use it in logistic regression. Set it to 1-10 to help control the update.  **Optional** Valid values: Integer. Range: [0,∞). Default value: 0  | 
| max\$1depth | Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfit. 0 indicates no limit. A limit is required when `grow_policy`=`depth-wise`. **Optional** Valid values: Integer. Range: [0,∞) Default value: 6  | 
| max\$1leaves | Maximum number of nodes to be added. Relevant only if `grow_policy` is set to `lossguide`. **Optional** Valid values: integer Default value: 0  | 
| min\$1child\$1weight | Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than `min_child_weight`, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is. **Optional** Valid values: Float. Range: [0,∞). Default value: 1  | 
| normalize\$1type | Type of normalization algorithm. **Optional** Valid values: Either *tree* or *forest*. Default value: *tree*  | 
| nthread | Number of parallel threads used to run *xgboost*. **Optional** Valid values: integer Default value: Maximum number of threads.  | 
| objective | Specifies the learning task and the corresponding learning objective. Examples: `reg:logistic`, `reg:softmax`, `multi:squarederror`. For a full list of valid inputs, refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters). **Optional** Valid values: string Default value: `reg:squarederror`  | 
| one\$1drop | When this flag is enabled, at least one tree is always dropped during the dropout. **Optional** Valid values: 0 or 1 Default value: 0  | 
| process\$1type | The type of boosting process to run. **Optional** Valid values: String. Either `default` or `update`. Default value: `default`  | 
| rate\$1drop | The dropout rate that specifies the fraction of previous trees to drop during the dropout. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| refresh\$1leaf | This is a parameter of the 'refresh' updater plug-in. When set to `true` (1), tree leaves and tree node stats are updated. When set to `false`(0), only tree node stats are updated. **Optional** Valid values: 0/1 Default value: 1  | 
| sample\$1type | Type of sampling algorithm. **Optional** Valid values: Either `uniform` or `weighted`. Default value: `uniform`  | 
| scale\$1pos\$1weight | Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: `sum(negative cases)` / `sum(positive cases)`. **Optional** Valid values: float Default value: 1  | 
| seed | Random number seed. **Optional** Valid values: integer Default value: 0  | 
| silent | 0 means print running messages, 1 means silent mode. Valid values: 0 or 1 **Optional** Default value: 0  | 
| sketch\$1eps | Used only for approximate greedy algorithm. This translates into O(1 / `sketch_eps`) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. **Optional** Valid values: Float, Range: [0, 1]. Default value: 0.03  | 
| skip\$1drop | Probability of skipping the dropout procedure during a boosting iteration. **Optional** Valid values: Float. Range: [0.0, 1.0]. Default value: 0.0  | 
| subsample | Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collects half of the data instances to grow trees. This prevents overfitting. **Optional** Valid values: Float. Range: [0,1]. Default value: 1  | 
| tree\$1method | The tree construction algorithm used in XGBoost. **Optional** Valid values: One of `auto`, `exact`, `approx`, or `hist`. Default value: `auto`  | 
| tweedie\$1variance\$1power | Parameter that controls the variance of the Tweedie distribution. **Optional** Valid values: Float. Range: (1, 2). Default value: 1.5  | 
| updater | A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. For a full list of valid inputs, please refer to [XGBoost Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). **Optional** Valid values: comma-separated string. Default value: `grow_colmaker`, prune  | 

## Tune an XGBoost Release 0.72 Model
<a name="xgboost-72-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your training and validation datasets. You choose three types of hyperparameters:
+ a learning `objective` function to optimize during model training
+ an `eval_metric` to use to evaluate model performance during validation
+ a set of hyperparameters and a range of values for each to use when tuning the model automatically

You choose the evaluation metric from set of evaluation metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the evaluation metric. 

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

### Metrics Computed by the XGBoost Release 0.72 Algorithm
<a name="xgboost-72-metrics"></a>

The XGBoost algorithm based on version 0.72 computes the following nine metrics to use for model validation. When tuning the model, choose one of these metrics to evaluate the model. For full list of valid `eval_metric` values, refer to [XGBoost Learning Task Parameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst#learning-task-parameters)


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:auc |  Area under the curve.  |  Maximize  | 
| validation:error |  Binary classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:logloss |  Negative log-likelihood.  |  Minimize  | 
| validation:mae |  Mean absolute error.  |  Minimize  | 
| validation:map |  Mean average precision.  |  Maximize  | 
| validation:merror |  Multiclass classification error rate, calculated as \$1(wrong cases)/\$1(all cases).  |  Minimize  | 
| validation:mlogloss |  Negative log-likelihood for multiclass classification.  |  Minimize  | 
| validation:ndcg |  Normalized Discounted Cumulative Gain.  |  Maximize  | 
| validation:rmse |  Root mean square error.  |  Minimize  | 

### Tunable XGBoost Release 0.72 Hyperparameters
<a name="xgboost-72-tunable-hyperparameters"></a>

Tune the XGBoost model with the following hyperparameters. The hyperparameters that have the greatest effect on optimizing the XGBoost evaluation metrics are: `alpha`, `min_child_weight`, `subsample`, `eta`, and `num_round`. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| colsample\$1bylevel |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 1  | 
| colsample\$1bytree |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 
| eta |  ContinuousParameterRanges  |  MinValue: 0.1, MaxValue: 0.5  | 
| gamma |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 5  | 
| lambda |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 1000  | 
| max\$1delta\$1step |  IntegerParameterRanges  |  [0, 10]  | 
| max\$1depth |  IntegerParameterRanges  |  [0, 10]  | 
| min\$1child\$1weight |  ContinuousParameterRanges  |  MinValue: 0, MaxValue: 120  | 
| num\$1round |  IntegerParameterRanges  |  [1, 4000]  | 
| subsample |  ContinuousParameterRanges  |  MinValue: 0.5, MaxValue: 1  | 

# Built-in SageMaker AI Algorithms for Text Data
<a name="algorithms-text"></a>

SageMaker AI provides algorithms that are tailored to the analysis of textual documents used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.
+ [BlazingText algorithm](blazingtext.md)—a highly optimized implementation of the Word2vec and text classification algorithms that scale to large datasets easily. It is useful for many downstream natural language processing (NLP) tasks.
+ [Latent Dirichlet Allocation (LDA) Algorithm](lda.md)—an algorithm suitable for determining topics in a set of documents. It is an *unsupervised algorithm*, which means that it doesn't use example data with answers during training.
+ [Neural Topic Model (NTM) Algorithm](ntm.md)—another unsupervised technique for determining topics in a set of documents, using a neural network approach.
+ [Object2Vec Algorithm](object2vec.md)—a general-purpose neural embedding algorithm that can be used for recommendation systems, document classification, and sentence embeddings.
+ [Sequence-to-Sequence Algorithm](seq-2-seq.md)—a supervised algorithm commonly used for neural machine translation. 
+ [Text Classification - TensorFlow](text-classification-tensorflow.md)—a supervised algorithm that supports transfer learning with available pretrained models for text classification. 


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| BlazingText | train | File or Pipe | Text file (one sentence per line with space-separated tokens)  | GPU (single instance only) or CPU | No | 
| LDA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU (single instance only) | No | 
| Neural Topic Model | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | GPU or CPU | Yes | 
| Object2Vec | train and (optionally) validation, test, or both | File | JSON Lines  | GPU or CPU (single instance only) | No | 
| Seq2Seq Modeling | train, validation, and vocab | File | recordIO-protobuf | GPU (single instance only) | No | 
| Text Classification - TensorFlow | training and validation | File | CSV | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 

# BlazingText algorithm
<a name="blazingtext"></a>

The Amazon SageMaker AI BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.

The Word2vec algorithm maps words to high-quality distributed vectors. The resulting vector representation of a word is called a *word embedding*. Words that are semantically similar correspond to vectors that are close together. That way, word embeddings capture the semantic relationships between words. 

Many natural language processing (NLP) applications learn word embeddings by training on large collections of documents. These pretrained vector representations provide information about semantics and word distributions that typically improves the generalizability of other models that are later trained on a more limited amount of data. Most implementations of the Word2vec algorithm are not optimized for multi-core CPU architectures. This makes it difficult to scale to large datasets. 

With the BlazingText algorithm, you can scale to large datasets easily. Similar to Word2vec, it provides the Skip-gram and continuous bag-of-words (CBOW) training architectures. BlazingText's implementation of the supervised multi-class, multi-label text classification algorithm extends the fastText text classifier to use GPU acceleration with custom [CUDA ](https://docs.nvidia.com/cuda/index.html) kernels. You can train a model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU. And, you achieve performance on par with the state-of-the-art deep learning text classification algorithms.

The BlazingText algorithm is not parallelizable. For more information on parameters related to training, see [ Docker Registry Paths for SageMaker Built-in Algorithms](https://docs.aws.amazon.com/en_us/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html).

 The SageMaker AI BlazingText algorithms provides the following features:
+ Accelerated training of the fastText text classifier on multi-core CPUs or a GPU and Word2Vec on GPUs using highly optimized CUDA kernels. For more information, see [BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs](https://dl.acm.org/citation.cfm?doid=3146347.3146354).
+ [Enriched Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) by learning vector representations for character n-grams. This approach enables BlazingText to generate meaningful vectors for out-of-vocabulary (OOV) words by representing their vectors as the sum of the character n-gram (subword) vectors.
+ A `batch_skipgram` `mode` for the Word2Vec algorithm that allows faster training and distributed computation across multiple CPU nodes. The `batch_skipgram` `mode` does mini-batching using the Negative Sample Sharing strategy to convert level-1 BLAS operations into level-3 BLAS operations. This efficiently leverages the multiply-add instructions of modern architectures. For more information, see [Parallelizing Word2Vec in Shared and Distributed Memory](https://arxiv.org/pdf/1604.04661.pdf).

To summarize, the following modes are supported by BlazingText on different types instances:


| Modes |  Word2Vec (Unsupervised Learning)  |  Text Classification (Supervised Learning)  | 
| --- | --- | --- | 
|  Single CPU instance  |  `cbow` `Skip-gram` `Batch Skip-gram`  |  `supervised`  | 
|  Single GPU instance (with 1 or more GPUs)  |  `cbow` `Skip-gram`  |  `supervised` with one GPU  | 
|  Multiple CPU instances  | Batch Skip-gram  | None | 

For more information about the mathematics behind BlazingText, see [BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs](https://dl.acm.org/citation.cfm?doid=3146347.3146354).

**Topics**
+ [

## Input/Output Interface for the BlazingText Algorithm
](#bt-inputoutput)
+ [

## EC2 Instance Recommendation for the BlazingText Algorithm
](#blazingtext-instances)
+ [

## BlazingText Sample Notebooks
](#blazingtext-sample-notebooks)
+ [

# BlazingText Hyperparameters
](blazingtext_hyperparameters.md)
+ [

# Tune a BlazingText Model
](blazingtext-tuning.md)

## Input/Output Interface for the BlazingText Algorithm
<a name="bt-inputoutput"></a>

The BlazingText algorithm expects a single preprocessed text file with space-separated tokens. Each line in the file should contain a single sentence. If you need to train on multiple text files, concatenate them into one file and upload the file in the respective channel.

### Training and Validation Data Format
<a name="blazingtext-data-formats"></a>

#### Training and Validation Data Format for the Word2Vec Algorithm
<a name="blazingtext-data-formats-word2vec"></a>

For Word2Vec training, upload the file under the *train* channel. No other channels are supported. The file should contain a training sentence per line.

#### Training and Validation Data Format for the Text Classification Algorithm
<a name="blazingtext-data-formats-text-class"></a>

For supervised mode, you can train with file mode or with the augmented manifest text format.

##### Train with File Mode
<a name="blazingtext-data-formats-text-class-file-mode"></a>

For `supervised` mode, the training/validation file should contain a training sentence per line along with the labels. Labels are words that are prefixed by the string *\$1\$1label\$1\$1*. Here is an example of a training/validation file:

```
__label__4  linux ready for prime time , intel says , despite all the linux hype , the open-source movement has yet to make a huge splash in the desktop market . that may be about to change , thanks to chipmaking giant intel corp .

__label__2  bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly as the indian skippers return to international cricket was short lived .
```

**Note**  
The order of labels within the sentence doesn't matter. 

Upload the training file under the train channel, and optionally upload the validation file under the validation channel.

##### Train with Augmented Manifest Text Format
<a name="blazingtext-data-formats-text-class-augmented-manifest"></a>

Supervised mode for CPU instances also supports the augmented manifest format, which enables you to do training in pipe mode without needing to create RecordIO files. While using the format, an S3 manifest file needs to be generated that contains the list of sentences and their corresponding labels. The manifest file format should be in [JSON Lines](http://jsonlines.org/) format in which each line represents one sample. The sentences are specified using the `source` tag and the label can be specified using the `label` tag. Both `source` and `label` tags should be provided under the `AttributeNames` parameter value as specified in the request.

```
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label":1}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label":2}
```

Multi-label training is also supported by specifying a JSON array of labels.

```
{"source":"linux ready for prime time , intel says , despite all the linux hype", "label": [1, 3]}
{"source":"bowled by the slower one again , kolkata , november 14 the past caught up with sourav ganguly", "label": [2, 4, 5]}
```

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Model Artifacts and Inference
<a name="blazingtext-artifacts-inference"></a>

#### Model Artifacts for the Word2Vec Algorithm
<a name="blazingtext--artifacts-inference-word2vec"></a>

For Word2Vec training, the model artifacts consist of *vectors.txt*, which contains words-to-vectors mapping, and *vectors.bin*, a binary used by BlazingText for hosting, inference, or both. *vectors.txt* stores the vectors in a format that is compatible with other tools like Gensim and Spacy. For example, a Gensim user can run the following commands to load the vectors.txt file:

```
from gensim.models import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('vectors.txt', binary=False)
word_vectors.most_similar(positive=['woman', 'king'], negative=['man'])
word_vectors.doesnt_match("breakfast cereal dinner lunch".split())
```

If the evaluation parameter is set to `True`, an additional file, *eval.json*, is created. This file contains the similarity evaluation results (using Spearman’s rank correlation coefficients) on WS-353 dataset. The number of words from the WS-353 dataset that aren't there in the training corpus are reported.

For inference requests, the model accepts a JSON file containing a list of strings and returns a list of vectors. If the word is not found in vocabulary, inference returns a vector of zeros. If subwords is set to `True` during training, the model is able to generate vectors for out-of-vocabulary (OOV) words.

##### Sample JSON Request
<a name="word2vec-json-request"></a>

Mime-type:` application/json`

```
{
"instances": ["word1", "word2", "word3"]
}
```

#### Model Artifacts for the Text Classification Algorithm
<a name="blazingtext-artifacts-inference-text-class"></a>

Training with supervised outputs creates a *model.bin* file that can be consumed by BlazingText hosting. For inference, the BlazingText model accepts a JSON file containing a list of sentences and returns a list of corresponding predicted labels and probability scores. Each sentence is expected to be a string with space-separated tokens, words, or both.

##### Sample JSON Request
<a name="text-class-json-request"></a>

Mime-type:` application/json`

```
{
 "instances": ["the movie was excellent", "i did not like the plot ."]
}
```

By default, the server returns only one prediction, the one with the highest probability. For retrieving the top *k* predictions, you can set *k* in the configuration, as follows:

```
{
 "instances": ["the movie was excellent", "i did not like the plot ."],
 "configuration": {"k": 2}
}
```

For BlazingText, the` content-type` and `accept` parameters must be equal. For batch transform, they both need to be `application/jsonlines`. If they differ, the `Accept` field is ignored. The format for input follows:

```
content-type: application/jsonlines

{"source": "source_0"}
{"source": "source_1"}

if you need to pass the value of k for top-k, then you can do it in the following way:

{"source": "source_0", "k": 2}
{"source": "source_1", "k": 3}
```

The format for output follows:

```
accept: application/jsonlines


{"prob": [prob_1], "label": ["__label__1"]}
{"prob": [prob_1], "label": ["__label__1"]}

If you have passed the value of k to be more than 1, then response will be in this format:

{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}
{"prob": [prob_1, prob_2], "label": ["__label__1", "__label__2"]}
```

For both supervised (text classification) and unsupervised (Word2Vec) modes, the binaries (*\$1.bin*) produced by BlazingText can be cross-consumed by fastText and vice versa. You can use binaries produced by BlazingText by fastText. Likewise, you can host the model binaries created with fastText using BlazingText.

Here is an example of how to use a model generated with BlazingText with fastText:

```
#Download the model artifact from S3
aws s3 cp s3://<YOUR_S3_BUCKET>/<PREFIX>/model.tar.gz model.tar.gz

#Unzip the model archive
tar -xzf model.tar.gz

#Use the model archive with fastText
fasttext predict ./model.bin test.txt
```

However, the binaries are only supported when training on CPU and single GPU; training on multi-GPU will not produce binaries.

## EC2 Instance Recommendation for the BlazingText Algorithm
<a name="blazingtext-instances"></a>

For `cbow` and `skipgram` modes, BlazingText supports single CPU and single GPU instances. Both of these modes support learning of `subwords` embeddings. To achieve the highest speed without compromising accuracy, we recommend that you use an ml.p3.2xlarge instance. 

For `batch_skipgram` mode, BlazingText supports single or multiple CPU instances. When training on multiple instances, set the value of the `S3DataDistributionType` field of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) object that you pass to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) to `FullyReplicated`. BlazingText takes care of distributing data across machines.

For the supervised text classification mode, a C5 instance is recommended if the training dataset is less than 2 GB. For larger datasets, use an instance with a single GPU. BlazingText supports P2, P3, G4dn, and G5 instances for training and inference.

## BlazingText Sample Notebooks
<a name="blazingtext-sample-notebooks"></a>

For a sample notebook that trains and deploys the SageMaker AI BlazingText algorithm to generate word vectors, see [Learning Word2Vec Word Representations using BlazingText](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/blazingtext_word2vec_text8/blazingtext_word2vec_text8.html). For instructions for creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating and opening a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI examples. The topic modeling example notebooks that use the Blazing Text are located in the **Introduction to Amazon algorithms** section. To open a notebook, choose its **Use** tab, then choose **Create copy**.

# BlazingText Hyperparameters
<a name="blazingtext_hyperparameters"></a>

When you start a training job with a `CreateTrainingJob` request, you specify a training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The hyperparameters for the BlazingText algorithm depend on which mode you use: Word2Vec (unsupervised) and Text Classification (supervised).

## Word2Vec Hyperparameters
<a name="blazingtext_hyperparameters_word2vec"></a>

The following table lists the hyperparameters for the BlazingText Word2Vec training algorithm provided by Amazon SageMaker AI.


| Parameter Name | Description | 
| --- | --- | 
| mode |  The Word2vec architecture used for training. **Required** Valid values: `batch_skipgram`, `skipgram`, or `cbow`  | 
| batch\$1size |  The size of each batch when `mode` is set to `batch_skipgram`. Set to a number between 10 and 20. **Optional** Valid values: Positive integer Default value: 11  | 
| buckets |  The number of hash buckets to use for subwords. **Optional** Valid values: positive integer Default value: 2000000  | 
| epochs |  The number of complete passes through the training data. **Optional** Valid values: Positive integer Default value: 5  | 
| evaluation |  Whether the trained model is evaluated using the [WordSimilarity-353 Test](http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.html). **Optional** Valid values: (Boolean) `True` or `False` Default value: `True`  | 
| learning\$1rate |  The step size used for parameter updates. **Optional** Valid values: Positive float Default value: 0.05  | 
| min\$1char |  The minimum number of characters to use for subwords/character n-grams. **Optional** Valid values: positive integer Default value: 3  | 
| min\$1count |  Words that appear less than `min_count` times are discarded. **Optional** Valid values: Non-negative integer Default value: 5  | 
| max\$1char |  The maximum number of characters to use for subwords/character n-grams **Optional** Valid values: positive integer Default value: 6  | 
| negative\$1samples |  The number of negative samples for the negative sample sharing strategy. **Optional** Valid values: Positive integer Default value: 5  | 
| sampling\$1threshold |  The threshold for the occurrence of words. Words that appear with higher frequency in the training data are randomly down-sampled. **Optional** Valid values: Positive fraction. The recommended range is (0, 1e-3] Default value: 0.0001  | 
| subwords |  Whether to learn subword embeddings on not. **Optional** Valid values: (Boolean) `True` or `False` Default value: `False`  | 
| vector\$1dim |  The dimension of the word vectors that the algorithm learns. **Optional** Valid values: Positive integer Default value: 100  | 
| window\$1size |  The size of the context window. The context window is the number of words surrounding the target word used for training. **Optional** Valid values: Positive integer Default value: 5  | 

## Text Classification Hyperparameters
<a name="blazingtext_hyperparameters_text_class"></a>

The following table lists the hyperparameters for the Text Classification training algorithm provided by Amazon SageMaker AI.

**Note**  
Although some of the parameters are common between the Text Classification and Word2Vec modes, they might have different meanings depending on the context.


| Parameter Name | Description | 
| --- | --- | 
| mode |  The training mode. **Required** Valid values: `supervised`  | 
| buckets |  The number of hash buckets to use for word n-grams. **Optional** Valid values: Positive integer Default value: 2000000  | 
| early\$1stopping |  Whether to stop training if validation accuracy doesn't improve after a `patience` number of epochs. Note that a validation channel is required if early stopping is used. **Optional** Valid values: (Boolean) `True` or `False` Default value: `False`  | 
| epochs |  The maximum number of complete passes through the training data. **Optional** Valid values: Positive integer Default value: 5  | 
| learning\$1rate |  The step size used for parameter updates. **Optional** Valid values: Positive float Default value: 0.05  | 
| min\$1count |  Words that appear less than `min_count` times are discarded. **Optional** Valid values: Non-negative integer Default value: 5  | 
| min\$1epochs |  The minimum number of epochs to train before early stopping logic is invoked. **Optional** Valid values: Positive integer Default value: 5  | 
| patience |  The number of epochs to wait before applying early stopping when no progress is made on the validation set. Used only when `early_stopping` is `True`. **Optional** Valid values: Positive integer Default value: 4  | 
| vector\$1dim |  The dimension of the embedding layer. **Optional** Valid values: Positive integer Default value: 100  | 
| word\$1ngrams |  The number of word n-gram features to use. **Optional** Valid values: Positive integer Default value: 2  | 

# Tune a BlazingText Model
<a name="blazingtext-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the BlazingText Algorithm
<a name="blazingtext-metrics"></a>

The BlazingText Word2Vec algorithm (`skipgram`, `cbow`, and `batch_skipgram` modes) reports on a single metric during training: `train:mean_rho`. This metric is computed on [WS-353 word similarity datasets](https://aclweb.org/aclwiki/WordSimilarity-353_Test_Collection_(State_of_the_art)). When tuning the hyperparameter values for the Word2Vec algorithm, use this metric as the objective.

The BlazingText Text Classification algorithm (`supervised` mode), also reports on a single metric during training: the `validation:accuracy`. When tuning the hyperparameter values for the text classification algorithm, use these metrics as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| train:mean\$1rho |  The mean rho (Spearman's rank correlation coefficient) on [WS-353 word similarity datasets](http://alfonseca.org/pubs/ws353simrel.tar.gz)  |  Maximize  | 
| validation:accuracy |  The classification accuracy on the user-specified validation dataset  |  Maximize  | 

## Tunable BlazingText Hyperparameters
<a name="blazingtext-tunable-hyperparameters"></a>

### Tunable Hyperparameters for the Word2Vec Algorithm
<a name="blazingtext-tunable-hyperparameters-word2vec"></a>

Tune an Amazon SageMaker AI BlazingText Word2Vec model with the following hyperparameters. The hyperparameters that have the greatest impact on Word2Vec objective metrics are: `mode`, ` learning_rate`, `window_size`, `vector_dim`, and `negative_samples`.


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| batch\$1size |  `IntegerParameterRange`  |  [8-32]  | 
| epochs |  `IntegerParameterRange`  |  [5-15]  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.005, MaxValue: 0.01  | 
| min\$1count |  `IntegerParameterRange`  |  [0-100]  | 
| mode |  `CategoricalParameterRange`  |  [`'batch_skipgram'`, `'skipgram'`, `'cbow'`]  | 
| negative\$1samples |  `IntegerParameterRange`  |  [5-25]  | 
| sampling\$1threshold |  `ContinuousParameterRange`  |  MinValue: 0.0001, MaxValue: 0.001  | 
| vector\$1dim |  `IntegerParameterRange`  |  [32-300]  | 
| window\$1size |  `IntegerParameterRange`  |  [1-10]  | 

### Tunable Hyperparameters for the Text Classification Algorithm
<a name="blazingtext-tunable-hyperparameters-text_class"></a>

Tune an Amazon SageMaker AI BlazingText text classification model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges or Values | 
| --- | --- | --- | 
| buckets |  `IntegerParameterRange`  |  [1000000-10000000]  | 
| epochs |  `IntegerParameterRange`  |  [5-15]  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.005, MaxValue: 0.01  | 
| min\$1count |  `IntegerParameterRange`  |  [0-100]  | 
| vector\$1dim |  `IntegerParameterRange`  |  [32-300]  | 
| word\$1ngrams |  `IntegerParameterRange`  |  [1-3]  | 

# Latent Dirichlet Allocation (LDA) Algorithm
<a name="lda"></a>

The Amazon SageMaker AI Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.

The exact content of two documents with similar topic mixtures will not be the same. But overall, you would expect these documents to more frequently use a shared subset of words, than when compared with a document from a different topic mixture. This allows LDA to discover these word groups and use them to form topics. As an extremely simple example, given a set of documents where the only words that occur within them are: *eat*, *sleep*, *play*, *meow*, and *bark*, LDA might produce topics like the following:


| **Topic** | *eat* | *sleep*  | *play* | *meow* | *bark* | 
| --- | --- | --- | --- | --- | --- | 
| Topic 1  | 0.1  | 0.3  | 0.2  | 0.4  | 0.0  | 
| Topic 2  | 0.2  | 0.1 | 0.4  | 0.0  | 0.3  | 

You can infer that documents that are more likely to fall into Topic 1 are about cats (who are more likely to *meow* and *sleep*), and documents that fall into Topic 2 are about dogs (who prefer to *play* and *bark*). These topics can be found even though the words dog and cat never appear in any of the texts. 

**Topics**
+ [

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
](#lda-or-ntm)
+ [

## Input/Output Interface for the LDA Algorithm
](#lda-inputoutput)
+ [

## EC2 Instance Recommendation for the LDA Algorithm
](#lda-instances)
+ [

## LDA Sample Notebooks
](#LDA-sample-notebooks)
+ [

# How LDA Works
](lda-how-it-works.md)
+ [

# LDA Hyperparameters
](lda_hyperparameters.md)
+ [

# Tune an LDA Model
](lda-tuning.md)

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
<a name="lda-or-ntm"></a>

Topic models are commonly used to produce topics from corpuses that (1) coherently encapsulate semantic meaning and (2) describe documents well. As such, topic models aim to minimize perplexity and maximize topic coherence. 

Perplexity is an intrinsic language modeling evaluation metric that measures the inverse of the geometric mean per-word likelihood in your test data. A lower perplexity score indicates better generalization performance. Research has shown that the likelihood computed per word often does not align to human judgement, and can be entirely non-correlated, thus topic coherence has been introduced. Each inferred topic from your model consists of words, and topic coherence is computed to the top N words for that particular topic from your model. It is often defined as the average or median of the pairwise word-similarity scores of the words in that topic e.g., Pointwise Mutual Information (PMI). A promising model generates coherent topics or topics with high topic coherence scores. 

While the objective is to train a topic model that minimizes perplexity and maximizes topic coherence, there is often a tradeoff with both LDA and NTM. Recent research by Amazon, Dinget et al., 2018 has shown that NTM is promising for achieving high topic coherence but LDA trained with collapsed Gibbs sampling achieves better perplexity. There is a tradeoff between perplexity and topic coherence. From a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized across multiple GPU instances, whereas LDA only supports single-instance CPU training. 

**Topics**
+ [

## Choosing between Latent Dirichlet Allocation (LDA) and Neural Topic Model (NTM)
](#lda-or-ntm)
+ [

## Input/Output Interface for the LDA Algorithm
](#lda-inputoutput)
+ [

## EC2 Instance Recommendation for the LDA Algorithm
](#lda-instances)
+ [

## LDA Sample Notebooks
](#LDA-sample-notebooks)
+ [

# How LDA Works
](lda-how-it-works.md)
+ [

# LDA Hyperparameters
](lda_hyperparameters.md)
+ [

# Tune an LDA Model
](lda-tuning.md)

## Input/Output Interface for the LDA Algorithm
<a name="lda-inputoutput"></a>

LDA expects data to be provided on the train channel, and optionally supports a test channel, which is scored by the final model. LDA supports both `recordIO-wrapped-protobuf` (dense and sparse) and `CSV` file formats. For `CSV`, the data must be dense and have dimension equal to *number of records \$1 vocabulary size*. LDA can be trained in File or Pipe mode when using recordIO-wrapped protobuf, but only in File mode for the `CSV` format.

For inference, `text/csv`, `application/json`, and `application/x-recordio-protobuf` content types are supported. Sparse data can also be passed for `application/json` and `application/x-recordio-protobuf`. LDA inference returns `application/json` or `application/x-recordio-protobuf` *predictions*, which include the `topic_mixture` vector for each observation.

Please see the [LDA Sample Notebooks](#LDA-sample-notebooks) for more detail on training and inference formats.

## EC2 Instance Recommendation for the LDA Algorithm
<a name="lda-instances"></a>

LDA currently only supports single-instance CPU training. CPU instances are recommended for hosting/inference.

## LDA Sample Notebooks
<a name="LDA-sample-notebooks"></a>

For a sample notebook that shows how to train the SageMaker AI Latent Dirichlet Allocation algorithm on a dataset and then how to deploy the trained model to perform inferences about the topic mixtures in input documents, see the [An Introduction to SageMaker AI LDA](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/lda_topic_modeling/LDA-Introduction.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How LDA Works
<a name="lda-how-it-works"></a>

Amazon SageMaker AI LDA is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of different categories. These categories are themselves a probability distribution over the features. LDA is a generative probability model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to discriminative models, which attempt to learn how inputs map to outputs.

You can use LDA for a variety of tasks, from clustering customers based on product purchases to automatic harmonic analysis in music. However, it is most commonly associated with topic modeling in text corpuses. Observations are referred to as documents. The feature set is referred to as vocabulary. A feature is referred to as a word. And the resulting categories are referred to as topics.

**Note**  
Lemmatization significantly increases algorithm performance and accuracy. Consider pre-processing any input text data. For more information, see [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html).

An LDA model is defined by two parameters:
+ α—A prior estimate on topic probability (in other words, the average frequency that each topic within a given document occurs). 
+ β—a collection of k topics where each topic is given a probability distribution over the vocabulary used in a document corpus, also called a "topic-word distribution."

LDA is a "bag-of-words" model, which means that the order of words does not matter. LDA is a generative model where each document is generated word-by-word by choosing a topic mixture θ ∼ Dirichlet(α). 

 For each word in the document: 
+  Choose a topic z ∼ Multinomial(θ) 
+  Choose the corresponding topic-word distribution β\$1z. 
+  Draw a word w ∼ Multinomial(β\$1z). 

When training the model, the goal is to find parameters α and β, which maximize the probability that the text corpus is generated by the model.

The most popular methods for estimating the LDA model use Gibbs sampling or Expectation Maximization (EM) techniques. The Amazon SageMaker AI LDA uses tensor spectral decomposition. This provides several advantages:
+  **Theoretical guarantees on results**. The standard EM-method is guaranteed to converge only to local optima, which are often of poor quality. 
+  **Embarrassingly parallelizable**. The work can be trivially divided over input documents in both training and inference. The EM-method and Gibbs Sampling approaches can be parallelized, but not as easily. 
+  **Fast**. Although the EM-method has low iteration cost it is prone to slow convergence rates. Gibbs Sampling is also subject to slow convergence rates and also requires a large number of samples. 

At a high-level, the tensor decomposition algorithm follows this process:

1.  The goal is to calculate the spectral decomposition of a **V** x **V** x **V** tensor, which summarizes the moments of the documents in our corpus. **V** is vocabulary size (in other words, the number of distinct words in all of the documents). The spectral components of this tensor are the LDA parameters α and β, which maximize the overall likelihood of the document corpus. However, because vocabulary size tends to be large, this **V** x **V** x **V** tensor is prohibitively large to store in memory. 

1.  Instead, it uses a **V** x **V** moment matrix, which is the two-dimensional analog of the tensor from step 1, to find a whitening matrix of dimension **V** x **k**. This matrix can be used to convert the **V** x **V** moment matrix into a **k** x **k** identity matrix. **k** is the number of topics in the model. 

1.  This same whitening matrix can then be used to find a smaller **k** x **k** x **k** tensor. When spectrally decomposed, this tensor has components that have a simple relationship with the components of the **V** x **V** x **V** tensor. 

1.  *Alternating Least Squares* is used to decompose the smaller **k** x *k* x **k** tensor. This provides a substantial improvement in memory consumption and speed. The parameters α and β can be found by “unwhitening” these outputs in the spectral decomposition. 

After the LDA model’s parameters have been found, you can find the topic mixtures for each document. You use stochastic gradient descent to maximize the likelihood function of observing a given topic mixture corresponding to these data.

Topic quality can be improved by increasing the number of topics to look for in training and then filtering out poor quality ones. This is in fact done automatically in SageMaker AI LDA: 25% more topics are computed and only the ones with largest associated Dirichlet priors are returned. To perform further topic filtering and analysis, you can increase the topic count and modify the resulting LDA model as follows:

```
> import mxnet as mx
> alpha, beta = mx.ndarray.load(‘model.tar.gz’)
> # modify alpha and beta
> mx.nd.save(‘new_model.tar.gz’, [new_alpha, new_beta])
> # upload to S3 and create new SageMaker model using the console
```

For more information about algorithms for LDA and the SageMaker AI implementation, see the following:
+ Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. *Tensor Decompositions for Learning Latent Variable Models*, Journal of Machine Learning Research, 15:2773–2832, 2014.
+  David M Blei, Andrew Y Ng, and Michael I Jordan. *Latent Dirichlet Allocation*. Journal of Machine Learning Research, 3(Jan):993–1022, 2003.
+  Thomas L Griffiths and Mark Steyvers. *Finding Scientific Topics*. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004. 
+  Tamara G Kolda and Brett W Bader. *Tensor Decompositions and Applications*. SIAM Review, 51(3):455–500, 2009. 

# LDA Hyperparameters
<a name="lda_hyperparameters"></a>

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the LDA training algorithm provided by Amazon SageMaker AI. For more information, see [How LDA Works](lda-how-it-works.md).


| Parameter Name | Description | 
| --- | --- | 
| num\$1topics |  The number of topics for LDA to find within the data. **Required** Valid values: positive integer  | 
| feature\$1dim |  The size of the vocabulary of the input document corpus. **Required** Valid values: positive integer  | 
| mini\$1batch\$1size |  The total number of documents in the input document corpus. **Required** Valid values: positive integer  | 
| alpha0 |  Initial guess for the concentration parameter: the sum of the elements of the Dirichlet prior. Small values are more likely to generate sparse topic mixtures and large values (greater than 1.0) produce more uniform mixtures.  **Optional** Valid values: Positive float Default value: 1.0  | 
| max\$1restarts |  The number of restarts to perform during the Alternating Least Squares (ALS) spectral decomposition phase of the algorithm. Can be used to find better quality local minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive integer Default value: 10  | 
| max\$1iterations |  The maximum number of iterations to perform during the ALS phase of the algorithm. Can be used to find better quality minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive integer Default value: 1000  | 
| tol |  Target error tolerance for the ALS phase of the algorithm. Can be used to find better quality minima at the expense of additional computation, but typically should not be adjusted.  **Optional** Valid values: Positive float Default value: 1e-8  | 

# Tune an LDA Model
<a name="lda-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

LDA is an unsupervised topic modeling algorithm that attempts to describe a set of observations (documents) as a mixture of different categories (topics). The “per-word log-likelihood” (PWLL) metric measures the likelihood that a learned set of topics (an LDA model) accurately describes a test document dataset. Larger values of PWLL indicate that the test data is more likely to be described by the LDA model.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the LDA Algorithm
<a name="lda-metrics"></a>

The LDA algorithm reports on a single metric during training: `test:pwll`. When tuning a model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:pwll | Per-word log-likelihood on the test dataset. The likelihood that the test dataset is accurately described by the learned LDA model. | Maximize | 

## Tunable LDA Hyperparameters
<a name="lda-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the LDA algorithm. Both hyperparameters, `alpha0` and `num_topics`, can affect the LDA objective metric (`test:pwll`). If you don't already know the optimal values for these hyperparameters, which maximize per-word log-likelihood and produce an accurate LDA model, automatic model tuning can help find them.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| alpha0 | ContinuousParameterRanges | MinValue: 0.1, MaxValue: 10 | 
| num\$1topics | IntegerParameterRanges | MinValue: 1, MaxValue: 150 | 

# Neural Topic Model (NTM) Algorithm
<a name="ntm"></a>

Amazon SageMaker AI NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into *topics* that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example. Topic modeling can be used to classify or summarize documents based on the topics detected or to retrieve information or recommend content based on topic similarities. The topics from documents that NTM learns are characterized as a *latent representation* because the topics are inferred from the observed word distributions in the corpus. The semantics of topics are usually inferred by examining the top ranking words they contain. Because the method is unsupervised, only the number of topics, not the topics themselves, are prespecified. In addition, the topics are not guaranteed to align with how a human might naturally categorize documents.

Topic modeling provides a way to visualize the contents of a large document corpus in terms of the learned topics. Documents relevant to each topic might be indexed or searched for based on their soft topic labels. The latent representations of documents might also be used to find similar documents in the topic space. You can also use the latent representations of documents that the topic model learns for input to another supervised algorithm such as a document classifier. Because the latent representations of documents are expected to capture the semantics of the underlying documents, algorithms based in part on these representations are expected to perform better than those based on lexical features alone.

Although you can use both the Amazon SageMaker AI NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.

For more information on the mathematics behind NTM, see [Neural Variational Inference for Text Processing](https://arxiv.org/pdf/1511.06038.pdf).

**Topics**
+ [

## Input/Output Interface for the NTM Algorithm
](#NTM-inputoutput)
+ [

## EC2 Instance Recommendation for the NTM Algorithm
](#NTM-instances)
+ [

## NTM Sample Notebooks
](#NTM-sample-notebooks)
+ [

# NTM Hyperparameters
](ntm_hyperparameters.md)
+ [

# Tune an NTM Model
](ntm-tuning.md)
+ [

# NTM Response Formats
](ntm-in-formats.md)

## Input/Output Interface for the NTM Algorithm
<a name="NTM-inputoutput"></a>

Amazon SageMaker AI Neural Topic Model supports four data channels: train, validation, test, and auxiliary. The validation, test, and auxiliary data channels are optional. If you specify any of these optional channels, set the value of the `S3DataDistributionType` parameter for them to `FullyReplicated`. If you provide validation data, the loss on this data is logged at every epoch, and the model stops training as soon as it detects that the validation loss is not improving. If you don't provide validation data, the algorithm stops early based on the training data, but this can be less efficient. If you provide test data, the algorithm reports the test loss from the final model. 

The train, validation, and test data channels for NTM support both `recordIO-wrapped-protobuf` (dense and sparse) and `CSV` file formats. For `CSV` format, each row must be represented densely with zero counts for words not present in the corresponding document, and have dimension equal to: (number of records) \$1 (vocabulary size). You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`. The auxiliary channel is used to supply a text file that contains vocabulary. By supplying the vocabulary file, users are able to see the top words for each of the topics printed in the log instead of their integer IDs. Having the vocabulary file also allows NTM to compute the Word Embedding Topic Coherence (WETC) scores, a new metric displayed in the log that captures similarity among the top words in each topic effectively. The `ContentType` for the auxiliary channel is `text/plain`, with each line containing a single word, in the order corresponding to the integer IDs provided in the data. The vocabulary file must be named `vocab.txt` and currently only UTF-8 encoding is supported. 

For inference, `text/csv`, `application/json`, `application/jsonlines`, and `application/x-recordio-protobuf` content types are supported. Sparse data can also be passed for `application/json` and `application/x-recordio-protobuf`. NTM inference returns `application/json` or `application/x-recordio-protobuf` *predictions*, which include the `topic_weights` vector for each observation.

See the [blog post](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-neural-topic-model-now-supports-auxiliary-vocabulary-channel-new-topic-evaluation-metrics-and-training-subsampling/) for more details on using the auxiliary channel and the WETC scores. For more information on how to compute the WETC score, see [Coherence-Aware Neural Topic Modeling](https://arxiv.org/pdf/1809.02687.pdf). We used the pairwise WETC described in this paper for the Amazon SageMaker AI Neural Topic Model.

For more information on input and output file formats, see [NTM Response Formats](ntm-in-formats.md) for inference and the [NTM Sample Notebooks](#NTM-sample-notebooks).

## EC2 Instance Recommendation for the NTM Algorithm
<a name="NTM-instances"></a>

NTM training supports both GPU and CPU instance types. We recommend GPU instances, but for certain workloads, CPU instances may result in lower training costs. CPU instances should be sufficient for inference. NTM training supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

## NTM Sample Notebooks
<a name="NTM-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI NTM algorithm to uncover topics in documents from a synthetic data source where the topic distributions are known, see the [Introduction to Basic Functionality of NTM](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ntm_synthetic/ntm_synthetic.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# NTM Hyperparameters
<a name="ntm_hyperparameters"></a>

The following table lists the hyperparameters that you can set for the Amazon SageMaker AI Neural Topic Model (NTM) algorithm.


| Parameter Name | Description | 
| --- | --- | 
|  `feature_dim`  |  The vocabulary size of the dataset. **Required** Valid values: Positive integer (min: 1, max: 1,000,000)  | 
| num\$1topics |  The number of required topics. **Required** Valid values: Positive integer (min: 2, max: 1000)  | 
| batch\$1norm |  Whether to use batch normalization during training. **Optional** Valid values: *true* or *false* Default value: *false*  | 
| clip\$1gradient |  The maximum magnitude for each gradient component. **Optional** Valid values: Float (min: 1e-3) Default value: Infinity  | 
| encoder\$1layers |  The number of layers in the encoder and the output size of each layer. When set to *auto*, the algorithm uses two layers of sizes 3 x `num_topics` and 2 x `num_topics` respectively.  **Optional** Valid values: Comma-separated list of positive integers or *auto* Default value: *auto*  | 
| encoder\$1layers\$1activation |  The activation function to use in the encoder layers. **Optional** Valid values:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html) Default value: `sigmoid`  | 
| epochs |  The maximum number of passes over the training data. **Optional** Valid values: Positive integer (min: 1) Default value: 50  | 
| learning\$1rate |  The learning rate for the optimizer. **Optional** Valid values: Float (min: 1e-6, max: 1.0) Default value: 0.001  | 
| mini\$1batch\$1size |  The number of examples in each mini batch. **Optional** Valid values: Positive integer (min: 1, max: 10000) Default value: 256  | 
| num\$1patience\$1epochs |  The number of successive epochs over which early stopping criterion is evaluated. Early stopping is triggered when the change in the loss function drops below the specified `tolerance` within the last `num_patience_epochs` number of epochs. To disable early stopping, set `num_patience_epochs` to a value larger than `epochs`. **Optional** Valid values: Positive integer (min: 1) Default value: 3  | 
| optimizer |  The optimizer to use for training. **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/ntm_hyperparameters.html) Default value: `adadelta`  | 
| rescale\$1gradient |  The rescale factor for gradient. **Optional** Valid values: float (min: 1e-3, max: 1.0) Default value: 1.0  | 
| sub\$1sample |  The fraction of the training data to sample for training per epoch. **Optional** Valid values: Float (min: 0.0, max: 1.0) Default value: 1.0  | 
| tolerance |  The maximum relative change in the loss function. Early stopping is triggered when change in the loss function drops below this value within the last `num_patience_epochs` number of epochs. **Optional** Valid values: Float (min: 1e-6, max: 0.1) Default value: 0.001  | 
| weight\$1decay |   The weight decay coefficient. Adds L2 regularization. **Optional** Valid values: Float (min: 0.0, max: 1.0) Default value: 0.0  | 

# Tune an NTM Model
<a name="ntm-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

Amazon SageMaker AI NTM is an unsupervised learning algorithm that learns latent representations of large collections of discrete data, such as a corpus of documents. Latent representations use inferred variables that are not directly measured to model the observations in a dataset. Automatic model tuning on NTM helps you find the model that minimizes loss over the training or validation data. *Training loss* measures how well the model fits the training data. *Validation loss* measures how well the model can generalize to data that it is not trained on. Low training loss indicates that a model is a good fit to the training data. Low validation loss indicates that a model has not overfit the training data and so should be able to model documents successfully on which is has not been trained. Usually, it's preferable to have both losses be small. However, minimizing training loss too much might result in overfitting and increase validation loss, which would reduce the generality of the model. 

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the NTM Algorithm
<a name="ntm-metrics"></a>

The NTM algorithm reports a single metric that is computed during training: `validation:total_loss`. The total loss is the sum of the reconstruction loss and Kullback-Leibler divergence. When tuning hyperparameter values, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:total\$1loss |  Total Loss on validation set  |  Minimize  | 

## Tunable NTM Hyperparameters
<a name="ntm-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the NTM algorithm. Usually setting low `mini_batch_size` and small `learning_rate` values results in lower validation losses, although it might take longer to train. Low validation losses don't necessarily produce more coherent topics as interpreted by humans. The effect of other hyperparameters on training and validation loss can vary from dataset to dataset. To see which values are compatible, see [NTM Hyperparameters](ntm_hyperparameters.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| encoder\$1layers\$1activation |  CategoricalParameterRanges  |  ['sigmoid', 'tanh', 'relu']  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 0.1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 16, MaxValue:2048  | 
| optimizer |  CategoricalParameterRanges  |  ['sgd', 'adam', 'adadelta']  | 
| rescale\$1gradient |  ContinuousParameterRange  |  MinValue: 0.1, MaxValue: 1.0  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 1.0  | 

# NTM Response Formats
<a name="ntm-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI NTM algorithm.

## JSON Response Format
<a name="ntm-json"></a>

```
{
    "predictions":    [
        {"topic_weights": [0.02, 0.1, 0,...]},
        {"topic_weights": [0.25, 0.067, 0,...]}
    ]
}
```

## JSONLINES Response Format
<a name="ntm-jsonlines"></a>

```
{"topic_weights": [0.02, 0.1, 0,...]}
{"topic_weights": [0.25, 0.067, 0,...]}
```

## RECORDIO Response Format
<a name="ntm-recordio"></a>

```
[
    Record = {
        features = {},
        label = {
            'topic_weights': {
                keys: [],
                values: [0.25, 0.067, 0, ...]  # float32
            }
        }
    },
    Record = {
        features = {},
        label = {
            'topic_weights': {
                keys: [],
                values: [0.25, 0.067, 0, ...]  # float32
            }
        }
    }  
]
```

# Object2Vec Algorithm
<a name="object2vec"></a>

The Amazon SageMaker AI Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression. 

Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in the SageMaker AI [BlazingText algorithm](blazingtext.md). For a blog post that discusses how to apply Object2Vec to some practical use cases, see [Introduction to Amazon SageMaker AI Object2Vec](https://aws.amazon.com/blogs/machine-learning/introduction-to-amazon-sagemaker-object2vec/). 

**Topics**
+ [

## I/O Interface for the Object2Vec Algorithm
](#object2vec-inputoutput)
+ [

## EC2 Instance Recommendation for the Object2Vec Algorithm
](#object2vec--instances)
+ [

## Object2Vec Sample Notebooks
](#object2vec-sample-notebooks)
+ [

# How Object2Vec Works
](object2vec-howitworks.md)
+ [

# Object2Vec Hyperparameters
](object2vec-hyperparameters.md)
+ [

# Tune an Object2Vec Model
](object2vec-tuning.md)
+ [

# Data Formats for Object2Vec Training
](object2vec-training-formats.md)
+ [

# Data Formats for Object2Vec Inference
](object2vec-inference-formats.md)
+ [

# Encoder Embeddings for Object2Vec
](object2vec-encoder-embeddings.md)

## I/O Interface for the Object2Vec Algorithm
<a name="object2vec-inputoutput"></a>

You can use Object2Vec on many input data types, including the following examples.


| Input Data Type | Example | 
| --- | --- | 
|  Sentence-sentence pairs  | "A soccer game with multiple males playing." and "Some men are playing a sport." | 
|  Labels-sequence pairs  | The genre tags of the movie "Titanic", such as "Romance" and "Drama", and its short description: "James Cameron's Titanic is an epic, action-packed romance set against the ill-fated maiden voyage of the R.M.S. Titanic. She was the most luxurious liner of her era, a ship of dreams, which ultimately carried over 1,500 people to their death in the ice cold waters of the North Atlantic in the early hours of April 15, 1912." | 
|  Customer-customer pairs  |  The customer ID of Jane and customer ID of Jackie.  | 
|  Product-product pairs  |  The product ID of football and product ID of basketball.  | 
|  Item review user-item pairs  |  A user's ID and the items she has bought, such as apple, pear, and orange.  | 

To transform the input data into the supported formats, you must preprocess it. Currently, Object2Vec natively supports two types of input: 
+ A discrete token, which is represented as a list of a single `integer-id`. For example, `[10]`.
+ A sequences of discrete tokens, which is represented as a list of `integer-ids`. For example, `[0,12,10,13]`.

The object in each pair can be asymmetric. For example, the pairs can be (token, sequence) or (token, token) or (sequence, sequence). For token inputs, the algorithm supports simple embeddings as compatible encoders. For sequences of token vectors, the algorithm supports the following as encoders:
+  Average-pooled embeddings
+  Hierarchical convolutional neural networks (CNNs),
+  Multi-layered bidirectional long short-term memory (BiLSTMs) 

The input label for each pair can be one of the following:
+ A categorical label that expresses the relationship between the objects in the pair 
+ A score that expresses the strength of the similarity between the two objects 

For categorical labels used in classification, the algorithm supports the cross-entropy loss function. For ratings/score-based labels used in regression, the algorithm supports the mean squared error (MSE) loss function. Specify these loss functions with the `output_layer` hyperparameter when you create the model training job.

## EC2 Instance Recommendation for the Object2Vec Algorithm
<a name="object2vec--instances"></a>

The type of Amazon Elastic Compute Cloud (Amazon EC2) instance that you use depends on whether you are training or running inference. 

When training a model using the Object2Vec algorithm on a CPU, start with an ml.m5.2xlarge instance. For training on a GPU, start with an ml.p2.xlarge instance. If the training takes too long on this instance, you can use a larger instance. Currently, the Object2Vec algorithm can train only on a single machine. However, it does offer support for multiple GPUs. Object2Vec supports P2, P3, G4dn, and G5 GPU instance families for training and inference.

For inference with a trained Object2Vec model that has a deep neural network, we recommend using ml.p3.2xlarge GPU instance. Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the [GPU optimization: Classification or Regression](object2vec-inference-formats.md#object2vec-inference-gpu-optimize-classification) or [GPU optimization: Encoder Embeddings](object2vec-encoder-embeddings.md#object2vec-inference-gpu-optimize-encoder-embeddings) inference network is loaded into GPU.

## Object2Vec Sample Notebooks
<a name="object2vec-sample-notebooks"></a>
+ [Using Object2Vec to Encode Sentences into Fixed Length Embeddings](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/object2vec_sentence_similarity/object2vec_sentence_similarity.html)

# How Object2Vec Works
<a name="object2vec-howitworks"></a>

When using the Amazon SageMaker AI Object2Vec algorithm, you follow the standard workflow: process the data, train the model, and produce inferences. 

**Topics**
+ [

## Step 1: Process Data
](#object2vec-step-1-data-preprocessing)
+ [

## Step 2: Train a Model
](#object2vec-step-2-training-model)
+ [

## Step 3: Produce Inferences
](#object2vec-step-3-inference)

## Step 1: Process Data
<a name="object2vec-step-1-data-preprocessing"></a>

During preprocessing, convert the data to the [JSON Lines](http://jsonlines.org/) text file format specified in [Data Formats for Object2Vec Training](object2vec-training-formats.md) . To get the highest accuracy during training, also randomly shuffle the data before feeding it into the model. How you generate random permutations depends on the language. For python, you could use `np.random.shuffle`; for Unix, `shuf`.

## Step 2: Train a Model
<a name="object2vec-step-2-training-model"></a>

The SageMaker AI Object2Vec algorithm has the following main components:
+ **Two input channels** – The input channels take a pair of objects of the same or different types as inputs, and pass them to independent and customizable encoders.
+ **Two encoders** – The two encoders, enc0 and enc1, convert each object into a fixed-length embedding vector. The encoded embeddings of the objects in the pair are then passed into a comparator.
+ **A comparator** – The comparator compares the embeddings in different ways and outputs scores that indicate the strength of the relationship between the paired objects. In the output score for a sentence pair. For example, 1 indicates a strong relationship between a sentence pair, and 0 represents a weak relationship. 

During training, the algorithm accepts pairs of objects and their relationship labels or scores as inputs. The objects in each pair can be of different types, as described earlier. If the inputs to both encoders are composed of the same token-level units, you can use a shared token embedding layer by setting the `tied_token_embedding_weight` hyperparameter to `True` when you create the training job. This is possible, for example, when comparing sentences that both have word token-level units. To generate negative samples at a specified rate, set the `negative_sampling_rate` hyperparameter to the desired ratio of negative to positive samples. This hyperparameter expedites learning how to discriminate between the positive samples observed in the training data and the negative samples that are not likely to be observed. 

Pairs of objects are passed through independent, customizable encoders that are compatible with the input types of corresponding objects. The encoders convert each object in a pair into a fixed-length embedding vector of equal length. The pair of vectors are passed to a comparator operator, which assembles the vectors into a single vector using the value specified in the he `comparator_list` hyperparameter. The assembled vector then passes through a multilayer perceptron (MLP) layer, which produces an output that the loss function compares with the labels that you provided. This comparison evaluates the strength of the relationship between the objects in the pair as predicted by the model. The following figure shows this workflow.

![\[Architecture of the Object2Vec Algorithm from Data Inputs to Scores\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/object2vec-training-image.png)


## Step 3: Produce Inferences
<a name="object2vec-step-3-inference"></a>

After the model is trained, you can use the trained encoder to preprocess input objects or to perform two types of inference:
+ To convert singleton input objects into fixed-length embeddings using the corresponding encoder
+ To predict the relationship label or score between a pair of input objects

The inference server automatically figures out which of the types is requested based on the input data. To get the embeddings as output, provide only one input. To predict the relationship label or score, provide both inputs in the pair.

# Object2Vec Hyperparameters
<a name="object2vec-hyperparameters"></a>

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Object2Vec training algorithm.


| Parameter Name | Description | 
| --- | --- | 
| enc0\$1max\$1seq\$1len |  The maximum sequence length for the enc0 encoder. **Required** Valid values: 1 ≤ integer ≤ 5000  | 
| enc0\$1vocab\$1size |  The vocabulary size of enc0 tokens. **Required** Valid values: 2 ≤ integer ≤ 3000000  | 
| bucket\$1width |  The allowed difference between data sequence length when bucketing is enabled. To enable bucketing, specify a non-zero value for this parameter. **Optional** Valid values: 0 ≤ integer ≤ 100 Default value: 0 (no bucketing)  | 
| comparator\$1list |  A list used to customize the way in which two embeddings are compared. The Object2Vec comparator operator layer takes the encodings from both encoders as inputs and outputs a single vector. This vector is a concatenation of subvectors. The string values passed to the `comparator_list` and the order in which they are passed determine how these subvectors are assembled. For example, if `comparator_list="hadamard, concat"`, then the comparator operator constructs the vector by concatenating the Hadamard product of two encodings and the concatenation of two encodings. If, on the other hand, `comparator_list="hadamard"`, then the comparator operator constructs the vector as the hadamard product of only two encodings.  **Optional** Valid values: A string that contains any combination of the names of the three binary operators: `hadamard`, `concat`, or `abs_diff`. The Object2Vec algorithm currently requires that the two vector encodings have the same dimension. These operators produce the subvectors as follows: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `"hadamard, concat, abs_diff"`  | 
| dropout |  The dropout probability for network layers. *Dropout* is a form of regularization used in neural networks that reduces overfitting by trimming codependent neurons. **Optional** Valid values: 0.0 ≤ float ≤ 1.0 Default value: 0.0  | 
| early\$1stopping\$1patience |  The number of consecutive epochs without improvement allowed before early stopping is applied. Improvement is defined by with the `early_stopping_tolerance` hyperparameter. **Optional** Valid values: 1 ≤ integer ≤ 5 Default value: 3  | 
| early\$1stopping\$1tolerance |  The reduction in the loss function that an algorithm must achieve between consecutive epochs to avoid early stopping after the number of consecutive epochs specified in the `early_stopping_patience` hyperparameter concludes. **Optional** Valid values: 0.000001 ≤ float ≤ 0.1 Default value: 0.01  | 
| enc\$1dim |  The dimension of the output of the embedding layer. **Optional** Valid values: 4 ≤ integer ≤ 10000 Default value: 4096  | 
| enc0\$1network |  The network model for the enc0 encoder. **Optional** Valid values: `hcnn`, `bilstm`, or `pooled_embedding` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `hcnn`  | 
| enc0\$1cnn\$1filter\$1width |  The filter width of the convolutional neural network (CNN) enc0 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 9 Default value: 3  | 
| enc0\$1freeze\$1pretrained\$1embedding |  Whether to freeze enc0 pretrained embedding weights. **Conditional** Valid values: `True` or `False` Default value: `True`  | 
| enc0\$1layers  |  The number of layers in the enc0 encoder. **Conditional** Valid values: `auto` or 1 ≤ integer ≤ 4 [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `auto`  | 
| enc0\$1pretrained\$1embedding\$1file |  The filename of the pretrained enc0 token embedding file in the auxiliary data channel. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc0\$1token\$1embedding\$1dim |  The output dimension of the enc0 token embedding layer. **Conditional** Valid values: 2 ≤ integer ≤ 1000 Default value: 300  | 
| enc0\$1vocab\$1file |  The vocabulary file for mapping pretrained enc0 token embedding vectors to numerical vocabulary IDs. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1network |  The network model for the enc1 encoder. If you want the enc1 encoder to use the same network model as enc0, including the hyperparameter values, set the value to `enc0`.   Even when the enc0 and enc1 encoder networks have symmetric architectures, you can't shared parameter values for these networks.  **Optional** Valid values: `enc0`, `hcnn`, `bilstm`, or `pooled_embedding` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `enc0`  | 
| enc1\$1cnn\$1filter\$1width |  The filter width of the CNN enc1 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 9 Default value: 3  | 
| enc1\$1freeze\$1pretrained\$1embedding |  Whether to freeze enc1 pretrained embedding weights. **Conditional** Valid values: `True` or `False` Default value: `True`  | 
| enc1\$1layers  |  The number of layers in the enc1 encoder. **Conditional** Valid values: `auto` or 1 ≤ integer ≤ 4 [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `auto`  | 
| enc1\$1max\$1seq\$1len |  The maximum sequence length for the enc1 encoder. **Conditional** Valid values: 1 ≤ integer ≤ 5000  | 
| enc1\$1pretrained\$1embedding\$1file |  The name of the enc1 pretrained token embedding file in the auxiliary data channel. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1token\$1embedding\$1dim |  The output dimension of the enc1 token embedding layer. **Conditional** Valid values: 2 ≤ integer ≤ 1000 Default value: 300  | 
| enc1\$1vocab\$1file |  The vocabulary file for mapping pretrained enc1 token embeddings to vocabulary IDs. **Conditional** Valid values: String with alphanumeric characters, underscore, or period. [A-Za-z0-9\$1.\$1\$1]  Default value: "" (empty string)  | 
| enc1\$1vocab\$1size |  The vocabulary size of enc0 tokens. **Conditional** Valid values: 2 ≤ integer ≤ 3000000  | 
| epochs |  The number of epochs to run for training.  **Optional** Valid values: 1 ≤ integer ≤ 100 Default value: 30  | 
| learning\$1rate |  The learning rate for training. **Optional** Valid values: 1.0E-6 ≤ float ≤ 1.0 Default value: 0.0004  | 
| mini\$1batch\$1size |  The batch size that the dataset is split into for an `optimizer` during training. **Optional** Valid values: 1 ≤ integer ≤ 10000 Default value: 32  | 
| mlp\$1activation |  The type of activation function for the multilayer perceptron (MLP) layer. **Optional** Valid values: `tanh`, `relu`, or `linear` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `linear`  | 
| mlp\$1dim |  The dimension of the output from MLP layers. **Optional** Valid values: 2 ≤ integer ≤ 10000 Default value: 512  | 
| mlp\$1layers |  The number of MLP layers in the network. **Optional** Valid values: 0 ≤ integer ≤ 10 Default value: 2  | 
| negative\$1sampling\$1rate |  The ratio of negative samples, generated to assist in training the algorithm, to positive samples that are provided by users. Negative samples represent data that is unlikely to occur in reality and are labelled negatively for training. They facilitate training a model to discriminate between the positive samples observed and the negative samples that are not. To specify the ratio of negative samples to positive samples used for training, set the value to a positive integer. For example, if you train the algorithm on input data in which all of the samples are positive and set `negative_sampling_rate` to 2, the Object2Vec algorithm internally generates two negative samples per positive sample. If you don't want to generate or use negative samples during training, set the value to 0.  **Optional** Valid values: 0 ≤ integer Default value: 0 (off)  | 
| num\$1classes |  The number of classes for classification training. Amazon SageMaker AI ignores this hyperparameter for regression problems. **Optional** Valid values: 2 ≤ integer ≤ 30 Default value: 2  | 
| optimizer |  The optimizer type. **Optional** Valid values: `adadelta`, `adagrad`, `adam`, `sgd`, or `rmsprop`. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `adam`  | 
| output\$1layer |  The type of output layer where you specify that the task is regression or classification. **Optional** Valid values: `softmax` or `mean_squared_error` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) Default value: `softmax`  | 
| tied\$1token\$1embedding\$1weight |  Whether to use a shared embedding layer for both encoders. If the inputs to both encoders use the same token-level units, use a shared token embedding layer. For example, for a collection of documents, if one encoder encodes sentences and another encodes whole documents, you can use a shared token embedding layer. That's because both sentences and documents are composed of word tokens from the same vocabulary. **Optional** Valid values: `True` or `False` Default value: `False`  | 
| token\$1embedding\$1storage\$1type |  The mode of gradient update used during training: when the `dense` mode is used, the optimizer calculates the full gradient matrix for the token embedding layer even if most rows of the gradient are zero-valued. When `sparse` mode is used, the optimizer only stores rows of the gradient that are actually being used in the mini-batch. If you want the algorithm to perform lazy gradient updates, which calculate the gradients only in the non-zero rows and which speed up training, specify `row_sparse`. Setting the value to `row_sparse` constrains the values available for other hyperparameters, as follows:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object2vec-hyperparameters.html) **Optional** Valid values: `dense` or `row_sparse` Default value: `dense`  | 
| weight\$1decay |  The weight decay parameter used for optimization. **Optional** Valid values: 0 ≤ float ≤ 10000 Default value: 0 (no decay)  | 

# Tune an Object2Vec Model
<a name="object2vec-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. For the objective metric, you use one of the metrics that the algorithm computes. Automatic model tuning searches the chosen hyperparameters to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Object2Vec Algorithm
<a name="object2vec-metrics"></a>

The Object2Vec algorithm has both classification and regression metrics. The `output_layer` type determines which metric you can use for automatic model tuning. 

### Regressor Metrics Computed by the Object2Vec Algorithm
<a name="object2vec-regressor-metrics"></a>

The algorithm reports a mean squared error regressor metric, which is computed during testing and validation. When tuning the model for regression tasks, choose this metric as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:mean\$1squared\$1error | The Mean Square Error | Minimize | 
| validation:mean\$1squared\$1error | The Mean Square Error | Minimize | 

### Classification Metrics Computed by the Object2Vec Algorithm
<a name="object2vec-classification-metrics"></a>

The Object2Vec algorithm reports accuracy and cross-entropy classification metrics, which are computed during test and validation. When tuning the model for classification tasks, choose one of these as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:accuracy | Accuracy | Maximize | 
| test:cross\$1entropy | Cross-entropy | Minimize | 
| validation:accuracy | Accuracy | Maximize | 
| validation:cross\$1entropy | Cross-entropy | Minimize | 

## Tunable Object2Vec Hyperparameters
<a name="object2vec-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the Object2Vec algorithm.


| Hyperparameter Name | Hyperparameter Type | Recommended Ranges and Values | 
| --- | --- | --- | 
| dropout | ContinuousParameterRange | MinValue: 0.0, MaxValue: 1.0 | 
| early\$1stopping\$1patience | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| early\$1stopping\$1tolerance | ContinuousParameterRange | MinValue: 0.001, MaxValue: 0.1 | 
| enc\$1dim | IntegerParameterRange | MinValue: 4, MaxValue: 4096 | 
| enc0\$1cnn\$1filter\$1width | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| enc0\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| enc0\$1token\$1embedding\$1dim | IntegerParameterRange | MinValue: 5, MaxValue: 300 | 
| enc1\$1cnn\$1filter\$1width | IntegerParameterRange | MinValue: 1, MaxValue: 5 | 
| enc1\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| enc1\$1token\$1embedding\$1dim | IntegerParameterRange | MinValue: 5, MaxValue: 300 | 
| epochs | IntegerParameterRange | MinValue: 4, MaxValue: 20 | 
| learning\$1rate | ContinuousParameterRange | MinValue: 1e-6, MaxValue: 1.0 | 
| mini\$1batch\$1size | IntegerParameterRange | MinValue: 1, MaxValue: 8192 | 
| mlp\$1activation | CategoricalParameterRanges |  [`tanh`, `relu`, `linear`]  | 
| mlp\$1dim | IntegerParameterRange | MinValue: 16, MaxValue: 1024 | 
| mlp\$1layers | IntegerParameterRange | MinValue: 1, MaxValue: 4 | 
| optimizer | CategoricalParameterRanges | [`adagrad`, `adam`, `rmsprop`, `sgd`, `adadelta`] | 
| weight\$1decay | ContinuousParameterRange | MinValue: 0.0, MaxValue: 1.0 | 

# Data Formats for Object2Vec Training
<a name="object2vec-training-formats"></a>

When training with the Object2Vec algorithm, make sure that the input data in your request is in JSON Lines format, where each line represents a single data point.

## Input: JSON Lines Request Format
<a name="object2vec-in-training-data-jsonlines"></a>

Content-type: application/jsonlines

```
{"label": 0, "in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"label": 1, "in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"label": 1, "in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
```

The “in0” and “in1” are the inputs for encoder0 and encoder1, respectively. The same format is valid for both classification and regression problems. For regression, the field `"label"` can accept real valued inputs.

# Data Formats for Object2Vec Inference
<a name="object2vec-inference-formats"></a>

The following page describes the input request and output response formats for getting scoring inference from the Amazon SageMaker AI Object2Vec model.

## GPU optimization: Classification or Regression
<a name="object2vec-inference-gpu-optimize-classification"></a>

Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the classification/regression or the [Output: Encoder Embeddings](object2vec-encoder-embeddings.md#object2vec-out-encoder-embeddings-data) inference network is loaded into GPU. If the majority of your inference is for classification or regression, specify `INFERENCE_PREFERRED_MODE=classification`. The following is a Batch Transform example of using 4 instances of p3.2xlarge that optimizes for classification/regression inference:

```
transformer = o2v.transformer(instance_count=4,
                              instance_type="ml.p2.xlarge",
                              max_concurrent_transforms=2,
                              max_payload=1,  # 1MB
                              strategy='MultiRecord',
                              env={'INFERENCE_PREFERRED_MODE': 'classification'},  # only useful with GPU
                              output_path=output_s3_path)
```

## Input: Classification or Regression Request Format
<a name="object2vec-in-inference-data"></a>

Content-type: application/json

```
{
  "instances" : [
    {"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]},
    {"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]},
    {"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
  ]
}
```

Content-type: application/jsonlines

```
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4], "in1": [16, 21, 13, 45, 14, 9, 80, 59, 164, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4], "in1": [22, 32, 13, 25, 1016, 573, 3252, 4]}
{"in0": [774, 14, 21, 206], "in1": [21, 366, 125]}
```

For classification problems, the length of the scores vector corresponds to `num_classes`. For regression problems, the length is 1.

## Output: Classification or Regression Response Format
<a name="object2vec-out-inference-data"></a>

Accept: application/json

```
{
    "predictions": [
        {
            "scores": [
                0.6533935070037842,
                0.07582679390907288,
                0.2707797586917877
            ]
        },
        {
            "scores": [
                0.026291321963071823,
                0.6577019095420837,
                0.31600672006607056
            ]
        }
    ]
}
```

Accept: application/jsonlines

```
{"scores":[0.195667684078216,0.395351558923721,0.408980727195739]}
{"scores":[0.251988261938095,0.258233487606048,0.489778339862823]}
{"scores":[0.280087798833847,0.368331134319305,0.351581096649169]}
```

In both the classification and regression formats, the scores apply to individual labels. 

# Encoder Embeddings for Object2Vec
<a name="object2vec-encoder-embeddings"></a>

The following page lists the input request and output response formats for getting encoder embedding inference from the Amazon SageMaker AI Object2Vec model.

## GPU optimization: Encoder Embeddings
<a name="object2vec-inference-gpu-optimize-encoder-embeddings"></a>

An embedding is a mapping from discrete objects, such as words, to vectors of real numbers.

Due to GPU memory scarcity, the `INFERENCE_PREFERRED_MODE` environment variable can be specified to optimize on whether the [Data Formats for Object2Vec Inference](object2vec-inference-formats.md) or the encoder embedding inference network is loaded into GPU. If the majority of your inference is for encoder embeddings, specify `INFERENCE_PREFERRED_MODE=embedding`. The following is a Batch Transform example of using 4 instances of p3.2xlarge that optimizes for encoder embedding inference:

```
transformer = o2v.transformer(instance_count=4,
                              instance_type="ml.p2.xlarge",
                              max_concurrent_transforms=2,
                              max_payload=1,  # 1MB
                              strategy='MultiRecord',
                              env={'INFERENCE_PREFERRED_MODE': 'embedding'},  # only useful with GPU
                              output_path=output_s3_path)
```

## Input: Encoder Embeddings
<a name="object2vec-in-encoder-embeddings-data"></a>

Content-type: application/json; infer\$1max\$1seqlens=<FWD-LENGTH>,<BCK-LENGTH>

Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum sequence lengths for the forward and backward encoder.

```
{
  "instances" : [
    {"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4]},
    {"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]},
    {"in0": [774, 14, 21, 206]}
  ]
}
```

Content-type: application/jsonlines; infer\$1max\$1seqlens=<FWD-LENGTH>,<BCK-LENGTH>

Where <FWD-LENGTH> and <BCK-LENGTH> are integers in the range [1,5000] and define the maximum sequence lengths for the forward and backward encoder.

```
{"in0": [6, 17, 606, 19, 53, 67, 52, 12, 5, 10, 15, 10178, 7, 33, 652, 80, 15, 69, 821, 4]}
{"in0": [22, 1016, 32, 13, 25, 11, 5, 64, 573, 45, 5, 80, 15, 67, 21, 7, 9, 107, 4]}
{"in0": [774, 14, 21, 206]}
```

In both of these formats, you specify only one input type: `“in0”` or `“in1.”` The inference service then invokes the corresponding encoder and outputs the embeddings for each of the instances. 

## Output: Encoder Embeddings
<a name="object2vec-out-encoder-embeddings-data"></a>

Content-type: application/json

```
{
  "predictions": [
    {"embeddings":[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.003637571120634,0.021305780857801,0.004316598642617,0.0,0.003397724591195,0.0,0.000378780066967,0.0,0.0,0.0,0.007419463712722]},
    {"embeddings":[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133398,0.047553978860378,0.0,0.0,0.011533712036907,0.011472506448626,0.010696629062294,0.0,0.0,0.0,0.008508535102009]}
  ]
}
```

Content-type: application/jsonlines

```
{"embeddings":[0.057368703186511,0.030703511089086,0.099890425801277,0.063688032329082,0.026327300816774,0.003637571120634,0.021305780857801,0.004316598642617,0.0,0.003397724591195,0.0,0.000378780066967,0.0,0.0,0.0,0.007419463712722]}
{"embeddings":[0.150190666317939,0.05145975202322,0.098204270005226,0.064249359071254,0.056249320507049,0.01513972133398,0.047553978860378,0.0,0.0,0.011533712036907,0.011472506448626,0.010696629062294,0.0,0.0,0.0,0.008508535102009]}
```

The vector length of the embeddings output by the inference service is equal to the value of one of the following hyperparameters that you specify at training time: `enc0_token_embedding_dim`, `enc1_token_embedding_dim`, or `enc_dim`.

# Sequence-to-Sequence Algorithm
<a name="seq-2-seq"></a>

Amazon SageMaker AI Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker AI seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures. 

**Topics**
+ [

## Input/Output Interface for the Sequence-to-Sequence Algorithm
](#s2s-inputoutput)
+ [

## EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm
](#s2s-instances)
+ [

## Sequence-to-Sequence Sample Notebooks
](#seq-2-seq-sample-notebooks)
+ [

# How Sequence-to-Sequence Works
](seq-2-seq-howitworks.md)
+ [

# Sequence-to-Sequence Hyperparameters
](seq-2-seq-hyperparameters.md)
+ [

# Tune a Sequence-to-Sequence Model
](seq-2-seq-tuning.md)

## Input/Output Interface for the Sequence-to-Sequence Algorithm
<a name="s2s-inputoutput"></a>

**Training**

SageMaker AI seq2seq expects data in RecordIO-Protobuf format. However, the tokens are expected as integers, not as floating points, as is usually the case.

A script to convert data from tokenized text files to the protobuf format is included in [the seq2seq example notebook](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.html). In general, it packs the data into 32-bit integer tensors and generates the necessary vocabulary files, which are needed for metric calculation and inference.

After preprocessing is done, the algorithm can be invoked for training. The algorithm expects three channels:
+ `train`: It should contain the training data (for example, the `train.rec` file generated by the preprocessing script).
+ `validation`: It should contain the validation data (for example, the `val.rec` file generated by the preprocessing script).
+ `vocab`: It should contain two vocabulary files (`vocab.src.json` and `vocab.trg.json`) 

If the algorithm doesn't find data in any of these three channels, training results in an error.

**Inference**

For hosted endpoints, inference supports two data formats. To perform inference using space separated text tokens, use the `application/json` format. Otherwise, use the `recordio-protobuf` format to work with the integer encoded data. Both modes support batching of input data. `application/json` format also allows you to visualize the attention matrix.
+ `application/json`: Expects the input in JSON format and returns the output in JSON format. Both content and accept types should be `application/json`. Each sequence is expected to be a string with whitespace separated tokens. This format is recommended when the number of source sequences in the batch is small. It also supports the following additional configuration options:

  `configuration`: \$1`attention_matrix`: `true`\$1: Returns the attention matrix for the particular input sequence.
+ `application/x-recordio-protobuf`: Expects the input in `recordio-protobuf` format and returns the output in `recordio-protobuf format`. Both content and accept types should be `applications/x-recordio-protobuf`. For this format, the source sequences must be converted into a list of integers for subsequent protobuf encoding. This format is recommended for bulk inference.

For batch transform, inference supports JSON Lines format. Batch transform expects the input in JSON Lines format and returns the output in JSON Lines format. Both content and accept types should be `application/jsonlines`. The format for input is as follows:

```
content-type: application/jsonlines

{"source": "source_sequence_0"}
{"source": "source_sequence_1"}
```

The format for response is as follows:

```
accept: application/jsonlines

{"target": "predicted_sequence_0"}
{"target": "predicted_sequence_1"}
```

For additional details on how to serialize and deserialize the inputs and outputs to specific formats for inference, see the [Sequence-to-Sequence Sample Notebooks](#seq-2-seq-sample-notebooks) .

## EC2 Instance Recommendation for the Sequence-to-Sequence Algorithm
<a name="s2s-instances"></a>

The Amazon SageMaker AI seq2seq algorithm only supports on GPU instance types and can only train on a single machine. However, you can use instances with multiple GPUs. The seq2seq algorithm supports P2, P3, G4dn, and G5 GPU instance families.

## Sequence-to-Sequence Sample Notebooks
<a name="seq-2-seq-sample-notebooks"></a>

For a sample notebook that shows how to use the SageMaker AI Sequence to Sequence algorithm to train a English-German translation model, see [Machine Translation English-German Example Using SageMaker AI Seq2Seq](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/seq2seq_translation_en-de/SageMaker-Seq2Seq-Translation-English-German.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Sequence-to-Sequence Works
<a name="seq-2-seq-howitworks"></a>

Typically, a neural network for sequence-to-sequence modeling consists of a few layers, including: 
+ An **embedding layer**. In this layer, the input matrix, which is input tokens encoded in a sparse way (for example, one-hot encoded) are mapped to a dense feature layer. This is required because a high-dimensional feature vector is more capable of encoding information regarding a particular token (word for text corpora) than a simple one-hot-encoded vector. It is also a standard practice to initialize this embedding layer with a pre-trained word vector like [FastText](https://fasttext.cc/) or [Glove](https://nlp.stanford.edu/projects/glove/) or to initialize it randomly and learn the parameters during training. 
+ An **encoder layer**. After the input tokens are mapped into a high-dimensional feature space, the sequence is passed through an encoder layer to compress all the information from the input embedding layer (of the entire sequence) into a fixed-length feature vector. Typically, an encoder is made of RNN-type networks like long short-term memory (LSTM) or gated recurrent units (GRU). ([ Colah's blog](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) explains LSTM in a great detail.) 
+ A **decoder layer**. The decoder layer takes this encoded feature vector and produces the output sequence of tokens. This layer is also usually built with RNN architectures (LSTM and GRU). 

The whole model is trained jointly to maximize the probability of the target sequence given the source sequence. This model was first introduced by [Sutskever et al.](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) in 2014. 

**Attention mechanism**. The disadvantage of an encoder-decoder framework is that model performance decreases as and when the length of the source sequence increases because of the limit of how much information the fixed-length encoded feature vector can contain. To tackle this problem, in 2015, Bahdanau et al. proposed the [attention mechanism](https://arxiv.org/pdf/1409.0473.pdf). In an attention mechanism, the decoder tries to find the location in the encoder sequence where the most important information could be located and uses that information and previously decoded words to predict the next token in the sequence. 

For more in details, see the whitepaper [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025) by Luong, et al. that explains and simplifies calculations for various attention mechanisms. Additionally, the whitepaper [Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation](https://arxiv.org/abs/1609.08144) by Wu, et al. describes Google's architecture for machine translation, which uses skip connections between encoder and decoder layers.

# Sequence-to-Sequence Hyperparameters
<a name="seq-2-seq-hyperparameters"></a>

The following table lists the hyperparameters that you can set when training with the Amazon SageMaker AI Sequence-to-Sequence (seq2seq) algorithm.


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size | Mini batch size for gradient descent. **Optional** Valid values: positive integer Default value: 64 | 
| beam\$1size | Length of the beam for beam search. Used during training for computing `bleu` and used during inference. **Optional** Valid values: positive integer Default value: 5 | 
| bleu\$1sample\$1size | Number of instances to pick from validation dataset to decode and compute `bleu` score during training. Set to -1 to use full validation set (if `bleu` is chosen as `optimized_metric`). **Optional** Valid values: integer Default value: 0 | 
| bucket\$1width | Returns (source,target) buckets up to (`max_seq_len_source`, `max_seq_len_target`). The longer side of the data uses steps of `bucket_width` while the shorter side uses steps scaled down by the average target/source length ratio. If one sided reaches its maximum length before the other, width of extra buckets on that side is fixed to that side of `max_len`. **Optional** Valid values: positive integer Default value: 10 | 
| bucketing\$1enabled | Set to `false` to disable bucketing, unroll to maximum length. **Optional** Valid values: `true` or `false` Default value: `true` | 
| checkpoint\$1frequency\$1num\$1batches | Checkpoint and evaluate every x batches. This checkpointing hyperparameter is passed to the SageMaker AI's seq2seq algorithm for early stopping and retrieving the best model. The algorithm's checkpointing runs locally in the algorithm's training container and is not compatible with SageMaker AI checkpointing. The algorithm temporarily saves checkpoints to a local path and stores the best model artifact to the model output path in S3 after the training job has stopped. **Optional** Valid values: positive integer Default value: 1000 | 
| checkpoint\$1threshold | Maximum number of checkpoints model is allowed to not improve in `optimized_metric` on validation dataset before training is stopped. This checkpointing hyperparameter is passed to the SageMaker AI's seq2seq algorithm for early stopping and retrieving the best model. The algorithm's checkpointing runs locally in the algorithm's training container and is not compatible with SageMaker AI checkpointing. The algorithm temporarily saves checkpoints to a local path and stores the best model artifact to the model output path in S3 after the training job has stopped. **Optional** Valid values: positive integer Default value: 3 | 
| clip\$1gradient | Clip absolute gradient values greater than this. Set to negative to disable. **Optional** Valid values: float Default value: 1 | 
| cnn\$1activation\$1type | The `cnn` activation type to be used. **Optional** Valid values: String. One of `glu`, `relu`, `softrelu`, `sigmoid`, or `tanh`. Default value: `glu` | 
| cnn\$1hidden\$1dropout | Dropout probability for dropout between convolutional layers. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| cnn\$1kernel\$1width\$1decoder | Kernel width for the `cnn` decoder. **Optional** Valid values: positive integer Default value: 5 | 
| cnn\$1kernel\$1width\$1encoder | Kernel width for the `cnn` encoder. **Optional** Valid values: positive integer Default value: 3 | 
| cnn\$1num\$1hidden | Number of `cnn` hidden units for encoder and decoder. **Optional** Valid values: positive integer Default value: 512 | 
| decoder\$1type | Decoder type. **Optional** Valid values: String. Either `rnn` or `cnn`. Default value: *rnn* | 
| embed\$1dropout\$1source | Dropout probability for source side embeddings. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| embed\$1dropout\$1target | Dropout probability for target side embeddings. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| encoder\$1type | Encoder type. The `rnn` architecture is based on attention mechanism by Bahdanau et al. and *cnn* architecture is based on Gehring et al. **Optional** Valid values: String. Either `rnn` or `cnn`. Default value: `rnn` | 
| fixed\$1rate\$1lr\$1half\$1life | Half life for learning rate in terms of number of checkpoints for `fixed_rate_`\$1 schedulers. **Optional** Valid values: positive integer Default value: 10 | 
| learning\$1rate | Initial learning rate. **Optional** Valid values: float Default value: 0.0003 | 
| loss\$1type | Loss function for training. **Optional** Valid values: String. `cross-entropy` Default value: `cross-entropy` | 
| lr\$1scheduler\$1type | Learning rate scheduler type. `plateau_reduce` means reduce the learning rate whenever `optimized_metric` on `validation_accuracy` plateaus. `inv_t` is inverse time decay. `learning_rate`/(1\$1`decay_rate`\$1t) **Optional** Valid values: String. One of `plateau_reduce`, `fixed_rate_inv_t`, or `fixed_rate_inv_sqrt_t`. Default value: `plateau_reduce` | 
| max\$1num\$1batches | Maximum number of updates/batches to process. -1 for infinite. **Optional** Valid values: integer Default value: -1 | 
| max\$1num\$1epochs | Maximum number of epochs to pass through training data before fitting is stopped. Training continues until this number of epochs even if validation accuracy is not improving if this parameter is passed. Ignored if not passed. **Optional** Valid values: Positive integer and less than or equal to max\$1num\$1epochs. Default value: none | 
| max\$1seq\$1len\$1source | Maximum length for the source sequence length. Sequences longer than this length are truncated to this length. **Optional** Valid values: positive integer Default value: 100  | 
| max\$1seq\$1len\$1target | Maximum length for the target sequence length. Sequences longer than this length are truncated to this length. **Optional** Valid values: positive integer Default value: 100 | 
| min\$1num\$1epochs | Minimum number of epochs the training must run before it is stopped via `early_stopping` conditions. **Optional** Valid values: positive integer Default value: 0 | 
| momentum | Momentum constant used for `sgd`. Don't pass this parameter if you are using `adam` or `rmsprop`. **Optional** Valid values: float Default value: none | 
| num\$1embed\$1source | Embedding size for source tokens. **Optional** Valid values: positive integer Default value: 512 | 
| num\$1embed\$1target | Embedding size for target tokens. **Optional** Valid values: positive integer Default value: 512 | 
| num\$1layers\$1decoder | Number of layers for Decoder *rnn* or *cnn*. **Optional** Valid values: positive integer Default value: 1 | 
| num\$1layers\$1encoder | Number of layers for Encoder `rnn` or `cnn`. **Optional** Valid values: positive integer Default value: 1 | 
| optimized\$1metric | Metrics to optimize with early stopping. **Optional** Valid values: String. One of `perplexity`, `accuracy`, or `bleu`. Default value: `perplexity` | 
| optimizer\$1type | Optimizer to choose from. **Optional** Valid values: String. One of `adam`, `sgd`, or `rmsprop`. Default value: `adam` | 
| plateau\$1reduce\$1lr\$1factor | Factor to multiply learning rate with (for `plateau_reduce`). **Optional** Valid values: float Default value: 0.5 | 
| plateau\$1reduce\$1lr\$1threshold | For `plateau_reduce` scheduler, multiply learning rate with reduce factor if `optimized_metric` didn't improve for this many checkpoints. **Optional** Valid values: positive integer Default value: 3 | 
| rnn\$1attention\$1in\$1upper\$1layers | Pass the attention to upper layers of *rnn*, like Google NMT paper. Only applicable if more than one layer is used. **Optional** Valid values: boolean (`true` or `false`) Default value: `true` | 
| rnn\$1attention\$1num\$1hidden | Number of hidden units for attention layers. defaults to `rnn_num_hidden`. **Optional** Valid values: positive integer Default value: `rnn_num_hidden` | 
| rnn\$1attention\$1type | Attention model for encoders. `mlp` refers to concat and bilinear refers to general from the Luong et al. paper. **Optional** Valid values: String. One of `dot`, `fixed`, `mlp`, or `bilinear`. Default value: `mlp` | 
| rnn\$1cell\$1type | Specific type of `rnn` architecture. **Optional** Valid values: String. Either `lstm` or `gru`. Default value: `lstm` | 
| rnn\$1decoder\$1state\$1init | How to initialize `rnn` decoder states from encoders. **Optional** Valid values: String. One of `last`, `avg`, or `zero`. Default value: `last` | 
| rnn\$1first\$1residual\$1layer | First *rnn* layer to have a residual connection, only applicable if number of layers in encoder or decoder is more than 1. **Optional** Valid values: positive integer Default value: 2 | 
| rnn\$1num\$1hidden | The number of *rnn* hidden units for encoder and decoder. This must be a multiple of 2 because the algorithm uses bi-directional Long Term Short Term Memory (LSTM) by default. **Optional** Valid values: positive even integer Default value: 1024 | 
| rnn\$1residual\$1connections | Add residual connection to stacked *rnn*. Number of layers should be more than 1. **Optional** Valid values: boolean (`true` or `false`) Default value: `false` | 
| rnn\$1decoder\$1hidden\$1dropout | Dropout probability for hidden state that combines the context with the *rnn* hidden state in the decoder. **Optional** Valid values: Float. Range in [0,1]. Default value: 0 | 
| training\$1metric | Metrics to track on training on validation data. **Optional** Valid values: String. Either `perplexity` or `accuracy`. Default value: `perplexity` | 
| weight\$1decay | Weight decay constant. **Optional** Valid values: float Default value: 0 | 
| weight\$1init\$1scale | Weight initialization scale (for `uniform` and `xavier` initialization).  **Optional** Valid values: float Default value: 2.34 | 
| weight\$1init\$1type | Type of weight initialization.  **Optional** Valid values: String. Either `uniform` or `xavier`. Default value: `xavier` | 
| xavier\$1factor\$1type | Xavier factor type. **Optional** Valid values: String. One of `in`, `out`, or `avg`. Default value: `in` | 

# Tune a Sequence-to-Sequence Model
<a name="seq-2-seq-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Sequence-to-Sequence Algorithm
<a name="seq-2-seq-metrics"></a>

The sequence to sequence algorithm reports three metrics that are computed during training. Choose one of them as an objective to optimize when tuning the hyperparameter values.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy |  Accuracy computed on the validation dataset.  |  Maximize  | 
| validation:bleu |  [Bleu﻿](https://en.wikipedia.org/wiki/BLEU) score computed on the validation dataset. Because BLEU computation is expensive, you can choose to compute BLEU on a random subsample of the validation dataset to speed up the overall training process. Use the `bleu_sample_size` parameter to specify the subsample.  |  Maximize  | 
| validation:perplexity |  [Perplexity](https://en.wikipedia.org/wiki/Perplexity), is a loss function computed on the validation dataset. Perplexity measures the cross-entropy between an empirical sample and the distribution predicted by a model and so provides a measure of how well a model predicts the sample values, Models that are good at predicting a sample have a low perplexity.  |  Minimize  | 

## Tunable Sequence-to-Sequence Hyperparameters
<a name="seq-2-seq-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the SageMaker AI Sequence to Sequence algorithm. The hyperparameters that have the greatest impact on sequence to sequence objective metrics are: `batch_size`, `optimizer_type`, `learning_rate`, `num_layers_encoder`, and `num_layers_decoder`.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| num\$1layers\$1encoder |  IntegerParameterRange  |  [1-10]  | 
| num\$1layers\$1decoder |  IntegerParameterRange  |  [1-10]  | 
| batch\$1size |  CategoricalParameterRange  |  [16,32,64,128,256,512,1024,2048]  | 
| optimizer\$1type |  CategoricalParameterRange  |  ['adam', 'sgd', 'rmsprop']  | 
| weight\$1init\$1type |  CategoricalParameterRange  |  ['xavier', 'uniform']  | 
| weight\$1init\$1scale |  ContinuousParameterRange  |  For the xavier type: MinValue: 2.0, MaxValue: 3.0 For the uniform type: MinValue: -1.0, MaxValue: 1.0  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 0.00005, MaxValue: 0.2  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.1  | 
| momentum |  ContinuousParameterRange  |  MinValue: 0.5, MaxValue: 0.9  | 
| clip\$1gradient |  ContinuousParameterRange  |  MinValue: 1.0, MaxValue: 5.0  | 
| rnn\$1num\$1hidden |  CategoricalParameterRange  |  Applicable only to recurrent neural networks (RNNs). [128,256,512,1024,2048]   | 
| cnn\$1num\$1hidden |  CategoricalParameterRange  |  Applicable only to convolutional neural networks (CNNs). [128,256,512,1024,2048]   | 
| num\$1embed\$1source |  IntegerParameterRange  |  [256-512]  | 
| num\$1embed\$1target |  IntegerParameterRange  |  [256-512]  | 
| embed\$1dropout\$1source |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| embed\$1dropout\$1target |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| rnn\$1decoder\$1hidden\$1dropout |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| cnn\$1hidden\$1dropout |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.5  | 
| lr\$1scheduler\$1type |  CategoricalParameterRange  |  ['plateau\$1reduce', 'fixed\$1rate\$1inv\$1t', 'fixed\$1rate\$1inv\$1sqrt\$1t']  | 
| plateau\$1reduce\$1lr\$1factor |  ContinuousParameterRange  |  MinValue: 0.1, MaxValue: 0.5  | 
| plateau\$1reduce\$1lr\$1threshold |  IntegerParameterRange  |  [1-5]  | 
| fixed\$1rate\$1lr\$1half\$1life |  IntegerParameterRange  |  [10-30]  | 

# Text Classification - TensorFlow
<a name="text-classification-tensorflow"></a>

The Amazon SageMaker AI Text Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the [TensorFlow Hub](https://tfhub.dev/). Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of text data is not available. The text classification algorithm takes a text string as input and outputs a probability for each of the class labels. Training datasets must be in CSV format. This page includes information about Amazon EC2 instance recommendations and sample notebooks for Text Classification - TensorFlow.

**Topics**
+ [

# How to use the SageMaker AI Text Classification - TensorFlow algorithm
](text-classification-tensorflow-how-to-use.md)
+ [

# Input and output interface for the Text Classification - TensorFlow algorithm
](text-classification-tensorflow-inputoutput.md)
+ [

## Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm
](#text-classification-tensorflow-instances)
+ [

## Text Classification - TensorFlow sample notebooks
](#text-classification-tensorflow-sample-notebooks)
+ [

# How Text Classification - TensorFlow Works
](text-classification-tensorflow-HowItWorks.md)
+ [

# TensorFlow Hub Models
](text-classification-tensorflow-Models.md)
+ [

# Text Classification - TensorFlow Hyperparameters
](text-classification-tensorflow-Hyperparameter.md)
+ [

# Tune a Text Classification - TensorFlow model
](text-classification-tensorflow-tuning.md)

# How to use the SageMaker AI Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-how-to-use"></a>

You can use Text Classification - TensorFlow as an Amazon SageMaker AI built-in algorithm. The following section describes how to use Text Classification - TensorFlow with the SageMaker AI Python SDK. For information on how to use Text Classification - TensorFlow from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

The Text Classification - TensorFlow algorithm supports transfer learning using any of the compatible pretrained TensorFlow models. For a list of all available pretrained models, see [TensorFlow Hub Models](text-classification-tensorflow-Models.md). Every pretrained model has a unique `model_id`. The following example uses BERT Base Uncased (`model_id`: `tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2`) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated model training artifacts to construct a SageMaker AI Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and their default values with `hyperparameters.retrieve_default`. For more information, see [Text Classification - TensorFlow Hyperparameters](text-classification-tensorflow-Hyperparameter.md). Use these values to construct a SageMaker AI Estimator.

**Note**  
Default hyperparameter values are different for different models. For example, for larger models, the default batch size is smaller. 

This example uses the [https://www.tensorflow.org/datasets/catalog/glue#gluesst2](https://www.tensorflow.org/datasets/catalog/glue#gluesst2) dataset, which contains positive and negative movie reviews. We pre-downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call `.fit` using the Amazon S3 location of your training dataset. Any S3 bucket used in a notebook must be in the same AWS Region as the notebook instance that accesses it.

```
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2", "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)

# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")

# Retrieve the pretrained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

# Retrieve the default hyperparameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/SST2/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-tc-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create an Estimator instance
tf_tc_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a training job
tf_tc_estimator.fit({"training": training_dataset_s3_path}, logs=True)
```

For more information about how to use the SageMaker Text Classification - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) notebook.

# Input and output interface for the Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-inputoutput"></a>

Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset made up of text sentences with any number of classes. The pretrained model attaches a classification layer to the Text Embedding model and initializes the layer parameters to random values. The output dimension of the classification layer is determined based on the number of classes detected in the input data. 

Be mindful of how to format your training data for input to the Text Classification - TensorFlow model.
+ **Training data input format:** A directory containing a `data.csv` file. Each row of the first column should have integer class labels between 0 and the number of classes. Each row of the second column should have the corresponding text data.

The following is an example of an input CSV file. Note that the file should not have any header. The file should be hosted in an Amazon S3 bucket with a path similar to the following: `s3://bucket_name/input_directory/`. Note that the trailing `/` is required.

```
|   |  |
|---|---|
|0 |hide new secretions from the parental units|
|0 |contains no wit , only labored gags|
|1 |that loves its characters and communicates something rather beautiful about human nature|
|...|...|
```

## Incremental training
<a name="text-classification-tensorflow-incremental-training"></a>

You can seed the training of a new model with artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data.

**Note**  
You can only seed a SageMaker AI Text Classification - TensorFlow model with another Text Classification - TensorFlow model trained in SageMaker AI. 

You can use any dataset for incremental training, as long as the set of classes remains the same. The incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained model, you start with an existing fine-tuned model. 

For more information on using incremental training with the SageMaker AI Text Classification - TensorFlow algorithm, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) sample notebook.

## Inference with the Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-inference"></a>

You can host the fine-tuned model that results from your TensorFlow Text Classification training for inference. Any raw text formats for inference must be content type `application/x-text`.

Running inference results in probability values, class labels for all classes, and the predicted label corresponding to the class index with the highest probability encoded in JSON format. The Text Classification - TensorFlow model processes a single string per request and outputs only one line. The following is an example of a JSON format response:

```
accept: application/json;verbose

{"probabilities": [prob_0, prob_1, prob_2, ...],
"labels": [label_0, label_1, label_2, ...],
"predicted_label": predicted_label}
```

If `accept` is set to `application/json`, then the model only outputs probabilities. 

## Amazon EC2 instance recommendation for the Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-instances"></a>

The Text Classification - TensorFlow algorithm supports all CPU and GPU instances for training, including:
+ `ml.p2.xlarge`
+ `ml.p2.16xlarge`
+ `ml.p3.2xlarge`
+ `ml.p3.16xlarge`
+ `ml.g4dn.xlarge`
+ `ml.g4dn.16.xlarge`
+ `ml.g5.xlarge`
+ `ml.g5.48xlarge`

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference. For a comprehensive list of SageMaker training and inference instances across AWS Regions, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

## Text Classification - TensorFlow sample notebooks
<a name="text-classification-tensorflow-sample-notebooks"></a>

For more information about how to use the SageMaker AI Text Classification - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to JumpStart - Text Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart_text_classification/Amazon_JumpStart_Text_Classification.ipynb) notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How Text Classification - TensorFlow Works
<a name="text-classification-tensorflow-HowItWorks"></a>

The Text Classification - TensorFlow algorithm takes text as classifies it into one of the output class labels. Deep learning networks such as [BERT](https://arxiv.org/pdf/1810.04805.pdf) are highly accurate for text classification. There are also deep learning networks that are trained on large text datasets, such as TextNet, which has more than 11 million texts with about 11,000 categories. After a network is trained with TextNet data, you can then fine-tune the network on a dataset with a particular focus to perform more specific text classification tasks. The Amazon SageMaker AI Text Classification - TensorFlow algorithm supports transfer learning on many pretrained models that are available in the TensorFlow Hub.

According to the number of class labels in your training data, a text classification layer is attached to the pretrained TensorFlow model of your choice. The classification layer consists of a dropout layer, a dense layer, and a fully connected layer with 2-norm regularization, and is initialized with random weights. You can change the hyperparameter values for the dropout rate of the dropout layer and the L2 regularization factor for the dense layer.

You can fine-tune either the entire network (including the pretrained model) or only the top classification layer on new training data. With this method of transfer learning, training with smaller datasets is possible.

# TensorFlow Hub Models
<a name="text-classification-tensorflow-Models"></a>

The following pretrained models are available to use for transfer learning with the Text Classification - TensorFlow algorithm. 

The following models vary significantly in size, number of model parameters, training time, and inference latency for any given dataset. The best model for your use case depends on the complexity of your fine-tuning dataset and any requirements that you have on training time, inference latency, or model accuracy.


| Model Name | `model_id` | Source | 
| --- | --- | --- | 
|  BERT Base Uncased  | `tensorflow-tc-bert-en-uncased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3) | 
|  BERT Base Cased  | `tensorflow-tc-bert-en-cased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3) | 
|  BERT Base Multilingual Cased  | `tensorflow-tc-bert-multi-cased-L-12-H-768-A-12-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3) | 
|  Small BERT L-2\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1) | 
|  Small BERT L-2\$1H-256\$1A-4 | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1) | 
|  Small BERT L-2\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1) | 
|  Small BERT L-2\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-2-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1) | 
|  Small BERT L-4\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1) | 
|  Small BERT L-4\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1) | 
|  Small BERT L-4\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1) | 
|  Small BERT L-4\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-4-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1) | 
|  Small BERT L-6\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1) | 
|  Small BERT L-6\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1) | 
|  Small BERT L-6\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1) | 
|  Small BERT L-6\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-6-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1) | 
|  Small BERT L-8\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1) | 
|  Small BERT L-8\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1) | 
|  Small BERT L-8\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1) | 
|  Small BERT L-8\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-8-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1) | 
|  Small BERT L-10\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1) | 
|  Small BERT L-10\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1) | 
|  Small BERT L-10\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1) | 
|  Small BERT L-10\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-10-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1) | 
|  Small BERT L-12\$1H-128\$1A-2  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-128-A-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1) | 
|  Small BERT L-12\$1H-256\$1A-4  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-256-A-4` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1) | 
|  Small BERT L-12\$1H-512\$1A-8  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-512-A-8` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1) | 
|  Small BERT L-12\$1H-768\$1A-12  | `tensorflow-tc-small-bert-bert-en-uncased-L-12-H-768-A-12` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1) | 
|  BERT Large Uncased  | `tensorflow-tc-bert-en-uncased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/3) | 
|  BERT Large Cased  | `tensorflow-tc-bert-en-cased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_cased_L-24_H-1024_A-16/3) | 
|  BERT Large Uncased Whole Word Masking  | `tensorflow-tc-bert-en-wwm-uncased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_wwm_uncased_L-24_H-1024_A-16/3) | 
|  BERT Large Cased Whole Word Masking  | `tensorflow-tc-bert-en-wwm-cased-L-24-H-1024-A-16-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/bert_en_wwm_cased_L-24_H-1024_A-16/3) | 
|  ALBERT Base  | `tensorflow-tc-albert-en-base` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/albert_en_base/2) | 
|  ELECTRA Small\$1\$1  | `tensorflow-tc-electra-small-1` | [TensorFlow Hub link](https://tfhub.dev/google/electra_small/2) | 
|  ELECTRA Base  | `tensorflow-tc-electra-base-1` | [TensorFlow Hub link](https://tfhub.dev/google/electra_base/2) | 
|  BERT Base Wikipedia and BooksCorpus  | `tensorflow-tc-experts-bert-wiki-books-1` | [TensorFlow Hub link](https://tfhub.dev/google/experts/bert/wiki_books/2) | 
|  BERT Base MEDLINE/PubMed  | `tensorflow-tc-experts-bert-pubmed-1` | [TensorFlow Hub link](https://tfhub.dev/google/experts/bert/pubmed/2) | 
|  Talking Heads Base  | `tensorflow-tc-talking-heads-base` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1) | 
|  Talking Heads Large  | `tensorflow-tc-talking-heads-large` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_large/1) | 

# Text Classification - TensorFlow Hyperparameters
<a name="text-classification-tensorflow-Hyperparameter"></a>

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Object Detection - TensorFlow algorithm. See [Tune a Text Classification - TensorFlow model](text-classification-tensorflow-tuning.md) for information on hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size |  The batch size for training. For training on instances with multiple GPUs, this batch size is used across the GPUs.  Valid values: positive integer. Default value: `32`.  | 
| beta\$11 |  The beta1 for the `"adam"` and `"adamw"` optimizers. Represents the exponential decay rate for the first moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| beta\$12 |  The beta2 for the `"adam"` and `"adamw"` optimizers. Represents the exponential decay rate for the second moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.999`.  | 
| dropout\$1rate | The dropout rate for the dropout layer in the top classification layer. Used only when `reinitialize_top_layer` is set to `"True"`. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.2` | 
| early\$1stopping |  Set to `"True"` to use early stopping logic during training. If `"False"`, early stopping is not used. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| early\$1stopping\$1min\$1delta | The minimum change needed to qualify as an improvement. An absolute change less than the value of early\$1stopping\$1min\$1delta does not qualify as improvement. Used only when early\$1stopping is set to "True".Valid values: float, range: [`0.0`, `1.0`].Default value: `0.0`. | 
| early\$1stopping\$1patience |  The number of epochs to continue training with no improvement. Used only when `early_stopping` is set to `"True"`. Valid values: positive integer. Default value: `5`.  | 
| epochs |  The number of training epochs. Valid values: positive integer. Default value: `10`.  | 
| epsilon |  The epsilon for `"adam"`, `"rmsprop"`, `"adadelta"`, and `"adagrad"` optimizers. Usually set to a small value to avoid division by 0. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `1e-7`.  | 
| initial\$1accumulator\$1value |  The starting value for the accumulators, or the per-parameter momentum values, for the `"adagrad"` optimizer. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.0001`.  | 
| learning\$1rate | The optimizer learning rate. Valid values: float, range: [`0.0`, `1.0`].Default value: `0.001`. | 
| momentum |  The momentum for the `"sgd"` and `"nesterov"` optimizers. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| optimizer |  The optimizer type. For more information, see [Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) in the TensorFlow documentation. Valid values: string, any of the following: (`"adamw"`, `"adam"`, `"sgd"`, `"nesterov"`, `"rmsprop"`,` "adagrad"` , `"adadelta"`). Default value: `"adam"`.  | 
| regularizers\$1l2 |  The L2 regularization factor for the dense layer in the classification layer. Used only when `reinitialize_top_layer` is set to `"True"`. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.0001`.  | 
| reinitialize\$1top\$1layer |  If set to `"Auto"`, the top classification layer parameters are re-initialized during fine-tuning. For incremental training, top classification layer parameters are not re-initialized unless set to `"True"`. Valid values: string, any of the following: (`"Auto"`, `"True"` or `"False"`). Default value: `"Auto"`.  | 
| rho |  The discounting factor for the gradient of the `"adadelta"` and `"rmsprop"` optimizers. Ignored for other optimizers.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.95`.  | 
| train\$1only\$1on\$1top\$1layer |  If `"True"`, only the top classification layer parameters are fine-tuned. If `"False"`, all model parameters are fine-tuned. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| validation\$1split\$1ratio |  The fraction of training data to randomly split to create validation data. Only used if validation data is not provided through the `validation` channel. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.2`.  | 
| warmup\$1steps\$1fraction |  The fraction of the total number of gradient update steps, where the learning rate increases from 0 to the initial learning rate as a warm up. Only used with the `adamw` optimizer. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.1`.  | 

# Tune a Text Classification - TensorFlow model
<a name="text-classification-tensorflow-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the Text Classification - TensorFlow algorithm
<a name="text-classification-tensorflow-metrics"></a>

Refer to the following chart to find which metrics are computed by the Text Classification - TensorFlow algorithm.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| validation:accuracy | The ratio of the number of correct predictions to the total number of predictions made. | Maximize | `val_accuracy=([0-9\\.]+)` | 

## Tunable Text Classification - TensorFlow hyperparameters
<a name="text-classification-tensorflow-tunable-hyperparameters"></a>

Tune a text classification model with the following hyperparameters. The hyperparameters that have the greatest impact on text classification objective metrics are: `batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `regularizers_l2`, `beta_1`, `beta_2`, and `eps` based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adamw` or `adam` is the `optimizer`.

For more information about which hyperparameters are used for each `optimizer`, see [Text Classification - TensorFlow Hyperparameters](text-classification-tensorflow-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| batch\$1size | IntegerParameterRanges | MinValue: 4, MaxValue: 128 | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['adamw', 'adam', 'sgd', 'rmsprop', 'nesterov', 'adagrad', 'adadelta'] | 
| regularizers\$1l2 | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| train\$1only\$1on\$1top\$1layer | CategoricalParameterRanges | ['True', 'False'] | 

# Built-in SageMaker AI Algorithms for Time-Series Data
<a name="algorithms-time-series"></a>

SageMaker AI provides algorithms that are tailored to the analysis of time-series data for forecasting product demand, server loads, webpage requests, and more.
+ [Use the SageMaker AI DeepAR forecasting algorithm](deepar.md)—a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN).


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| DeepAR Forecasting | train and (optionally) test | File | JSON Lines or Parquet | GPU or CPU | Yes | 

# Use the SageMaker AI DeepAR forecasting algorithm
<a name="deepar"></a>

The Amazon SageMaker AI DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series. They then use that model to extrapolate the time series into the future. 

In many applications, however, you have many similar time series across a set of cross-sectional units. For example, you might have time series groupings for demand for different products, server loads, and requests for webpages. For this type of application, you can benefit from training a single model jointly over all of the time series. DeepAR takes this approach. When your dataset contains hundreds of related time series, DeepAR outperforms the standard ARIMA and ETS methods. You can also use the trained model to generate forecasts for new time series that are similar to the ones it has been trained on.

The training input for the DeepAR algorithm is one or, preferably, more `target` time series that have been generated by the same process or similar processes. Based on this input dataset, the algorithm trains a model that learns an approximation of this process/processes and uses it to predict how the target time series evolves. Each target time series can be optionally associated with a vector of static (time-independent) categorical features provided by the `cat` field and a vector of dynamic (time-dependent) time series provided by the `dynamic_feat` field. SageMaker AI trains the DeepAR model by randomly sampling training examples from each target time series in the training dataset. Each training example consists of a pair of adjacent context and prediction windows with fixed predefined lengths. To control how far in the past the network can see, use the `context_length` hyperparameter. To control how far in the future predictions can be made, use the `prediction_length` hyperparameter. For more information, see [How the DeepAR Algorithm Works](deepar_how-it-works.md).

**Topics**
+ [

## Input/Output Interface for the DeepAR Algorithm
](#deepar-inputoutput)
+ [

## Best Practices for Using the DeepAR Algorithm
](#deepar_best_practices)
+ [

## EC2 Instance Recommendations for the DeepAR Algorithm
](#deepar-instances)
+ [

## DeepAR Sample Notebooks
](#deepar-sample-notebooks)
+ [

# How the DeepAR Algorithm Works
](deepar_how-it-works.md)
+ [

# DeepAR Hyperparameters
](deepar_hyperparameters.md)
+ [

# Tune a DeepAR Model
](deepar-tuning.md)
+ [

# DeepAR Inference Formats
](deepar-in-formats.md)

## Input/Output Interface for the DeepAR Algorithm
<a name="deepar-inputoutput"></a>

DeepAR supports two data channels. The required `train` channel describes the training dataset. The optional `test` channel describes a dataset that the algorithm uses to evaluate model accuracy after training. You can provide training and test datasets in [JSON Lines](http://jsonlines.org/) format. Files can be in gzip or [Parquet](https://parquet.apache.org/) file format.

When specifying the paths for the training and test data, you can specify a single file or a directory that contains multiple files, which can be stored in subdirectories. If you specify a directory, DeepAR uses all files in the directory as inputs for the corresponding channel, except those that start with a period (.) and those named *\$1SUCCESS*. This ensures that you can directly use output folders produced by Spark jobs as input channels for your DeepAR training jobs.

By default, the DeepAR model determines the input format from the file extension (`.json`, `.json.gz`, or `.parquet`) in the specified input path. If the path does not end in one of these extensions, you must explicitly specify the format in the SDK for Python. Use the `content_type` parameter of the [s3\$1input](https://sagemaker.readthedocs.io/en/stable/session.html#sagemaker.session.s3_input) class.

The records in your input files should contain the following fields:
+ `start`—A string with the format `YYYY-MM-DD HH:MM:SS`. The start timestamp can't contain time zone information.
+ `target`—An array of floating-point values or integers that represent the time series. You can encode missing values as `null` literals, or as `"NaN"` strings in JSON, or as `nan` floating-point values in Parquet.
+ `dynamic_feat` (optional)—An array of arrays of floating-point values or integers that represents the vector of custom feature time series (dynamic features). If you set this field, all records must have the same number of inner arrays (the same number of feature time series). In addition, each inner array must be the same length as the associated `target` value plus `prediction_length`. Missing values are not supported in the features. For example, if target time series represents the demand of different products, an associated `dynamic_feat` might be a boolean time-series which indicates whether a promotion was applied (1) to the particular product or not (0): 

  ```
  {"start": ..., "target": [1, 5, 10, 2], "dynamic_feat": [[0, 1, 1, 0]]}
  ```
+ `cat` (optional)—An array of categorical features that can be used to encode the groups that the record belongs to. Categorical features must be encoded as a 0-based sequence of positive integers. For example, the categorical domain \$1R, G, B\$1 can be encoded as \$10, 1, 2\$1. All values from each categorical domain must be represented in the training dataset. That's because the DeepAR algorithm can forecast only for categories that have been observed during training. And, each categorical feature is embedded in a low-dimensional space whose dimensionality is controlled by the `embedding_dimension` hyperparameter. For more information, see [DeepAR Hyperparameters](deepar_hyperparameters.md).

If you use a JSON file, it must be in [JSON Lines](http://jsonlines.org/) format. For example:

```
{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1], "dynamic_feat": [[1.1, 1.2, 0.5, ...]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat": [[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat": [[1.3, 0.4]]}
```

In this example, each time series has two associated categorical features and one time series features.

For Parquet, you use the same three fields as columns. In addition, `"start"` can be the `datetime` type. You can compress Parquet files using gzip (`gzip`) or the Snappy compression library (`snappy`).

If the algorithm is trained without `cat` and `dynamic_feat` fields, it learns a "global" model, that is a model that is agnostic to the specific identity of the target time series at inference time and is conditioned only on its shape.

If the model is conditioned on the `cat` and `dynamic_feat` feature data provided for each time series, the prediction will probably be influenced by the character of time series with the corresponding `cat` features. For example, if the `target` time series represents the demand of clothing items, you can associate a two-dimensional `cat` vector that encodes the type of item (e.g. 0 = shoes, 1 = dress) in the first component and the color of an item (e.g. 0 = red, 1 = blue) in the second component. A sample input would look as follows:

```
{ "start": ..., "target": ..., "cat": [0, 0], ... } # red shoes
{ "start": ..., "target": ..., "cat": [1, 1], ... } # blue dress
```

At inference time, you can request predictions for targets with `cat` values that are combinations of the `cat` values observed in the training data, for example:

```
{ "start": ..., "target": ..., "cat": [0, 1], ... } # blue shoes
{ "start": ..., "target": ..., "cat": [1, 0], ... } # red dress
```

The following guidelines apply to training data:
+ The start time and length of the time series can differ. For example, in marketing, products often enter a retail catalog at different dates, so their start dates naturally differ. But all series must have the same frequency, number of categorical features, and number of dynamic features. 
+ Shuffle the training file with respect to the position of the time series in the file. In other words, the time series should occur in random order in the file.
+ Make sure to set the `start` field correctly. The algorithm uses the `start` timestamp to derive the internal features. 
+ If you use categorical features (`cat`), all time series must have the same number of categorical features. If the dataset contains the `cat` field, the algorithm uses it and extracts the cardinality of the groups from the dataset. By default, `cardinality` is `"auto"`. If the dataset contains the `cat` field, but you don't want to use it, you can disable it by setting `cardinality` to `""`. If a model was trained using a `cat` feature, you must include it for inference.
+ If your dataset contains the `dynamic_feat` field, the algorithm uses it automatically. All time series have to have the same number of feature time series. The time points in each of the feature time series correspond one-to-one to the time points in the target. In addition, the entry in the `dynamic_feat` field should have the same length as the `target`. If the dataset contains the `dynamic_feat` field, but you don't want to use it, disable it by setting(`num_dynamic_feat` to `""`). If the model was trained with the `dynamic_feat` field, you must provide this field for inference. In addition, each of the features has to have the length of the provided target plus the `prediction_length`. In other words, you must provide the feature value in the future.

If you specify optional test channel data, the DeepAR algorithm evaluates the trained model with different accuracy metrics. The algorithm calculates the root mean square error (RMSE) over the test data as follows:

![\[RMSE Formula: Sqrt(1/nT(Sum[i,t](y-hat(i,t)-y(i,t))^2))\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/deepar-1.png)


*y**i*,*t* is the true value of time series *i* at the time *t*. *ŷ**i*,*t* is the mean prediction. The sum is over all *n* time series in the test set and over the last Τ time points for each time series, where Τ corresponds to the forecast horizon. You specify the length of the forecast horizon by setting the `prediction_length` hyperparameter. For more information, see [DeepAR Hyperparameters](deepar_hyperparameters.md).

In addition, the algorithm evaluates the accuracy of the forecast distribution using weighted quantile loss. For a quantile in the range [0, 1], the weighted quantile loss is defined as follows:

![\[Weighted quantile loss equation.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/deepar-2.png)


 *q**i*,*t*(τ) is the τ-quantile of the distribution that the model predicts. To specify which quantiles to calculate loss for, set the `test_quantiles` hyperparameter. In addition to these, the average of the prescribed quantile losses is reported as part of the training logs. For information, see [DeepAR Hyperparameters](deepar_hyperparameters.md). 

For inference, DeepAR accepts JSON format and the following fields:
+  `"instances"`, which includes one or more time series in JSON Lines format
+  A name of `"configuration"`, which includes parameters for generating the forecast 

For more information, see [DeepAR Inference Formats](deepar-in-formats.md).

## Best Practices for Using the DeepAR Algorithm
<a name="deepar_best_practices"></a>

When preparing your time series data, follow these best practices to achieve the best results:
+ Except for when splitting your dataset for training and testing, always provide the entire time series for training, testing, and when calling the model for inference. Regardless of how you set `context_length`, don't break up the time series or provide only a part of it. The model uses data points further back than the value set in `context_length` for the lagged values feature.
+ When tuning a DeepAR model, you can split the dataset to create a training dataset and a test dataset. In a typical evaluation, you would test the model on the same time series used for training, but on the future `prediction_length` time points that follow immediately after the last time point visible during training. You can create training and test datasets that satisfy this criteria by using the entire dataset (the full length of all time series that are available) as a test set and removing the last `prediction_length` points from each time series for training. During training, the model doesn't see the target values for time points on which it is evaluated during testing. During testing, the algorithm withholds the last `prediction_length` points of each time series in the test set and generates a prediction. Then it compares the forecast with the withheld values. You can create more complex evaluations by repeating time series multiple times in the test set, but cutting them at different endpoints. With this approach, accuracy metrics are averaged over multiple forecasts from different time points. For more information, see [Tune a DeepAR Model](deepar-tuning.md).
+ Avoid using very large values (>400) for the `prediction_length` because it makes the model slow and less accurate. If you want to forecast further into the future, consider aggregating your data at a lower frequency. For example, use `5min` instead of `1min`.
+ Because lags are used, a model can look further back in the time series than the value specified for `context_length`. Therefore, you don't need to set this parameter to a large value. We recommend starting with the value that you used for `prediction_length`.
+ We recommend training a DeepAR model on as many time series as are available. Although a DeepAR model trained on a single time series might work well, standard forecasting algorithms, such as ARIMA or ETS, might provide more accurate results. The DeepAR algorithm starts to outperform the standard methods when your dataset contains hundreds of related time series. Currently, DeepAR requires that the total number of observations available across all training time series is at least 300.

## EC2 Instance Recommendations for the DeepAR Algorithm
<a name="deepar-instances"></a>

You can train DeepAR on both GPU and CPU instances and in both single and multi-machine settings. We recommend starting with a single CPU instance (for example, ml.c4.2xlarge or ml.c4.4xlarge), and switching to GPU instances and multiple machines only when necessary. Using GPUs and multiple machines improves throughput only for larger models (with many cells per layer and many layers) and for large mini-batch sizes (for example, greater than 512).

For inference, DeepAR supports only CPU instances.

Specifying large values for `context_length`, `prediction_length`, `num_cells`, `num_layers`, or `mini_batch_size` can create models that are too large for small instances. In this case, use a larger instance type or reduce the values for these parameters. This problem also frequently occurs when running hyperparameter tuning jobs. In that case, use an instance type large enough for the model tuning job and consider limiting the upper values of the critical parameters to avoid job failures. 

## DeepAR Sample Notebooks
<a name="deepar-sample-notebooks"></a>

For a sample notebook that shows how to prepare a time series dataset for training the SageMaker AI DeepAR algorithm and how to deploy the trained model for performing inferences, see [DeepAR demo on electricity dataset](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/deepar_electricity/DeepAR-Electricity.html), which illustrates the advanced features of DeepAR on a real world dataset. For instructions on creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating and opening a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all of the SageMaker AI examples. To open a notebook, choose its **Use** tab, and choose **Create copy**.

For more information about the Amazon SageMaker AI DeepAR algorithm, see the following blog posts:
+ [Now available in Amazon SageMaker AI: DeepAR algorithm for more accurate time series forecasting](https://aws.amazon.com/blogs/machine-learning/now-available-in-amazon-sagemaker-deepar-algorithm-for-more-accurate-time-series-forecasting/)
+ [Deep demand forecasting with Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/deep-demand-forecasting-with-amazon-sagemaker/)

# How the DeepAR Algorithm Works
<a name="deepar_how-it-works"></a>

During training, DeepAR accepts a training dataset and an optional test dataset. It uses the test dataset to evaluate the trained model. In general, the datasets don't have to contain the same set of time series. You can use a model trained on a given training set to generate forecasts for the future of the time series in the training set, and for other time series. Both the training and the test datasets consist of one or, preferably, more target time series. Each target time series can optionally be associated with a vector of feature time series and a vector of categorical features. For more information, see [Input/Output Interface for the DeepAR Algorithm](deepar.md#deepar-inputoutput). 

For example, the following is an element of a training set indexed by *i* which consists of a target time series, *Zi,t*, and two associated feature time series, *Xi,1,t* and *Xi,2,t*:

![\[Figure 1: Target time series and associated feature time series\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ts-full-159.base.png)


The target time series might contain missing values, which are represented by line breaks in the time series. DeepAR supports only feature time series that are known in the future. This allows you to run "what if?" scenarios. What happens, for example, if I change the price of a product in some way? 

Each target time series can also be associated with a number of categorical features. You can use these features to encode which groupings a time series belongs to. Categorical features allow the model to learn typical behavior for groups, which it can use to increase model accuracy. DeepAR implements this by learning an embedding vector for each group that captures the common properties of all time series in the group. 

## How Feature Time Series Work in the DeepAR Algorithm
<a name="deepar_under-the-hood"></a>

To facilitate learning time-dependent patterns, such as spikes during weekends, DeepAR automatically creates feature time series based on the frequency of the target time series. It uses these derived feature time series with the custom feature time series that you provide during training and inference. The following figure shows two of these derived time series features: *ui,1,t* represents the hour of the day and *ui,2,t* the day of the week.

![\[Figure 2: Derived time series\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ts-full-159.derived.png)


The DeepAR algorithm automatically generates these feature time series. The following table lists the derived features for the supported basic time frequencies.


| Frequency of the Time Series | Derived Features | 
| --- | --- | 
| Minute |  `minute-of-hour`, `hour-of-day`, `day-of-week`, `day-of-month`, `day-of-year`  | 
| Hour |  `hour-of-day`, `day-of-week`, `day-of-month`, `day-of-year`  | 
| Day |  `day-of-week`, `day-of-month`, `day-of-year`  | 
| Week |  `day-of-month`, `week-of-year`  | 
| Month |  month-of-year  | 

DeepAR trains a model by randomly sampling several training examples from each of the time series in the training dataset. Each training example consists of a pair of adjacent context and prediction windows with fixed predefined lengths. The `context_length` hyperparameter controls how far in the past the network can see, and the `prediction_length` hyperparameter controls how far in the future predictions can be made. During training, the algorithm ignores training set elements containing time series that are shorter than a specified prediction length. The following figure represents five samples with context lengths of 12 hours and prediction lengths of 6 hours drawn from element *i*. For brevity, we've omitted the feature time series *xi,1,t* and *ui,2,t*.

![\[Figure 3: Sampled time series\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ts-full-159.sampled.png)


To capture seasonality patterns, DeepAR also automatically feeds lagged values from the target time series. In the example with hourly frequency, for each time index, *t = T*, the model exposes the *zi,t* values, which occurred approximately one, two, and three days in the past.

![\[Figure 4: Lagged time series\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ts-full-159.lags.png)


For inference, the trained model takes as input target time series, which might or might not have been used during training, and forecasts a probability distribution for the next `prediction_length` values. Because DeepAR is trained on the entire dataset, the forecast takes into account patterns learned from similar time series.

For information on the mathematics behind DeepAR, see [DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks](https://arxiv.org/abs/1704.04110). 

# DeepAR Hyperparameters
<a name="deepar_hyperparameters"></a>

The following table lists the hyperparameters that you can set when training with the Amazon SageMaker AI DeepAR forecasting algorithm.


| Parameter Name | Description | 
| --- | --- | 
| context\$1length |  The number of time-points that the model gets to see before making the prediction. The value for this parameter should be about the same as the `prediction_length`. The model also receives lagged inputs from the target, so `context_length` can be much smaller than typical seasonalities. For example, a daily time series can have yearly seasonality. The model automatically includes a lag of one year, so the context length can be shorter than a year. The lag values that the model picks depend on the frequency of the time series. For example, lag values for daily frequency are previous week, 2 weeks, 3 weeks, 4 weeks, and year. **Required** Valid values: Positive integer  | 
| epochs |  The maximum number of passes over the training data. The optimal value depends on your data size and learning rate. See also `early_stopping_patience`. Typical values range from 10 to 1000. **Required** Valid values: Positive integer  | 
| prediction\$1length |  The number of time-steps that the model is trained to predict, also called the forecast horizon. The trained model always generates forecasts with this length. It can't generate longer forecasts. The `prediction_length` is fixed when a model is trained and it cannot be changed later. **Required** Valid values: Positive integer  | 
| time\$1freq |  The granularity of the time series in the dataset. Use `time_freq` to select appropriate date features and lags. The model supports the following basic frequencies. It also supports multiples of these basic frequencies. For example, `5min` specifies a frequency of 5 minutes. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deepar_hyperparameters.html) **Required** Valid values: An integer followed by *M*, *W*, *D*, *H*, or *min*. For example, `5min`.  | 
| cardinality |  When using the categorical features (`cat`), `cardinality` is an array specifying the number of categories (groups) per categorical feature. Set this to `auto` to infer the cardinality from the data. The `auto` mode also works when no categorical features are used in the dataset. This is the recommended setting for the parameter. Set cardinality to `ignore` to force DeepAR to not use categorical features, even it they are present in the data. To perform additional data validation, it is possible to explicitly set this parameter to the actual value. For example, if two categorical features are provided where the first has 2 and the other has 3 possible values, set this to [2, 3]. For more information on how to use categorical feature, see the data-section on the main documentation page of DeepAR. **Optional** Valid values: `auto`, `ignore`, array of positive integers, empty string, or  Default value: `auto`  | 
| dropout\$1rate |  The dropout rate to use during training. The model uses zoneout regularization. For each iteration, a random subset of hidden neurons are not updated. Typical values are less than 0.2. **Optional** Valid values: float Default value: 0.1  | 
| early\$1stopping\$1patience |  If this parameter is set, training stops when no progress is made within the specified number of `epochs`. The model that has the lowest loss is returned as the final model. **Optional** Valid values: integer  | 
| embedding\$1dimension |  Size of embedding vector learned per categorical feature (same value is used for all categorical features). The DeepAR model can learn group-level time series patterns when a categorical grouping feature is provided. To do this, the model learns an embedding vector of size `embedding_dimension` for each group, capturing the common properties of all time series in the group. A larger `embedding_dimension` allows the model to capture more complex patterns. However, because increasing the `embedding_dimension` increases the number of parameters in the model, more training data is required to accurately learn these parameters. Typical values for this parameter are between 10-100.  **Optional** Valid values: positive integer Default value: 10  | 
| learning\$1rate |  The learning rate used in training. Typical values range from 1e-4 to 1e-1. **Optional** Valid values: float Default value: 1e-3  | 
| likelihood |  The model generates a probabilistic forecast, and can provide quantiles of the distribution and return samples. Depending on your data, select an appropriate likelihood (noise model) that is used for uncertainty estimates. The following likelihoods can be selected: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/deepar_hyperparameters.html) **Optional** Valid values: One of *gaussian*, *beta*, *negative-binomial*, *student-T*, or *deterministic-L1*. Default value: `student-T`  | 
| mini\$1batch\$1size |  The size of mini-batches used during training. Typical values range from 32 to 512. **Optional** Valid values: positive integer Default value: 128  | 
| num\$1cells |  The number of cells to use in each hidden layer of the RNN. Typical values range from 30 to 100. **Optional** Valid values: positive integer Default value: 40  | 
| num\$1dynamic\$1feat |  The number of `dynamic_feat` provided in the data. Set this to `auto` to infer the number of dynamic features from the data. The `auto` mode also works when no dynamic features are used in the dataset. This is the recommended setting for the parameter. To force DeepAR to not use dynamic features, even it they are present in the data, set `num_dynamic_feat` to `ignore`.  To perform additional data validation, it is possible to explicitly set this parameter to the actual integer value. For example, if two dynamic features are provided, set this to 2.  **Optional** Valid values: `auto`, `ignore`, positive integer, or empty string Default value: `auto`  | 
| num\$1eval\$1samples |  The number of samples that are used per time-series when calculating test accuracy metrics. This parameter does not have any influence on the training or the final model. In particular, the model can be queried with a different number of samples. This parameter only affects the reported accuracy scores on the test channel after training. Smaller values result in faster evaluation, but then the evaluation scores are typically worse and more uncertain. When evaluating with higher quantiles, for example 0.95, it may be important to increase the number of evaluation samples. **Optional** Valid values: integer Default value: 100  | 
| num\$1layers |  The number of hidden layers in the RNN. Typical values range from 1 to 4. **Optional** Valid values: positive integer Default value: 2  | 
| test\$1quantiles |  Quantiles for which to calculate quantile loss on the test channel. **Optional** Valid values: array of floats Default value: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]  | 

# Tune a DeepAR Model
<a name="deepar-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the DeepAR Algorithm
<a name="deepar-metrics"></a>

The DeepAR algorithm reports three metrics, which are computed during training. When tuning a model, choose one of these as the objective. For the objective, use either the forecast accuracy on a provided test channel (recommended) or the training loss. For recommendations for the training/test split for the DeepAR algorithm, see [Best Practices for Using the DeepAR Algorithm](deepar.md#deepar_best_practices). 


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:RMSE |  The root mean square error between the forecast and the actual target computed on the test set.  |  Minimize  | 
| test:mean\$1wQuantileLoss |  The average overall quantile losses computed on the test set. To control which quantiles are used, set the `test_quantiles` hyperparameter.   |  Minimize  | 
| train:final\$1loss |  The training negative log-likelihood loss averaged over the last training epoch for the model.  |  Minimize  | 

## Tunable Hyperparameters for the DeepAR Algorithm
<a name="deepar-tunable-hyperparameters"></a>

Tune a DeepAR model with the following hyperparameters. The hyperparameters that have the greatest impact, listed in order from the most to least impactful, on DeepAR objective metrics are: `epochs`, `context_length`, `mini_batch_size`, `learning_rate`, and `num_cells`.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| epochs |  `IntegerParameterRanges`  |  MinValue: 1, MaxValue: 1000  | 
| context\$1length |  `IntegerParameterRanges`  |  MinValue: 1, MaxValue: 200  | 
| mini\$1batch\$1size |  `IntegerParameterRanges`  |  MinValue: 32, MaxValue: 1028  | 
| learning\$1rate |  `ContinuousParameterRange`  |  MinValue: 1e-5, MaxValue: 1e-1  | 
| num\$1cells |  `IntegerParameterRanges`  |  MinValue: 30, MaxValue: 200  | 
| num\$1layers |  `IntegerParameterRanges`  |  MinValue: 1, MaxValue: 8  | 
| dropout\$1rate |  `ContinuousParameterRange`  |  MinValue: 0.00, MaxValue: 0.2  | 
| embedding\$1dimension |  `IntegerParameterRanges`  |  MinValue: 1, MaxValue: 50  | 

# DeepAR Inference Formats
<a name="deepar-in-formats"></a>

The following page describes the request and response formats for inference with the Amazon SageMaker AI DeepAR model.

## DeepAR JSON Request Formats
<a name="deepar-json-request"></a>

Query a trained model by using the model's endpoint. The endpoint takes the following JSON request format. 

In the request, the `instances` field corresponds to the time series that should be forecast by the model. 

If the model was trained with categories, you must provide a `cat` for each instance. If the model was trained without the `cat` field, it should be omitted.

If the model was trained with a custom feature time series (`dynamic_feat`), you have to provide the same number of `dynamic_feat`values for each instance. Each of them should have a length given by `length(target) + prediction_length`, where the last `prediction_length` values correspond to the time points in the future that will be predicted. If the model was trained without custom feature time series, the field should not be included in the request.

```
{
    "instances": [
        {
            "start": "2009-11-01 00:00:00",
            "target": [4.0, 10.0, "NaN", 100.0, 113.0],
            "cat": [0, 1],
            "dynamic_feat": [[1.0, 1.1, 2.1, 0.5, 3.1, 4.1, 1.2, 5.0, ...]]
        },
        {
            "start": "2012-01-30",
            "target": [1.0],
            "cat": [2, 1],
            "dynamic_feat": [[2.0, 3.1, 4.5, 1.5, 1.8, 3.2, 0.1, 3.0, ...]]
        },
        {
            "start": "1999-01-30",
            "target": [2.0, 1.0],
            "cat": [1, 3],
            "dynamic_feat": [[1.0, 0.1, -2.5, 0.3, 2.0, -1.2, -0.1, -3.0, ...]]
        }
    ],
    "configuration": {
         "num_samples": 50,
         "output_types": ["mean", "quantiles", "samples"],
         "quantiles": ["0.5", "0.9"]
    }
}
```

The `configuration` field is optional. `configuration.num_samples` sets the number of sample paths that the model generates to estimate the mean and quantiles. `configuration.output_types` describes the information that will be returned in the request. Valid values are `"mean"` `"quantiles"` and `"samples"`. If you specify `"quantiles"`, each of the quantile values in `configuration.quantiles` is returned as a time series. If you specify `"samples"`, the model also returns the raw samples used to calculate the other outputs.

## DeepAR JSON Response Formats
<a name="deepar-json-response"></a>

The following is the format of a response, where `[...]` are arrays of numbers:

```
{
    "predictions": [
        {
            "quantiles": {
                "0.9": [...],
                "0.5": [...]
            },
            "samples": [...],
            "mean": [...]
        },
        {
            "quantiles": {
                "0.9": [...],
                "0.5": [...]
            },
            "samples": [...],
            "mean": [...]
        },
        {
            "quantiles": {
                "0.9": [...],
                "0.5": [...]
            },
            "samples": [...],
            "mean": [...]
        }
    ]
}
```

DeepAR has a response timeout of 60 seconds. When passing multiple time series in a single request, the forecasts are generated sequentially. Because the forecast for each time series typically takes about 300 to 1000 milliseconds or longer, depending on the model size, passing too many time series in a single request can cause time outs. It's better to send fewer time series per request and send more requests. Because the DeepAR algorithm uses multiple workers per instance, you can achieve much higher throughput by sending multiple requests in parallel.

By default, DeepAR uses one worker per CPU for inference, if there is sufficient memory per CPU. If the model is large and there isn't enough memory to run a model on each CPU, the number of workers is reduced. The number of workers used for inference can be overwritten using the environment variable `MODEL_SERVER_WORKERS` For example, by setting `MODEL_SERVER_WORKERS=1`) when calling the SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API.

## Batch Transform with the DeepAR Algorithm
<a name="deepar-batch"></a>

DeepAR forecasting supports getting inferences by using batch transform from data using the JSON Lines format. In this format, each record is represented on a single line as a JSON object, and lines are separated by newline characters. The format is identical to the JSON Lines format used for model training. For information, see [Input/Output Interface for the DeepAR Algorithm](deepar.md#deepar-inputoutput). For example:

```
{"start": "2009-11-01 00:00:00", "target": [4.3, "NaN", 5.1, ...], "cat": [0, 1], "dynamic_feat": [[1.1, 1.2, 0.5, ..]]}
{"start": "2012-01-30 00:00:00", "target": [1.0, -5.0, ...], "cat": [2, 3], "dynamic_feat": [[1.1, 2.05, ...]]}
{"start": "1999-01-30 00:00:00", "target": [2.0, 1.0], "cat": [1, 4], "dynamic_feat": [[1.3, 0.4]]}
```

**Note**  
When creating the transformation job with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html), set the `BatchStrategy` value to `SingleRecord` and set the `SplitType` value in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html) configuration to `Line`, as the default values currently cause runtime failures.

Similar to the hosted endpoint inference request format, the `cat` and the `dynamic_feat` fields for each instance are required if both of the following are true:
+ The model is trained on a dataset that contained both the `cat` and the `dynamic_feat` fields.
+ The corresponding `cardinality` and `num_dynamic_feat` values used in the training job are not set to `"".`

Unlike hosted endpoint inference, the configuration field is set once for the entire batch inference job using an environment variable named `DEEPAR_INFERENCE_CONFIG`. The value of `DEEPAR_INFERENCE_CONFIG` can be passed when the model is created by calling [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) API. If `DEEPAR_INFERENCE_CONFIG` is missing in the container environment, the inference container uses the following default:

```
{
    "num_samples": 100,
    "output_types": ["mean", "quantiles"],
    "quantiles": ["0.1", "0.2", "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9"]
}
```

The output is also in JSON Lines format, with one line per prediction, in an order identical to the instance order in the corresponding input file. Predictions are encoded as objects identical to the ones returned by responses in online inference mode. For example:

```
{ "quantiles": { "0.1": [...], "0.2": [...] }, "samples": [...], "mean": [...] }
```

Note that in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TransformInput.html) configuration of the SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request clients must explicitly set the `AssembleWith` value to `Line`, as the default value `None` concatenates all JSON objects on the same line.

For example, here is a SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request for a DeepAR job with a custom `DEEPAR_INFERENCE_CONFIG`:

```
{
   "BatchStrategy": "SingleRecord",
   "Environment": { 
      "DEEPAR_INFERENCE_CONFIG" : "{ \"num_samples\": 200, \"output_types\": [\"mean\"] }",
      ...
   },
   "TransformInput": {
      "SplitType": "Line",
      ...
   },
   "TransformOutput": { 
      "AssembleWith": "Line",
      ...
   },
   ...
}
```

# Unsupervised Built-in SageMaker AI Algorithms
<a name="algorithms-unsupervised"></a>

Amazon SageMaker AI provides several built-in algorithms that can be used for a variety of unsupervised learning tasks such as clustering, dimension reduction, pattern recognition, and anomaly detection.
+ [IP Insights](ip-insights.md)—learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers.
+ [K-Means Algorithm](k-means.md)—finds discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.
+ [Principal Component Analysis (PCA) Algorithm](pca.md)—reduces the dimensionality (number of features) within a dataset by projecting data points onto the first few principal components. The objective is to retain as much information or variation as possible. For mathematicians, principal components are eigenvectors of the data's covariance matrix.
+ [Random Cut Forest (RCF) Algorithm](randomcutforest.md)—detects anomalous data points within a data set that diverge from otherwise well-structured or patterned data.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| IP Insights | train and (optionally) validation | File | CSV | CPU or GPU | Yes | 
| K-Means | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU or GPUCommon (single GPU device on one or more instances) | No | 
| PCA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | GPU or CPU | Yes | 
| Random Cut Forest | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU | Yes | 

# IP Insights
<a name="ip-insights"></a>

Amazon SageMaker AI IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers. You can use it to identify a user attempting to log into a web service from an anomalous IP address, for example. Or you can use it to identify an account that is attempting to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an endpoint for making real-time predictions or used for processing batch transforms.

SageMaker AI IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker AI IP Insights model returns a score that infers how anomalous the pattern of the event is. For example, when a user attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the IP Insights score into another machine learning model. For example, you can combine the IP Insight score with other features to rank the findings of another security system, such as those from [Amazon GuardDuty](https://docs.aws.amazon.com/guardduty/latest/ug/what-is-guardduty.html).

The SageMaker AI IP Insights algorithm can also learn vector representations of IP addresses, known as *embeddings*. You can use vector-encoded embeddings as features in downstream machine learning tasks that use the information observed in the IP addresses. For example, you can use them in tasks such as measuring similarities between IP addresses in clustering and visualization tasks.

**Topics**
+ [

## Input/Output Interface for the IP Insights Algorithm
](#ip-insights-inputoutput)
+ [

## EC2 Instance Recommendation for the IP Insights Algorithm
](#ip-insights-instances)
+ [

## IP Insights Sample Notebooks
](#ip-insights-sample-notebooks)
+ [

# How IP Insights Works
](ip-insights-howitworks.md)
+ [

# IP Insights Hyperparameters
](ip-insights-hyperparameters.md)
+ [

# Tune an IP Insights Model
](ip-insights-tuning.md)
+ [

# IP Insights Data Formats
](ip-insights-data-formats.md)

## Input/Output Interface for the IP Insights Algorithm
<a name="ip-insights-inputoutput"></a>

**Training and Validation**

The SageMaker AI IP Insights algorithm supports training and validation data channels. It uses the optional validation channel to compute an area-under-curve (AUC) score on a predefined negative sampling strategy. The AUC metric validates how well the model discriminates between positive and negative samples. Training and validation data content types need to be in `text/csv` format. The first column of the CSV data is an opaque string that provides a unique identifier for the entity. The second column is an IPv4 address in decimal-dot notation. IP Insights currently supports only File mode. For more information and some examples, see [IP Insights Training Data Formats](ip-insights-training-data-formats.md).

**Inference**

For inference, IP Insights supports `text/csv`, `application/json`, and `application/jsonlines` data content types. For more information about the common data formats for inference provided by SageMaker AI, see [Common data formats for inference](cdf-inference.md). IP Insights inference returns output formatted as either `application/json` or `application/jsonlines`. Each record in the output data contains the corresponding `dot_product` (or compatibility score) for each input data point. For more information and some examples, see [IP Insights Inference Data Formats](ip-insights-inference-data-formats.md).

## EC2 Instance Recommendation for the IP Insights Algorithm
<a name="ip-insights-instances"></a>

The SageMaker AI IP Insights algorithm can run on both GPU and CPU instances. For training jobs, we recommend using GPU instances. However, for certain workloads with large training datasets, distributed CPU instances might reduce training costs. For inference, we recommend using CPU instances. IP Insights supports P2, P3, G4dn, and G5 GPU families.

### GPU Instances for the IP Insights Algorithm
<a name="ip-insights-instances-gpu"></a>

IP Insights supports all available GPUs. If you need to speed up training, we recommend starting with a single GPU instance, such as ml.p3.2xlarge, and then moving to a multi-GPU environment, such as ml.p3.8xlarge and ml.p3.16xlarge. Multi-GPUs automatically divide the mini batches of training data across themselves. If you switch from a single GPU to multiple GPUs, the `mini_batch_size` is divided equally into the number of GPUs used. You may want to increase the value of the `mini_batch_size` to compensate for this.

### CPU Instances for the IP Insights Algorithm
<a name="ip-insights-instances-cpu"></a>

The type of CPU instance that we recommend depends largely on the instance's available memory and the model size. The model size is determined by two hyperparameters: `vector_dim` and `num_entity_vectors`. The maximum supported model size is 8 GB. The following table lists typical EC2 instance types that you would deploy based on these input parameters for various model sizes. In Table 1, the value for `vector_dim` in the first column range from 32 to 2048 and the values for `num_entity_vectors` in the first row range from 10,000 to 50,000,000.


| `vector_dim` \$1 `num_entity_vectors`. | 10,000 | 50,000 | 100,000 | 500,000 | 1,000,000 | 5,000,000 | 10,000,000 | 50,000,000 | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| 32 |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.2xlarge | ml.m5.4xlarge | 
|  `64`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge | ml.m5.2xlarge |  | 
|  `128`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge | ml.m5.4xlarge |  | 
|  `256`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.4xlarge |  |  | 
|  `512`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.large | ml.m5.2xlarge |  |  |  | 
|  `1024`  |  ml.m5.large  | ml.m5.large | ml.m5.large | ml.m5.xlarge | ml.m5.4xlarge |  |  |  | 
|  `2048`  |  ml.m5.large  | ml.m5.large | ml.m5.xlarge | ml.m5.xlarge |  |  |  |  | 

The values for the `mini_batch_size`, `num_ip_encoder_layers`, `random_negative_sampling_rate`, and `shuffled_negative_sampling_rate` hyperparameters also affect the amount of memory required. If these values are large, you might need to use a larger instance type than normal.

## IP Insights Sample Notebooks
<a name="ip-insights-sample-notebooks"></a>

For a sample notebook that shows how to train the SageMaker AI IP Insights algorithm and perform inferences with it, see [An Introduction to the SageMaker AIIP Insights Algorithm ](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ipinsights_login/ipinsights-tutorial.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After creating a notebook instance, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI examples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How IP Insights Works
<a name="ip-insights-howitworks"></a>

Amazon SageMaker AI IP Insights is an unsupervised algorithm that consumes observed data in the form of (entity, IPv4 address) pairs that associates entities with IP addresses. IP Insights determines how likely it is that an entity would use a particular IP address by learning latent vector representations for both entities and IP addresses. The distance between these two representations can then serve as the proxy for how likely this association is.

The IP Insights algorithm uses a neural network to learn the latent vector representations for entities and IP addresses. Entities are first hashed to a large but fixed hash space and then encoded by a simple embedding layer. Character strings such as user names or account IDs can be fed directly into IP Insights as they appear in log files. You don't need to preprocess the data for entity identifiers. You can provide entities as an arbitrary string value during both training and inference. The hash size should be configured with a value that is high enough to ensure that the number of *collisions*, which occur when distinct entities are mapped to the same latent vector, remain insignificant. For more information about how to select appropriate hash sizes, see [Feature Hashing for Large Scale Multitask Learning](https://alex.smola.org/papers/2009/Weinbergeretal09.pdf). For representing IP addresses, on the other hand, IP Insights uses a specially designed encoder network to uniquely represent each possible IPv4 address by exploiting the prefix structure of IP addresses.

During training, IP Insights automatically generates negative samples by randomly pairing entities and IP addresses. These negative samples represent data that is less likely to occur in reality. The model is trained to discriminate between positive samples that are observed in the training data and these generated negative samples. More specifically, the model is trained to minimize the *cross entropy*, also known as the *log loss*, defined as follows: 

![\[An image containing the equation for log loss.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-cross-entropy.png)


yn is the label that indicates whether the sample is from the real distribution governing observed data (yn=1) or from the distribution generating negative samples (yn=0). pn is the probability that the sample is from the real distribution, as predicted by the model.

Generating negative samples is an important process that is used to achieve an accurate model of the observed data. If negative samples are extremely unlikely, for example, if all of the IP addresses in negative samples are 10.0.0.0, then the model trivially learns to distinguish negative samples and fails to accurately characterize the actual observed dataset. To keep negative samples more realistic, IP Insights generates negative samples both by randomly generating IP addresses and randomly picking IP addresses from training data. You can configure the type of negative sampling and the rates at which negative samples are generated with the `random_negative_sampling_rate` and `shuffled_negative_sampling_rate` hyperparameters.

Given an nth (entity, IP address pair), the IP Insights model outputs a *score*, Sn , that indicates how compatible the entity is with the IP address. This score corresponds to the log odds ratio for a given (entity, IP address) of the pair coming from a real distribution as compared to coming from a negative distribution. It is defined as follows:

![\[An image containing the equation for the score, a log odds ratio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-log-odds.png)


The score is essentially a measure of the similarity between the vector representations of the nth entity and IP address. It can be interpreted as how much more likely it would be to observe this event in reality than in a randomly generated dataset. During training, the algorithm uses this score to calculate an estimate of the probability of a sample coming from the real distribution, pn, to use in the cross entropy minimization, where:

![\[An image showing the equation for probability that the sample is from a real distribution.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/ip-insight-image-sample-probability.png)


# IP Insights Hyperparameters
<a name="ip-insights-hyperparameters"></a>

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html) request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Amazon SageMaker AI IP Insights algorithm.


| Parameter Name | Description | 
| --- | --- | 
| num\$1entity\$1vectors | The number of entity vector representations (entity embedding vectors) to train. Each entity in the training set is randomly assigned to one of these vectors using a hash function. Because of hash collisions, it might be possible to have multiple entities assigned to the same vector. This would cause the same vector to represent multiple entities. This generally has a negligible effect on model performance, as long as the collision rate is not too severe. To keep the collision rate low, set this value as high as possible. However, the model size, and, therefore, the memory requirement, for both training and inference, scales linearly with this hyperparameter. We recommend that you set this value to twice the number of unique entity identifiers. **Required** Valid values: 1 ≤ positive integer ≤ 250,000,000  | 
| vector\$1dim | The size of embedding vectors to represent entities and IP addresses. The larger the value, the more information that can be encoded using these representations. In practice, model size scales linearly with this parameter and limits how large the dimension can be. In addition, using vector representations that are too large can cause the model to overfit, especially for small training datasets. Overfitting occurs when a model doesn't learn any pattern in the data but effectively memorizes the training data and, therefore, cannot generalize well and performs poorly during inference. The recommended value is 128. **Required** Valid values: 4 ≤ positive integer ≤ 4096  | 
| batch\$1metrics\$1publish\$1interval | The interval (every X batches) at which the Apache MXNet Speedometer function prints the training speed of the network (samples/second).  **Optional** Valid values: positive integer ≥ 1 Default value: 1,000 | 
| epochs | The number of passes over the training data. The optimal value depends on your data size and learning rate. Typical values range from 5 to 100. **Optional** Valid values: positive integer ≥ 1 Default value: 10 | 
| learning\$1rate | The learning rate for the optimizer. IP Insights use a gradient-descent-based Adam optimizer. The learning rate effectively controls the step size to update model parameters at each iteration. Too large a learning rate can cause the model to diverge because the training is likely to overshoot a minima. On the other hand, too small a learning rate slows down convergence. Typical values range from 1e-4 to 1e-1. **Optional** Valid values: 1e-6 ≤ float ≤ 10.0 Default value: 0.001 | 
| mini\$1batch\$1size | The number of examples in each mini batch. The training procedure processes data in mini batches. The optimal value depends on the number of unique account identifiers in the dataset. In general, the larger the `mini_batch_size`, the faster the training and the greater the number of possible shuffled-negative-sample combinations. However, with a large `mini_batch_size`, the training is more likely to converge to a poor local minimum and perform relatively worse for inference.  **Optional** Valid values: 1 ≤ positive integer ≤ 500000 Default value: 10,000 | 
| num\$1ip\$1encoder\$1layers | The number of fully connected layers used to encode the IP address embedding. The larger the number of layers, the greater the model's capacity to capture patterns among IP addresses. However, using a large number of layers increases the chance of overfitting. **Optional** Valid values: 0 ≤ positive integer ≤ 100 Default value: 1 | 
| random\$1negative\$1sampling\$1rate | The number of random negative samples, R, to generate per input example. The training procedure relies on negative samples to prevent the vector representations of the model collapsing to a single point. Random negative sampling generates R random IP addresses for each input account in the mini batch. The sum of the `random_negative_sampling_rate` (R) and `shuffled_negative_sampling_rate` (S) must be in the interval: 1 ≤ R \$1 S ≤ 500. **Optional** Valid values: 0 ≤ positive integer ≤ 500 Default value: 1 | 
| shuffled\$1negative\$1sampling\$1rate | The number of shuffled negative samples, S, to generate per input example. In some cases, it helps to use more realistic negative samples that are randomly picked from the training data itself. This kind of negative sampling is achieved by shuffling the data within a mini batch. Shuffled negative sampling generates S negative IP addresses by shuffling the IP address and account pairings within a mini batch. The sum of the `random_negative_sampling_rate` (R) and `shuffled_negative_sampling_rate` (S) must be in the interval: 1 ≤ R \$1 S ≤ 500. **Optional** Valid values: 0 ≤ positive integer ≤ 500 Default value: 1 | 
| weight\$1decay | The weight decay coefficient. This parameter adds an L2 regularization factor that is required to prevent the model from overfitting the training data. **Optional** Valid values: 0.0 ≤ float ≤ 10.0 Default value: 0.00001 | 

# Tune an IP Insights Model
<a name="ip-insights-tuning"></a>

*Automatic model tuning*, also called hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the IP Insights Algorithm
<a name="ip-insights-metrics"></a>

The Amazon SageMaker AI IP Insights algorithm is an unsupervised learning algorithm that learns associations between IP addresses and entities. The algorithm trains a discriminator model , which learns to separate observed data points (*positive samples*) from randomly generated data points (*negative samples*). Automatic model tuning on IP Insights helps you find the model that can most accurately distinguish between unlabeled validation data and automatically generated negative samples. The model accuracy on the validation dataset is measured by the area under the receiver operating characteristic curve. This `validation:discriminator_auc` metric can take values between 0.0 and 1.0, where 1.0 indicates perfect accuracy.

The IP Insights algorithm computes a `validation:discriminator_auc` metric during validation, the value of which is used as the objective function to optimize for hyperparameter tuning.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:discriminator\$1auc |  Area under the receiver operating characteristic curve on the validation dataset. The validation dataset is not labeled. Area Under the Curve (AUC) is a metric that describes the model's ability to discriminate validation data points from randomly generated data points.  |  Maximize  | 

## Tunable IP Insights Hyperparameters
<a name="ip-insights-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the SageMaker AI IP Insights algorithm. 


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| epochs |  IntegerParameterRange  |  MinValue: 1, MaxValue: 100  | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 0.1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 100, MaxValue: 50000  | 
| num\$1entity\$1vectors |  IntegerParameterRanges  |  MinValue: 10000, MaxValue: 1000000  | 
| num\$1ip\$1encoder\$1layers |  IntegerParameterRanges  |  MinValue: 1, MaxValue: 10  | 
| random\$1negative\$1sampling\$1rate |  IntegerParameterRanges  |  MinValue: 0, MaxValue: 10  | 
| shuffled\$1negative\$1sampling\$1rate |  IntegerParameterRanges  |  MinValue: 0, MaxValue: 10  | 
| vector\$1dim |  IntegerParameterRanges  |  MinValue: 8, MaxValue: 256  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 1.0  | 

# IP Insights Data Formats
<a name="ip-insights-data-formats"></a>

This section provides examples of the available input and output data formats used by the IP Insights algorithm during training and inference.

**Topics**
+ [

# IP Insights Training Data Formats
](ip-insights-training-data-formats.md)
+ [

# IP Insights Inference Data Formats
](ip-insights-inference-data-formats.md)

# IP Insights Training Data Formats
<a name="ip-insights-training-data-formats"></a>

The following are the available data input formats for the IP Insights algorithm. Amazon SageMaker AI built-in algorithms adhere to the common input training format described in [Common Data Formats for Training](cdf-training.md). However, the SageMaker AI IP Insights algorithm currently supports only the CSV data input format.

## IP Insights Training Data Input Formats
<a name="ip-insights-training-input-format-requests"></a>

### INPUT: CSV
<a name="ip-insights-input-csv"></a>

The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot notation. 

content-type: text/csv

```
entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2
```

# IP Insights Inference Data Formats
<a name="ip-insights-inference-data-formats"></a>

The following are the available input and output formats for the IP Insights algorithm. Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common data formats for inference](cdf-inference.md). However, the SageMaker AI IP Insights algorithm does not currently support RecordIO format.

## IP Insights Input Request Formats
<a name="ip-insights-input-format-requests"></a>

### INPUT: CSV Format
<a name="ip-insights-input-csv-format"></a>

The CSV file must have two columns. The first column is an opaque string that corresponds to an entity's unique identifier. The second column is the IPv4 address of the entity's access event in decimal-dot notation. 

content-type: text/csv

```
entity_id_1, 192.168.1.2
entity_id_2, 10.10.1.2
```

### INPUT: JSON Format
<a name="ip-insights-input-json"></a>

JSON data can be provided in different formats. IP Insights follows the common SageMaker AI formats. For more information about inference formats, see [Common data formats for inference](cdf-inference.md).

content-type: application/json

```
{
  "instances": [
    {"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},
    {"features": ["entity_id_2", "10.10.1.2"]}
  ]
}
```

### INPUT: JSONLINES Format
<a name="ip-insights-input-jsonlines"></a>

The JSON Lines content type is useful for running batch transform jobs. For more information on SageMaker AI inference formats, see [Common data formats for inference](cdf-inference.md). For more information on running batch transform jobs, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md).

content-type: application/jsonlines

```
{"data": {"features": {"values": ["entity_id_1", "192.168.1.2"]}}},
{"features": ["entity_id_2", "10.10.1.2"]}]
```

## IP Insights Output Response Formats
<a name="ip-insights-ouput-format-response"></a>

### OUTPUT: JSON Response Format
<a name="ip-insights-output-json"></a>

The default output of the SageMaker AI IP Insights algorithm is the `dot_product` between the input entity and IP address. The dot\$1product signifies how compatible the model considers the entity and IP address. The `dot_product` is unbounded. To make predictions about whether an event is anomalous, you need to set a threshold based on your defined distribution. For information about how to use the `dot_product` for anomaly detection, see the [An Introduction to the SageMaker AIIP Insights Algorithm](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/ipinsights_login/ipinsights-tutorial.html).

accept: application/json

```
{
  "predictions": [
    {"dot_product": 0.0},
    {"dot_product": 2.0}
  ]
}
```

Advanced users can access the model's learned entity and IP embeddings by providing the additional content-type parameter `verbose=True` to the Accept heading. You can use the `entity_embedding` and `ip_embedding` for debugging, visualizing, and understanding the model. Additionally, you can use these embeddings in other machine learning techniques, such as classification or clustering.

accept: application/json;verbose=True

```
{
  "predictions": [
    {
        "dot_product": 0.0,
        "entity_embedding": [1.0, 0.0, 0.0],
        "ip_embedding": [0.0, 1.0, 0.0]
    },
    {
        "dot_product": 2.0,
        "entity_embedding": [1.0, 0.0, 1.0],
        "ip_embedding": [1.0, 0.0, 1.0]
    }
  ]
}
```

### OUTPUT: JSONLINES Response Format
<a name="ip-insights-jsonlines"></a>

accept: application/jsonlines 

```
{"dot_product": 0.0}
{"dot_product": 2.0}
```

accept: application/jsonlines; verbose=True 

```
{"dot_product": 0.0, "entity_embedding": [1.0, 0.0, 0.0], "ip_embedding": [0.0, 1.0, 0.0]}
{"dot_product": 2.0, "entity_embedding": [1.0, 0.0, 1.0], "ip_embedding": [1.0, 0.0, 1.0]}
```

# K-Means Algorithm
<a name="k-means"></a>

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity. 

Amazon SageMaker AI uses a modified version of the web-scale k-means clustering algorithm. Compared with the original version of the algorithm, the version used by Amazon SageMaker AI is more accurate. Like the original algorithm, it scales to massive datasets and delivers improvements in training time. To do this, the version used by Amazon SageMaker AI streams mini-batches (small, random subsets) of the training data. For more information about mini-batch k-means, see [Web-scale k-means Clustering](https://dl.acm.org/doi/10.1145/1772690.1772862).

The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. The *n* attributes in each row represent a point in *n*-dimensional space. The Euclidean distance between these points represents the similarity of the corresponding observations. The algorithm groups observations with similar attribute values (the points corresponding to these observations are closer together). For more information about how k-means works in Amazon SageMaker AI, see [How K-Means Clustering Works](algo-kmeans-tech-notes.md).

**Topics**
+ [

## Input/Output Interface for the K-Means Algorithm
](#km-inputoutput)
+ [

## EC2 Instance Recommendation for the K-Means Algorithm
](#km-instances)
+ [

## K-Means Sample Notebooks
](#kmeans-sample-notebooks)
+ [

# How K-Means Clustering Works
](algo-kmeans-tech-notes.md)
+ [

# K-Means Hyperparameters
](k-means-api-config.md)
+ [

# Tune a K-Means Model
](k-means-tuning.md)
+ [

# K-Means Response Formats
](km-in-formats.md)

## Input/Output Interface for the K-Means Algorithm
<a name="km-inputoutput"></a>

For training, the k-means algorithm expects data to be provided in the *train* channel (recommended `S3DataDistributionType=ShardedByS3Key`), with an optional *test* channel (recommended `S3DataDistributionType=FullyReplicated`) to score the data on. Both `recordIO-wrapped-protobuf` and `CSV` formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference, `text/csv`, `application/json`, and `application/x-recordio-protobuf` are supported. k-means returns a `closest_cluster` label and the `distance_to_cluster` for each observation.

For more information on input and output file formats, see [K-Means Response Formats](km-in-formats.md) for inference and the [K-Means Sample Notebooks](#kmeans-sample-notebooks). The k-means algorithm does not support multiple instance learning, in which the training set consists of labeled “bags”, each of which is a collection of unlabeled instances.

## EC2 Instance Recommendation for the K-Means Algorithm
<a name="km-instances"></a>

We recommend training k-means on CPU instances. You can train on GPU instances, but should limit GPU training to single-GPU instances (such as ml.g4dn.xlarge) because only one GPU is used per instance. The k-means algorithm supports P2, P3, G4dn, and G5 instances for training and inference.

## K-Means Sample Notebooks
<a name="kmeans-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI K-means algorithm to segment the population of counties in the United States by attributes identified using principle component analysis, see [Analyze US census data for population segmentation using Amazon SageMaker AI](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans/sagemaker-countycensusclustering.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, click on its **Use** tab and select **Create copy**.

# How K-Means Clustering Works
<a name="algo-kmeans-tech-notes"></a>

K-means is an algorithm that trains a model that groups similar objects together. The k-means algorithm accomplishes this by mapping each observation in the input dataset to a point in the *n*-dimensional space (where *n* is the number of attributes of the observation). For example, your dataset might contain observations of temperature and humidity in a particular location, which are mapped to points (*t, h*) in 2-dimensional space. 


**Note**  
Clustering algorithms are unsupervised. In unsupervised learning, labels that might be associated with the objects in the training dataset aren't used. For more information, see [Unsupervised learning](algorithms-choose.md#algorithms-choose-unsupervised-learning).

In k-means clustering, each cluster has a center. During model training, the k-means algorithm uses the distance of the point that corresponds to each observation in the dataset to the cluster centers as the basis for clustering. You choose the number of clusters (*k*) to create. 

For example, suppose that you want to create a model to recognize handwritten digits and you choose the MNIST dataset for training. The dataset provides thousands of images of handwritten digits (0 through 9). In this example, you might choose to create 10 clusters, one for each digit (0, 1, …, 9). As part of model training, the k-means algorithm groups the input images into 10 clusters.

Each image in the MNIST dataset is a 28x28-pixel image, with a total of 784 pixels. Each image corresponds to a point in a 784-dimensional space, similar to a point in a 2-dimensional space (x,y). To find a cluster to which a point belongs, the k-means algorithm finds the distance of that point from all of the cluster centers. It then chooses the cluster with the closest center as the cluster to which the image belongs. 

**Note**  
Amazon SageMaker AI uses a customized version of the algorithm where, instead of specifying that the algorithm create *k* clusters, you might choose to improve model accuracy by specifying extra cluster centers *(K = k\$1x)*. However, the algorithm ultimately reduces these to *k* clusters.

In SageMaker AI, you specify the number of clusters when creating a training job. For more information, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html). In the request body, you add the `HyperParameters` string map to specify the `k` and `extra_center_factor` strings.

The following is a summary of how k-means works for model training in SageMaker AI:

1. It determines the initial *K* cluster centers. 
**Note**  
In the following topics, *K* clusters refer to *k \$1 x*, where you specify *k* and *x* when creating a model training job. 

1. It iterates over input training data and recalculates cluster centers.

1. It reduces resulting clusters to *k* (if the data scientist specified the creation of *k\$1x* clusters in the request). 

The following sections also explain some of the parameters that a data scientist might specify to configure a model training job as part of the `HyperParameters` string map. 

**Topics**
+ [

## Step 1: Determine the Initial Cluster Centers
](#kmeans-step1)
+ [

## Step 2: Iterate over the Training Dataset and Calculate Cluster Centers
](#kmeans-step2)
+ [

## Step 3: Reduce the Clusters from *K* to *k*
](#kmeans-step3)

## Step 1: Determine the Initial Cluster Centers
<a name="kmeans-step1"></a>

When using k-means in SageMaker AI, the initial cluster centers are chosen from the observations in a small, randomly sampled batch. Choose one of the following strategies to determine how these initial cluster centers are selected:
+ The random approach—Randomly choose *K* observations in your input dataset as cluster centers. For example, you might choose a cluster center that points to the 784-dimensional space that corresponds to any 10 images in the MNIST training dataset.
+ The k-means\$1\$1 approach, which works as follows: 

  1. Start with one cluster and determine its center. You randomly select an observation from your training dataset and use the point corresponding to the observation as the cluster center. For example, in the MNIST dataset, randomly choose a handwritten digit image. Then choose the point in the 784-dimensional space that corresponds to the image as your cluster center. This is cluster center 1.

  1. Determine the center for cluster 2. From the remaining observations in the training dataset, pick an observation at random. Choose one that is different than the one you previously selected. This observation corresponds to a point that is far away from cluster center 1. Using the MNIST dataset as an example, you do the following:
     + For each of the remaining images, find the distance of the corresponding point from cluster center 1. Square the distance and assign a probability that is proportional to the square of the distance. That way, an image that is different from the one that you previously selected has a higher probability of getting selected as cluster center 2. 
     + Choose one of the images randomly, based on probabilities assigned in the previous step. The point that corresponds to the image is cluster center 2.

  1. Repeat Step 2 to find cluster center 3. This time, find the distances of the remaining images from cluster center 2.

  1. Repeat the process until you have the *K* cluster centers.

To train a model in SageMaker AI, you create a training job. In the request, you provide configuration information by specifying the following `HyperParameters` string maps:
+ To specify the number of clusters to create, add the `k` string.
+ For greater accuracy, add the optional `extra_center_factor` string. 
+ To specify the strategy that you want to use to determine the initial cluster centers, add the `init_method` string and set its value to `random` or `k-means++`.

For more information about the SageMaker AI k-means estimator, see [K-means](https://sagemaker.readthedocs.io/en/stable/algorithms/unsupervised/kmeans.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation.

You now have an initial set of cluster centers. 

## Step 2: Iterate over the Training Dataset and Calculate Cluster Centers
<a name="kmeans-step2"></a>

The cluster centers that you created in the preceding step are mostly random, with some consideration for the training dataset. In this step, you use the training dataset to move these centers toward the true cluster centers. The algorithm iterates over the training dataset, and recalculates the *K* cluster centers.

1. Read a mini-batch of observations (a small, randomly chosen subset of all records) from the training dataset and do the following. 
**Note**  
When creating a model training job, you specify the batch size in the `mini_batch_size` string in the `HyperParameters` string map. 

   1. Assign all of the observations in the mini-batch to one of the clusters with the closest cluster center.

   1. Calculate the number of observations assigned to each cluster. Then, calculate the proportion of new points assigned per cluster.

      For example, consider the following clusters:

      Cluster c1 = 100 previously assigned points. You added 25 points from the mini-batch in this step.

      Cluster c2 = 150 previously assigned points. You added 40 points from the mini-batch in this step.

      Cluster c3 = 450 previously assigned points. You added 5 points from the mini-batch in this step.

      Calculate the proportion of new points assigned to each of clusters as follows:

      ```
      p1 = proportion of points assigned to c1 = 25/(100+25)
      p2 = proportion of points assigned to c2 = 40/(150+40)
      p3 = proportion of points assigned to c3 = 5/(450+5)
      ```

   1. Compute the center of the new points added to each cluster:

      ```
      d1 = center of the new points added to cluster 1
      d2 = center of the new points added to cluster 2
      d3 = center of the new points added to cluster 3
      ```

   1. Compute the weighted average to find the updated cluster centers as follows:

      ```
      Center of cluster 1 = ((1 - p1) * center of cluster 1) + (p1 * d1)
      Center of cluster 2 = ((1 - p2) * center of cluster 2) + (p2 * d2)
      Center of cluster 3 = ((1 - p3) * center of cluster 3) + (p3 * d3)
      ```

1. Read the next mini-batch, and repeat Step 1 to recalculate the cluster centers. 

1. For more information about mini-batch *k*-means, see [Web-scale k-means Clustering](https://citeseerx.ist.psu.edu/document?repid=rep1type=pdf&doi=b452a856a3e3d4d37b1de837996aa6813bedfdcf)).

## Step 3: Reduce the Clusters from *K* to *k*
<a name="kmeans-step3"></a>

If the algorithm created *K* clusters—*(K = k\$1x)* where *x* is greater than 1—then it reduces the *K* clusters to *k* clusters. (For more information, see `extra_center_factor` in the preceding discussion.) It does this by applying Lloyd's method with `kmeans++` initialization to the *K* cluster centers. For more information about Lloyd's method, see [k-means clustering](https://pdfs.semanticscholar.org/0074/4cb7cc9ccbbcdadbd5ff2f2fee6358427271.pdf). 

# K-Means Hyperparameters
<a name="k-means-api-config"></a>

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm that you want to use. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the k-means training algorithm provided by Amazon SageMaker AI. For more information about how k-means clustering works, see [How K-Means Clustering Works](algo-kmeans-tech-notes.md).


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim | The number of features in the input data. **Required** Valid values: Positive integer  | 
| k |  The number of required clusters. **Required** Valid values: Positive integer  | 
| epochs | The number of passes done over the training data. **Optional** Valid values: Positive integer Default value: 1  | 
| eval\$1metrics | A JSON list of metric types used to report a score for the model. Allowed values are `msd` for Means Square Deviation and `ssd` for Sum of Square Distance. If test data is provided, the score is reported for each of the metrics requested. **Optional** Valid values: Either `[\"msd\"]` or `[\"ssd\"]` or `[\"msd\",\"ssd\"]` . Default value: `[\"msd\"]`  | 
| extra\$1center\$1factor | The algorithm creates K centers = `num_clusters` \$1 `extra_center_factor` as it runs and reduces the number of centers from K to `k` when finalizing the model. **Optional** Valid values: Either a positive integer or `auto`. Default value: `auto`  | 
| half\$1life\$1time\$1size | Used to determine the weight given to an observation when computing a cluster mean. This weight decays exponentially as more points are observed. When a point is first observed, it is assigned a weight of 1 when computing the cluster mean. The decay constant for the exponential decay function is chosen so that after observing `half_life_time_size` points, its weight is 1/2. If set to 0, there is no decay. **Optional** Valid values: Non-negative integer Default value: 0  | 
| init\$1method | Method by which the algorithm chooses the initial cluster centers. The standard k-means approach chooses them at random. An alternative k-means\$1\$1 method chooses the first cluster center at random. Then it spreads out the position of the remaining initial clusters by weighting the selection of centers with a probability distribution that is proportional to the square of the distance of the remaining data points from existing centers. **Optional** Valid values: Either `random` or `kmeans++`. Default value: `random`  | 
| local\$1lloyd\$1init\$1method | The initialization method for Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Either `random` or `kmeans++`. Default value: `kmeans++`  | 
| local\$1lloyd\$1max\$1iter | The maximum number of iterations for Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Positive integer Default value: 300  | 
| local\$1lloyd\$1num\$1trials | The number of times the Lloyd's expectation-maximization (EM) procedure with the least loss is run when building the final model containing `k` centers. **Optional** Valid values: Either a positive integer or `auto`. Default value: `auto`  | 
| local\$1lloyd\$1tol | The tolerance for change in loss for early stopping of Lloyd's expectation-maximization (EM) procedure used to build the final model containing `k` centers. **Optional** Valid values: Float. Range in [0, 1]. Default value: 0.0001  | 
| mini\$1batch\$1size | The number of observations per mini-batch for the data iterator. **Optional** Valid values: Positive integer Default value: 5000  | 

# Tune a K-Means Model
<a name="k-means-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

The Amazon SageMaker AI k-means algorithm is an unsupervised algorithm that groups data into clusters whose members are as similar as possible. Because it is unsupervised, it doesn't use a validation dataset that hyperparameters can optimize against. But it does take a test dataset and emits metrics that depend on the squared distance between the data points and the final cluster centroids at the end of each training run. To find the model that reports the tightest clusters on the test dataset, you can use a hyperparameter tuning job. The clusters optimize the similarity of their members.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the K-Means Algorithm
<a name="km-metrics"></a>

The k-means algorithm computes the following metrics during training. When tuning a model, choose one of these metrics as the objective metric. 


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:msd | Mean squared distances between each record in the test set and the closest center of the model. | Minimize | 
| test:ssd | Sum of the squared distances between each record in the test set and the closest center of the model. | Minimize | 


## Tunable K-Means Hyperparameters
<a name="km-tunable-hyperparameters"></a>

Tune the Amazon SageMaker AI k-means model with the following hyperparameters. The hyperparameters that have the greatest impact on k-means objective metrics are: `mini_batch_size`, `extra_center_factor`, and `init_method`. Tuning the hyperparameter `epochs` generally results in minor improvements.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| epochs | IntegerParameterRanges | MinValue: 1, MaxValue:10 | 
| extra\$1center\$1factor | IntegerParameterRanges | MinValue: 4, MaxValue:10 | 
| init\$1method | CategoricalParameterRanges | ['kmeans\$1\$1', 'random'] | 
| mini\$1batch\$1size | IntegerParameterRanges | MinValue: 3000, MaxValue:15000 | 

# K-Means Response Formats
<a name="km-in-formats"></a>

All SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI k-means algorithm.

## JSON Response Format
<a name="km-json"></a>

```
{
    "predictions": [
        {
            "closest_cluster": 1.0,
            "distance_to_cluster": 3.0,
        },
        {
            "closest_cluster": 2.0,
            "distance_to_cluster": 5.0,
        },

        ....
    ]
}
```

## JSONLINES Response Format
<a name="km-jsonlines"></a>

```
{"closest_cluster": 1.0, "distance_to_cluster": 3.0}
{"closest_cluster": 2.0, "distance_to_cluster": 5.0}
```

## RECORDIO Response Format
<a name="km-recordio"></a>

```
[
    Record = {
        features = {},
        label = {
            'closest_cluster': {
                keys: [],
                values: [1.0, 2.0]  # float32
            },
            'distance_to_cluster': {
                keys: [],
                values: [3.0, 5.0]  # float32
            },
        }
    }
]
```

## CSV Response Format
<a name="km-csv"></a>

The first value in each line corresponds to `closest_cluster`.

The second value in each line corresponds to `distance_to_cluster`.

```
1.0,3.0
2.0,5.0
```

# Principal Component Analysis (PCA) Algorithm
<a name="pca"></a>

PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called *components*, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

In Amazon SageMaker AI, PCA operates in two modes, depending on the scenario: 
+ **regular**: For datasets with sparse data and a moderate number of observations and features.
+ **randomized**: For datasets with both a large number of observations and features. This mode uses an approximation algorithm. 

PCA uses tabular data. 

The rows represent observations you want to embed in a lower dimensional space. The columns represent features that you want to find a reduced approximation for. The algorithm calculates the covariance matrix (or an approximation thereof in a distributed manner), and then performs the singular value decomposition on this summary to produce the principal components. 

**Topics**
+ [

## Input/Output Interface for the PCA Algorithm
](#pca-inputoutput)
+ [

## EC2 Instance Recommendation for the PCA Algorithm
](#pca-instances)
+ [

## PCA Sample Notebooks
](#PCA-sample-notebooks)
+ [

# How PCA Works
](how-pca-works.md)
+ [

# PCA Hyperparameters
](PCA-reference.md)
+ [

# PCA Response Formats
](PCA-in-formats.md)

## Input/Output Interface for the PCA Algorithm
<a name="pca-inputoutput"></a>

For training, PCA expects data provided in the train channel, and optionally supports a dataset passed to the test dataset, which is scored by the final algorithm. Both `recordIO-wrapped-protobuf` and `CSV` formats are supported for training. You can use either File mode or Pipe mode to train models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`.

For inference, PCA supports `text/csv`, `application/json`, and `application/x-recordio-protobuf`. Results are returned in either `application/json` or `application/x-recordio-protobuf` format with a vector of "projections."

For more information on input and output file formats, see [PCA Response Formats](PCA-in-formats.md) for inference and the [PCA Sample Notebooks](#PCA-sample-notebooks).

## EC2 Instance Recommendation for the PCA Algorithm
<a name="pca-instances"></a>

PCA supports CPU and GPU instances for training and inference. Which instance type is most performant depends heavily on the specifics of the input data. For GPU instances, PCA supports P2, P3, G4dn, and G5.

## PCA Sample Notebooks
<a name="PCA-sample-notebooks"></a>

For a sample notebook that shows how to use the SageMaker AI Principal Component Analysis algorithm to analyze the images of handwritten digits from zero to nine in the MNIST dataset, see [An Introduction to PCA with MNIST](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/pca_mnist/pca_mnist.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The topic modeling example notebooks using the NTM algorithms are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How PCA Works
<a name="how-pca-works"></a>

Principal Component Analysis (PCA) is a learning algorithm that reduces the dimensionality (number of features) within a dataset while still retaining as much information as possible. 

PCA reduces dimensionality by finding a new set of features called *components*, which are composites of the original features, but are uncorrelated with one another. The first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.

It is an unsupervised dimensionality reduction algorithm. In unsupervised learning, labels that might be associated with the objects in the training dataset aren't used.

Given the input of a matrix with rows ![\[x_1,…,x_n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-39b.png) each of dimension `1 * d`, the data is partitioned into mini-batches of rows and distributed among the training nodes (workers). Each worker then computes a summary of its data. The summaries of the different workers are then unified into a single solution at the end of the computation. 

**Modes**

The Amazon SageMaker AI PCA algorithm uses either of two modes to calculate these summaries, depending on the situation:
+ **regular**: for datasets with sparse data and a moderate number of observations and features.
+ **randomized**: for datasets with both a large number of observations and features. This mode uses an approximation algorithm. 

As the algorithm's last step, it performs the singular value decomposition on the unified solution, from which the principal components are then derived.

## Mode 1: Regular
<a name="mode-1"></a>

The workers jointly compute both ![\[Equation in text-form: \sum x_i^T x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-1b.png) and ![\[Equation in text-form: \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-2b.png) .

**Note**  
Because ![\[Equation in text-form: x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-3b.png) are `1 * d` row vectors, ![\[Equation in text-form: x_i^T x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-4b.png) is a matrix (not a scalar). Using row vectors within the code allows us to obtain efficient caching.

The covariance matrix is computed as ![\[Equation in text-form: \sum x_i^T x_i - (1/n) (\sum x_i)^T \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-32b.png) , and its top `num_components` singular vectors form the model.

**Note**  
If `subtract_mean` is `False`, we avoid computing and subtracting ![\[Equation in text-form: \sum x_i\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-2b.png) .

Use this algorithm when the dimension `d` of the vectors is small enough so that ![\[Equation in text-form: d^2\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-7b.png) can fit in memory.

## Mode 2: Randomized
<a name="mode-2"></a>

When the number of features in the input dataset is large, we use a method to approximate the covariance metric. For every mini-batch ![\[Equation in text-form: X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-23b.png) of dimension `b * d`, we randomly initialize a `(num_components + extra_components) * b` matrix that we multiply by each mini-batch, to create a `(num_components + extra_components) * d` matrix. The sum of these matrices is computed by the workers, and the servers perform SVD on the final `(num_components + extra_components) * d` matrix. The top right `num_components` singular vectors of it are the approximation of the top singular vectors of the input matrix.

Let ![\[Equation in text-form: \ell\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-38b.png) ` = num_components + extra_components`. Given a mini-batch ![\[Equation in text-form: X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-23b.png) of dimension `b * d`, the worker draws a random matrix ![\[Equation in text-form: H_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-24b.png) of dimension ![\[Equation in text-form: \ell * b\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-38.png) . Depending on whether the environment uses a GPU or CPU and the dimension size, the matrix is either a random sign matrix where each entry is `+-1` or a *FJLT* (fast Johnson Lindenstrauss transform; for information, see [FJLT Transforms](https://www.cs.princeton.edu/~chazelle/pubs/FJLT-sicomp09.pdf) and the follow-up papers). The worker then computes ![\[Equation in text-form: H_t X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-26b.png) and maintains ![\[Equation in text-form: B = \sum H_t X_t\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-27b.png) . The worker also maintains ![\[Equation in text-form: h^T\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-28b.png) , the sum of columns of ![\[Equation in text-form: H_1,..,H_T\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-29b.png) (`T` being the total number of mini-batches), and `s`, the sum of all input rows. After processing the entire shard of data, the worker sends the server `B`, `h`, `s`, and `n` (the number of input rows).

Denote the different inputs to the server as ![\[Equation in text-form: B^1, h^1, s^1, n^1,…\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-30b.png) The server computes `B`, `h`, `s`, `n` the sums of the respective inputs. It then computes ![\[Equation in text-form: C = B – (1/n) h^T s\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/PCA-31b.png) , and finds its singular value decomposition. The top-right singular vectors and singular values of `C` are used as the approximate solution to the problem.

# PCA Hyperparameters
<a name="PCA-reference"></a>

In the `CreateTrainingJob` request, you specify the training algorithm. You can also specify algorithm-specific HyperParameters as string-to-string maps. The following table lists the hyperparameters for the PCA training algorithm provided by Amazon SageMaker AI. For more information about how PCA works, see [How PCA Works](how-pca-works.md). 


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  Input dimension. **Required** Valid values: positive integer  | 
| mini\$1batch\$1size |  Number of rows in a mini-batch. **Required** Valid values: positive integer  | 
| num\$1components |  The number of principal components to compute. **Required** Valid values: positive integer  | 
| algorithm\$1mode |  Mode for computing the principal components.  **Optional** Valid values: *regular* or *randomized* Default value: *regular*  | 
| extra\$1components |  As the value increases, the solution becomes more accurate but the runtime and memory consumption increase linearly. The default, -1, means the maximum of 10 and `num_components`. Valid for *randomized* mode only. **Optional** Valid values: Non-negative integer or -1 Default value: -1  | 
| subtract\$1mean |  Indicates whether the data should be unbiased both during training and at inference.  **Optional** Valid values: One of *true* or *false* Default value: *true*  | 

# PCA Response Formats
<a name="PCA-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). This topic contains a list of the available output formats for the SageMaker AI PCA algorithm.

## JSON Response Format
<a name="PCA-json"></a>

Accept—application/json

```
{
    "projections": [
        {
            "projection": [1.0, 2.0, 3.0, 4.0, 5.0]
        },
        {
            "projection": [6.0, 7.0, 8.0, 9.0, 0.0]
        },
        ....
    ]
}
```

## JSONLINES Response Format
<a name="PCA-jsonlines"></a>

Accept—application/jsonlines

```
{ "projection": [1.0, 2.0, 3.0, 4.0, 5.0] }
{ "projection": [6.0, 7.0, 8.0, 9.0, 0.0] }
```

## RECORDIO Response Format
<a name="PCA-recordio"></a>

Accept—application/x-recordio-protobuf

```
[
    Record = {
        features = {},
        label = {
            'projection': {
                keys: [],
                values: [1.0, 2.0, 3.0, 4.0, 5.0]
            }
        }
    },
    Record = {
        features = {},
        label = {
            'projection': {
                keys: [],
                values: [1.0, 2.0, 3.0, 4.0, 5.0]
            }
        }
    }  
]
```

# Random Cut Forest (RCF) Algorithm
<a name="randomcutforest"></a>

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

While there are many applications of anomaly detection algorithms to one-dimensional time series data such as traffic volume analysis or sound volume spike detection, RCF is designed to work with arbitrary-dimensional input. Amazon SageMaker AI RCF scales well with respect to number of features, data set size, and number of instances.

**Topics**
+ [

## Input/Output Interface for the RCF Algorithm
](#rcf-input_output)
+ [

## Instance Recommendations for the RCF Algorithm
](#rcf-instance-recommend)
+ [

## RCF Sample Notebooks
](#rcf-sample-notebooks)
+ [

# How RCF Works
](rcf_how-it-works.md)
+ [

# RCF Hyperparameters
](rcf_hyperparameters.md)
+ [

# Tune an RCF Model
](random-cut-forest-tuning.md)
+ [

# RCF Response Formats
](rcf-in-formats.md)

## Input/Output Interface for the RCF Algorithm
<a name="rcf-input_output"></a>

Amazon SageMaker AI Random Cut Forest supports the `train` and `test` data channels. The optional test channel is used to compute accuracy, precision, recall, and F1-score metrics on labeled data. Train and test data content types can be either `application/x-recordio-protobuf` or `text/csv` formats. For the test data, when using text/csv format, the content must be specified as text/csv;label\$1size=1 where the first column of each row represents the anomaly label: "1" for an anomalous data point and "0" for a normal data point. You can use either File mode or Pipe mode to train RCF models on data that is formatted as `recordIO-wrapped-protobuf` or as `CSV`

The train channel only supports `S3DataDistributionType=ShardedByS3Key` and the test channel only supports `S3DataDistributionType=FullyReplicated`. The following example specifies the S3 distribution type for the train channel using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html).

**Note**  
The `sagemaker.inputs.s3_input` method was renamed to `sagemaker.inputs.TrainingInput` in [SageMaker Python SDK v2](https://sagemaker.readthedocs.io/en/stable/v2.html#s3-input).

```
  import sagemaker
    
  # specify Random Cut Forest training job information and hyperparameters
  rcf = sagemaker.estimator.Estimator(...)
    
  # explicitly specify "ShardedByS3Key" distribution type
  train_data = sagemaker.inputs.TrainingInput(
       s3_data=s3_training_data_location,
       content_type='text/csv;label_size=0',
       distribution='ShardedByS3Key')
    
  # run the training job on input data stored in S3
  rcf.fit({'train': train_data})
```

To avoid common errors around execution roles, ensure that you have the execution roles required, `AmazonSageMakerFullAccess` and `AmazonEC2ContainerRegistryFullAccess`. To avoid common errors around your image not existing or its permissions being incorrect, ensure that your ECR image is not larger then the allocated disk space on the training instance. To avoid this, run your training job on an instance that has sufficient disk space. In addition, if your ECR image is from a different AWS account's Elastic Container Service (ECS) repository, and you do not set repository permissions to grant access, this will result in an error. See the [ECR repository permissions ](https://docs.aws.amazon.com/AmazonECR/latest/userguide/set-repository-policy.html) for more information on setting a repository policy statement.

See the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for more information on customizing the S3 data source attributes. Finally, in order to take advantage of multi-instance training the training data must be partitioned into at least as many files as instances.

For inference, RCF supports `application/x-recordio-protobuf`, `text/csv` and `application/json` input data content types. See the [Parameters for Built-in Algorithms](common-info-all-im-models.md) documentation for more information. RCF inference returns `application/x-recordio-protobuf` or `application/json` formatted output. Each record in these output data contains the corresponding anomaly scores for each input data point. See [Common Data Formats--Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html) for more information.

For more information on input and output file formats, see [RCF Response Formats](rcf-in-formats.md) for inference and the [RCF Sample Notebooks](#rcf-sample-notebooks).

## Instance Recommendations for the RCF Algorithm
<a name="rcf-instance-recommend"></a>

For training, we recommend the `ml.m4`, `ml.c4`, and `ml.c5` instance families. For inference we recommend using a `ml.c5.xl` instance type in particular, for maximum performance as well as minimized cost per hour of usage. Although the algorithm could technically run on GPU instance types it does not take advantage of GPU hardware.

## RCF Sample Notebooks
<a name="rcf-sample-notebooks"></a>

For an example of how to train an RCF model and perform inferences with it, see the [An Introduction to SageMaker AI Random Cut Forests](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.html) notebook. For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, click on its **Use** tab and select **Create copy**.

For a blog post on using the RCF algorithm, see [Use the built-in Amazon SageMaker AI Random Cut Forest algorithm for anomaly detection](https://aws.amazon.com/blogs/machine-learning/use-the-built-in-amazon-sagemaker-random-cut-forest-algorithm-for-anomaly-detection/).

# How RCF Works
<a name="rcf_how-it-works"></a>

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a dataset. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a dataset can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model.

The main idea behind the RCF algorithm is to create a forest of trees where each tree is obtained using a partition of a sample of the training data. For example, a random sample of the input data is first determined. The random sample is then partitioned according to the number of trees in the forest. Each tree is given such a partition and organizes that subset of points into a k-d tree. The anomaly score assigned to a data point by the tree is defined as the expected change in complexity of the tree as a result adding that point to the tree; which, in approximation, is inversely proportional to the resulting depth of the point in the tree. The random cut forest assigns an anomaly score by computing the average score from each constituent tree and scaling the result with respect to the sample size. The RCF algorithm is based on the one described in reference [1].

## Sample Data Randomly
<a name="rcf-rndm-sample-data"></a>

The first step in the RCF algorithm is to obtain a random sample of the training data. In particular, suppose we want a sample of size ![\[Equation in text-form: K\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf13.jpg) from ![\[Equation in text-form: N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf14.jpg) total data points. If the training data is small enough, the entire dataset can be used, and we could randomly draw ![\[Equation in text-form: K\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf13.jpg) elements from this set. However, frequently the training data is too large to fit all at once, and this approach isn't feasible. Instead, we use a technique called reservoir sampling.

[Reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) is an algorithm for efficiently drawing random samples from a dataset ![\[Equation in text-form: S={S_1,...,S_N}\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf3.jpg) where the elements in the dataset can only be observed one at a time or in batches. In fact, reservoir sampling works even when ![\[Equation in text-form: N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf14.jpg) is not known *a priori*. If only one sample is requested, such as when ![\[Equation in text-form: K=1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf15.jpg), the algorithm is like this:

**Algorithm: Reservoir Sampling**
+  Input: dataset or data stream ![\[Equation in text-form: S={S_1,...,S_N}\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf3.jpg) 
+  Initialize the random sample ![\[Equation in text-form: X=S_1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf4.jpg) 
+  For each observed sample ![\[Equation in text-form: S_n,n=2,...,N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf5.jpg):
  +  Pick a uniform random number ![\[Equation in text-form: \xi \in [0,1]\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf6.jpg) 
  +  If ![\[Equation in text-form: \xi \less 1/n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf7.jpg) 
    +  Set ![\[Equation in text-form: X=S_n\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf8.jpg) 
+  Return ![\[Equation in text-form: X\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf9.jpg) 

This algorithm selects a random sample such that ![\[Equation in text-form: P(X=S_n)=1/N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf10.jpg) for all ![\[Equation in text-form: n=1,...,N\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf11.jpg). When ![\[Equation in text-form: K>1\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/rcf12.jpg) the algorithm is more complicated. Additionally, a distinction must be made between random sampling that is with and without replacement. RCF performs an augmented reservoir sampling without replacement on the training data based on the algorithms described in [2].

## Train a RCF Model and Produce Inferences
<a name="rcf-training-inference"></a>

The next step in RCF is to construct a random cut forest using the random sample of data. First, the sample is partitioned into a number of equal-sized partitions equal to the number of trees in the forest. Then, each partition is sent to an individual tree. The tree recursively organizes its partition into a binary tree by partitioning the data domain into bounding boxes.

This procedure is best illustrated with an example. Suppose a tree is given the following two-dimensional dataset. The corresponding tree is initialized to the root node:

![\[A two-dimensional dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/RCF1.jpg)


Figure: A two-dimensional dataset where the majority of data lies in a cluster (blue) except for one anomalous data point (orange). The tree is initialized with a root node.

The RCF algorithm organizes these data in a tree by first computing a bounding box of the data, selecting a random dimension (giving more weight to dimensions with higher "variance"), and then randomly determining the position of a hyperplane "cut" through that dimension. The two resulting subspaces define their own sub tree. In this example, the cut happens to separate a lone point from the remainder of the sample. The first level of the resulting binary tree consists of two nodes, one which will consist of the subtree of points to the left of the initial cut and the other representing the single point on the right.

![\[A random cut partitioning the two-dimensional dataset.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/RCF2.jpg)


Figure: A random cut partitioning the two-dimensional dataset. An anomalous data point is more likely to lie isolated in a bounding box at a smaller tree depth than other points. 

Bounding boxes are then computed for the left and right halves of the data and the process is repeated until every leaf of the tree represents a single data point from the sample. Note that if the lone point is sufficiently far away then it is more likely that a random cut would result in point isolation. This observation provides the intuition that tree depth is, loosely speaking, inversely proportional to the anomaly score.

When performing inference using a trained RCF model the final anomaly score is reported as the average across scores reported by each tree. Note that it is often the case that the new data point does not already reside in the tree. To determine the score associated with the new point the data point is inserted into the given tree and the tree is efficiently (and temporarily) reassembled in a manner equivalent to the training process described above. That is, the resulting tree is as if the input data point were a member of the sample used to construct the tree in the first place. The reported score is inversely proportional to the depth of the input point within the tree.

## Choose Hyperparameters
<a name="rcf-choose-hyperparam"></a>

The primary hyperparameters used to tune the RCF model are `num_trees` and `num_samples_per_tree`. Increasing `num_trees` has the effect of reducing the noise observed in anomaly scores since the final score is the average of the scores reported by each tree. While the optimal value is application-dependent we recommend using 100 trees to begin with as a balance between score noise and model complexity. Note that inference time is proportional to the number of trees. Although training time is also affected it is dominated by the reservoir sampling algorithm describe above.

The parameter `num_samples_per_tree` is related to the expected density of anomalies in the dataset. In particular, `num_samples_per_tree` should be chosen such that `1/num_samples_per_tree` approximates the ratio of anomalous data to normal data. For example, if 256 samples are used in each tree then we expect our data to contain anomalies 1/256 or approximately 0.4% of the time. Again, an optimal value for this hyperparameter is dependent on the application.

## References
<a name="references"></a>

1.  Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers. "Robust random cut forest based anomaly detection on streams." In *International Conference on Machine Learning*, pp. 2712-2721. 2016.

1.  Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, and Al Geist. "Reservoir-based random sampling with replacement from data stream." In *Proceedings of the 2004 SIAM International Conference on Data Mining*, pp. 492-496. Society for Industrial and Applied Mathematics, 2004.

# RCF Hyperparameters
<a name="rcf_hyperparameters"></a>

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the Amazon SageMaker AI RCF algorithm. For more information, including recommendations on how to choose hyperparameters, see [How RCF Works](rcf_how-it-works.md).


| Parameter Name | Description | 
| --- | --- | 
| feature\$1dim |  The number of features in the data set. (If you use the [Random Cut Forest](https://sagemaker.readthedocs.io/en/stable/algorithms/unsupervised/randomcutforest.html) estimator, this value is calculated for you and need not be specified.) **Required** Valid values: Positive integer (min: 1, max: 10000)  | 
| eval\$1metrics |  A list of metrics used to score a labeled test data set. The following metrics can be selected for output: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html) **Optional** Valid values: a list with possible values taken from `accuracy` or `precision_recall_fscore`.  Default value: Both `accuracy`, `precision_recall_fscore` are calculated.  | 
| num\$1samples\$1per\$1tree |  Number of random samples given to each tree from the training data set. **Optional** Valid values: Positive integer (min: 1, max: 2048) Default value: 256  | 
| num\$1trees |  Number of trees in the forest. **Optional** Valid values: Positive integer (min: 50, max: 1000) Default value: 100  | 

# Tune an RCF Model
<a name="random-cut-forest-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning or hyperparameter optimization, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

The Amazon SageMaker AI RCF algorithm is an unsupervised anomaly-detection algorithm that requires a labeled test dataset for hyperparameter optimization. RCF calculates anomaly scores for test data points and then labels the data points as anomalous if their scores are beyond three standard deviations from the mean score. This is known as the three-sigma limit heuristic. The F1-score is based on the difference between calculated labels and actual labels. The hyperparameter tuning job finds the model that maximizes that score. The success of hyperparameter optimization depends on the applicability of the three-sigma limit heuristic to the test dataset.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the RCF Algorithm
<a name="random-cut-forest-metrics"></a>

The RCF algorithm computes the following metric during training. When tuning the model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| test:f1 | F1-score on the test dataset, based on the difference between calculated labels and actual labels. | Maximize | 

## Tunable RCF Hyperparameters
<a name="random-cut-forest-tunable-hyperparameters"></a>

You can tune a RCF model with the following hyperparameters.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| num\$1samples\$1per\$1tree | IntegerParameterRanges | MinValue: 1, MaxValue:2048 | 
| num\$1trees | IntegerParameterRanges | MinValue: 50, MaxValue:1000 | 

# RCF Response Formats
<a name="rcf-in-formats"></a>

All Amazon SageMaker AI built-in algorithms adhere to the common input inference format described in [Common Data Formats - Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html). Note that SageMaker AI Random Cut Forest supports both dense and sparse JSON and RecordIO formats. This topic contains a list of the available output formats for the SageMaker AI RCF algorithm.

## JSON Response Format
<a name="RCF-json"></a>

ACCEPT: application/json.

```
    {                                                                                                                                                                                                                                                                                    
        "scores":    [                                                                                                                                                                                                                                                                   
            {"score": 0.02},                                                                                                                                                                                                                                                             
            {"score": 0.25}                                                                                                                                                                                                                                                              
        ]                                                                                                                                                                                                                                                                                
    }
```

### JSONLINES Response Format
<a name="RCF-jsonlines"></a>

ACCEPT: application/jsonlines.

```
{"score": 0.02},
{"score": 0.25}
```

## RECORDIO Response Format
<a name="rcf-recordio"></a>

ACCEPT: application/x-recordio-protobuf.

```
    [                                                                                                                                                                                                                                                                                    
         Record = {                                                                                                                                                                                                                                                                           
             features = {},                                                                                                                                                                                                                                                                   
             label = {                                                                                                                                                                                                                                                                       
                 'score': {                                                                                                                                                                                                                                                                   
                     keys: [],                                                                                                                                                                                                                                                                
                     values: [0.25]  # float32                                                                                                                                                                                                                                                
                 }                                                                                                                                                                                                                                                                            
             }                                                                                                                                                                                                                                                                                
         },                                                                                                                                                                                                                                                                                   
         Record = {                                                                                                                                                                                                                                                                           
             features = {},                                                                                                                                                                                                                                                                   
             label = {                                                                                                                                                                                                                                                                       
                 'score': {                                                                                                                                                                                                                                                                   
                     keys: [],                                                                                                                                                                                                                                                                
                     values: [0.23]  # float32                                                                                                                                                                                                                                                
                 }                                                                                                                                                                                                                                                                            
             }                                                                                                                                                                                                                                                                                
         }                                                                                                                                                                                                                                                                                    
    ]
```

# Built-in SageMaker AI Algorithms for Computer Vision
<a name="algorithms-vision"></a>

SageMaker AI provides image processing algorithms that are used for image classification, object detection, and computer vision.
+ [Image Classification - MXNet](image-classification.md)—uses example data with answers (referred to as a *supervised algorithm*). Use this algorithm to classify images.
+ [Image Classification - TensorFlow](image-classification-tensorflow.md)—uses pretrained TensorFlow Hub models to fine-tune for specific tasks (referred to as a *supervised algorithm*). Use this algorithm to classify images.
+ [Object Detection - MXNet](object-detection.md)—detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene.
+ [Object Detection - TensorFlow](object-detection-tensorflow.md)—detects bounding boxes and object labels in an image. It is a supervised learning algorithm that supports transfer learning with available pretrained TensorFlow models.
+ [Semantic Segmentation Algorithm](semantic-segmentation.md)—provides a fine-grained, pixel-level approach to developing computer vision applications.


| Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable | 
| --- | --- | --- | --- | --- | --- | 
| Image Classification - MXNet | train and validation, (optionally) train\$1lst, validation\$1lst, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Image Classification - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | CPU or GPU | Yes (only across multiple GPUs on a single instance) | 
| Object Detection | train and validation, (optionally) train\$1annotation, validation\$1annotation, and model | File or Pipe | recordIO or image files (.jpg or .png)  | GPU | Yes | 
| Object Detection - TensorFlow | training and validation | File | image files (.jpg, .jpeg, or .png)  | GPU | Yes (only across multiple GPUs on a single instance) | 
| Semantic Segmentation | train and validation, train\$1annotation, validation\$1annotation, and (optionally) label\$1map and model | File or Pipe | Image files | GPU (single instance only) | No | 

# Image Classification - MXNet
<a name="image-classification"></a>

The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification. It takes an image as input and outputs one or more labels assigned to that image. It uses a convolutional neural network that can be trained from scratch or trained using transfer learning when a large number of training images are not available 

The recommended input format for the Amazon SageMaker AI image classification algorithms is Apache MXNet [RecordIO](https://mxnet.apache.org/api/faq/recordio). However, you can also use raw images in .jpg or .png format. Refer to [this discussion](https://mxnet.apache.org/api/architecture/note_data_loading) for a broad overview of efficient data preparation and loading for machine learning systems. 

**Note**  
To maintain better interoperability with existing deep learning frameworks, this differs from the protobuf data formats commonly used by other Amazon SageMaker AI algorithms.

For more information on convolutional networks, see: 
+ [Deep residual learning for image recognition](https://arxiv.org/abs/1512.03385) Kaiming He, et al., 2016 IEEE Conference on Computer Vision and Pattern Recognition
+ [ImageNet image database](http://www.image-net.org/)
+ [Image classification with Gluon-CV and MXNet](https://gluon-cv.mxnet.io/build/examples_classification/index.html)

**Topics**
+ [

## Input/Output Interface for the Image Classification Algorithm
](#IC-inputoutput)
+ [

## EC2 Instance Recommendation for the Image Classification Algorithm
](#IC-instances)
+ [

## Image Classification Sample Notebooks
](#IC-sample-notebooks)
+ [

# How Image Classification Works
](IC-HowItWorks.md)
+ [

# Image Classification Hyperparameters
](IC-Hyperparameter.md)
+ [

# Tune an Image Classification Model
](IC-tuning.md)

## Input/Output Interface for the Image Classification Algorithm
<a name="IC-inputoutput"></a>

The SageMaker AI Image Classification algorithm supports both RecordIO (`application/x-recordio`) and image (`image/png`, `image/jpeg`, and `application/x-image`) content types for training in file mode, and supports the RecordIO (`application/x-recordio`) content type for training in pipe mode. However, you can also train in pipe mode using the image files (`image/png`, `image/jpeg`, and `application/x-image`), without creating RecordIO files, by using the augmented manifest format.

Distributed training is supported for file mode and pipe mode. When using the RecordIO content type in pipe mode, you must set the `S3DataDistributionType` of the `S3DataSource` to `FullyReplicated`. The algorithm supports a fully replicated model where your data is copied onto each machine.

The algorithm supports `image/png`, `image/jpeg`, and `application/x-image` for inference.

### Train with RecordIO Format
<a name="IC-recordio-training"></a>

If you use the RecordIO format for training, specify both `train` and `validation` channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Specify one RecordIO (`.rec`) file in the `train` channel and one RecordIO file in the `validation` channel. Set the content type for both channels to `application/x-recordio`. 

### Train with Image Format
<a name="IC-image-training"></a>

If you use the Image format for training, specify `train`, `validation`, `train_lst`, and `validation_lst` channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Specify the individual image data (`.jpg` or `.png` files) for the `train` and `validation` channels. Specify one `.lst` file in each of the `train_lst` and `validation_lst` channels. Set the content type for all four channels to `application/x-image`. 

**Note**  
SageMaker AI reads the training and validation data separately from different channels, so you must store the training and validation data in different folders.

A `.lst` file is a tab-separated file with three columns that contains a list of image files. The first column specifies the image index, the second column specifies the class label index for the image, and the third column specifies the relative path of the image file. The image index in the first column must be unique across all of the images. The set of class label indices are numbered successively and the numbering should start with 0. For example, 0 for the cat class, 1 for the dog class, and so on for additional classes. 

 The following is an example of a `.lst` file: 

```
5      1   your_image_directory/train_img_dog1.jpg
1000   0   your_image_directory/train_img_cat1.jpg
22     1   your_image_directory/train_img_dog2.jpg
```

For example, if your training images are stored in `s3://<your_bucket>/train/class_dog`, `s3://<your_bucket>/train/class_cat`, and so on, specify the path for your `train` channel as `s3://<your_bucket>/train`, which is the top-level directory for your data. In the `.lst` file, specify the relative path for an individual file named `train_image_dog1.jpg` in the `class_dog` class directory as `class_dog/train_image_dog1.jpg`. You can also store all your image files under one subdirectory inside the `train` directory. In that case, use that subdirectory for the relative path. For example, `s3://<your_bucket>/train/your_image_directory`. 

### Train with Augmented Manifest Image Format
<a name="IC-augmented-manifest-training"></a>

The augmented manifest format enables you to do training in Pipe mode using image files without needing to create RecordIO files. You need to specify both train and validation channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. While using the format, an S3 manifest file needs to be generated that contains the list of images and their corresponding annotations. The manifest file format should be in [JSON Lines](http://jsonlines.org/) format in which each line represents one sample. The images are specified using the `'source-ref'` tag that points to the S3 location of the image. The annotations are provided under the `"AttributeNames"` parameter value as specified in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. It can also contain additional metadata under the `metadata` tag, but these are ignored by the algorithm. In the following example, the `"AttributeNames"` are contained in the list of image and annotation references `["source-ref", "class"]`. The corresponding label value is `"0"` for the first image and `“1”` for the second image:

```
{"source-ref":"s3://image/filename1.jpg", "class":"0"}
{"source-ref":"s3://image/filename2.jpg", "class":"1", "class-metadata": {"class-name": "cat", "type" : "groundtruth/image-classification"}}
```

The order of `"AttributeNames"` in the input files matters when training the ImageClassification algorithm. It accepts piped data in a specific order, with `image` first, followed by `label`. So the "AttributeNames" in this example are provided with `"source-ref"` first, followed by `"class"`. When using the ImageClassification algorithm with Augmented Manifest, the value of the `RecordWrapperType` parameter must be `"RecordIO"`.

Multi-label training is also supported by specifying a JSON array of values. The `num_classes` hyperparameter must be set to match the total number of classes. There are two valid label formats: multi-hot and class-id. 

In the multi-hot format, each label is a multi-hot encoded vector of all classes, where each class takes the value of 0 or 1. In the following example, there are three classes. The first image is labeled with classes 0 and 2, while the second image is labeled with class 2 only: 

```
{"image-ref": "s3://amzn-s3-demo-bucket/sample01/image1.jpg", "class": "[1, 0, 1]"}
{"image-ref": "s3://amzn-s3-demo-bucket/sample02/image2.jpg", "class": "[0, 0, 1]"}
```

In the class-id format, each label is a list of the class ids, from [0, `num_classes`), which apply to the data point. The previous example would instead look like this:

```
{"image-ref": "s3://amzn-s3-demo-bucket/sample01/image1.jpg", "class": "[0, 2]"}
{"image-ref": "s3://amzn-s3-demo-bucket/sample02/image2.jpg", "class": "[2]"}
```

The multi-hot format is the default, but can be explicitly set in the content type with the `label-format` parameter: `"application/x-recordio; label-format=multi-hot".` The class-id format, which is the format outputted by GroundTruth, must be set explicitly: `"application/x-recordio; label-format=class-id".`

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Incremental Training
<a name="IC-incremental-training"></a>

You can also seed the training of a new model with the artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data. SageMaker AI image classification models can be seeded only with another built-in image classification model trained in SageMaker AI.

To use a pretrained model, in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, specify the `ChannelName` as "model" in the `InputDataConfig` parameter. Set the `ContentType` for the model channel to `application/x-sagemaker-model`. The input hyperparameters of both the new model and the pretrained model that you upload to the model channel must have the same settings for the `num_layers`, `image_shape` and `num_classes` input parameters. These parameters define the network architecture. For the pretrained model file, use the compressed model artifacts (in .tar.gz format) output by SageMaker AI. You can use either RecordIO or image formats for input data.

### Inference with the Image Classification Algorithm
<a name="IC-inference"></a>

The generated models can be hosted for inference and support encoded `.jpg` and `.png` image formats as `image/png, image/jpeg`, and `application/x-image` content-type. The input image is resized automatically. The output is the probability values for all classes encoded in JSON format, or in [JSON Lines text format](http://jsonlines.org/) for batch transform. The image classification model processes a single image per request and so outputs only one line in the JSON or JSON Lines format. The following is an example of a response in JSON Lines format:

```
accept: application/jsonlines

 {"prediction": [prob_0, prob_1, prob_2, prob_3, ...]}
```

For more details on training and inference, see the image classification sample notebook instances referenced in the introduction.

## EC2 Instance Recommendation for the Image Classification Algorithm
<a name="IC-instances"></a>

For image classification, we support P2, P3, G4dn, and G5 instances. We recommend using GPU instances with more memory for training with large batch sizes. You can also run the algorithm on multi-GPU and multi-machine settings for distributed training. Both CPU (such as C4) and GPU (P2, P3, G4dn, or G5) instances can be used for inference.

## Image Classification Sample Notebooks
<a name="IC-sample-notebooks"></a>

For a sample notebook that uses the SageMaker AI image classification algorithm, see [Build and Register an MXNet Image Classification Model via SageMaker Pipelines](https://github.com/aws-samples/amazon-sagemaker-pipelines-mxnet-image-classification/blob/main/image-classification-sagemaker-pipelines.ipynb). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The example image classification notebooks are located in the **Introduction to Amazon algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

# How Image Classification Works
<a name="IC-HowItWorks"></a>

The image classification algorithm takes an image as input and classifies it into one of the output categories. Deep learning has revolutionized the image classification domain and has achieved great performance. Various deep learning networks such as [ResNet](https://arxiv.org/abs/1512.03385), [DenseNet](https://arxiv.org/abs/1608.06993), [Inception](https://arxiv.org/pdf/1409.4842.pdf), and so on, have been developed to be highly accurate for image classification. At the same time, there have been efforts to collect labeled image data that are essential for training these networks. [ImageNet](https://www.image-net.org/) is one such large dataset that has more than 11 million images with about 11,000 categories. Once a network is trained with ImageNet data, it can then be used to generalize with other datasets as well, by simple re-adjustment or fine-tuning. In this transfer learning approach, a network is initialized with weights (in this example, trained on ImageNet), which can be later fine-tuned for an image classification task in a different dataset. 

Image classification in Amazon SageMaker AI can be run in two modes: full training and transfer learning. In full training mode, the network is initialized with random weights and trained on user data from scratch. In transfer learning mode, the network is initialized with pre-trained weights and just the top fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new data. In this mode, training can be achieved even with a smaller dataset. This is because the network is already trained and therefore can be used in cases without sufficient training data.

# Image Classification Hyperparameters
<a name="IC-Hyperparameter"></a>

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Image Classification algorithm. See [Tune an Image Classification Model](IC-tuning.md) for information on image classification hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes | Number of output classes. This parameter defines the dimensions of the network output and is typically set to the number of classes in the dataset. Besides multi-class classification, multi-label classification is supported too. Please refer to [Input/Output Interface for the Image Classification Algorithm](image-classification.md#IC-inputoutput) for details on how to work with multi-label classification with augmented manifest files.  **Required** Valid values: positive integer  | 
| num\$1training\$1samples | Number of training examples in the input dataset. If there is a mismatch between this value and the number of samples in the training set, then the behavior of the `lr_scheduler_step` parameter is undefined and distributed training accuracy might be affected. **Required** Valid values: positive integer  | 
| augmentation\$1type |  Data augmentation type. The input images can be augmented in multiple ways as specified below. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html) **Optional**  Valid values: `crop`, `crop_color`, or `crop_color_transform`. Default value: no default value  | 
| beta\$11 | The beta1 for `adam`, that is the exponential decay rate for the first moment estimates. **Optional**  Valid values: float. Range in [0, 1]. Default value: 0.9 | 
| beta\$12 | The beta2 for `adam`, that is the exponential decay rate for the second moment estimates. **Optional**  Valid values: float. Range in [0, 1]. Default value: 0.999 | 
| checkpoint\$1frequency | Period to store model parameters (in number of epochs). Note that all checkpoint files are saved as part of the final model file "model.tar.gz" and uploaded to S3 to the specified model location. This increases the size of the model file proportionally to the number of checkpoints saved during training. **Optional** Valid values: positive integer no greater than `epochs`. Default value: no default value (Save checkpoint at the epoch that has the best validation accuracy) | 
| early\$1stopping | `True` to use early stopping logic during training. `False` not to use it. **Optional** Valid values: `True` or `False` Default value: `False` | 
| early\$1stopping\$1min\$1epochs | The minimum number of epochs that must be run before the early stopping logic can be invoked. It is used only when `early_stopping` = `True`. **Optional** Valid values: positive integer Default value: 10 | 
| early\$1stopping\$1patience | The number of epochs to wait before ending training if no improvement is made in the relevant metric. It is used only when `early_stopping` = `True`. **Optional** Valid values: positive integer Default value: 5 | 
| early\$1stopping\$1tolerance | Relative tolerance to measure an improvement in accuracy validation metric. If the ratio of the improvement in accuracy divided by the previous best accuracy is smaller than the `early_stopping_tolerance` value set, early stopping considers there is no improvement. It is used only when `early_stopping` = `True`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.0 | 
| epochs | Number of training epochs. **Optional** Valid values: positive integer Default value: 30 | 
| eps | The epsilon for `adam` and `rmsprop`. It is usually set to a small value to avoid division by 0. **Optional** Valid values: float. Range in [0, 1]. Default value: 1e-8 | 
| gamma | The gamma for `rmsprop`, the decay factor for the moving average of the squared gradient. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.9 | 
| image\$1shape | The input image dimensions, which is the same size as the input layer of the network. The format is defined as '`num_channels`, height, width'. The image dimension can take on any value as the network can handle varied dimensions of the input. However, there may be memory constraints if a larger image dimension is used. Pretrained models can use only a fixed 224 x 224 image size. Typical image dimensions for image classification are '3,224,224'. This is similar to the ImageNet dataset.  For training, if any input image is smaller than this parameter in any dimension, training fails. If an image is larger, a portion of the image is cropped, with the cropped area specified by this parameter. If hyperparameter `augmentation_type` is set, random crop is taken; otherwise, central crop is taken.  At inference, input images are resized to the `image_shape` that was used during training. Aspect ratio is not preserved, and images are not cropped. **Optional** Valid values: string Default value: ‘3,224,224’ | 
| kv\$1store |  Weight update synchronization mode during distributed training. The weight updates can be updated either synchronously or asynchronously across machines. Synchronous updates typically provide better accuracy than asynchronous updates but can be slower. See distributed training in MXNet for more details. This parameter is not applicable to single machine training. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html) **Optional** Valid values: `dist_sync` or `dist_async` Default value: no default value  | 
| learning\$1rate | Initial learning rate. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.1 | 
| lr\$1scheduler\$1factor | The ratio to reduce learning rate used in conjunction with the `lr_scheduler_step` parameter, defined as `lr_new` = `lr_old` \$1 `lr_scheduler_factor`. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.1 | 
| lr\$1scheduler\$1step | The epochs at which to reduce the learning rate. As explained in the `lr_scheduler_factor` parameter, the learning rate is reduced by `lr_scheduler_factor` at these epochs. For example, if the value is set to "10, 20", then the learning rate is reduced by `lr_scheduler_factor` after 10th epoch and again by `lr_scheduler_factor` after 20th epoch. The epochs are delimited by ",". **Optional** Valid values: string Default value: no default value | 
| mini\$1batch\$1size | The batch size for training. In a single-machine multi-GPU setting, each GPU handles `mini_batch_size`/num\$1gpu training samples. For the multi-machine training in dist\$1sync mode, the actual batch size is `mini_batch_size`\$1number of machines. See MXNet docs for more details. **Optional** Valid values: positive integer Default value: 32 | 
| momentum | The momentum for `sgd` and `nag`, ignored for other optimizers. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.9 | 
| multi\$1label |  Flag to use for multi-label classification where each sample can be assigned multiple labels. Average accuracy across all classes is logged. **Optional** Valid values: 0 or 1 Default value: 0  | 
| num\$1layers | Number of layers for the network. For data with large image size (for example, 224x224 - like ImageNet), we suggest selecting the number of layers from the set [18, 34, 50, 101, 152, 200]. For data with small image size (for example, 28x28 - like CIFAR), we suggest selecting the number of layers from the set [20, 32, 44, 56, 110]. The number of layers in each set is based on the ResNet paper. For transfer learning, the number of layers defines the architecture of base network and hence can only be selected from the set [18, 34, 50, 101, 152, 200]. **Optional** Valid values: positive integer in [18, 34, 50, 101, 152, 200] or [20, 32, 44, 56, 110] Default value: 152 | 
| optimizer | The optimizer type. For more details of the parameters for the optimizers, please refer to MXNet's API. **Optional** Valid values: One of `sgd`, `adam`, `rmsprop`, or `nag`. [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html) Default value: `sgd` | 
| precision\$1dtype | The precision of the weights used for training. The algorithm can use either single precision (`float32`) or half precision (`float16`) for the weights. Using half-precision for weights results in reduced memory consumption. **Optional** Valid values: `float32` or `float16` Default value: `float32` | 
| resize | The number of pixels in the shortest side of an image after resizing it for training. If the parameter is not set, then the training data is used without resizing. The parameter should be larger than both the width and height components of `image_shape` to prevent training failure. **Required** when using image content types **Optional** when using the RecordIO content type Valid values: positive integer Default value: no default value  | 
| top\$1k | Reports the top-k accuracy during training. This parameter has to be greater than 1, since the top-1 training accuracy is the same as the regular training accuracy that has already been reported. **Optional** Valid values: positive integer larger than 1. Default value: no default value | 
| use\$1pretrained\$1model | Flag to use pre-trained model for training. If set to 1, then the pretrained model with the corresponding number of layers is loaded and used for training. Only the top FC layer are reinitialized with random weights. Otherwise, the network is trained from scratch. **Optional** Valid values: 0 or 1 Default value: 0 | 
| use\$1weighted\$1loss |  Flag to use weighted cross-entropy loss for multi-label classification (used only when `multi_label` = 1), where the weights are calculated based on the distribution of classes. **Optional** Valid values: 0 or 1 Default value: 0  | 
| weight\$1decay | The coefficient weight decay for `sgd` and `nag`, ignored for other optimizers. **Optional** Valid values: float. Range in [0, 1]. Default value: 0.0001 | 

# Tune an Image Classification Model
<a name="IC-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Image Classification Algorithm
<a name="IC-metrics"></a>

The image classification algorithm is a supervised algorithm. It reports an accuracy metric that is computed during training. When tuning the model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy | The ratio of the number of correct predictions to the total number of predictions made. | Maximize | 

## Tunable Image Classification Hyperparameters
<a name="IC-tunable-hyperparameters"></a>

Tune an image classification model with the following hyperparameters. The hyperparameters that have the greatest impact on image classification objective metrics are: `mini_batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `weight_decay`, `beta_1`, `beta_2`, `eps`, and `gamma`, based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adam` is the `optimizer`.

For more information about which hyperparameters are used in each optimizer, see [Image Classification Hyperparameters](IC-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| gamma | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 0.999 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| mini\$1batch\$1size | IntegerParameterRanges | MinValue: 8, MaxValue: 512 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['sgd', ‘adam’, ‘rmsprop’, 'nag'] | 
| weight\$1decay | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 

# Image Classification - TensorFlow
<a name="image-classification-tensorflow"></a>

The Amazon SageMaker Image Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the [TensorFlow Hub](https://tfhub.dev/s?fine-tunable=yes&module-type=image-classification&subtype=module,placeholder&tf-version=tf2). Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The image classification algorithm takes an image as input and outputs a probability for each provided class label. Training datasets must consist of images in .jpg, .jpeg, or .png format. This page includes information about Amazon EC2 instance recommendations and sample notebooks for Image Classification - TensorFlow.

**Topics**
+ [

# How to use the SageMaker Image Classification - TensorFlow algorithm
](IC-TF-how-to-use.md)
+ [

# Input and output interface for the Image Classification - TensorFlow algorithm
](IC-TF-inputoutput.md)
+ [

## Amazon EC2 instance recommendation for the Image Classification - TensorFlow algorithm
](#IC-TF-instances)
+ [

## Image Classification - TensorFlow sample notebooks
](#IC-TF-sample-notebooks)
+ [

# How Image Classification - TensorFlow Works
](IC-TF-HowItWorks.md)
+ [

# TensorFlow Hub Models
](IC-TF-Models.md)
+ [

# Image Classification - TensorFlow Hyperparameters
](IC-TF-Hyperparameter.md)
+ [

# Tune an Image Classification - TensorFlow model
](IC-TF-tuning.md)

# How to use the SageMaker Image Classification - TensorFlow algorithm
<a name="IC-TF-how-to-use"></a>

You can use Image Classification - TensorFlow as an Amazon SageMaker AI built-in algorithm. The following section describes how to use Image Classification - TensorFlow with the SageMaker AI Python SDK. For information on how to use Image Classification - TensorFlow from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

The Image Classification - TensorFlow algorithm supports transfer learning using any of the compatible pretrained TensorFlow Hub models. For a list of all available pretrained models, see [TensorFlow Hub Models](IC-TF-Models.md). Every pretrained model has a unique `model_id`. The following example uses MobileNet V2 1.00 224 (`model_id`: `tensorflow-ic-imagenet-mobilenet-v2-100-224-classification-4`) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated model training artifacts to construct a SageMaker AI Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and their default values with `hyperparameters.retrieve_default`. For more information, see [Image Classification - TensorFlow Hyperparameters](IC-TF-Hyperparameter.md). Use these values to construct a SageMaker AI Estimator.

**Note**  
Default hyperparameter values are different for different models. For larger models, the default batch size is smaller and the `train_only_top_layer` hyperparameter is set to `"True"`.

This example uses the [https://www.tensorflow.org/datasets/catalog/tf_flowers](https://www.tensorflow.org/datasets/catalog/tf_flowers) dataset, which contains five classes of flower images. We pre-downloaded the dataset from TensorFlow under the Apache 2.0 license and made it available with Amazon S3. To fine-tune your model, call `.fit` using the Amazon S3 location of your training dataset.

```
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-ic-imagenet-mobilenet-v2-100-224-classification-4", "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)

# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")

# Retrieve the pretrained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

# The sample training data is available in the following S3 bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/tf_flowers/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-ic-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create SageMaker Estimator instance
tf_ic_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Use S3 path of the training data to launch SageMaker TrainingJob
tf_ic_estimator.fit({"training": training_dataset_s3_path}, logs=True)
```

# Input and output interface for the Image Classification - TensorFlow algorithm
<a name="IC-TF-inputoutput"></a>

Each of the pretrained models listed in TensorFlow Hub Models can be fine-tuned to any dataset with any number of image classes. Be mindful of how to format your training data for input to the Image Classification - TensorFlow model.
+ **Training data input format:** Your training data should be a directory with as many subdirectories as the number of classes. Each subdirectory should contain images belonging to that class in .jpg, .jpeg, or .png format.

The following is an example of an input directory structure. This example dataset has two classes: `roses` and `dandelion`. The image files in each class folder can have any name. The input directory should be hosted in an Amazon S3 bucket with a path similar to the following: `s3://bucket_name/input_directory/`. Note that the trailing `/` is required.

```
input_directory
    |--roses
        |--abc.jpg
        |--def.jpg
    |--dandelion
        |--ghi.jpg
        |--jkl.jpg
```

Trained models output label mapping files that map class folder names to the indices in the list of output class probabilities. This mapping is in alphabetical order. For example, in the preceding example, the dandelion class is index 0 and the roses class is index 1. 

After training, you have a fine-tuned model that you can further train using incremental training or deploy for inference. The Image Classification - TensorFlow algorithm automatically adds a pre-processing and post-processing signature to the fine-tuned model so that it can take in images as input and return class probabilities. The file mapping class indices to class labels is saved along with the models. 

## Incremental training
<a name="IC-TF-incremental-training"></a>

You can seed the training of a new model with artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data.

**Note**  
You can only seed a SageMaker Image Classification - TensorFlow model with another Image Classification - TensorFlow model trained in SageMaker AI. 

You can use any dataset for incremental training, as long as the set of classes remains the same. The incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained model, you start with an existing fine-tuned model. For an example of incremental training with the SageMaker AI Image Classification - TensorFlow algorithm, see the [Introduction to SageMaker TensorFlow - Image Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb) sample notebook.

## Inference with the Image Classification - TensorFlow algorithm
<a name="IC-TF-inference"></a>

You can host the fine-tuned model that results from your TensorFlow Image Classification training for inference. Any input image for inference must be in `.jpg`, .`jpeg`, or `.png` format and be content type `application/x-image`. The Image Classification - TensorFlow algorithm resizes input images automatically. 

Running inference results in probability values, class labels for all classes, and the predicted label corresponding to the class index with the highest probability encoded in JSON format. The Image Classification - TensorFlow model processes a single image per request and outputs only one line. The following is an example of a JSON format response:

```
accept: application/json;verbose

 {"probabilities": [prob_0, prob_1, prob_2, ...],
  "labels":        [label_0, label_1, label_2, ...],
  "predicted_label": predicted_label}
```

If `accept` is set to `application/json`, then the model only outputs probabilities. For more information on training and inference with the Image Classification - TensorFlow algorithm, see the [Introduction to SageMaker TensorFlow - Image Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb) sample notebook.

## Amazon EC2 instance recommendation for the Image Classification - TensorFlow algorithm
<a name="IC-TF-instances"></a>

The Image Classification - TensorFlow algorithm supports all CPU and GPU instances for training, including:
+ `ml.p2.xlarge`
+ `ml.p2.16xlarge`
+ `ml.p3.2xlarge`
+ `ml.p3.16xlarge`
+ `ml.g4dn.xlarge`
+ `ml.g4dn.16.xlarge`
+ `ml.g5.xlarge`
+ `ml.g5.48xlarge`

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as M5) and GPU (P2, P3, G4dn, or G5) instances can be used for inference.

## Image Classification - TensorFlow sample notebooks
<a name="IC-TF-sample-notebooks"></a>

For more information about how to use the SageMaker Image Classification - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to SageMaker TensorFlow - Image Classification](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/image_classification_tensorflow/Amazon_TensorFlow_Image_Classification.ipynb) notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How Image Classification - TensorFlow Works
<a name="IC-TF-HowItWorks"></a>

The Image Classification - TensorFlow algorithm takes an image as input and classifies it into one of the output class labels. Various deep learning networks such as MobileNet, ResNet, Inception, and EfficientNet are highly accurate for image classification. There are also deep learning networks that are trained on large image datasets, such as ImageNet, which has over 11 million images and almost 11,000 classes. After a network is trained with ImageNet data, you can then fine-tune the network on a dataset with a particular focus to perform more specific classification tasks. The Amazon SageMaker Image Classification - TensorFlow algorithm supports transfer learning on many pretrained models that are available in the TensorFlow Hub.

According to the number of class labels in your training data, a classification layer is attached to the pretrained TensorFlow Hub model of your choice. The classification layer consists of a dropout layer, a dense layer, and a fully-connected layer with 2-norm regularizer that is initialized with random weights. The model has hyperparameters for the dropout rate of the dropout layer and the L2 regularization factor for the dense layer. You can then fine-tune either the entire network (including the pretrained model) or only the top classification layer on new training data. With this method of transfer learning, training with smaller datasets is possible.

# TensorFlow Hub Models
<a name="IC-TF-Models"></a>

The following pretrained models are available to use for transfer learning with the Image Classification - TensorFlow algorithm. 

The following models vary significantly in size, number of model parameters, training time, and inference latency for any given dataset. The best model for your use case depends on the complexity of your fine-tuning dataset and any requirements that you have on training time, inference latency, or model accuracy.


| Model Name | `model_id` | Source | 
| --- | --- | --- | 
| MobileNet V2 1.00 224 | `tensorflow-ic-imagenet-mobilenet-v2-100-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/classification/4) | 
| MobileNet V2 0.75 224 | `tensorflow-ic-imagenet-mobilenet-v2-075-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_075_224/classification/4) | 
| MobileNet V2 0.50 224 | `tensorflow-ic-imagenet-mobilenet-v2-050-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_050_224/classification/4) | 
| MobileNet V2 0.35 224 | `tensorflow-ic-imagenet-mobilenet-v2-035-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/classification/4) | 
| MobileNet V2 1.40 224 | `tensorflow-ic-imagenet-mobilenet-v2-140-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_140_224/classification/4) | 
| MobileNet V2 1.30 224 | `tensorflow-ic-imagenet-mobilenet-v2-130-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/4) | 
| MobileNet V2 | `tensorflow-ic-tf2-preview-mobilenet-v2-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/tf2-preview/mobilenet_v2/classification/4) | 
| Inception V3 | `tensorflow-ic-imagenet-inception-v3-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/inception_v3/classification/4) | 
| Inception V2 | `tensorflow-ic-imagenet-inception-v2-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/inception_v2/classification/4) | 
| Inception V1 | `tensorflow-ic-imagenet-inception-v1-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/inception_v1/classification/4) | 
| Inception V3 Preview | `tensorflow-ic-tf2-preview-inception-v3-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/tf2-preview/inception_v3/classification/4) | 
| Inception ResNet V2 | `tensorflow-ic-imagenet-inception-resnet-v2-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/inception_resnet_v2/classification/4) | 
| ResNet V2 50 | `tensorflow-ic-imagenet-resnet-v2-50-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v2_50/classification/4) | 
| ResNet V2 101 | `tensorflow-ic-imagenet-resnet-v2-101-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v2_101/classification/4) | 
| ResNet V2 152 | `tensorflow-ic-imagenet-resnet-v2-152-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v2_152/classification/4) | 
| ResNet V1 50 | `tensorflow-ic-imagenet-resnet-v1-50-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v1_50/classification/4) | 
| ResNet V1 101 | `tensorflow-ic-imagenet-resnet-v1-101-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v1_101/classification/4) | 
| ResNet V1 152 | `tensorflow-ic-imagenet-resnet-v1-152-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_v1_152/classification/4) | 
| ResNet 50 | `tensorflow-ic-imagenet-resnet-50-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/resnet_50/classification/1) | 
| EfficientNet B0 | `tensorflow-ic-efficientnet-b0-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b0/classification/1) | 
| EfficientNet B1 | `tensorflow-ic-efficientnet-b1-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b1/classification/1) | 
| EfficientNet B2 | `tensorflow-ic-efficientnet-b2-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b2/classification/1) | 
| EfficientNet B3 | `tensorflow-ic-efficientnet-b3-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b3/classification/1) | 
| EfficientNet B4 | `tensorflow-ic-efficientnet-b4-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b4/classification/1) | 
| EfficientNet B5 | `tensorflow-ic-efficientnet-b5-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b5/classification/1) | 
| EfficientNet B6 | `tensorflow-ic-efficientnet-b6-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b6/classification/1) | 
| EfficientNet B7 | `tensorflow-ic-efficientnet-b7-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/efficientnet/b7/classification/1) | 
| EfficientNet B0 Lite | `tensorflow-ic-efficientnet-lite0-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite0/classification/2) | 
| EfficientNet B1 Lite | `tensorflow-ic-efficientnet-lite1-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite1/classification/2) | 
| EfficientNet B2 Lite | `tensorflow-ic-efficientnet-lite2-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite2/classification/2) | 
| EfficientNet B3 Lite | `tensorflow-ic-efficientnet-lite3-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite3/classification/2) | 
| EfficientNet B4 Lite | `tensorflow-ic-efficientnet-lite4-classification-2` | [TensorFlow Hub link](https://tfhub.dev/tensorflow/efficientnet/lite4/classification/2) | 
| MobileNet V1 1.00 224 | `tensorflow-ic-imagenet-mobilenet-v1-100-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_100_224/classification/4) | 
| MobileNet V1 1.00 192 | `tensorflow-ic-imagenet-mobilenet-v1-100-192-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_100_192/classification/4) | 
| MobileNet V1 1.00 160 | `tensorflow-ic-imagenet-mobilenet-v1-100-160-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_100_160/classification/4) | 
| MobileNet V1 1.00 128 | `tensorflow-ic-imagenet-mobilenet-v1-100-128-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_100_128/classification/4) | 
| MobileNet V1 0.75 224 | `tensorflow-ic-imagenet-mobilenet-v1-075-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_075_224/classification/4) | 
| MobileNet V1 0.75 192 | `tensorflow-ic-imagenet-mobilenet-v1-075-192-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_075_192/classification/4) | 
| MobileNet V1 0.75 160 | `tensorflow-ic-imagenet-mobilenet-v1-075-160-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_075_160/classification/4) | 
| MobileNet V1 0.75 128 | `tensorflow-ic-imagenet-mobilenet-v1-075-128-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_075_128/classification/4) | 
| MobileNet V1 0.50 224 | `tensorflow-ic-imagenet-mobilenet-v1-050-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_050_224/classification/4) | 
| MobileNet V1 0.50 192 | `tensorflow-ic-imagenet-mobilenet-v1-050-192-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_050_192/classification/4) | 
| MobileNet V1 1.00 160 | `tensorflow-ic-imagenet-mobilenet-v1-050-160-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_050_160/classification/4) | 
| MobileNet V1 0.50 128 | `tensorflow-ic-imagenet-mobilenet-v1-050-128-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_050_128/classification/4) | 
| MobileNet V1 0.25 224 | `tensorflow-ic-imagenet-mobilenet-v1-025-224-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_025_224/classification/4) | 
| MobileNet V1 0.25 192 | `tensorflow-ic-imagenet-mobilenet-v1-025-192-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_025_192/classification/4) | 
| MobileNet V1 0.25 160 | `tensorflow-ic-imagenet-mobilenet-v1-025-160-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_025_160/classification/4) | 
| MobileNet V1 0.25 128 | `tensorflow-ic-imagenet-mobilenet-v1-025-128-classification-4` | [TensorFlow Hub link](https://tfhub.dev/google/imagenet/mobilenet_v1_025_128/classification/4) | 
| BiT-S R50x1 | `tensorflow-ic-bit-s-r50x1-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/s-r50x1/ilsvrc2012_classification/1) | 
| BiT-S R50x3 | `tensorflow-ic-bit-s-r50x3-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/s-r50x3/ilsvrc2012_classification/1) | 
| BiT-S R101x1 | `tensorflow-ic-bit-s-r101x1-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/s-r101x1/ilsvrc2012_classification/1) | 
| BiT-S R101x3 | `tensorflow-ic-bit-s-r101x3-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/s-r101x3/ilsvrc2012_classification/1) | 
| BiT-M R50x1 | `tensorflow-ic-bit-m-r50x1-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r50x1/ilsvrc2012_classification/1) | 
| BiT-M R50x3 | `tensorflow-ic-bit-m-r50x3-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r50x3/ilsvrc2012_classification/1) | 
| BiT-M R101x1 | `tensorflow-ic-bit-m-r101x1-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r101x1/ilsvrc2012_classification/1) | 
| BiT-M R101x3 | `tensorflow-ic-bit-m-r101x3-ilsvrc2012-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r101x3/ilsvrc2012_classification/1) | 
| BiT-M R50x1 ImageNet-21k | `tensorflow-ic-bit-m-r50x1-imagenet21k-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r50x1/imagenet21k_classification/1) | 
| BiT-M R50x3 ImageNet-21k | `tensorflow-ic-bit-m-r50x3-imagenet21k-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r50x3/imagenet21k_classification/1) | 
| BiT-M R101x1 ImageNet-21k | `tensorflow-ic-bit-m-r101x1-imagenet21k-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r101x1/imagenet21k_classification/1) | 
| BiT-M R101x3 ImageNet-21k | `tensorflow-ic-bit-m-r101x3-imagenet21k-classification-1` | [TensorFlow Hub link](https://tfhub.dev/google/bit/m-r101x3/imagenet21k_classification/1) | 

# Image Classification - TensorFlow Hyperparameters
<a name="IC-TF-Hyperparameter"></a>

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Image Classification - TensorFlow algorithm. See [Tune an Image Classification - TensorFlow model](IC-TF-tuning.md) for information on hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| augmentation |  Set to `"True"` to apply `augmentation_random_flip`, `augmentation_random_rotation`, and `augmentation_random_zoom` to the training data.  Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| augmentation\$1random\$1flip |  Indicates which flip mode to use for data augmentation when `augmentation` is set to `"True"`. For more information, see [RandomFlip](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RandomFlip) in the TensorFlow documentation. Valid values: string, any of the following: (`"horizontal_and_vertical"`, `"vertical"`, or `"None"`). Default value: `"horizontal_and_vertical"`.  | 
| augmentation\$1random\$1rotation |  Indicates how much rotation to use for data augmentation when `augmentation` is set to `"True"`. Values represent a fraction of 2π. Positive values rotate counterclockwise while negative values rotate clockwise. `0` means no rotation. For more information, see [RandomRotation](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RandomRotation) in the TensorFlow documentation. Valid values: float, range: [`-1.0`, `1.0`]. Default value: `0.2`.  | 
| augmentation\$1random\$1zoom |  Indicates how much vertical zoom to use for data augmentation when `augmentation` is set to `"True"`. Positive values zoom out while negative values zoom in. `0` means no zoom. For more information, see [RandomZoom](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RandomZoom) in the TensorFlow documentation. Valid values: float, range: [`-1.0`, `1.0`]. Default value: `0.1`.  | 
| batch\$1size |  The batch size for training. For training on instances with multiple GPUs, this batch size is used across the GPUs.  Valid values: positive integer. Default value: `32`.  | 
| beta\$11 |  The beta1 for the `"adam"` optimizer. Represents the exponential decay rate for the first moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| beta\$12 |  The beta2 for the `"adam"` optimizer. Represents the exponential decay rate for the second moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.999`.  | 
| binary\$1mode |  When `binary_mode` is set to `"True"`, the model returns a single probability number for the positive class and can use additional `eval_metric` options. Use only for binary classification problems. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| dropout\$1rate | The dropout rate for the dropout layer in the top classification layer. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.2` | 
| early\$1stopping |  Set to `"True"` to use early stopping logic during training. If `"False"`, early stopping is not used. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| early\$1stopping\$1min\$1delta | The minimum change needed to qualify as an improvement. An absolute change less than the value of early\$1stopping\$1min\$1delta does not qualify as improvement. Used only when early\$1stopping is set to "True".Valid values: float, range: [`0.0`, `1.0`].Default value: `0.0`. | 
| early\$1stopping\$1patience |  The number of epochs to continue training with no improvement. Used only when `early_stopping` is set to `"True"`. Valid values: positive integer. Default value: `5`.  | 
| epochs |  The number of training epochs. Valid values: positive integer. Default value: `3`.  | 
| epsilon |  The epsilon for `"adam"`, `"rmsprop"`, `"adadelta"`, and `"adagrad"` optimizers. Usually set to a small value to avoid division by 0. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `1e-7`.  | 
| eval\$1metric |  If `binary_mode` is set to `"False"`, `eval_metric` can only be `"accuracy"`. If `binary_mode` is `"True"`, select any of the valid values. For more information, see [Metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics) in the TensorFlow documentation. Valid values: string, any of the following: (`"accuracy"`, `"precision"`, `"recall"`, `"auc"`, or `"prc"`). Default value: `"accuracy"`.  | 
| image\$1resize\$1interpolation |  Indicates interpolation method used when resizing images. For more information, see [image.resize](https://www.tensorflow.org/api_docs/python/tf/image/resize) in the TensorFlow documentation. Valid values: string, any of the following: (`"bilinear"`, `"nearest"`, `"bicubic"`, `"area"`,` "lanczos3"` , `"lanczos5"`, `"gaussian"`, or `"mitchellcubic"`). Default value: `"bilinear"`.  | 
| initial\$1accumulator\$1value |  The starting value for the accumulators, or the per-parameter momentum values, for the `"adagrad"` optimizer. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.0001`.  | 
| label\$1smoothing |  Indicates how much to relax the confidence on label values. For example, if `label_smoothing` is `0.1`, then non-target labels are `0.1/num_classes `and target labels are `0.9+0.1/num_classes`.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.1`.  | 
| learning\$1rate | The optimizer learning rate. Valid values: float, range: [`0.0`, `1.0`].Default value: `0.001`. | 
| momentum |  The momentum for `"sgd"`, `"nesterov"`, and `"rmsprop"` optimizers. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| optimizer |  The optimizer type. For more information, see [Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) in the TensorFlow documentation. Valid values: string, any of the following: (`"adam"`, `"sgd"`, `"nesterov"`, `"rmsprop"`,` "adagrad"` , `"adadelta"`). Default value: `"adam"`.  | 
| regularizers\$1l2 |  The L2 regularization factor for the dense layer in the classification layer.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `.0001`.  | 
| reinitialize\$1top\$1layer |  If set to `"Auto"`, the top classification layer parameters are re-initialized during fine-tuning. For incremental training, top classification layer parameters are not re-initialized unless set to `"True"`. Valid values: string, any of the following: (`"Auto"`, `"True"` or `"False"`). Default value: `"Auto"`.  | 
| rho |  The discounting factor for the gradient of the `"adadelta"` and `"rmsprop"` optimizers. Ignored for other optimizers.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.95`.  | 
| train\$1only\$1top\$1layer |  If `"True"`, only the top classification layer parameters are fine-tuned. If `"False"`, all model parameters are fine-tuned. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 

# Tune an Image Classification - TensorFlow model
<a name="IC-TF-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the Image Classification - TensorFlow algorithm
<a name="IC-TF-metrics"></a>

The image classification algorithm is a supervised algorithm. It reports an accuracy metric that is computed during training. When tuning the model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:accuracy | The ratio of the number of correct predictions to the total number of predictions made. | Maximize | 

## Tunable Image Classification - TensorFlow hyperparameters
<a name="IC-TF-tunable-hyperparameters"></a>

Tune an image classification model with the following hyperparameters. The hyperparameters that have the greatest impact on image classification objective metrics are: `batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `regularizers_l2`, `beta_1`, `beta_2`, and `eps` based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adam` is the `optimizer`.

For more information about which hyperparameters are used for each `optimizer`, see [Image Classification - TensorFlow Hyperparameters](IC-TF-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| batch\$1size | IntegerParameterRanges | MinValue: 8, MaxValue: 512 | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['sgd', ‘adam’, ‘rmsprop’, 'nesterov', 'adagrad', 'adadelta'] | 
| regularizers\$1l2 | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| train\$1only\$1top\$1layer | ContinuousParameterRanges | ['True', 'False'] | 

# Object Detection - MXNet
<a name="object-detection"></a>

The Amazon SageMaker AI Object Detection - MXNet algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene. The object is categorized into one of the classes in a specified collection with a confidence score that it belongs to the class. Its location and scale in the image are indicated by a rectangular bounding box. It uses the [Single Shot multibox Detector (SSD)](https://arxiv.org/pdf/1512.02325.pdf) framework and supports two base networks: [VGG](https://arxiv.org/pdf/1409.1556.pdf) and [ResNet](https://arxiv.org/pdf/1603.05027.pdf). The network can be trained from scratch, or trained with models that have been pre-trained on the [ImageNet](http://www.image-net.org/) dataset.

**Topics**
+ [

## Input/Output Interface for the Object Detection Algorithm
](#object-detection-inputoutput)
+ [

## EC2 Instance Recommendation for the Object Detection Algorithm
](#object-detection-instances)
+ [

## Object Detection Sample Notebooks
](#object-detection-sample-notebooks)
+ [

# How Object Detection Works
](algo-object-detection-tech-notes.md)
+ [

# Object Detection Hyperparameters
](object-detection-api-config.md)
+ [

# Tune an Object Detection Model
](object-detection-tuning.md)
+ [

# Object Detection Request and Response Formats
](object-detection-in-formats.md)

## Input/Output Interface for the Object Detection Algorithm
<a name="object-detection-inputoutput"></a>

The SageMaker AI Object Detection algorithm supports both RecordIO (`application/x-recordio`) and image (`image/png`, `image/jpeg`, and `application/x-image`) content types for training in file mode and supports RecordIO (`application/x-recordio`) for training in pipe mode. However you can also train in pipe mode using the image files (`image/png`, `image/jpeg`, and `application/x-image`), without creating RecordIO files, by using the augmented manifest format. The recommended input format for the Amazon SageMaker AI object detection algorithms is [Apache MXNet RecordIO](https://mxnet.apache.org/api/architecture/note_data_loading). However, you can also use raw images in .jpg or .png format. The algorithm supports only `application/x-image` for inference.

**Note**  
To maintain better interoperability with existing deep learning frameworks, this differs from the protobuf data formats commonly used by other Amazon SageMaker AI algorithms.

See the [Object Detection Sample Notebooks](#object-detection-sample-notebooks) for more details on data formats.

### Train with the RecordIO Format
<a name="object-detection-recordio-training"></a>

If you use the RecordIO format for training, specify both train and validation channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Specify one RecordIO (.rec) file in the train channel and one RecordIO file in the validation channel. Set the content type for both channels to `application/x-recordio`. An example of how to generate RecordIO file can be found in the object detection sample notebook. You can also use tools from the [MXNet's GluonCV](https://gluon-cv.mxnet.io/build/examples_datasets/recordio.html) to generate RecordIO files for popular datasets like the [PASCAL Visual Object Classes](http://host.robots.ox.ac.uk/pascal/VOC/) and [Common Objects in Context (COCO)](http://cocodataset.org/#home).

### Train with the Image Format
<a name="object-detection-image-training"></a>

If you use the image format for training, specify `train`, `validation`, `train_annotation`, and `validation_annotation` channels as values for the `InputDataConfig` parameter of [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Specify the individual image data (.jpg or .png) files for the train and validation channels. For annotation data, you can use the JSON format. Specify the corresponding .json files in the `train_annotation` and `validation_annotation` channels. Set the content type for all four channels to `image/png` or `image/jpeg` based on the image type. You can also use the content type `application/x-image` when your dataset contains both .jpg and .png images. The following is an example of a .json file.

```
{
   "file": "your_image_directory/sample_image1.jpg",
   "image_size": [
      {
         "width": 500,
         "height": 400,
         "depth": 3
      }
   ],
   "annotations": [
      {
         "class_id": 0,
         "left": 111,
         "top": 134,
         "width": 61,
         "height": 128
      },
      {
         "class_id": 0,
         "left": 161,
         "top": 250,
         "width": 79,
         "height": 143
      },
      {
         "class_id": 1,
         "left": 101,
         "top": 185,
         "width": 42,
         "height": 130
      }
   ],
   "categories": [
      {
         "class_id": 0,
         "name": "dog"
      },
      {
         "class_id": 1,
         "name": "cat"
      }
   ]
}
```

Each image needs a .json file for annotation, and the .json file should have the same name as the corresponding image. The name of above .json file should be "sample\$1image1.json". There are four properties in the annotation .json file. The property "file" specifies the relative path of the image file. For example, if your training images and corresponding .json files are stored in s3://*your\$1bucket*/train/sample\$1image and s3://*your\$1bucket*/train\$1annotation, specify the path for your train and train\$1annotation channels as s3://*your\$1bucket*/train and s3://*your\$1bucket*/train\$1annotation, respectively. 

In the .json file, the relative path for an image named sample\$1image1.jpg should be sample\$1image/sample\$1image1.jpg. The `"image_size"` property specifies the overall image dimensions. The SageMaker AI object detection algorithm currently only supports 3-channel images. The `"annotations"` property specifies the categories and bounding boxes for objects within the image. Each object is annotated by a `"class_id"` index and by four bounding box coordinates (`"left"`, `"top"`, `"width"`, `"height"`). The `"left"` (x-coordinate) and `"top"` (y-coordinate) values represent the upper-left corner of the bounding box. The `"width"` (x-coordinate) and `"height"` (y-coordinate) values represent the dimensions of the bounding box. The origin (0, 0) is the upper-left corner of the entire image. If you have multiple objects within one image, all the annotations should be included in a single .json file. The `"categories"` property stores the mapping between the class index and class name. The class indices should be numbered successively and the numbering should start with 0. The `"categories"` property is optional for the annotation .json file

### Train with Augmented Manifest Image Format
<a name="object-detection-augmented-manifest-training"></a>

The augmented manifest format enables you to do training in pipe mode using image files without needing to create RecordIO files. You need to specify both train and validation channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. While using the format, an S3 manifest file needs to be generated that contains the list of images and their corresponding annotations. The manifest file format should be in [JSON Lines](http://jsonlines.org/) format in which each line represents one sample. The images are specified using the `'source-ref'` tag that points to the S3 location of the image. The annotations are provided under the `"AttributeNames"` parameter value as specified in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. It can also contain additional metadata under the `metadata` tag, but these are ignored by the algorithm. In the following example, the `"AttributeNames` are contained in the list `["source-ref", "bounding-box"]`:

```
{"source-ref": "s3://your_bucket/image1.jpg", "bounding-box":{"image_size":[{ "width": 500, "height": 400, "depth":3}], "annotations":[{"class_id": 0, "left": 111, "top": 134, "width": 61, "height": 128}, {"class_id": 5, "left": 161, "top": 250, "width": 80, "height": 50}]}, "bounding-box-metadata":{"class-map":{"0": "dog", "5": "horse"}, "type": "groundtruth/object-detection"}}
{"source-ref": "s3://your_bucket/image2.jpg", "bounding-box":{"image_size":[{ "width": 400, "height": 300, "depth":3}], "annotations":[{"class_id": 1, "left": 100, "top": 120, "width": 43, "height": 78}]}, "bounding-box-metadata":{"class-map":{"1": "cat"}, "type": "groundtruth/object-detection"}}
```

The order of `"AttributeNames"` in the input files matters when training the Object Detection algorithm. It accepts piped data in a specific order, with `image` first, followed by `annotations`. So the "AttributeNames" in this example are provided with `"source-ref"` first, followed by `"bounding-box"`. When using Object Detection with Augmented Manifest, the value of parameter `RecordWrapperType` must be set as `"RecordIO"`.

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Incremental Training
<a name="object-detection-incremental-training"></a>

You can also seed the training of a new model with the artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data. SageMaker AI object detection models can be seeded only with another built-in object detection model trained in SageMaker AI.

To use a pretrained model, in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, specify the `ChannelName` as "model" in the `InputDataConfig` parameter. Set the `ContentType` for the model channel to `application/x-sagemaker-model`. The input hyperparameters of both the new model and the pretrained model that you upload to the model channel must have the same settings for the `base_network` and `num_classes` input parameters. These parameters define the network architecture. For the pretrained model file, use the compressed model artifacts (in .tar.gz format) output by SageMaker AI. You can use either RecordIO or image formats for input data.

For more information on incremental training and for instructions on how to use it, see [Use Incremental Training in Amazon SageMaker AI](incremental-training.md). 

## EC2 Instance Recommendation for the Object Detection Algorithm
<a name="object-detection-instances"></a>

The object detection algorithm supports P2, P3, G4dn, and G5 GPU instance families. We recommend using GPU instances with more memory for training with large batch sizes. You can run the object detection algorithm on multi-GPU and mult-machine settings for distributed training.

You can use both CPU (such as C5 and M5) and GPU (such as P3 and G4dn) instances for inference.

## Object Detection Sample Notebooks
<a name="object-detection-sample-notebooks"></a>

For a sample notebook that shows how to use the SageMaker AI Object Detection algorithm to train and host a model on the 

[Caltech Birds (CUB 200 2011)](http://www.vision.caltech.edu/datasets/cub_200_2011/) dataset using the Single Shot multibox Detector algorithm, see [Amazon SageMaker AI Object Detection for Bird Species](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/object_detection_birds/object_detection_birds.html). For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). Once you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The object detection example notebook using the Object Detection algorithm is located in the **Introduction to Amazon Algorithms** section. To open a notebook, click on its **Use** tab and select **Create copy**.

For more information about the Amazon SageMaker AI Object Detection algorithm, see the following blog posts:
+ [Training the Amazon SageMaker AI object detection model and running it on AWS IoT Greengrass – Part 1 of 3: Preparing training data](https://aws.amazon.com/blogs/iot/sagemaker-object-detection-greengrass-part-1-of-3/)
+ [Training the Amazon SageMaker AI object detection model and running it on AWS IoT Greengrass – Part 2 of 3: Training a custom object detection model](https://aws.amazon.com/blogs/iot/sagemaker-object-detection-greengrass-part-2-of-3/)
+ [Training the Amazon SageMaker AI object detection model and running it on AWS IoT Greengrass – Part 3 of 3: Deploying to the edge](https://aws.amazon.com/blogs/iot/sagemaker-object-detection-greengrass-part-3-of-3/)

# How Object Detection Works
<a name="algo-object-detection-tech-notes"></a>

The object detection algorithm identifies and locates all instances of objects in an image from a known collection of object categories. The algorithm takes an image as input and outputs the category that the object belongs to, along with a confidence score that it belongs to the category. The algorithm also predicts the object's location and scale with a rectangular bounding box. Amazon SageMaker AI Object Detection uses the [Single Shot multibox Detector (SSD)](https://arxiv.org/pdf/1512.02325.pdf) algorithm that takes a convolutional neural network (CNN) pretrained for classification task as the base network. SSD uses the output of intermediate layers as features for detection. 

Various CNNs such as [VGG](https://arxiv.org/pdf/1409.1556.pdf) and [ResNet](https://arxiv.org/pdf/1603.05027.pdf) have achieved great performance on the image classification task. Object detection in Amazon SageMaker AI supports both VGG-16 and ResNet-50 as a base network for SSD. The algorithm can be trained in full training mode or in transfer learning mode. In full training mode, the base network is initialized with random weights and then trained on user data. In transfer learning mode, the base network weights are loaded from pretrained models.

The object detection algorithm uses standard data augmentation operations, such as flip, rescale, and jitter, on the fly internally to help avoid overfitting.

# Object Detection Hyperparameters
<a name="object-detection-api-config"></a>

In the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request, you specify the training algorithm that you want to use. You can also specify algorithm-specific hyperparameters that are used to help estimate the parameters of the model from a training dataset. The following table lists the hyperparameters provided by Amazon SageMaker AI for training the object detection algorithm. For more information about how object training works, see [How Object Detection Works](algo-object-detection-tech-notes.md).


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes |  The number of output classes. This parameter defines the dimensions of the network output and is typically set to the number of classes in the dataset. **Required** Valid values: positive integer  | 
| num\$1training\$1samples |  The number of training examples in the input dataset.  If there is a mismatch between this value and the number of samples in the training set, then the behavior of the `lr_scheduler_step` parameter will be undefined and distributed training accuracy may be affected.  **Required** Valid values: positive integer  | 
| base\$1network |  The base network architecture to use. **Optional** Valid values: 'vgg-16' or 'resnet-50' Default value: 'vgg-16'  | 
| early\$1stopping |  `True` to use early stopping logic during training. `False` not to use it. **Optional** Valid values: `True` or `False` Default value: `False`  | 
| early\$1stopping\$1min\$1epochs |  The minimum number of epochs that must be run before the early stopping logic can be invoked. It is used only when `early_stopping` = `True`. **Optional** Valid values: positive integer Default value: 10  | 
| early\$1stopping\$1patience |  The number of epochs to wait before ending training if no improvement, as defined by the `early_stopping_tolerance` hyperparameter, is made in the relevant metric. It is used only when `early_stopping` = `True`. **Optional** Valid values: positive integer Default value: 5  | 
| early\$1stopping\$1tolerance |  The tolerance value that the relative improvement in `validation:mAP`, the mean average precision (mAP), is required to exceed to avoid early stopping. If the ratio of the change in the mAP divided by the previous best mAP is smaller than the `early_stopping_tolerance` value set, early stopping considers that there is no improvement. It is used only when `early_stopping` = `True`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.0  | 
| image\$1shape |  The image size for input images. We rescale the input image to a square image with this size. We recommend using 300 and 512 for better performance. **Optional** Valid values: positive integer ≥300 Default: 300  | 
| epochs |  The number of training epochs.  **Optional** Valid values: positive integer Default: 30  | 
| freeze\$1layer\$1pattern |  The regular expression (regex) for freezing layers in the base network. For example, if we set `freeze_layer_pattern` = `"^(conv1_\|conv2_).*"`, then any layers with a name that contains `"conv1_"` or `"conv2_"` are frozen, which means that the weights for these layers are not updated during training. The layer names can be found in the network symbol files [vgg16-symbol.json](http://data.mxnet.io/models/imagenet/vgg/vgg16-symbol.json ) and [resnet-50-symbol.json](http://data.mxnet.io/models/imagenet/resnet/50-layers/resnet-50-symbol.json). Freezing a layer means that its weights can not be modified further. This can reduce training time significantly in exchange for modest losses in accuracy. This technique is commonly used in transfer learning where the lower layers in the base network do not need to be retrained. **Optional** Valid values: string Default: No layers frozen.  | 
| kv\$1store |  The weight update synchronization mode used for distributed training. The weights can be updated either synchronously or asynchronously across machines. Synchronous updates typically provide better accuracy than asynchronous updates but can be slower. See the [Distributed Training](https://mxnet.apache.org/api/faq/distributed_training) MXNet tutorial for details.  This parameter is not applicable to single machine training.  **Optional** Valid values: `'dist_sync'` or `'dist_async'` [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/object-detection-api-config.html) Default: -  | 
| label\$1width |  The force padding label width used to sync across training and validation data. For example, if one image in the data contains at most 10 objects, and each object's annotation is specified with 5 numbers, [class\$1id, left, top, width, height], then the `label_width` should be no smaller than (10\$15 \$1 header information length). The header information length is usually 2. We recommend using a slightly larger `label_width` for the training, such as 60 for this example. **Optional** Valid values: Positive integer large enough to accommodate the largest annotation information length in the data. Default: 350  | 
| learning\$1rate |  The initial learning rate. **Optional** Valid values: float in (0, 1] Default: 0.001  | 
| lr\$1scheduler\$1factor |  The ratio to reduce learning rate. Used in conjunction with the `lr_scheduler_step` parameter defined as `lr_new` = `lr_old` \$1 `lr_scheduler_factor`. **Optional** Valid values: float in (0, 1) Default: 0.1  | 
| lr\$1scheduler\$1step |  The epochs at which to reduce the learning rate. The learning rate is reduced by `lr_scheduler_factor` at epochs listed in a comma-delimited string: "epoch1, epoch2, ...". For example, if the value is set to "10, 20" and the `lr_scheduler_factor` is set to 1/2, then the learning rate is halved after 10th epoch and then halved again after 20th epoch. **Optional** Valid values: string Default: empty string  | 
| mini\$1batch\$1size |  The batch size for training. In a single-machine multi-gpu setting, each GPU handles `mini_batch_size`/`num_gpu` training samples. For the multi-machine training in `dist_sync` mode, the actual batch size is `mini_batch_size`\$1number of machines. A large `mini_batch_size` usually leads to faster training, but it may cause out of memory problem. The memory usage is related to `mini_batch_size`, `image_shape`, and `base_network` architecture. For example, on a single p3.2xlarge instance, the largest `mini_batch_size` without an out of memory error is 32 with the base\$1network set to "resnet-50" and an `image_shape` of 300. With the same instance, you can use 64 as the `mini_batch_size` with the base network `vgg-16` and an `image_shape` of 300. **Optional** Valid values: positive integer Default: 32  | 
| momentum |  The momentum for `sgd`. Ignored for other optimizers. **Optional** Valid values: float in (0, 1] Default: 0.9  | 
| nms\$1threshold |  The non-maximum suppression threshold. **Optional** Valid values: float in (0, 1] Default: 0.45  | 
| optimizer |  The optimizer types. For details on optimizer values, see [MXNet's API](https://mxnet.apache.org/api/python/docs/api/). **Optional** Valid values: ['sgd', 'adam', 'rmsprop', 'adadelta'] Default: 'sgd'  | 
| overlap\$1threshold |  The evaluation overlap threshold. **Optional** Valid values: float in (0, 1] Default: 0.5  | 
| use\$1pretrained\$1model |  Indicates whether to use a pre-trained model for training. If set to 1, then the pre-trained model with corresponding architecture is loaded and used for training. Otherwise, the network is trained from scratch. **Optional** Valid values: 0 or 1 Default: 1  | 
| weight\$1decay |  The weight decay coefficient for `sgd` and `rmsprop`. Ignored for other optimizers. **Optional** Valid values: float in (0, 1) Default: 0.0005  | 

# Tune an Object Detection Model
<a name="object-detection-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics Computed by the Object Detection Algorithm
<a name="object-detection-metrics"></a>

The object detection algorithm reports on a single metric during training: `validation:mAP`. When tuning a model, choose this metric as the objective metric.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:mAP |  Mean Average Precision (mAP) computed on the validation set.  |  Maximize  | 


## Tunable Object Detection Hyperparameters
<a name="object-detection-tunable-hyperparameters"></a>

Tune the Amazon SageMaker AI object detection model with the following hyperparameters. The hyperparameters that have the greatest impact on the object detection objective metric are: `mini_batch_size`, `learning_rate`, and `optimizer`.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-6, MaxValue: 0.5  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 8, MaxValue: 64  | 
| momentum |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.999  | 
| optimizer |  CategoricalParameterRanges  |  ['sgd', 'adam', 'rmsprop', 'adadelta']  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 0.0, MaxValue: 0.999  | 

# Object Detection Request and Response Formats
<a name="object-detection-in-formats"></a>

The following page describes the inference request and response formats for the Amazon SageMaker AI Object Detection - MXNet model.

## Request Format
<a name="object-detection-json"></a>

Query a trained model by using the model's endpoint. The endpoint takes .jpg and .png image formats with `image/jpeg` and `image/png` content-types.

## Response Formats
<a name="object-detection-recordio"></a>

The response is the class index with a confidence score and bounding box coordinates for all objects within the image encoded in JSON format. The following is an example of response .json file:

```
{"prediction":[
  [4.0, 0.86419455409049988, 0.3088374733924866, 0.07030484080314636, 0.7110607028007507, 0.9345266819000244],
  [0.0, 0.73376623392105103, 0.5714187026023865, 0.40427327156066895, 0.827075183391571, 0.9712159633636475],
  [4.0, 0.32643985450267792, 0.3677481412887573, 0.034883320331573486, 0.6318609714508057, 0.5967587828636169],
  [8.0, 0.22552496790885925, 0.6152569651603699, 0.5722782611846924, 0.882301390171051, 0.8985623121261597],
  [3.0, 0.42260299175977707, 0.019305512309074402, 0.08386176824569702, 0.39093565940856934, 0.9574796557426453]
]}
```

Each row in this .json file contains an array that represents a detected object. Each of these object arrays consists of a list of six numbers. The first number is the predicted class label. The second number is the associated confidence score for the detection. The last four numbers represent the bounding box coordinates [xmin, ymin, xmax, ymax]. These output bounding box corner indices are normalized by the overall image size. Note that this encoding is different than that use by the input .json format. For example, in the first entry of the detection result, 0.3088374733924866 is the left coordinate (x-coordinate of upper-left corner) of the bounding box as a ratio of the overall image width, 0.07030484080314636 is the top coordinate (y-coordinate of upper-left corner) of the bounding box as a ratio of the overall image height, 0.7110607028007507 is the right coordinate (x-coordinate of lower-right corner) of the bounding box as a ratio of the overall image width, and 0.9345266819000244 is the bottom coordinate (y-coordinate of lower-right corner) of the bounding box as a ratio of the overall image height. 

To avoid unreliable detection results, you might want to filter out the detection results with low confidence scores. In the [object detection sample notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/object_detection_birds/object_detection_birds.ipynb), we provide examples of scripts that use a threshold to remove low confidence detections and to plot bounding boxes on the original images.

For batch transform, the response is in JSON format, where the format is identical to the JSON format described above. The detection results of each image is represented as a JSON file. For example:

```
{"prediction": [[label_id, confidence_score, xmin, ymin, xmax, ymax], [label_id, confidence_score, xmin, ymin, xmax, ymax]]}
```

For more details on training and inference, see the [Object Detection Sample Notebooks](object-detection.md#object-detection-sample-notebooks).

## OUTPUT: JSON Response Format
<a name="object-detection-output-json"></a>

accept: application/json;annotation=1

```
{
   "image_size": [
      {
         "width": 500,
         "height": 400,
         "depth": 3
      }
   ],
   "annotations": [
      {
         "class_id": 0,
         "score": 0.943,
         "left": 111,
         "top": 134,
         "width": 61,
         "height": 128
      },
      {
         "class_id": 0,
         "score": 0.0013,
         "left": 161,
         "top": 250,
         "width": 79,
         "height": 143
      },
      {
         "class_id": 1,
         "score": 0.0133,
         "left": 101,
         "top": 185,
         "width": 42,
         "height": 130
      }
   ]
}
```

# Object Detection - TensorFlow
<a name="object-detection-tensorflow"></a>

The Amazon SageMaker AI Object Detection - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the [TensorFlow Model Garden](https://github.com/tensorflow/models). Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The object detection algorithm takes an image as input and outputs a list of bounding boxes. Training datasets must consist of images in .`jpg`, `.jpeg`, or `.png` format. This page includes information about Amazon EC2 instance recommendations and sample notebooks for Object Detection - TensorFlow.

**Topics**
+ [

# How to use the SageMaker AI Object Detection - TensorFlow algorithm
](object-detection-tensorflow-how-to-use.md)
+ [

# Input and output interface for the Object Detection - TensorFlow algorithm
](object-detection-tensorflow-inputoutput.md)
+ [

## Amazon EC2 instance recommendation for the Object Detection - TensorFlow algorithm
](#object-detection-tensorflow-instances)
+ [

## Object Detection - TensorFlow sample notebooks
](#object-detection-tensorflow-sample-notebooks)
+ [

# How Object Detection - TensorFlow Works
](object-detection-tensorflow-HowItWorks.md)
+ [

# TensorFlow Models
](object-detection-tensorflow-Models.md)
+ [

# Object Detection - TensorFlow Hyperparameters
](object-detection-tensorflow-Hyperparameter.md)
+ [

# Tune an Object Detection - TensorFlow model
](object-detection-tensorflow-tuning.md)

# How to use the SageMaker AI Object Detection - TensorFlow algorithm
<a name="object-detection-tensorflow-how-to-use"></a>

You can use Object Detection - TensorFlow as an Amazon SageMaker AI built-in algorithm. The following section describes how to use Object Detection - TensorFlow with the SageMaker AI Python SDK. For information on how to use Object Detection - TensorFlow from the Amazon SageMaker Studio Classic UI, see [SageMaker JumpStart pretrained models](studio-jumpstart.md).

The Object Detection - TensorFlow algorithm supports transfer learning using any of the compatible pretrained TensorFlow models. For a list of all available pretrained models, see [TensorFlow Models](object-detection-tensorflow-Models.md). Every pretrained model has a unique `model_id`. The following example uses ResNet50 (`model_id`: `tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8`) to fine-tune on a custom dataset. The pretrained models are all pre-downloaded from the TensorFlow Hub and stored in Amazon S3 buckets so that training jobs can run in network isolation. Use these pre-generated model training artifacts to construct a SageMaker AI Estimator.

First, retrieve the Docker image URI, training script URI, and pretrained model URI. Then, change the hyperparameters as you see fit. You can see a Python dictionary of all available hyperparameters and their default values with `hyperparameters.retrieve_default`. For more information, see [Object Detection - TensorFlow Hyperparameters](object-detection-tensorflow-Hyperparameter.md). Use these values to construct a SageMaker AI Estimator.

**Note**  
Default hyperparameter values are different for different models. For example, for larger models, the default number of epochs is smaller. 

This example uses the [https://www.cis.upenn.edu/~jshi/ped_html/#pub1](https://www.cis.upenn.edu/~jshi/ped_html/#pub1) dataset, which contains images of pedestriants in the street. We pre-downloaded the dataset and made it available with Amazon S3. To fine-tune your model, call `.fit` using the Amazon S3 location of your training dataset.

```
from sagemaker import image_uris, model_uris, script_uris, hyperparameters
from sagemaker.estimator import Estimator

model_id, model_version = "tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8", "*"
training_instance_type = "ml.p3.2xlarge"

# Retrieve the Docker image
train_image_uri = image_uris.retrieve(model_id=model_id,model_version=model_version,image_scope="training",instance_type=training_instance_type,region=None,framework=None)

# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=model_id, model_version=model_version, script_scope="training")

# Retrieve the pretrained model tarball for transfer learning
train_model_uri = model_uris.retrieve(model_id=model_id, model_version=model_version, model_scope="training")

# Retrieve the default hyperparameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "5"

# Sample training data is available in this bucket
training_data_bucket = f"jumpstart-cache-prod-{aws_region}"
training_data_prefix = "training-datasets/PennFudanPed_COCO_format/"

training_dataset_s3_path = f"s3://{training_data_bucket}/{training_data_prefix}"

output_bucket = sess.default_bucket()
output_prefix = "jumpstart-example-od-training"
s3_output_location = f"s3://{output_bucket}/{output_prefix}/output"

# Create an Estimator instance
tf_od_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
)

# Launch a training job
tf_od_estimator.fit({"training": training_dataset_s3_path}, logs=True)
```

For more information about how to use the SageMaker AI Object Detection - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to SageMaker TensorFlow - Object Detection](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/object_detection_tensorflow/Amazon_Tensorflow_Object_Detection.ipynb) notebook.

# Input and output interface for the Object Detection - TensorFlow algorithm
<a name="object-detection-tensorflow-inputoutput"></a>

Each of the pretrained models listed in TensorFlow Models can be fine-tuned to any dataset with any number of image classes. Be mindful of how to format your training data for input to the Object Detection - TensorFlow model.
+ **Training data input format:** Your training data should be a directory with an `images` subdirectory and an `annotations.json` file. 

The following is an example of an input directory structure. The input directory should be hosted in an Amazon S3 bucket with a path similar to the following: `s3://bucket_name/input_directory/`. Note that the trailing `/` is required.

```
input_directory
    |--images
        |--abc.png
        |--def.png
    |--annotations.json
```

The `annotations.json` file should contain information for bounding boxes and their class labels in the form of a dictionary `"images"` and `"annotations"` keys. The value for the `"images"` key should be a list of dictionaries. There should be one dictionary for each image with the following information: `{"file_name": image_name, "height": height, "width": width, "id": image_id}`. The value for the `"annotations"` key should also be a list of dictionaries. There should be one dictionary for each bounding box with the following information: `{"image_id": image_id, "bbox": [xmin, ymin, xmax, ymax], "category_id": bbox_label}`.

After training, a label mapping file and trained model are saved to your Amazon S3 bucket.

## Incremental training
<a name="object-detection-tensorflow-incremental-training"></a>

You can seed the training of a new model with artifacts from a model that you trained previously with SageMaker AI. Incremental training saves training time when you want to train a new model with the same or similar data.

**Note**  
You can only seed a SageMaker AI Object Detection - TensorFlow model with another Object Detection - TensorFlow model trained in SageMaker AI. 

You can use any dataset for incremental training, as long as the set of classes remains the same. The incremental training step is similar to the fine-tuning step, but instead of starting with a pretrained model, you start with an existing fine-tuned model. For more information about how to use incremental training with the SageMaker AI Object Detection - TensorFlow, see the [Introduction to SageMaker TensorFlow - Object Detection](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/object_detection_tensorflow/Amazon_Tensorflow_Object_Detection.ipynb) notebook.

## Inference with the Object Detection - TensorFlow algorithm
<a name="object-detection-tensorflow-inference"></a>

You can host the fine-tuned model that results from your TensorFlow Object Detection training for inference. Any input image for inference must be in `.jpg`, .`jpeg`, or `.png` format and be content type `application/x-image`. The Object Detection - TensorFlow algorithm resizes input images automatically. 

Running inference results in bounding boxes, predicted classes, and the scores of each prediction encoded in JSON format. The Object Detection - TensorFlow model processes a single image per request and outputs only one line. The following is an example of a JSON format response:

```
accept: application/json;verbose

{"normalized_boxes":[[xmin1, xmax1, ymin1, ymax1],....], 
    "classes":[classidx1, class_idx2,...], 
    "scores":[score_1, score_2,...], 
    "labels": [label1, label2, ...], 
    "tensorflow_model_output":<original output of the model>}
```

If `accept` is set to `application/json`, then the model only outputs normalized boxes, classes, and scores. 

## Amazon EC2 instance recommendation for the Object Detection - TensorFlow algorithm
<a name="object-detection-tensorflow-instances"></a>

The Object Detection - TensorFlow algorithm supports all GPU instances for training, including:
+ `ml.p2.xlarge`
+ `ml.p2.16xlarge`
+ `ml.p3.2xlarge`
+ `ml.p3.16xlarge`

We recommend GPU instances with more memory for training with large batch sizes. Both CPU (such as M5) and GPU (P2 or P3) instances can be used for inference. For a comprehensive list of SageMaker training and inference instances across AWS Regions, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

## Object Detection - TensorFlow sample notebooks
<a name="object-detection-tensorflow-sample-notebooks"></a>

For more information about how to use the SageMaker AI Object Detection - TensorFlow algorithm for transfer learning on a custom dataset, see the [Introduction to SageMaker TensorFlow - Object Detection](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/object_detection_tensorflow/Amazon_Tensorflow_Object_Detection.ipynb) notebook.

For instructions how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you have created a notebook instance and opened it, select the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. To open a notebook, choose its **Use** tab and choose **Create copy**.

# How Object Detection - TensorFlow Works
<a name="object-detection-tensorflow-HowItWorks"></a>

The Object Detection - TensorFlow algorithm takes an image as input and predicts bounding boxes and object labels. Various deep learning networks such as MobileNet, ResNet, Inception, and EfficientNet are highly accurate for object detection. There are also deep learning networks that are trained on large image datasets, such as Common Objects in Context (COCO), which has 328,000 images. After a network is trained with COCO data, you can then fine-tune the network on a dataset with a particular focus to perform more specific object detection tasks. The Amazon SageMaker AI Object Detection - TensorFlow algorithm supports transfer learning on many pretrained models that are available in the TensorFlow Model Garden.

According to the number of class labels in your training data, an object detection layer is attached to the pretrained TensorFlow model of your choice. You can then fine-tune either the entire network (including the pretrained model) or only the top classification layer on new training data. With this method of transfer learning, training with smaller datasets is possible.

# TensorFlow Models
<a name="object-detection-tensorflow-Models"></a>

The following pretrained models are available to use for transfer learning with the Object Detection - TensorFlow algorithm. 

The following models vary significantly in size, number of model parameters, training time, and inference latency for any given dataset. The best model for your use case depends on the complexity of your fine-tuning dataset and any requirements that you have on training time, inference latency, or model accuracy.


| Model Name | `model_id` | Source | 
| --- | --- | --- | 
| ResNet50 V1 FPN 640 | `tensorflow-od1-ssd-resnet50-v1-fpn-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8.tar.gz) | 
| EfficientDet D0 512 | `tensorflow-od1-ssd-efficientdet-d0-512x512-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d0_coco17_tpu-32.tar.gz) | 
| EfficientDet D1 640 | `tensorflow-od1-ssd-efficientdet-d1-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d1_coco17_tpu-32.tar.gz) | 
| EfficientDet D2 768 | `tensorflow-od1-ssd-efficientdet-d2-768x768-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d2_coco17_tpu-32.tar.gz) | 
| EfficientDet D3 896 | `tensorflow-od1-ssd-efficientdet-d3-896x896-coco17-tpu-32` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/efficientdet_d3_coco17_tpu-32.tar.gz) | 
| MobileNet V1 FPN 640 | `tensorflow-od1-ssd-mobilenet-v1-fpn-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8.tar.gz) | 
| MobileNet V2 FPNLite 320 | `tensorflow-od1-ssd-mobilenet-v2-fpnlite-320x320-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8.tar.gz) | 
| MobileNet V2 FPNLite 640 | `tensorflow-od1-ssd-mobilenet-v2-fpnlite-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz) | 
| ResNet50 V1 FPN 1024 | `tensorflow-od1-ssd-resnet50-v1-fpn-1024x1024-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet50_v1_fpn_1024x1024_coco17_tpu-8.tar.gz) | 
| ResNet101 V1 FPN 640 | `tensorflow-od1-ssd-resnet101-v1-fpn-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet101_v1_fpn_640x640_coco17_tpu-8.tar.gz) | 
| ResNet101 V1 FPN 1024 | `tensorflow-od1-ssd-resnet101-v1-fpn-1024x1024-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8.tar.gz) | 
| ResNet152 V1 FPN 640 | `tensorflow-od1-ssd-resnet152-v1-fpn-640x640-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet152_v1_fpn_640x640_coco17_tpu-8.tar.gz) | 
| ResNet152 V1 FPN 1024 | `tensorflow-od1-ssd-resnet152-v1-fpn-1024x1024-coco17-tpu-8` | [TensorFlow Model Garden link](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_resnet152_v1_fpn_1024x1024_coco17_tpu-8.tar.gz) | 

# Object Detection - TensorFlow Hyperparameters
<a name="object-detection-tensorflow-Hyperparameter"></a>

Hyperparameters are parameters that are set before a machine learning model begins learning. The following hyperparameters are supported by the Amazon SageMaker AI built-in Object Detection - TensorFlow algorithm. See [Tune an Object Detection - TensorFlow model](object-detection-tensorflow-tuning.md) for information on hyperparameter tuning. 


| Parameter Name | Description | 
| --- | --- | 
| batch\$1size |  The batch size for training.  Valid values: positive integer. Default value: `3`.  | 
| beta\$11 |  The beta1 for the `"adam"` optimizer. Represents the exponential decay rate for the first moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| beta\$12 |  The beta2 for the `"adam"` optimizer. Represents the exponential decay rate for the second moment estimates. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.999`.  | 
| early\$1stopping |  Set to `"True"` to use early stopping logic during training. If `"False"`, early stopping is not used. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 
| early\$1stopping\$1min\$1delta | The minimum change needed to qualify as an improvement. An absolute change less than the value of early\$1stopping\$1min\$1delta does not qualify as improvement. Used only when early\$1stopping is set to "True".Valid values: float, range: [`0.0`, `1.0`].Default value: `0.0`. | 
| early\$1stopping\$1patience |  The number of epochs to continue training with no improvement. Used only when `early_stopping` is set to `"True"`. Valid values: positive integer. Default value: `5`.  | 
| epochs |  The number of training epochs. Valid values: positive integer. Default value: `5` for smaller models, `1` for larger models.  | 
| epsilon |  The epsilon for `"adam"`, `"rmsprop"`, `"adadelta"`, and `"adagrad"` optimizers. Usually set to a small value to avoid division by 0. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `1e-7`.  | 
| initial\$1accumulator\$1value |  The starting value for the accumulators, or the per-parameter momentum values, for the `"adagrad"` optimizer. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.1`.  | 
| learning\$1rate | The optimizer learning rate. Valid values: float, range: [`0.0`, `1.0`].Default value: `0.001`. | 
| momentum |  The momentum for the `"sgd"` and `"nesterov"` optimizers. Ignored for other optimizers. Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.9`.  | 
| optimizer |  The optimizer type. For more information, see [Optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers) in the TensorFlow documentation. Valid values: string, any of the following: (`"adam"`, `"sgd"`, `"nesterov"`, `"rmsprop"`,` "adagrad"` , `"adadelta"`). Default value: `"adam"`.  | 
| reinitialize\$1top\$1layer |  If set to `"Auto"`, the top classification layer parameters are re-initialized during fine-tuning. For incremental training, top classification layer parameters are not re-initialized unless set to `"True"`. Valid values: string, any of the following: (`"Auto"`, `"True"` or `"False"`). Default value: `"Auto"`.  | 
| rho |  The discounting factor for the gradient of the `"adadelta"` and `"rmsprop"` optimizers. Ignored for other optimizers.  Valid values: float, range: [`0.0`, `1.0`]. Default value: `0.95`.  | 
| train\$1only\$1on\$1top\$1layer |  If `"True"`, only the top classification layer parameters are fine-tuned. If `"False"`, all model parameters are fine-tuned. Valid values: string, either: (`"True"` or `"False"`). Default value: `"False"`.  | 

# Tune an Object Detection - TensorFlow model
<a name="object-detection-tensorflow-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

For more information about model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

## Metrics computed by the Object Detection - TensorFlow algorithm
<a name="object-detection-tensorflow-metrics"></a>

Refer to the following chart to find which metrics are computed by the Object Detection - TensorFlow algorithm.


| Metric Name | Description | Optimization Direction | Regex Pattern | 
| --- | --- | --- | --- | 
| validation:localization\$1loss | The localization loss for box prediction. | Minimize | `Val_localization=([0-9\\.]+)` | 

## Tunable Object Detection - TensorFlow hyperparameters
<a name="object-detection-tensorflow-tunable-hyperparameters"></a>

Tune an object detection model with the following hyperparameters. The hyperparameters that have the greatest impact on object detection objective metrics are: `batch_size`, `learning_rate`, and `optimizer`. Tune the optimizer-related hyperparameters, such as `momentum`, `regularizers_l2`, `beta_1`, `beta_2`, and `eps` based on the selected `optimizer`. For example, use `beta_1` and `beta_2` only when `adam` is the `optimizer`.

For more information about which hyperparameters are used for each `optimizer`, see [Object Detection - TensorFlow Hyperparameters](object-detection-tensorflow-Hyperparameter.md).


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| batch\$1size | IntegerParameterRanges | MinValue: 8, MaxValue: 512 | 
| beta\$11 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| beta\$12 | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.999 | 
| eps | ContinuousParameterRanges | MinValue: 1e-8, MaxValue: 1.0 | 
| learning\$1rate | ContinuousParameterRanges | MinValue: 1e-6, MaxValue: 0.5 | 
| momentum | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| optimizer | CategoricalParameterRanges | ['sgd', ‘adam’, ‘rmsprop’, 'nesterov', 'adagrad', 'adadelta'] | 
| regularizers\$1l2 | ContinuousParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 
| train\$1only\$1on\$1top\$1layer | CategoricalParameterRanges | ['True', 'False'] | 
| initial\$1accumulator\$1value | CategoricalParameterRanges | MinValue: 0.0, MaxValue: 0.999 | 

# Semantic Segmentation Algorithm
<a name="semantic-segmentation"></a>

The SageMaker AI semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an increasing number of computer vision applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing. 

For comparison, the SageMaker AI [Image Classification - MXNet](image-classification.md) is a supervised learning algorithm that analyzes only whole images, classifying them into one of multiple output categories. The [Object Detection - MXNet](object-detection.md) is a supervised learning algorithm that detects and classifies all instances of an object in an image. It indicates the location and scale of each object in the image with a rectangular bounding box. 

Because the semantic segmentation algorithm classifies every pixel in an image, it also provides information about the shapes of the objects contained in the image. The segmentation output is represented as a grayscale image, called a *segmentation mask*. A segmentation mask is a grayscale image with the same shape as the input image.

The SageMaker AI semantic segmentation algorithm is built using the [MXNet Gluon framework and the Gluon CV toolkit](https://github.com/dmlc/gluon-cv). It provides you with a choice of three built-in algorithms to train a deep neural network. You can use the [Fully-Convolutional Network (FCN) algorithm ](https://arxiv.org/abs/1605.06211), [Pyramid Scene Parsing (PSP) algorithm](https://arxiv.org/abs/1612.01105), or [DeepLabV3](https://arxiv.org/abs/1706.05587). 

Each of the three algorithms has two distinct components: 
+ The *backbone* (or *encoder*)—A network that produces reliable activation maps of features.
+ The *decoder*—A network that constructs the segmentation mask from the encoded activation maps.

You also have a choice of backbones for the FCN, PSP, and DeepLabV3 algorithms: [ResNet50 or ResNet101](https://arxiv.org/abs/1512.03385). These backbones include pretrained artifacts that were originally trained on the [ImageNet](http://www.image-net.org/) classification task. You can fine-tune these backbones for segmentation using your own data. Or, you can initialize and train these networks from scratch using only your own data. The decoders are never pretrained. 

To deploy the trained model for inference, use the SageMaker AI hosting service. During inference, you can request the segmentation mask either as a PNG image or as a set of probabilities for each class for each pixel. You can use these masks as part of a larger pipeline that includes additional downstream image processing or other applications.

**Topics**
+ [

## Semantic Segmentation Sample Notebooks
](#semantic-segmentation-sample-notebooks)
+ [

## Input/Output Interface for the Semantic Segmentation Algorithm
](#semantic-segmentation-inputoutput)
+ [

## EC2 Instance Recommendation for the Semantic Segmentation Algorithm
](#semantic-segmentation-instances)
+ [

# Semantic Segmentation Hyperparameters
](segmentation-hyperparameters.md)
+ [

# Tuning a Semantic Segmentation Model
](semantic-segmentation-tuning.md)

## Semantic Segmentation Sample Notebooks
<a name="semantic-segmentation-sample-notebooks"></a>

For a sample Jupyter notebook that uses the SageMaker AI semantic segmentation algorithm to train a model and deploy it to perform inferences, see the [Semantic Segmentation Example](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/semantic_segmentation_pascalvoc/semantic_segmentation_pascalvoc.html). For instructions on how to create and access Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). 

To see a list of all of the SageMaker AI samples, create and open a notebook instance, and choose the **SageMaker AI Examples** tab. The example semantic segmentation notebooks are located under **Introduction to Amazon algorithms**. To open a notebook, choose its **Use** tab, and choose **Create copy**.

## Input/Output Interface for the Semantic Segmentation Algorithm
<a name="semantic-segmentation-inputoutput"></a>

SageMaker AI semantic segmentation expects the customer's training dataset to be on [Amazon Simple Storage Service (Amazon S3)](https://aws.amazon.com/s3/). Once trained, it produces the resulting model artifacts on Amazon S3. The input interface format for the SageMaker AI semantic segmentation is similar to that of most standardized semantic segmentation benchmarking datasets. The dataset in Amazon S3 is expected to be presented in two channels, one for `train` and one for `validation` using four directories, two for images and two for annotations. Annotations are expected to be uncompressed PNG images. The dataset might also have a label map that describes how the annotation mappings are established. If not, the algorithm uses a default. It also supports the augmented manifest image format (`application/x-image`) for training in Pipe input mode straight from Amazon S3. For inference, an endpoint accepts images with an `image/jpeg` content type. 

### How Training Works
<a name="semantic-segmentation-inputoutput-training"></a>

The training data is split into four directories: `train`, `train_annotation`, `validation`, and `validation_annotation`. There is a channel for each of these directories. The dataset also expected to have one `label_map.json` file per channel for `train_annotation` and `validation_annotation` respectively. If you don't provide these JSON files, SageMaker AI provides the default set label map.

The dataset specifying these files should look similar to the following example:

```
s3://bucket_name
    |
    |- train
                 |
                 | - 0000.jpg
                 | - coffee.jpg
    |- validation
                 |
                 | - 00a0.jpg
                 | - bananna.jpg
    |- train_annotation
                 |
                 | - 0000.png
                 | - coffee.png
    |- validation_annotation
                 |
                 | - 00a0.png
                 | - bananna.png
    |- label_map
                 | - train_label_map.json
                 | - validation_label_map.json
```

Every JPG image in the train and validation directories has a corresponding PNG label image with the same name in the `train_annotation` and `validation_annotation` directories. This naming convention helps the algorithm to associate a label with its corresponding image during training. The `train`, `train_annotation`, `validation`, and `validation_annotation` channels are mandatory. The annotations are single-channel PNG images. The format works as long as the metadata (modes) in the image helps the algorithm read the annotation images into a single-channel 8-bit unsigned integer. For more information on our support for modes, see the [Python Image Library documentation](https://pillow.readthedocs.io/en/stable/handbook/concepts.html#modes). We recommend using the 8-bit pixel, true color `P` mode. 

The image that is encoded is a simple 8-bit integer when using modes. To get from this mapping to a map of a label, the algorithm uses one mapping file per channel, called the *label map*. The label map is used to map the values in the image with actual label indices. In the default label map, which is provided by default if you don’t provide one, the pixel value in an annotation matrix (image) directly index the label. These images can be grayscale PNG files or 8-bit indexed PNG files. The label map file for the unscaled default case is the following: 

```
{
  "scale": "1"
}
```

To provide some contrast for viewing, some annotation software scales the label images by a constant amount. To support this, the SageMaker AI semantic segmentation algorithm provides a rescaling option to scale down the values to actual label values. When scaling down doesn’t convert the value to an appropriate integer, the algorithm defaults to the greatest integer less than or equal to the scale value. The following code shows how to set the scale value to rescale the label values:

```
{
  "scale": "3"
}
```

The following example shows how this `"scale"` value is used to rescale the `encoded_label` values of the input annotation image when they are mapped to the `mapped_label` values to be used in training. The label values in the input annotation image are 0, 3, 6, with scale 3, so they are mapped to 0, 1, 2 for training:

```
encoded_label = [0, 3, 6]
mapped_label = [0, 1, 2]
```

In some cases, you might need to specify a particular color mapping for each class. Use the map option in the label mapping as shown in the following example of a `label_map` file:

```
{
    "map": {
        "0": 5,
        "1": 0,
        "2": 2
    }
}
```

This label mapping for this example is:

```
encoded_label = [0, 5, 2]
mapped_label = [1, 0, 2]
```

With label mappings, you can use different annotation systems and annotation software to obtain data without a lot of preprocessing. You can provide one label map per channel. The files for a label map in the `label_map` channel must follow the naming conventions for the four directory structure. If you don't provide a label map, the algorithm assumes a scale of 1 (the default).

### Training with the Augmented Manifest Format
<a name="semantic-segmentation-inputoutput-training-augmented-manifest"></a>

The augmented manifest format enables you to do training in Pipe mode using image files without needing to create RecordIO files. The augmented manifest file contains data objects and should be in [JSON Lines](http://jsonlines.org/) format, as described in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Each line in the manifest is an entry containing the Amazon S3 URI for the image and the URI for the annotation image.

Each JSON object in the manifest file must contain a `source-ref` key. The `source-ref` key should contain the value of the Amazon S3 URI to the image. The labels are provided under the `AttributeNames` parameter value as specified in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. It can also contain additional metadata under the metadata tag, but these are ignored by the algorithm. In the example below, the `AttributeNames` are contained in the list of image and annotation references `["source-ref", "city-streets-ref"]`. These names must have `-ref` appended to them. When using the Semantic Segmentation algorithm with Augmented Manifest, the value of the `RecordWrapperType` parameter must be `"RecordIO"` and value of the `ContentType` parameter must be `application/x-recordio`.

```
{"source-ref": "S3 bucket location", "city-streets-ref": "S3 bucket location", "city-streets-metadata": {"job-name": "label-city-streets", }}
```

For more information on augmented manifest files, see [Augmented Manifest Files for Training Jobs](augmented-manifest.md).

### Incremental Training
<a name="semantic-segmentation-inputoutput-incremental-training"></a>

You can also seed the training of a new model with a model that you trained previously using SageMaker AI. This incremental training saves training time when you want to train a new model with the same or similar data. Currently, incremental training is supported only for models trained with the built-in SageMaker AI Semantic Segmentation.

To use your own pre-trained model, specify the `ChannelName` as "model" in the `InputDataConfig` for the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Set the `ContentType` for the model channel to `application/x-sagemaker-model`. The `backbone`, `algorithm`, `crop_size`, and `num_classes` input parameters that define the network architecture must be consistently specified in the input hyperparameters of the new model and the pre-trained model that you upload to the model channel. For the pretrained model file, you can use the compressed (.tar.gz) artifacts from SageMaker AI outputs. You can only use Image formats for input data. For more information on incremental training and for instructions on how to use it, see [Use Incremental Training in Amazon SageMaker AI](incremental-training.md). 

### Produce Inferences
<a name="semantic-segmentation-inputoutput-inference"></a>

To query a trained model that is deployed to an endpoint, you need to provide an image and an `AcceptType` that denotes the type of output required. The endpoint takes JPEG images with an `image/jpeg` content type. If you request an `AcceptType` of `image/png`, the algorithm outputs a PNG file with a segmentation mask in the same format as the labels themselves. If you request an accept type of`application/x-recordio-protobuf`, the algorithm returns class probabilities encoded in recordio-protobuf format. The latter format outputs a 3D tensor where the third dimension is the same size as the number of classes. This component denotes the probability of each class label for each pixel.

## EC2 Instance Recommendation for the Semantic Segmentation Algorithm
<a name="semantic-segmentation-instances"></a>

The SageMaker AI semantic segmentation algorithm only supports GPU instances for training, and we recommend using GPU instances with more memory for training with large batch sizes. The algorithm can be trained using P2, P3, G4dn, or G5 instances in single machine configurations.

For inference, you can use either CPU instances (such as C5 and M5) and GPU instances (such as P3 and G4dn) or both. For information about the instance types that provide varying combinations of CPU, GPU, memory, and networking capacity for inference, see [Amazon SageMaker AI ML Instance Types](https://aws.amazon.com/sagemaker/pricing/instance-types/).

# Semantic Segmentation Hyperparameters
<a name="segmentation-hyperparameters"></a>

The following tables list the hyperparameters supported by the Amazon SageMaker AI semantic segmentation algorithm for network architecture, data inputs, and training. You specify Semantic Segmentation for training in the `AlgorithmName` of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request.

**Network Architecture Hyperparameters**


| Parameter Name | Description | 
| --- | --- | 
| backbone |  The backbone to use for the algorithm's encoder component. **Optional** Valid values: `resnet-50`, `resnet-101`  Default value: `resnet-50`  | 
| use\$1pretrained\$1model |  Whether a pretrained model is to be used for the backbone. **Optional** Valid values: `True`, `False` Default value: `True`  | 
| algorithm |  The algorithm to use for semantic segmentation.  **Optional** Valid values: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/segmentation-hyperparameters.html) Default value: `fcn`  | 

**Data Hyperparameters**


| Parameter Name | Description | 
| --- | --- | 
| num\$1classes |  The number of classes to segment. **Required** Valid values: 2 ≤ positive integer ≤ 254  | 
| num\$1training\$1samples |  The number of samples in the training data. The algorithm uses this value to set up the learning rate scheduler. **Required** Valid values: positive integer  | 
| base\$1size |  Defines how images are rescaled before cropping. Images are rescaled such that the long size length is set to `base_size` multiplied by a random number from 0.5 to 2.0, and the short size is computed to preserve the aspect ratio. **Optional** Valid values: positive integer > 16 Default value: 520  | 
| crop\$1size |  The image size for input during training. We randomly rescale the input image based on `base_size`, and then take a random square crop with side length equal to `crop_size`. The `crop_size` will be automatically rounded up to multiples of 8. **Optional** Valid values: positive integer > 16 Default value: 240  | 

**Training Hyperparameters**


| Parameter Name | Description | 
| --- | --- | 
| early\$1stopping |  Whether to use early stopping logic during training. **Optional** Valid values: `True`, `False` Default value: `False`  | 
| early\$1stopping\$1min\$1epochs |  The minimum number of epochs that must be run. **Optional** Valid values: integer Default value: 5  | 
| early\$1stopping\$1patience |  The number of epochs that meet the tolerance for lower performance before the algorithm enforces an early stop. **Optional** Valid values: integer Default value: 4  | 
| early\$1stopping\$1tolerance |  If the relative improvement of the score of the training job, the mIOU, is smaller than this value, early stopping considers the epoch as not improved. This is used only when `early_stopping` = `True`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.0  | 
| epochs |  The number of epochs with which to train. **Optional** Valid values: positive integer Default value: 10  | 
| gamma1 |  The decay factor for the moving average of the squared gradient for `rmsprop`. Used only for `rmsprop`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.9  | 
| gamma2 |  The momentum factor for `rmsprop`. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.9  | 
| learning\$1rate |  The initial learning rate.  **Optional** Valid values: 0 < float ≤ 1 Default value: 0.001  | 
| lr\$1scheduler |  The shape of the learning rate schedule that controls its decrease over time. **Optional** Valid values:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/segmentation-hyperparameters.html) Default value: `poly`  | 
| lr\$1scheduler\$1factor |  If `lr_scheduler` is set to `step`, the ratio by which to reduce (multipy) the `learning_rate` after each of the epochs specified by the `lr_scheduler_step`. Otherwise, ignored. **Optional** Valid values: 0 ≤ float ≤ 1 Default value: 0.1  | 
| lr\$1scheduler\$1step |  A comma delimited list of the epochs after which the `learning_rate` is reduced (multiplied) by an `lr_scheduler_factor`. For example, if the value is set to `"10, 20"`, then the `learning-rate` is reduced by `lr_scheduler_factor` after the 10th epoch and again by this factor after 20th epoch. **Conditionally Required** if `lr_scheduler` is set to `step`. Otherwise, ignored. Valid values: string Default value: (No default, as the value is required when used.)  | 
| mini\$1batch\$1size |  The batch size for training. Using a large `mini_batch_size` usually results in faster training, but it might cause you to run out of memory. Memory usage is affected by the values of the `mini_batch_size` and `image_shape` parameters, and the backbone architecture. **Optional** Valid values: positive integer  Default value: 16  | 
| momentum |  The momentum for the `sgd` optimizer. When you use other optimizers, the semantic segmentation algorithm ignores this parameter. **Optional** Valid values: 0 < float ≤ 1 Default value: 0.9  | 
| optimizer |  The type of optimizer. For more information about an optimizer, choose the appropriate link: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/segmentation-hyperparameters.html) **Optional** Valid values: `adam`, `adagrad`, `nag`, `rmsprop`, `sgd`  Default value: `sgd`  | 
| syncbn |  If set to `True`, the batch normalization mean and variance are computed over all the samples processed across the GPUs. **Optional**  Valid values: `True`, `False`  Default value: `False`  | 
| validation\$1mini\$1batch\$1size |  The batch size for validation. A large `mini_batch_size` usually results in faster training, but it might cause you to run out of memory. Memory usage is affected by the values of the `mini_batch_size` and `image_shape` parameters, and the backbone architecture.  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/segmentation-hyperparameters.html) **Optional** Valid values: positive integer Default value: 16  | 
| weight\$1decay |  The weight decay coefficient for the `sgd` optimizer. When you use other optimizers, the algorithm ignores this parameter.  **Optional** Valid values: 0 < float < 1 Default value: 0.0001  | 

# Tuning a Semantic Segmentation Model
<a name="semantic-segmentation-tuning"></a>

*Automatic model tuning*, also known as hyperparameter tuning, finds the best version of a model by running many jobs that test a range of hyperparameters on your dataset. You choose the tunable hyperparameters, a range of values for each, and an objective metric. You choose the objective metric from the metrics that the algorithm computes. Automatic model tuning searches the hyperparameters chosen to find the combination of values that result in the model that optimizes the objective metric.

## Metrics Computed by the Semantic Segmentation Algorithm
<a name="semantic-segmentation-metrics"></a>

The semantic segmentation algorithm reports two validation metrics. When tuning hyperparameter values, choose one of these metrics as the objective.


| Metric Name | Description | Optimization Direction | 
| --- | --- | --- | 
| validation:mIOU |  The area of the intersection of the predicted segmentation and the ground truth divided by the area of union between them for images in the validation set. Also known as the Jaccard Index.  |  Maximize  | 
| validation:pixel\$1accuracy | The percentage of pixels that are correctly classified in images from the validation set. |  Maximize  | 

## Tunable Semantic Segmentation Hyperparameters
<a name="semantic-segmentation-tunable-hyperparameters"></a>

You can tune the following hyperparameters for the semantic segmentation algorithm.


| Parameter Name | Parameter Type | Recommended Ranges | 
| --- | --- | --- | 
| learning\$1rate |  ContinuousParameterRange  |  MinValue: 1e-4, MaxValue: 1e-1  | 
| mini\$1batch\$1size |  IntegerParameterRanges  |  MinValue: 1, MaxValue: 128  | 
| momentum |  ContinuousParameterRange  |  MinValue: 0.9, MaxValue: 0.999  | 
| optimzer |  CategoricalParameterRanges  |  ['sgd', 'adam', 'adadelta']  | 
| weight\$1decay |  ContinuousParameterRange  |  MinValue: 1e-5, MaxValue: 1e-3  | 

# Use Reinforcement Learning with Amazon SageMaker AI
<a name="reinforcement-learning"></a>

Reinforcement learning (RL) combines fields such as computer science, neuroscience, and psychology to determine how to map situations to actions to maximize a numerical reward signal. This notion of a reward signal in RL stems from neuroscience research into how the human brain makes decisions about which actions maximize reward and minimize punishment. In most situations, humans are not given explicit instructions on which actions to take, but instead must learn both which actions yield the most immediate rewards, and how those actions influence future situations and consequences. 

The problem of RL is formalized using Markov decision processes (MDPs) that originate from dynamical systems theory. MDPs aim to capture high-level details of a real problem that a learning agent encounters over some period of time in attempting to achieve some ultimate goal. The learning agent should be able to determine the current state of its environment and identify possible actions that affect the learning agent’s current state. Furthermore, the learning agent’s goals should correlate strongly to the state of the environment. A solution to a problem formulated in this way is known as a reinforcement learning method. 

## What are the differences between reinforcement, supervised, and unsupervised learning paradigms?
<a name="rl-differences"></a>

Machine learning can be divided into three distinct learning paradigms: supervised, unsupervised, and reinforcement.

In supervised learning, an external supervisor provides a training set of labeled examples. Each example contains information about a situation, belongs to a category, and has a label identifying the category to which it belongs. The goal of supervised learning is to generalize in order to predict correctly in situations that are not present in the training data. 

In contrast, RL deals with interactive problems, making it infeasible to gather all possible examples of situations with correct labels that an agent might encounter. This type of learning is most promising when an agent is able to accurately learn from its own experience and adjust accordingly. 

In unsupervised learning, an agent learns by uncovering structure within unlabeled data. While a RL agent might benefit from uncovering structure based on its experiences, the sole purpose of RL is to maximize a reward signal. 

**Topics**
+ [

## What are the differences between reinforcement, supervised, and unsupervised learning paradigms?
](#rl-differences)
+ [

## Why is Reinforcement Learning Important?
](#rl-why)
+ [

## Markov Decision Process (MDP)
](#rl-terms)
+ [

## Key Features of Amazon SageMaker AI RL
](#sagemaker-rl)
+ [

## Reinforcement Learning Sample Notebooks
](#sagemaker-rl-notebooks)
+ [

# Sample RL Workflow Using Amazon SageMaker AI RL
](sagemaker-rl-workflow.md)
+ [

# RL Environments in Amazon SageMaker AI
](sagemaker-rl-environments.md)
+ [

# Distributed Training with Amazon SageMaker AI RL
](sagemaker-rl-distributed.md)
+ [

# Hyperparameter Tuning with Amazon SageMaker AI RL
](sagemaker-rl-tuning.md)

## Why is Reinforcement Learning Important?
<a name="rl-why"></a>

RL is well-suited for solving large, complex problems, such as supply chain management, HVAC systems, industrial robotics, game artificial intelligence, dialog systems, and autonomous vehicles. Because RL models learn by a continuous process of receiving rewards and punishments for every action taken by the agent, it is possible to train systems to make decisions under uncertainty and in dynamic environments. 

## Markov Decision Process (MDP)
<a name="rl-terms"></a>

RL is based on models called Markov Decision Processes (MDPs). An MDP consists of a series of time steps. Each time step consists of the following:

Environment  
Defines the space in which the RL model operates. This can be either a real-world environment or a simulator. For example, if you train a physical autonomous vehicle on a physical road, that would be a real-world environment. If you train a computer program that models an autonomous vehicle driving on a road, that would be a simulator.

State  
Specifies all information about the environment and past steps that is relevant to the future. For example, in an RL model in which a robot can move in any direction at any time step, the position of the robot at the current time step is the state, because if we know where the robot is, it isn't necessary to know the steps it took to get there.

Action  
What the agent does. For example, the robot takes a step forward.

Reward  
A number that represents the value of the state that resulted from the last action that the agent took. For example, if the goal is for a robot to find treasure, the reward for finding treasure might be 5, and the reward for not finding treasure might be 0. The RL model attempts to find a strategy that optimizes the cumulative reward over the long term. This strategy is called a *policy*.

Observation  
Information about the state of the environment that is available to the agent at each step. This might be the entire state, or it might be just a part of the state. For example, the agent in a chess-playing model would be able to observe the entire state of the board at any step, but a robot in a maze might only be able to observe a small portion of the maze that it currently occupies.

Typically, training in RL consists of many *episodes*. An episode consists of all of the time steps in an MDP from the initial state until the environment reaches the terminal state.

## Key Features of Amazon SageMaker AI RL
<a name="sagemaker-rl"></a>

To train RL models in SageMaker AI RL, use the following components: 
+ A deep learning (DL) framework. Currently, SageMaker AI supports RL in TensorFlow and Apache MXNet.
+ An RL toolkit. An RL toolkit manages the interaction between the agent and the environment and provides a wide selection of state of the art RL algorithms. SageMaker AI supports the Intel Coach and Ray RLlib toolkits. For information about Intel Coach, see [https://nervanasystems.github.io/coach/](https://nervanasystems.github.io/coach/). For information about Ray RLlib, see [https://ray.readthedocs.io/en/latest/rllib.html](https://ray.readthedocs.io/en/latest/rllib.html).
+ An RL environment. You can use custom environments, open-source environments, or commercial environments. For information, see [RL Environments in Amazon SageMaker AI](sagemaker-rl-environments.md).

The following diagram shows the RL components that are supported in SageMaker AI RL.

![\[The RL components that are supported in SageMaker AI RL.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-rl-support.png)


## Reinforcement Learning Sample Notebooks
<a name="sagemaker-rl-notebooks"></a>

For complete code examples, see the [reinforcement learning sample notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/reinforcement_learning) in the SageMaker AI Examples repository.

# Sample RL Workflow Using Amazon SageMaker AI RL
<a name="sagemaker-rl-workflow"></a>

The following example describes the steps for developing RL models using Amazon SageMaker AI RL.

1. **Formulate the RL problem**—First, formulate the business problem into an RL problem. For example, auto scaling enables services to dynamically increase or decrease capacity depending on conditions that you define. Currently, this requires setting up alarms, scaling policies, thresholds, and other manual steps. To solve this with RL, we define the components of the Markov Decision Process:

   1. **Objective**—Scale instance capacity so that it matches the desired load profile.

   1. **Environment**—A custom environment that includes the load profile. It generates a simulated load with daily and weekly variations and occasional spikes. The simulated system has a delay between when new resources are requested and when they become available for serving requests.

   1. **State**—The current load, number of failed jobs, and number of active machines.

   1. **Action**—Remove, add, or keep the same number of instances.

   1. **Reward**—A positive reward for successful transactions and a high penalty for failing transactions beyond a specified threshold.

1. **Define the RL environment**—The RL environment can be the real world where the RL agent interacts or a simulation of the real world. You can connect open source and custom environments developed using Gym interfaces and commercial simulation environments such as MATLAB and Simulink.

1. **Define the presets**—The presets configure the RL training jobs and define the hyperparameters for the RL algorithms.

1. **Write the training code**—Write training code as a Python script and pass the script to a SageMaker AI training job. In your training code, import the environment files and the preset files, and then define the `main()` function.

1. **Train the RL Model**—Use the SageMaker AI `RLEstimator` in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to start an RL training job. If you are using local mode, the training job runs on the notebook instance. When you use SageMaker AI for training, you can select GPU or CPU instances. Store the output from the training job in a local directory if you train in local mode, or on Amazon S3 if you use SageMaker AI training.

   The `RLEstimator` requires the following information as parameters. 

   1. The source directory where the environment, presets, and training code are uploaded.

   1. The path to the training script.

   1. The RL toolkit and deep learning framework you want to use. This automatically resolves to the Amazon ECR path for the RL container.

   1. The training parameters, such as the instance count, job name, and S3 path for output.

   1. Metric definitions that you want to capture in your logs. These can also be visualized in CloudWatch and in SageMaker AI notebooks.

1. **Visualize training metrics and output**—After a training job that uses an RL model completes, you can view the metrics you defined in the training jobs in CloudWatch,. You can also plot the metrics in a notebook by using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) analytics library. Visualizing metrics helps you understand how the performance of the model as measured by the reward improves over time.
**Note**  
If you train in local mode, you can't visualize metrics in CloudWatch.

1. **Evaluate the model**—Checkpointed data from the previously trained models can be passed on for evaluation and inference in the checkpoint channel. In local mode, use the local directory. In SageMaker AI training mode, you need to upload the data to S3 first.

1. **Deploy RL models**—Finally, deploy the trained model on an endpoint hosted on SageMaker AI containers or on an edge device by using AWS IoT Greengrass.

For more information on RL with SageMaker AI, see [Using RL with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_rl.html).

# RL Environments in Amazon SageMaker AI
<a name="sagemaker-rl-environments"></a>

Amazon SageMaker AI RL uses environments to mimic real-world scenarios. Given the current state of the environment and an action taken by the agent or agents, the simulator processes the impact of the action, and returns the next state and a reward. Simulators are useful in cases where it is not safe to train an agent in the real world (for example, flying a drone) or if the RL algorithm takes a long time to converge (for example, when playing chess).

The following diagram shows an example of the interactions with a simulator for a car racing game.

![\[An example of the interactions with a simulator for a car racing game.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-rl-flow.png)


The simulation environment consists of an agent and a simulator. Here, a convolutional neural network (CNN) consumes images from the simulator and generates actions to control the game controller. With multiple simulations, this environment generates training data of the form `state_t`, `action`, `state_t+1`, and `reward_t+1`. Defining the reward is not trivial and impacts the RL model quality. We want to provide a few examples of reward functions, but would like to make it user-configurable. 

**Topics**
+ [

## Use OpenAI Gym Interface for Environments in SageMaker AI RL
](#sagemaker-rl-environments-gym)
+ [

## Use Open-Source Environments
](#sagemaker-rl-environments-open)
+ [

## Use Commercial Environments
](#sagemaker-rl-environments-commercial)

## Use OpenAI Gym Interface for Environments in SageMaker AI RL
<a name="sagemaker-rl-environments-gym"></a>

To use OpenAI Gym environments in SageMaker AI RL, use the following API elements. For more information about OpenAI Gym, see [Gym Documentation](https://www.gymlibrary.dev/).
+ `env.action_space`—Defines the actions the agent can take, specifies whether each action is continuous or discrete, and specifies the minimum and maximum if the action is continuous.
+ `env.observation_space`—Defines the observations the agent receives from the environment, as well as minimum and maximum for continuous observations.
+ `env.reset()`—Initializes a training episode. The `reset()` function returns the initial state of the environment, and the agent uses the initial state to take its first action. The action is then sent to `step()` repeatedly until the episode reaches a terminal state. When `step()` returns `done = True`, the episode ends. The RL toolkit re-initializes the environment by calling `reset()`.
+ `step()`—Takes the agent action as input and outputs the next state of the environment, the reward, whether the episode has terminated, and an `info` dictionary to communicate debugging information. It is the responsibility of the environment to validate the inputs.
+ `env.render()`—Used for environments that have visualization. The RL toolkit calls this function to capture visualizations of the environment after each call to the `step()` function.

## Use Open-Source Environments
<a name="sagemaker-rl-environments-open"></a>

You can use open-source environments, such as EnergyPlus and RoboSchool, in SageMaker AI RL by building your own container. For more information about EnergyPlus, see [https://energyplus.net/](https://energyplus.net/). For more information about RoboSchool, see [https://github.com/openai/roboschool](https://github.com/openai/roboschool). The HVAC and RoboSchool examples in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning) show how to build a custom container to use with SageMaker AI RL:

## Use Commercial Environments
<a name="sagemaker-rl-environments-commercial"></a>

You can use commercial environments, such as MATLAB and Simulink, in SageMaker AI RL by building your own container. You need to manage your own licenses.

# Distributed Training with Amazon SageMaker AI RL
<a name="sagemaker-rl-distributed"></a>

Amazon SageMaker AI RL supports multi-core and multi-instance distributed training. Depending on your use case, training and/or environment rollout can be distributed. For example, SageMaker AI RL works for the following distributed scenarios:
+ Single training instance and multiple rollout instances of the same instance type. For an example, see the Neural Network Compression example in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning).
+ Single trainer instance and multiple rollout instances, where different instance types for training and rollouts. For an example, see the AWS DeepRacer / AWS RoboMaker example in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning).
+ Single trainer instance that uses multiple cores for rollout. For an example, see the Roboschool example in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning). This is useful if the simulation environment is light-weight and can run on a single thread. 
+ Multiple instances for training and rollouts. For an example, see the Roboschool example in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning).

# Hyperparameter Tuning with Amazon SageMaker AI RL
<a name="sagemaker-rl-tuning"></a>

You can run a hyperparameter tuning job to optimize hyperparameters for Amazon SageMaker AI RL. The Roboschool example in the sample notebooks in the [SageMaker AI examples repository](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning) shows how you can do this with RL Coach. The launcher script shows how you can abstract parameters from the Coach preset file and optimize them.

# Run your local code as a SageMaker training job
<a name="train-remote-decorator"></a>

You can run your local machine learning (ML) Python code as a large single-node Amazon SageMaker training job or as multiple parallel jobs. You can do this by annotating your code with an @remote decorator, as shown in the following code example. [Distributed training](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html) (across multiple instances) are not supported with remote functions.

```
@remote(**settings)
def divide(x, y):
    return x / y
```

The SageMaker Python SDK will automatically translate your existing workspace environment and any associated data processing code and datasets into a SageMaker training job that runs on the SageMaker training platform. You can also activate a persistent cache feature, which will further reduce job start latency by caching previously downloaded dependency packages. This reduction in job latency is greater than the reduction in latency from using SageMaker AI managed warm pools alone. For more information, see [Using persistent cache](train-warm-pools.md#train-warm-pools-persistent-cache).

**Note**  
Distributed training jobs are not supported by remote functions.

The following sections show how to annotate your local ML code with an @remote decorator and tailor your experience for your use case. This includes customizing your environment and integrating with SageMaker Experiments.

**Topics**
+ [

## Set up your environment
](#train-remote-decorator-env)
+ [

# Invoke a remote function
](train-remote-decorator-invocation.md)
+ [

# Configuration file
](train-remote-decorator-config.md)
+ [

# Customize your runtime environment
](train-remote-decorator-customize.md)
+ [

# Container image compatibility
](train-remote-decorator-container.md)
+ [

# Logging parameters and metrics with Amazon SageMaker Experiments
](train-remote-decorator-experiments.md)
+ [

# Using modular code with the @remote decorator
](train-remote-decorator-modular.md)
+ [

# Private repository for runtime dependencies
](train-remote-decorator-private.md)
+ [

# Example notebooks
](train-remote-decorator-examples.md)

## Set up your environment
<a name="train-remote-decorator-env"></a>

Choose one of the following three options to set up your environment.

### Run your code from Amazon SageMaker Studio Classic
<a name="train-remote-decorator-env-studio"></a>

You can annotate and run your local ML code from SageMaker Studio Classic by creating a SageMaker Notebook and attaching any image available on SageMaker Studio Classic image. The following instructions help you create a SageMaker Notebook, install the SageMaker Python SDK, and annotate your code with the decorator.

1. Create a SageMaker Notebook and attach an image in SageMaker Studio Classic as follows:

   1. Follow the instructions in [Launch Amazon SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-launch.html) in the *Amazon SageMaker AI Developer Guide*.

   1. Select **Studio** from the left navigation pane. This opens a new window.

   1. In the **Get Started** dialog box, select a user profile from the down arrow. This opens a new window.

   1. Select **Open Studio Classic**.

   1. Select **Open Launcher** from the main working area. This opens a new page.

   1. Select **Create notebook** from the main working area.

   1. Select **Base Python 3.0** from the down arrow next to **Image** in the **Change environment** dialog box. 

      The @remote decorator automatically detects the image attached to the SageMaker Studio Classic notebook and uses it to run the SageMaker training job. If `image_uri` is specified either as an argument in the decorator or in the configuration file, then the value specified in `image_uri` will be used instead of the detected image.

      For more information about how to create a notebook in SageMaker Studio Classic, see the **Create a Notebook from the File Menu** section in [Create or Open an Amazon SageMaker Studio Classic Notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-create-open.html#notebooks-create-file-menu).

      For a list of available images, see [Supported Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator-container.html).

1. Install the SageMaker Python SDK.

   To annotate your code with the @remote function inside a SageMaker Studio Classic Notebook, you must have the SageMaker Python SDK installed. Install the SageMaker Python SDK, as shown in the following code example.

   ```
   !pip install sagemaker
   ```

1. Use @remote decorator to run functions in a SageMaker training job.

   To run your local ML code, first create a dependencies file to instruct SageMaker AI where to locate your local code. To do so, follow these steps:

   1. From the SageMaker Studio Classic Launcher main working area, in **Utilities and files**, choose **Text file**. This opens a new tab with a text file called `untitled.txt.` 

      For more information about the SageMaker Studio Classic user interface (UI), see [Amazon SageMaker Studio Classic UI Overview](https://docs.aws.amazon.com//sagemaker/latest/dg/studio-ui.html).

   1. Rename `untitled.txt `to `requirements.txt`.

   1. Add all the dependencies required for the code along with the SageMaker AI library to `requirements.txt`. 

      A minimal code example for `requirements.txt` for the example `divide` function is provided in the following section, as follows.

      ```
      sagemaker
      ```

   1. Run your code with the remote decorator by passing the dependencies file, as follows.

      ```
      from sagemaker.remote_function import remote
      
      @remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt')
      def divide(x, y):
          return x / y
      
      divide(2, 3.0)
      ```

      For additional code examples, see the sample notebook [quick\$1start.ipynb](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-remote-function/quick_start/quick_start.ipynb).

      If you’re already running a SageMaker Studio Classic notebook, and you install the Python SDK as instructed in **2. Install the SageMaker Python SDK**, you must restart your kernel. For more information, see [Use the SageMaker Studio Classic Notebook Toolbar](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-menu.html) in the *Amazon SageMaker AI Developer Guide*.

### Run your code from an Amazon SageMaker notebook
<a name="train-remote-decorator-env-notebook"></a>

You can annotate your local ML code from a SageMaker notebook instance. The following instructions show how to create a notebook instance with a custom kernel, install the SageMaker Python SDK, and annotate your code with the decorator.

1. Create a notebook instance with a custom `conda` kernel.

   You can annotate your local ML code with an @remote decorator to use inside of a SageMaker training job. First you must create and customize a SageMaker notebook instance to use a kernel with Python version 3.7 or higher, up to 3.10.x. To do so, follow these steps:

   1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

   1. In the left navigation panel, choose **Notebook** to expand its options.

   1. Choose **Notebook Instances** from the expanded options.

   1. Choose the **Create Notebook Instance** button. This opens a new page.

   1. For **Notebook instance name**, enter a name with a maximum of 63 characters and no spaces. Valid characters: **A-Z**, **a-z**, **0-9**, and **.****:****\$1****=****@**** \$1****%****-** (hyphen).

   1. In the **Notebook instance settings** dialog box, expand the right arrow next to **Additional Configuration**.

   1. Under **Lifecycle configuration - optional**, expand the down arrow and select **Create a new lifecycle configuration**. This opens a new dialog box.

   1. Under **Name**, enter a name for your configuration setting.

   1. In the **Scripts** dialog box, in the **Start notebook** tab, replace the existing contents of the text box with the following script.

      ```
      #!/bin/bash
      
      set -e
      
      sudo -u ec2-user -i <<'EOF'
      unset SUDO_UID
      WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda/
      source "$WORKING_DIR/miniconda/bin/activate"
      for env in $WORKING_DIR/miniconda/envs/*; do
          BASENAME=$(basename "$env")
          source activate "$BASENAME"
          python -m ipykernel install --user --name "$BASENAME" --display-name "Custom ($BASENAME)"
      done
      EOF
      
      echo "Restarting the Jupyter server.."
      # restart command is dependent on current running Amazon Linux and JupyterLab
      CURR_VERSION_AL=$(cat /etc/system-release)
      CURR_VERSION_JS=$(jupyter --version)
      
      if [[ $CURR_VERSION_JS == *$"jupyter_core     : 4.9.1"* ]] && [[ $CURR_VERSION_AL == *$" release 2018"* ]]; then
       sudo initctl restart jupyter-server --no-wait
      else
       sudo systemctl --no-block restart jupyter-server.service
      fi
      ```

   1. In the **Scripts** dialog box, in the **Create notebook** tab, replace the existing contents of the text box with the following script.

      ```
      #!/bin/bash
      
      set -e
      
      sudo -u ec2-user -i <<'EOF'
      unset SUDO_UID
      # Install a separate conda installation via Miniconda
      WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda
      mkdir -p "$WORKING_DIR"
      wget https://repo.anaconda.com/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh -O "$WORKING_DIR/miniconda.sh"
      bash "$WORKING_DIR/miniconda.sh" -b -u -p "$WORKING_DIR/miniconda" 
      rm -rf "$WORKING_DIR/miniconda.sh"
      # Create a custom conda environment
      source "$WORKING_DIR/miniconda/bin/activate"
      KERNEL_NAME="custom_python310"
      PYTHON="3.10"
      conda create --yes --name "$KERNEL_NAME" python="$PYTHON" pip
      conda activate "$KERNEL_NAME"
      pip install --quiet ipykernel
      # Customize these lines as necessary to install the required packages
      EOF
      ```

   1. Choose the **Create configuration** button on the bottom right of the window.

   1. Choose the **Create notebook instance** button on the bottom right of the window.

   1. Wait for the notebook instance **Status** to change from **Pending** to **InService**.

1. Create a Jupyter notebook in the notebook instance.

   The following instructions show how to create a Jupyter notebook using Python 3.10 in your newly created SageMaker instance.

   1. After the notebook instance **Status** from the previous step is **InService**, do the following: 

      1. Select **Open Jupyter** under **Actions** in the row containing your newly created notebook instance **Name**. This opens a new Jupyter server.

   1. In the Jupyter server, select **New** from the top right menu. 

   1. From the down arrow, select **conda\$1custom\$1python310**. This creates a new Jupyter notebook that uses a Python 3.10 kernel. This new Jupyter notebook can now be used similarly to a local Jupyter notebook. 

1. Install the SageMaker Python SDK.

   After your virtual environment is running, install the SageMaker Python SDK by using the following code example.

   ```
   !pip install sagemaker
   ```

1. Use an @remote decorator to run functions in a SageMaker training job.

   When you annotate your local ML code with an @remote decorator inside the SageMaker notebook, SageMaker training will automatically interpret the function of your code and run it as a SageMaker training job. Set up your notebook by doing the following:

   1. Select the kernel name in the notebook menu from the SageMaker notebook instance that you created in step 1, **Create a SageMaker Notebook instance with a custom kernel**.

      For more information, see [Change an Image or a Kernel](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-run-and-manage-change-image.html). 

   1. From the down arrow, choose the a custom `conda` kernel that uses a version of Python that is 3.7 or higher. 

      As an example, selecting `conda_custom_python310` chooses the kernel for Python 3.10.

   1. Choose **Select**.

   1. Wait for the kernel’s status to show as idle, which indicates that the kernel has started.

   1. In the Jupyter Server Home, select **New** from the top right menu.

   1. Next to the down arrow, select **Text file**. This creates a new text file called `untitled.txt.`

   1. Rename `untitled.txt` to `requirements.txt` and add any dependencies required for the code along with `sagemaker`.

   1. Run your code with the remote decorator by passing the dependencies file as shown below.

      ```
      from sagemaker.remote_function import remote
      
      @remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt')
      def divide(x, y):
          return x / y
      
      divide(2, 3.0)
      ```

      See the sample notebook [quick\$1start.ipnyb](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-remote-function/quick_start/quick_start.ipynb) for additional code examples.

### Run your code from within your local IDE
<a name="train-remote-decorator-env-ide"></a>

You can annotate your local ML code with an @remote decorator inside your preferred local IDE. The following steps show the necessary prerequisites, how to install the Python SDK, and how to annotate your code with the @remote decorator.

1. Install prerequisites by setting up the AWS Command Line Interface (AWS CLI) and creating a role, as follows:
   + Onboard to a SageMaker AI domain following the instructions in the **AWS CLI Prerequisites** section of [Set Up Amazon SageMaker AI Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html#gs-cli-prereq).
   + Create an IAM role following the **Create execution role** section of [SageMaker AI Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html).

1. Create a virtual environment by using either PyCharm or `conda` and using Python version 3.7 or higher, up to 3.10.x.
   + Set up a virtual environment using PyCharm as follows:

     1. Select **File** from the main menu.

     1. Choose **New Project**.

     1. Choose **Conda** from the down arrow under **New environment using**.

     1. In the field for **Python version** use the down arrow to select a version of Python that is 3.7 or above. You can go up to 3.10.x from the list.  
![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-pycharm-ide.png)
   + If you have Anaconda installed, you can set up a virtual environment using `conda`, as follows:
     + Open an Anaconda prompt terminal interface.
     + Create and activate a new `conda` environment using a Python version of 3.7 or higher, up to 3.10x. The following code example shows how to create a `conda` environment using Python version 3.10.

       ```
       conda create -n sagemaker_jobs_quick_start python=3.10 pip
       conda activate sagemaker_jobs_quick_start
       ```

1. Install the SageMaker Python SDK.

   To package your code from your preferred IDE, you must have a virtual environment set up using Python 3.7 or higher, up to 3.10x. You also need a compatible container image. Install the SageMaker Python SDK using the following code example.

   ```
   pip install sagemaker
   ```

1. Wrap your code inside the @remote decorator. The SageMaker Python SDK will automatically interpret the function of your code and run it as a SageMaker training job. The following code examples show how to import the necessary libraries, set up a SageMaker session, and annotate a function with the @remote decorator.

   You can run your code by either providing the dependencies needed directly, or by using dependencies from the active `conda` environment.
   + To provide the dependencies directly, do the following:
     + Create a `requirements.txt` file in the working directory that the code resides in.
     + Add all of the dependencies required for the code along with the SageMaker library. The following section provides a minimal code example for `requirements.txt` for the example `divide` function.

       ```
       sagemaker
       ```
     + Run your code with the @remote decorator by passing the dependencies file. In the following code example, replace `The IAM role name` with an AWS Identity and Access Management (IAM) role ARN that you would like SageMaker to use to run your job.

       ```
       import boto3
       import sagemaker
       from sagemaker.remote_function import remote
       
       sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-west-2"))
       settings = dict(
           sagemaker_session=sm_session,
           role=<The IAM role name>,
           instance_type="ml.m5.xlarge",
           dependencies='./requirements.txt'
       )
       
       @remote(**settings)
       def divide(x, y):
           return x / y
       
       
       if __name__ == "__main__":
           print(divide(2, 3.0))
       ```
   + To use dependencies from the active `conda` environment, use the value `auto_capture` for the `dependencies` parameter, as shown in the following.

     ```
     import boto3
     import sagemaker
     from sagemaker.remote_function import remote
     
     sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-west-2"))
     settings = dict(
         sagemaker_session=sm_session,
         role=<The IAM role name>,
         instance_type="ml.m5.xlarge",
         dependencies="auto_capture"
     )
     
     @remote(**settings)
     def divide(x, y):
         return x / y
     
     
     if __name__ == "__main__":
         print(divide(2, 3.0))
     ```
**Note**  
You can also implement the previous code inside a Jupyter notebook. PyCharm Professional Edition supports Jupyter natively. For more guidance, see [Jupyter notebook support](https://www.jetbrains.com/help/pycharm/ipython-notebook-support.html) in PyCharm's documentation.

# Invoke a remote function
<a name="train-remote-decorator-invocation"></a>

To invoke a function inside the @remote decorator, use either of the following methods:
+ [Use an @remote decorator to invoke a function](#train-remote-decorator-invocation-decorator).
+ [Use the `RemoteExecutor` API to invoke a function](#train-remote-decorator-invocation-api).

If you use the @remote decorator method to invoke a function, the training job will wait for the function to complete before starting a new task. However, if you use the `RemoteExecutor` API, you can run more than one job in parallel. The following sections show both ways of invoking a function.

## Use an @remote decorator to invoke a function
<a name="train-remote-decorator-invocation-decorator"></a>

You can use the @remote decorator to annotate a function. SageMaker AI will transform the code inside the decorator into a SageMaker training job. The training job will then invoke the function inside the decorator and wait for the job to complete. The following code example shows how to import the required libraries, start a SageMaker AI instance, and annotate a matrix multiplication with the @remote decorator.

```
from sagemaker.remote_function import remote
import numpy as np

@remote(instance_type="ml.m5.large")
def matrix_multiply(a, b):
    return np.matmul(a, b)
    
a = np.array([[1, 0],
             [0, 1]])
b = np.array([1, 2])

assert (matrix_multiply(a, b) == np.array([1,2])).all()
```

The decorator is defined as follows.

```
def remote(
    *,
    **kwarg):
        ...
```

When you invoke a decorated function, SageMaker Python SDK loads any exceptions raised by an error into local memory. In the following code example, the first call to the divide function completes successfully and the result is loaded into local memory. In the second call to the divide function, the code returns an error and this error is loaded into local memory.

```
from sagemaker.remote_function import remote
import pytest

@remote()
def divide(a, b):
    return a/b

# the underlying job is completed successfully 
# and the function return is loaded
assert divide(10, 5) == 2

# the underlying job fails with "AlgorithmError" 
# and the function exception is loaded into local memory 
with pytest.raises(ZeroDivisionError):
    divide(10, 0)
```

**Note**  
The decorated function is run as a remote job. If the thread is interrupted, the underlying job will not be stopped.

### How to change the value of a local variable
<a name="train-remote-decorator-invocation-decorator-value"></a>

The decorator function is run on a remote machine. Changing a non-local variable or input arguments inside a decorated function will not change the local value.

In the following code example, a list and a dict are appended inside the decorator function. This does not change when the decorator function is invoked.

```
a = []

@remote
def func():
    a.append(1)

# when func is invoked, a in the local memory is not modified        
func() 
func()

# a stays as []
    
a = {}
@remote
def func(a):
    # append new values to the input dictionary
    a["key-2"] = "value-2"
    
a = {"key": "value"}
func(a)

# a stays as {"key": "value"}
```

To change the value of a local variable declared inside of a decorator function, return the variable from the function. The following code example shows that the value of a local variable is changed when it is returned from the function.

```
a = {"key-1": "value-1"}

@remote
def func(a):
    a["key-2"] = "value-2"
    return a

a = func(a)

-> {"key-1": "value-1", "key-2": "value-2"}
```

### Data serialization and deserialization
<a name="train-remote-decorator-invocation-input-output"></a>

When you invoke a remote function, SageMaker AI automatically serializes your function arguments during the input and output stages. Function arguments and returns are serialized using [cloudpickle](https://github.com/cloudpipe/cloudpickle). SageMaker AI supports serializing the following Python objects and functions. 
+ Built-in Python objects including dicts, lists, floats, ints, strings, boolean values and tuples
+ Numpy arrays
+ Pandas Dataframes
+ Scikit-learn datasets and estimators
+ PyTorch models
+ TensorFlow models
+ The Booster class for XGBoost

The following can be used with some limitations.
+ Dask DataFrames
+ The XGBoost Dmatrix class
+ TensorFlow datasets and subclasses
+ PyTorch models

The following section contains best practices for using the previous Python classes with some limitations in your remote function, information about where SageMaker AI stores your serialized data and how to manage access to it.

#### Best practices for Python classes with limited support for remote data serialization
<a name="train-remote-decorator-invocation-input-output-bestprac"></a>

You can use the Python classes listed in this section with limitations. The next sections discuss best practices for how to use the following Python classes.
+ [Dask](https://www.dask.org/) DataFrames
+ The XGBoost DMatric class
+ TensorFlow datasets and subclasses
+ PyTorch models

##### Best practices for Dask
<a name="train-remote-decorator-invocation-input-output-bestprac-dask"></a>

[Dask](https://www.dask.org/) is an open-source library used for parallel computing in Python. This section shows the following.
+ How to pass a Dask DataFrame into your remote function
+ How to convert summary statistics from a Dask DataFrame into a Pandas DataFrame

##### How to pass a Dask DataFrame into your remote function
<a name="train-remote-decorator-invocation-input-output-bestprac-dask-pass"></a>

[Dask DataFrames](https://docs.dask.org/en/latest/dataframe.html) are often used to process large datasets because they can hold datasets that require more memory than is available. This is because a Dask DataFrame does not load your local data into memory. If you pass a Dask DataFrame as a function argument to your remote function, Dask may pass a reference to the data in your local disk or cloud storage, instead of the data itself. The following code shows an example of passing a Dask DataFrame inside your remote function that will operate on an empty DataFrame.

```
#Do not pass a Dask DataFrame  to your remote function as follows
def clean(df: dask.DataFrame ):
    cleaned = df[] \ ...
```

Dask will load the data from the Dask DataFrame into memory only when you use the DataFrame . If you want to use a Dask DataFrame inside a remote function, provide the path to the data . Then Dask will read the dataset directly from the data path that you specify when the code runs.

The following code example shows how to use a Dask DataFrame inside the remote function `clean`. In the code example, `raw_data_path` is passed to clean instead of the Dask DataFrame. When the code runs, the dataset is read directly from the location of an Amazon S3 bucket specified in `raw_data_path`. Then the `persist` function keeps the dataset in memory to facilitate the subsequent `random_split` function and written back to the output data path in an S3 bucket using Dask DataFrame API functions.

```
import dask.dataframe as dd

@remote(
   instance_type='ml.m5.24xlarge',
   volume_size=300, 
   keep_alive_period_in_seconds=600)
#pass the data path to your remote function rather than the Dask DataFrame  itself
def clean(raw_data_path: str, output_data_path: str: split_ratio: list[float]):
    df = dd.read_parquet(raw_data_path) #pass the path to your DataFrame 
    cleaned = df[(df.column_a >= 1) & (df.column_a < 5)]\
        .drop(['column_b', 'column_c'], axis=1)\
        .persist() #keep the data in memory to facilitate the following random_split operation

    train_df, test_df = cleaned.random_split(split_ratio, random_state=10)

    train_df.to_parquet(os.path.join(output_data_path, 'train')
    test_df.to_parquet(os.path.join(output_data_path, 'test'))
    
clean("s3://amzn-s3-demo-bucket/raw/", "s3://amzn-s3-demo-bucket/cleaned/", split_ratio=[0.7, 0.3])
```

##### How to convert summary statistics from a Dask DataFrame into a Pandas DataFrame
<a name="train-remote-decorator-invocation-input-output-bestprac-dask-pd"></a>

Summary statistics from a Dask DataFrame can be converted into a Pandas DataFrame by invoking the `compute` method as shown in the following example code. In the example, the S3 bucket contains a large Dask DataFrame that cannot fit into memory or into a Pandas dataframe. In the following example, a remote function scans the data set and returns a Dask DataFrame containing the output statistics from `describe` to a Pandas DataFrame.

```
executor = RemoteExecutor(
    instance_type='ml.m5.24xlarge',
    volume_size=300, 
    keep_alive_period_in_seconds=600)

future = executor.submit(lambda: dd.read_parquet("s3://amzn-s3-demo-bucket/raw/").describe().compute())

future.result()
```

##### Best practices for the XGBoost DMatric class
<a name="train-remote-decorator-invocation-input-output-bestprac-xgboost"></a>

DMatrix is an internal data structure used by XGBoost to load data. A DMatrix object can’t be pickled in order to move easily between compute sessions. Directly passing DMatrix instances will fail with a `SerializationError`.

##### How to pass a data object to your remote function and train with XGBoost
<a name="train-remote-decorator-invocation-input-output-bestprac-xgboost-pass"></a>

To convert a Pandas DataFrame into a DMatrix instance and use it for training in your remote function, pass it directly to the remote function as shown in the following code example.

```
import xgboost as xgb

@remote
def train(df, params):
    #Convert a pandas dataframe into a DMatrix DataFrame and use it for training
    dtrain = DMatrix(df) 
    return xgb.train(dtrain, params)
```

##### Best practices for TensorFlow datasets and sub-classes
<a name="train-remote-decorator-invocation-input-output-bestprac-tf"></a>

TensorFlow datasets and subclasses are internal objects used by TensorFlow to load data during training. TensorFlow datasets and subclasses can’t be pickled in order to move easily between compute sessions. Directly passing Tensorflow datasets or subclasses will fail with a `SerializationError`. Use the Tensorflow I/O APIs to load data from the storage, as shown in the following code example.

```
import tensorflow as tf
import tensorflow_io as tfio

@remote
def train(data_path: str, params):
    
    dataset = tf.data.TextLineDataset(tf.data.Dataset.list_files(f"{data_path}/*.txt"))
    ...
    
train("s3://amzn-s3-demo-bucket/data", {})
```

##### Best practices for PyTorch models
<a name="train-remote-decorator-invocation-input-output-bestprac-pytorch"></a>

PyTorch models are serializable and can be passed between your local environment and remote function. If your local environment and remote environment have different device types, such as (GPUs and CPUs), you cannot return a trained model to your local environment. For example, if the following code is developed in a local environment without GPUs but run in an instance with GPUs, returning the trained model directly will lead to a `DeserializationError`.

```
# Do not return a model trained on GPUs to a CPU-only environment as follows

@remote(instance_type='ml.g4dn.xlarge')
def train(...):
    if torch.cuda.is_available():
        device = torch.device("cuda")
    else:
        device = torch.device("cpu") # a device without GPU capabilities
    
    model = Net().to(device)
    
    # train the model
    ...
    
    return model
    
model = train(...) #returns a DeserializationError if run on a device with GPU
```

To return a model trained in a GPU environment to one that contains only CPU capabilities, use the PyTorch model I/O APIs directly as shown in the code example below.

```
import s3fs

model_path = "s3://amzn-s3-demo-bucket/folder/"

@remote(instance_type='ml.g4dn.xlarge')
def train(...):
    if torch.cuda.is_available():
        device = torch.device("cuda")
    else:
        device = torch.device("cpu")
    
    model = Net().to(device)
    
    # train the model
    ...
    
    fs = s3fs.FileSystem()
    with fs.open(os.path.join(model_path, 'model.pt'), 'wb') as file:
        torch.save(model.state_dict(), file) #this writes the model in a device-agnostic way (CPU vs GPU)
    
train(...) #use the model to train on either CPUs or GPUs

model = Net()
fs = s3fs.FileSystem()with fs.open(os.path.join(model_path, 'model.pt'), 'rb') as file:
    model.load_state_dict(torch.load(file, map_location=torch.device('cpu')))
```

#### Where SageMaker AI stores your serialized data
<a name="train-remote-decorator-invocation-input-output-storage"></a>

When you invoke a remote function, SageMaker AI automatically serializes your function arguments and return values during the input and output stages. This serialized data is stored under a root directory in your S3 bucket. You specify the root directory, `<s3_root_uri>`, in a configuration file. The parameter `job_name` is automatically generated for you. 

Under the root directory, SageMaker AI creates a `<job_name>` folder, which holds your current work directory, serialized function, the arguments for your serialized function, results and any exceptions that arose from invoking the serialized function.

Under `<job_name>`, the directory `workdir` contains a zipped archive of your current working directory. The zipped archive includes any Python files in your working directory and the `requirements.txt` file, which specifies any dependencies needed to run your remote function.

The following is an example of the folder structure under an S3 bucket that you specify in your configuration file. 

```
<s3_root_uri>/ # specified by s3_root_uri or S3RootUri
    <job_name>/ #automatically generated for you
        workdir/workspace.zip # archive of the current working directory (workdir)
        function/ # serialized function
        arguments/ # serialized function arguments
        results/ # returned output from the serialized function including the model
        exception/ # any exceptions from invoking the serialized function
```

The root directory that you specify in your S3 bucket is not meant for long term storage. The serialized data are tightly tied to the Python version and machine learning (ML) framework version that were used during serialization. If you upgrade the Python version or ML framework, you may not be able to use your serialized data. Instead, do the following.
+ Store your model and model artifacts in a format that is agnostic to your Python version and ML framework.
+ If you upgrade your Python or ML framework, access your model results from your long-term storage.

**Important**  
To delete your serialized data after a specified amount of time, set a [lifetime configuration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html) on your S3 bucket.

**Note**  
Files that are serialized with the Python [pickle](https://docs.python.org/3/library/pickle.html) module can be less portable than other data formats including CSV, Parquet and JSON. Be wary of loading pickled files from unknown sources.

For more information about what to include in a configuration file for a remote function, see [Configuration File](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator-config.html).

#### Access to your serialized data
<a name="train-remote-decorator-invocation-input-output-access"></a>

Administrators can provide settings for your serialized data, including its location and any encryption settings in a configuration file. By default, the serialized data are encrypted with an AWS Key Management Service (AWS KMS) Key. Administrators can also restrict access to the root directory that you specify in your configuration file with a [bucket policy](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html). The configuration file can be shared and used across projects and jobs. For more information, see [Configuration File](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator-config.html).

## Use the `RemoteExecutor` API to invoke a function
<a name="train-remote-decorator-invocation-api"></a>

You can use the `RemoteExecutor` API to invoke a function. SageMaker AI Python SDK will transform the code inside the `RemoteExecutor` call into a SageMaker AI training job. The training job will then invoke the function as an asynchronous operation and return a future. If you use the `RemoteExecutor` API, you can run more than one training job in parallel. For more information about futures in Python, see [Futures](https://docs.python.org/3/library/asyncio-future.html).

The following code example shows how to import the required libraries, define a function, start a SageMaker AI instance, and use the API to submit a request to run `2` jobs in parallel.

```
from sagemaker.remote_function import RemoteExecutor

def matrix_multiply(a, b):
    return np.matmul(a, b)


a = np.array([[1, 0],
             [0, 1]])
b = np.array([1, 2])

with RemoteExecutor(max_parallel_job=2, instance_type="ml.m5.large") as e:
    future = e.submit(matrix_multiply, a, b)

assert (future.result() == np.array([1,2])).all()
```

The `RemoteExecutor` class is an implementation of the [concurrent.futures.Executor](https://docs.python.org/3/library/concurrent.futures.html) library.

The following code example shows how to define a function and call it using the `RemoteExecutorAPI`. In this example, the `RemoteExecutor` will submit `4` jobs in total, but only `2` in parallel. The last two jobs will reuse the clusters with minimal overhead.

```
from sagemaker.remote_function.client import RemoteExecutor

def divide(a, b):
    return a/b 

with RemoteExecutor(max_parallel_job=2, keep_alive_period_in_seconds=60) as e:
    futures = [e.submit(divide, a, 2) for a in [3, 5, 7, 9]]

for future in futures:
    print(future.result())
```

The `max_parallel_job` parameter only serves as a rate limiting mechanism without optimizing compute resource allocation. In the previous code example, `RemoteExecutor` doesn’t reserve compute resources for the two parallel jobs before any jobs are submitted. For more information about `max_parallel_job` or other parameters for the @remote decorator, see [Remote function classes and methods specification](https://sagemaker.readthedocs.io/en/stable/remote_function/sagemaker.remote_function.html).

### Future class for the `RemoteExecutor` API
<a name="train-remote-decorator-invocation-api-future"></a>

A future class is a public class that represents the return function from the training job when it is invoked asynchronously. The future class implements the [concurrent.futures.Future](https://docs.python.org/3/library/concurrent.futures.html) class. This class can be used to do operations on the underlying job and load data into memory.

# Configuration file
<a name="train-remote-decorator-config"></a>

The Amazon SageMaker Python SDK supports setting of default values for AWS infrastructure primitive types. After administrators configure these defaults, they are automatically passed when SageMaker Python SDK calls supported APIs. The arguments for the decorator function can be put inside of configuration files. This is so that you can separate settings that are related to the infrastructure from the code base. For more information about parameters and arguments for the remote function and methods, see [Remote function classes and methods specification](https://sagemaker.readthedocs.io/en/stable/remote_function/sagemaker.remote_function.html).

You can set infrastructure settings for the network configuration, IAM roles, Amazon S3 folder for input, output data, and tags inside the configuration file. The configuration file can be used when invoking a function using either the @remote decorator or the `RemoteExecutor` API.

An example configuration file that defines the dependencies, resources, and other arguments follows. This example configuration file is used to invoke a function that is initiated either using the @remote decorator or the RemoteExecutor API.

```
SchemaVersion: '1.0'
SageMaker:
  PythonSDK:
    Modules:
      RemoteFunction:
        Dependencies: 'path/to/requirements.txt'
        EnableInterContainerTrafficEncryption: true
        EnvironmentVariables: {'EnvVarKey': 'EnvVarValue'}
        ImageUri: '366666666666.dkr.ecr.us-west-2.amazonaws.com/my-image:latest'
        IncludeLocalWorkDir: true
        CustomFileFilter: 
          IgnoreNamePatterns:
          - "*.ipynb"
          - "data"
        InstanceType: 'ml.m5.large'
        JobCondaEnvironment: 'your_conda_env'
        PreExecutionCommands:
            - 'command_1'
            - 'command_2'
        PreExecutionScript: 'path/to/script.sh'
        RoleArn: 'arn:aws:iam::366666666666:role/MyRole'
        S3KmsKeyId: 'yourkmskeyid'
        S3RootUri: 's3://amzn-s3-demo-bucket/my-project'
        VpcConfig:
            SecurityGroupIds: 
            - 'sg123'
            Subnets: 
            - 'subnet-1234'
        Tags: [{'Key': 'yourTagKey', 'Value':'yourTagValue'}]
        VolumeKmsKeyId: 'yourkmskeyid'
```

The @remote decorator and `RemoteExecutor` will look for `Dependencies` in the following configuration files:
+ An admin-defined configuration file.
+ A user-defined configuration file.

The default locations for these configuration files depend on, and are relative to, your environment. The following code example returns the default location of your admin and user configuration files. These commands must be run in the same environment where you're using the SageMaker Python SDK.

```
import os
from platformdirs import site_config_dir, user_config_dir

#Prints the location of the admin config file
print(os.path.join(site_config_dir("sagemaker"), "config.yaml"))

#Prints the location of the user config file
print(os.path.join(user_config_dir("sagemaker"), "config.yaml"))
```

You can override the default locations of these files by setting the `SAGEMAKER_ADMIN_CONFIG_OVERRIDE` and `SAGEMAKER_USER_CONFIG_OVERRIDE` environment variables for the admin-defined and user-defined configuration file paths, respectively. 

If a key exists in both the admin-defined and user-defined configuration files, the value in the user-defined file will be used.

# Customize your runtime environment
<a name="train-remote-decorator-customize"></a>

You can customize your runtime environment to use your preferred local integrated development environments (IDEs), SageMaker notebooks, or SageMaker Studio Classic notebooks to write your ML code. SageMaker AI will help package and submit your functions and its dependencies as a SageMaker training job. This allows you to access the capacity of the SageMaker training server to run your training jobs.

Both the remote decorator and the `RemoteExecutor` methods to invoke a function allow users to define and customize their runtime environment. You can use either a `requirements.txt` file or a conda environment YAML file.

To customize a runtime environment using both a conda environment YAML file and a `requirements.txt` file, refer to the following code example.

```
# specify a conda environment inside a yaml file
@remote(instance_type="ml.m5.large",
        image_uri = "my_base_python:latest", 
        dependencies = "./environment.yml")
def matrix_multiply(a, b):
    return np.matmul(a, b)

# use a requirements.txt file to import dependencies
@remote(instance_type="ml.m5.large",
        image_uri = "my_base_python:latest", 
        dependencies = './requirements.txt')
def matrix_multiply(a, b):
    return np.matmul(a, b)
```

Alternatively, you can set `dependencies` to `auto_capture` to let the SageMaker Python SDK capture the installed dependencies in the active conda environment. The following are required for `auto_capture` to work reliably:
+ You must have an active conda environment. We recommend not using the `base` conda environment for remote jobs so that you can reduce potential dependency conflicts. Not using the `base` conda environment also allows for faster environment setup in the remote job.
+ You must not have any dependencies installed using pip with a value for the parameter `--extra-index-url`.
+ You must not have any dependency conflicts between packages installed with conda and packages installed with pip in the local development environment.
+ Your local development environment must not contain operating system-specific dependencies that are not compatible with Linux.

In case `auto_capture` does not work, we recommend that you pass in your dependencies as a requirement.txt or conda environment.yaml file, as described in the first coding example in this section.

# Container image compatibility
<a name="train-remote-decorator-container"></a>

The following table shows a list of SageMaker training images that are compatible with the @remote decorator.


| Name | Python Version | Image URI - CPU | Image URI - GPU | 
| --- | --- | --- | --- | 
|  Data Science  |  3.7(py37)  |  For SageMaker Studio Classic Notebooks only. Python SDK automatically selects the image URI when used as SageMaker Studio Classic Notebook kernel image.  |  For SageMaker Studio Classic Notebooks only. Python SDK automatically selects the image URI when used as SageMaker Studio Classic Notebook kernel image.  | 
|  Data Science 2.0  |  3.8(py38)  |  For SageMaker Studio Classic Notebooks only. Python SDK automatically selects the image URI when used as SageMaker Studio Classic Notebook kernel image.  |  For SageMaker Studio Classic Notebooks only. Python SDK automatically selects the image URI when used as SageMaker Studio Classic Notebook kernel image.  | 
|  Data Science 3.0  |  3.10(py310)  |  For SageMaker Studio Classic Notebooks only. Python SDK automatically selects the image URI when used as SageMaker Studio Classic Notebook kernel image.  |  For SageMaker Studio Classic Notebooks only. Python SDK automatically selects the image URI when used as SageMaker Studio Classic Notebook kernel image.  | 
|  Base Python 2.0  |  3.8(py38)  |  Python SDK selects this image when it detects that development environment is using Python 3.8 runtime. Otherwise Python SDK automatically selects this image when used as SageMaker Studio Classic Notebook kernel image  |  For SageMaker Studio Classic Notebooks only. Python SDK automatically selects the image URI when used as SageMaker Studio Classic Notebook kernel image.  | 
|  Base Python 3.0  |  3.10(py310)  |  Python SDK selects this image when it detects that development environment is using Python 3.8 runtime. Otherwise Python SDK automatically selects this image when used as SageMaker Studio Classic Notebook kernel image  |  For SageMaker Studio Classic Notebooks only. Python SDK automatically selects the image URI when used as Studio Classic Notebook kernel image.  | 
|  DLC-TensorFlow 2.12.0 for SageMaker Training  |  3.10(py310)  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.12.0-cpu-py310-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.12.0-gpu-py310-cu118-ubuntu20.04-sagemaker  | 
|  DLC-Tensorflow 2.11.0 for SageMaker training  |  3.9(py39)  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.11.0-cpu-py39-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-ubuntu20.04-sagemaker  | 
|  DLC-TensorFlow 2.10.1 for SageMaker training  |  3.9(py39)  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.10.1-cpu-py39-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.10.1-gpu-py39-cu112-ubuntu20.04-sagemaker  | 
|  DLC-TensorFlow 2.9.2 for SageMaker training  |  3.9(py39)  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.9.2-cpu-py39-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.9.2-gpu-py39-cu112-ubuntu20.04-sagemaker  | 
|  DLC-TensorFlow 2.8.3 for SageMaker training  |  3.9(py39)  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.8.3-cpu-py39-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.8.3-gpu-py39-cu112-ubuntu20.04-sagemaker  | 
|  DLC-PyTorch 2.0.0 for SageMaker training  |  3.10(py310)  |  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-cpu-py310-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker  | 
|  DLC-PyTorch 1.13.1 for SageMaker training  |  3.9(py39)  |  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-cpu-py39-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker  | 
|  DLC-PyTorch 1.12.1 for SageMaker training  |  3.8(py38)  |  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-cpu-py38-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker  | 
|  DLC-PyTorch 1.11.0 for SageMaker training  |  3.8(py38)  |  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-cpu-py38-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker  | 
|  DLC-MXNet 1.9.0 for SageMaker training  |  3.8(py38)  |  763104351884.dkr.ecr.<region>.amazonaws.com/mxnet-training:1.9.0-cpu-py38-ubuntu20.04-sagemaker  |  763104351884.dkr.ecr.<region>.amazonaws.com/mxnet-training:1.9.0-gpu-py38-cu112-ubuntu20.04-sagemaker  | 

**Note**  
To run jobs locally using AWS Deep Learning Containers (DLC) images, use the image URIs found in the [DLC documentation](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). The DLC images do not support the `auto_capture` value for dependencies.  
Jobs with [SageMaker AI Distribution in SageMaker Studio](https://github.com/aws/sagemaker-distribution#amazon-sagemaker-studio) run in a container as a non-root user named `sagemaker-user`. This user needs full permission to access `/opt/ml` and `/tmp`. Grant this permission by adding `sudo chmod -R 777 /opt/ml /tmp` to the `pre_execution_commands` list, as shown in the following snippet:  

```
@remote(pre_execution_commands=["sudo chmod -R 777 /opt/ml /tmp"])
def func():
    pass
```

You can also run remote functions with your custom images. For compatibility with remote functions, custom images should be built with Python version 3.7.x-3.10.x. The following is a minimal Dockerfile example showing you how to use a Docker image with Python 3.10.

```
FROM python:3.10

#... Rest of the Dockerfile
```

To create `conda` environments in your image and use it to run jobs, set the environment variable `SAGEMAKER_JOB_CONDA_ENV` to the `conda` environment name. If your image has the `SAGEMAKER_JOB_CONDA_ENV` value set, the remote function cannot create a new conda environment during the training job runtime. Refer to the following Dockerfile example that uses a `conda` environment with Python version 3.10.

```
FROM continuumio/miniconda3:4.12.0  

ENV SHELL=/bin/bash \
    CONDA_DIR=/opt/conda \
    SAGEMAKER_JOB_CONDA_ENV=sagemaker-job-env

RUN conda create -n $SAGEMAKER_JOB_CONDA_ENV \
   && conda install -n $SAGEMAKER_JOB_CONDA_ENV python=3.10 -y \
   && conda clean --all -f -y \
```

For SageMaker AI to use [mamba](https://mamba.readthedocs.io/en/latest/user_guide/mamba.html) to manage your Python virtual environment in the container image, install the [mamba toolkit from miniforge](https://github.com/conda-forge/miniforge). To use mamba, add the following code example to your Dockerfile. Then, SageMaker AI will detect the `mamba` availability at runtime and use it instead of `conda`.

```
#Mamba Installation
RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" \
    && bash Mambaforge-Linux-x86_64.sh -b -p "/opt/conda"  \
    && /opt/conda/bin/conda init bash
```

Using a custom conda channel on an Amazon S3 bucket is not compatible with mamba when using a remote function. If you choose to use mamba, make sure you are not using a custom conda channel on Amazon S3. For more information, see the **Prerequisites** section under **Custom conda repository using Amazon S3**.

The following is a complete Dockerfile example showing how to create a compatible Docker image.

```
FROM python:3.10

RUN apt-get update -y \
    # Needed for awscli to work
    # See: https://github.com/aws/aws-cli/issues/1957#issuecomment-687455928
    && apt-get install -y groff unzip curl \
    && pip install --upgrade \
        'boto3>1.0<2' \
        'awscli>1.0<2' \
        'ipykernel>6.0.0<7.0.0' \
#Use ipykernel with --sys-prefix flag, so that the absolute path to 
    #/usr/local/share/jupyter/kernels/python3/kernel.json python is used
    # in kernelspec.json file
    && python -m ipykernel install --sys-prefix

#Install Mamba
RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" \
    && bash Mambaforge-Linux-x86_64.sh -b -p "/opt/conda"  \
    && /opt/conda/bin/conda init bash

#cleanup
RUN apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf ${HOME}/.cache/pip \
    && rm Mambaforge-Linux-x86_64.sh

ENV SHELL=/bin/bash \
    PATH=$PATH:/opt/conda/bin
```

 The resulting image from running the previous Dockerfile example can also be used as a [SageMaker Studio Classic kernel image](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi.html).

# Logging parameters and metrics with Amazon SageMaker Experiments
<a name="train-remote-decorator-experiments"></a>

This guide show how to log parameters and metrics with Amazon SageMaker Experiments. A SageMaker AI experiment consists of runs, and each run consists of all the inputs, parameters, configurations and results for a single model training interaction. 

You can log parameters and metrics from a remote function using either the @remote decorator or the `RemoteExecutor` API. 

To log parameters and metrics from a remote function, choose one of the following methods:
+ Instantiate a SageMaker AI experiment run inside a remote function using `Run` from the SageMaker Experiments library. For more information, see [Create an Amazon SageMaker AI Experiment](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-create.html).
+ Use the `load_run` function inside a remote function from the SageMaker AI Experiments library. This will load a `Run` instance that is declared outside of the remote function.

The following sections show how to create and track lineage with SageMaker AI experiment runs by using the previous listed methods. The sections also describe cases that are not supported by SageMaker training.

## Use the @remote decorator to integrate with SageMaker Experiments
<a name="train-remote-decorator-experiments-remote"></a>

You can either instantiate an experiment in SageMaker AI, or load a current SageMaker AI experiment from inside a remote function. The following sections show you show to use either method. 

### Create an experiment with SageMaker Experiments
<a name="train-remote-decorator-experiments-remote-create"></a>

You can create an experiment run in SageMaker AI experiment. To do this you pass your experiment name, run name, and other parameters into your remote function.

The following code example imports the name of your experiment, the name of the run, and the parameters to log during each run. The parameters `param_1` and `param_2` are logged over time inside a training loop. Common parameters may include batch size or epochs. In this example, the metrics `metric_a` and `metric_b` are logged for a run over time inside a training loop. Other common metrics may include `accuracy` or `loss`. 

```
from sagemaker.remote_function import remote
from sagemaker.experiments.run import Run

# Define your remote function
@remote
def train(value_1, value_2, exp_name, run_name):
    ...
    ...
    #Creates the experiment
    with Run(
        experiment_name=exp_name,
        run_name=run_name,
    ) as run:
        ...
        #Define values for the parameters to log
        run.log_parameter("param_1", value_1)
        run.log_parameter("param_2", value_2) 
        ...
        #Define metrics to log
        run.log_metric("metric_a", 0.5)
        run.log_metric("metric_b", 0.1)


# Invoke your remote function        
train(1.0, 2.0, "my-exp-name", "my-run-name")
```

### Load current SageMaker Experiments with a job initiated by the @remote decorator
<a name="train-remote-decorator-experiments-remote-current"></a>

Use the `load_run()` function from the SageMaker Experiments library to load the current run object from the run context. You can also use the `load_run()` function within your remote function. Load the run object initialized locally by the `with` statement on the run object as shown in the following code example.

```
from sagemaker.experiments.run import Run, load_run

# Define your remote function
@remote
def train(value_1, value_2):
    ...
    ...
    with load_run() as run:
        run.log_metric("metric_a", value_1)
        run.log_metric("metric_b", value_2)


# Invoke your remote function
with Run(
    experiment_name="my-exp-name",
    run_name="my-run-name",
) as run:
    train(0.5, 1.0)
```

## Load a current experiment run within a job initiated with the `RemoteExecutor` API
<a name="train-remote-decorator-experiments-api"></a>

You can also load a current SageMaker AI experiment run if your jobs were initiated with the `RemoteExecutor` API. The following code example shows how to use `RemoteExecutor` API with the SageMaker Experiments `load_run` function. You do this to load a current SageMaker AI experiment run and capture metrics in the job submitted by `RemoteExecutor`.

```
from sagemaker.experiments.run import Run, load_run

def square(x):
    with load_run() as run:
        result = x * x
        run.log_metric("result", result)
    return result


with RemoteExecutor(
    max_parallel_job=2,
    instance_type="ml.m5.large"
) as e:
    with Run(
        experiment_name="my-exp-name",
        run_name="my-run-name",
    ):
        future_1 = e.submit(square, 2)
```

## Unsupported uses for SageMaker Experiments while annotating your code with an @remote decorator
<a name="train-remote-decorator-experiments-unsupported"></a>

SageMaker AI does not support passing a `Run` type object to an @remote function or using global `Run` objects. The following examples show code that will throw a `SerializationError`.

The following code example attempts to pass a `Run` type object to an @remote decorator, and it generates an error.

```
@remote
def func(run: Run):
    run.log_metrics("metric_a", 1.0)
    
with Run(...) as run:
    func(run) ---> SerializationError caused by NotImplementedError
```

The following code example attempts to use a global `run` object instantiated outside of the remote function. In the code example, the `train()` function is defined inside the `with Run` context, referencing a global run object from within. When `train()` is called, it generates an error.

```
with Run(...) as run:
    @remote
    def train(metric_1, value_1, metric_2, value_2):
        run.log_parameter(metric_1, value_1)
        run.log_parameter(metric_2, value_2)
    
    train("p1", 1.0, "p2", 0.5) ---> SerializationError caused by NotImplementedError
```

# Using modular code with the @remote decorator
<a name="train-remote-decorator-modular"></a>

You can organize your code into modules for ease of workspace management during development and still use the @remote function to invoke a function. You can also replicate the local modules from your development environment to the remote job environment. To do so, set the parameter `include_local_workdir` to `True`, as shown in the following code example.

```
@remote(
  include_local_workdir=True,
)
```

**Note**  
The @remote decorator and parameter must appear in the main file, rather than in any of the dependent files.

When `include_local_workdir` is set to `True`, SageMaker AI packages all of the Python scripts while maintaining the directory structure in the process' current directory. It also makes the dependencies available in the job's working directory.

For example, suppose your Python script which processes the MNIST dataset is divided into a `main.py` script and a dependent `pytorch_mnist.py` script. `main.py` calls the dependent script. Also, the `main.py` script contains code to import the dependency as shown.

```
from mnist_impl.pytorch_mnist import ...
```

The `main.py` file must also contain the `@remote` decorator, and it must set the `include_local_workdir` parameter to `True`.

The `include_local_workdir` parameter by default includes all the Python scripts in the directory. You can customize which files you want to upload to the job by using this parameter in conjunction with the `custom_file_filter` parameter. You can either pass a function that filters job dependencies to be uploaded to S3, or a `CustomFileFilter` object that specifies the local directories and files to ignore in the remote function. You can use `custom_file_filter` only if `include_local_workdir` is set to `True`—otherwise the parameter is ignored.

The following example uses `CustomFileFilter` to ignore all notebook files and folders or files named `data` when uploading files to S3.

```
@remote(
   include_local_workdir=True,
   custom_file_filter=CustomFileFilter(
      ignore_name_patterns=[ # files or directories to ignore
        "*.ipynb", # all notebook files
        "data", # folter or file named data
      ]
   )
)
```

The following example demonstrates how you can package an entire workspace.

```
@remote(
   include_local_workdir=True,
   custom_file_filter=CustomFileFilter(
      ignore_pattern_names=[] # package whole workspace
   )
)
```

The following example shows how you can use a function to filter files.

```
import os

def my_filter(path: str, files: List[str]) -> List[str]:
    to_ignore = []
   for file in files:
       if file.endswith(".txt") or file.endswith(".ipynb"):
           to_ignore.append(file)
   return to_ignore

@remote(
   include_local_workdir=True,
   custom_file_filter=my_filter
)
```

## Best practices in structuring your working directory
<a name="train-remote-decorator-modular-bestprac"></a>

The following best practices suggest how you can organize your directory structure while using the `@remote` decorator in your modular code.
+ Put the @remote decorator in a file that resides at the root level directory of the workspace.
+ Structure the local modules at the root level.

The following example image shows the recommended directory structure. In this example structure, the `main.py` script is located at the root level directory.

```
.
├── config.yaml
├── data/
├── main.py <----------------- @remote used here 
├── mnist_impl
│ ├── __pycache__/
│ │ └── pytorch_mnist.cpython-310.pyc
│ ├── pytorch_mnist.py <-------- dependency of main.py
├── requirements.txt
```

The following example image shows a directory structure that will result in inconsistent behavior when it is used to annotate your code with an @remote decorator. 

In this example structure, the `main.py` script that contains the @remote decorator is **not** located at the root level directory. The following structure is **NOT** recommended.

```
.
├── config.yaml
├── entrypoint
│ ├── data
│ └── main.py <----------------- @remote used here
├── mnist_impl
│ ├── __pycache__
│ │ └── pytorch_mnist.cpython-310.pyc
│ └── pytorch_mnist.py <-------- dependency of main.py
├── requirements.txt
```

# Private repository for runtime dependencies
<a name="train-remote-decorator-private"></a>

You can use pre-execution commands or script to configure a dependency manager like pip or conda in your job environment. To achieve network isolation, use either of these options to redirect your dependency managers to access your private repositories and run remote functions within a VPC. The pre-execution commands or script will run before your remote function runs. You can define them with the @remote decorator, the `RemoteExecutor` API, or within a configuration file.

The following sections show you how to access a private Python Package Index (PyPI) repository managed with AWS CodeArtifact. The sections also show how to access a custom conda channel hosted on Amazon Simple Storage Service (Amazon S3).

## How to use a custom PyPI repository managed with AWS CodeArtifact
<a name="train-remote-decorator-private-pypi"></a>

To use CodeArtifact to manage a custom PyPI repository, the following prerequisites are required:
+ Your private PyPI repository should already have been created. You can utilize AWS CodeArtifact to create and manage your private package repositories. To learn more about CodeArtifact, see the [CodeArtifact User Guide](https://docs.aws.amazon.com/codeartifact/latest/ug/welcome.html).
+ Your VPC should have access to your CodeArtifact repository. To allow a connection from your VPC to your CodeArtifact repository, you must do the following:
  + [Create VPC endpoints for CodeArtifact](https://docs.aws.amazon.com/codeartifact/latest/ug/create-vpc-endpoints.html).
  + [Create an Amazon S3 gateway endpoint](https://docs.aws.amazon.com/codeartifact/latest/ug/create-s3-gateway-endpoint.html) for your VPC, which allows CodeArtifact to store package assets.

The following pre-execution command example shows how to configure pip in the SageMaker AI training job to point to your CodeArtifact repository. For more information, see [Configure and use pip with CodeArtifact](https://docs.aws.amazon.com/codeartifact/latest/ug/python-configure-pip.html).

```
# use a requirements.txt file to import dependencies
@remote(
    instance_type="ml.m5.large"
    image_uri = "my_base_python:latest", 
    dependencies = './requirements.txt',
    pre_execution_commands=[
        "aws codeartifact login --tool pip --domain my-org --domain-owner <000000000000> --repository my-codeartifact-python-repo --endpoint-url https://vpce-xxxxx.api.codeartifact.us-east-1.vpce.amazonaws.com"
    ]
)
def matrix_multiply(a, b):
    return np.matmul(a, b)
```

## How to use a custom conda channel hosted on Amazon S3
<a name="train-remote-decorator-private-conda"></a>

To use Amazon S3 to manage a custom conda repository, the following prerequisites are required:
+ Your private conda channel must already be set up in your Amazon S3 bucket, and all dependent packages must be indexed and uploaded to your Amazon S3 bucket. For instructions on how to index your conda packages, see [Creating custom channels](https://conda.io/projects/conda/en/latest/user-guide/tasks/create-custom-channels.html).
+ Your VPC should have access to the Amazon S3 bucket. For more information, see [Endpoints for Amazon S3](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html).
+ The base conda environment in your job image should have `boto3` installed. To check your environment, enter the following in your Anaconda prompt to check that `boto3` appears in the resulting generated list.

  ```
  conda list -n base
  ```
+ You job image should be installed with conda, not [mamba](https://mamba.readthedocs.io/en/latest/installation.html). To check your environment, ensure that the previous code prompt does not return `mamba`.

The following pre-execution commands example shows how to configure conda in the SageMaker training job to point to your private channel on Amazon S3 The pre-execution commands removes the defaults channel and adds custom channels to a `.condarc` conda configuration file.

```
# specify your dependencies inside a conda yaml file
@remote(
    instance_type="ml.m5.large"
    image_uri = "my_base_python:latest", 
    dependencies = "./environment.yml",
    pre_execution_commands=[
        "conda config --remove channels 'defaults'"
        "conda config --add channels 's3://my_bucket/my-conda-repository/conda-forge/'",
        "conda config --add channels 's3://my_bucket/my-conda-repository/main/'"
    ]
)
def matrix_multiply(a, b):
    return np.matmul(a, b)
```

# Example notebooks
<a name="train-remote-decorator-examples"></a>

You can transform a training code in an existing workspace environment and any associated data processing code and datasets into a SageMaker training job. The following notebooks show you how to customize your environment, job settings, and more for an image classification problem, using the XGBoost algorithm and Hugging Face.

The [quick\$1start notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-remote-function/quick_start/quick_start.ipynb) contains the following code examples:
+ How to customize your job settings with a configuration file.
+ How to invoke Python functions as jobs, asynchronously.
+ How to customize the job runtime environment by bringing in additional dependencies.
+ How to use local dependencies with the @remote function method.

The following notebooks provide additional code examples for different ML problems types and implementations. 
+ To see code examples to use the @remote decorator for an image classification problem, open the [pytorch\$1mnist.ipynb](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-remote-function/pytorch_mnist_sample_notebook) notebook. This classification problem recognizes handwritten digits using the Modified National Institute of Standards and Technology (MNIST) sample dataset.
+ To see code examples for using the @remote decorator for the previous image classification problem with a script, see the Pytorch MNIST sample script, [train.py](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-remote-function/pytorch_mnist_sample_script).
+ To see how the XGBoost algorithm implemented with an @remote decorator: Open the [xgboost\$1abalone.ipynb](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-remote-function/xgboost_abalone) notebook.
+ To see how Hugging Face is integrated with an @remote decorator: Open the [huggingface.ipynb](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-remote-function/huggingface_text_classification) notebook.

# Accelerate generative AI development using managed MLflow on Amazon SageMaker AI
<a name="mlflow"></a>

Fully managed MLflow 3.0 on Amazon SageMaker AI enables you to accelerate generative AI by making it easier to track experiments and monitor performance of models and AI applications using a single tool.

**Generative AI development with MLflow 3.0**

As customers across industries accelerate their generative AI development, they require capabilities to track experiments, observe behavior, and evaluate performance of models and AI applications. Data scientists and developers lack tools for analyzing the performance of models and AI applications from experimentation to production, making it hard to root cause and resolve issues. Teams spend more time integrating tools than improving their models or generative AI applications.

Training or fine-tuning generative AI and machine learning is an iterative process that requires experimenting with various combinations of data, algorithms, and parameters, while observing their impact on model accuracy. The iterative nature of experimentation results in numerous model training runs and versions, making it challenging to track the best performing models and their configurations. The complexity of managing and comparing iterative training runs increases with GenAI, where experimentation involves not only fine-tuning models but also exploring creative and diverse outputs. Researchers must adjust hyperparameters, select suitable model architectures, and curate diverse datasets to optimize both the quality and creativity of the generated content. Evaluating generative AI models requires both quantitative and qualitative metrics, adding another layer of complexity to the experimentation process. Experimentation tracking capabilities in MLflow 3.0 on Amazon SageMaker AI enables you to track, organize, view, analyze, and compare iterative ML experimentation to gain comparative insights and register and deploy your best performing models.

Tracing capabilities in fully managed MLflow 3.0 enables you to record the inputs, outputs, and metadata at every step of a generative AI application, helping you to quickly identify the source of bugs or unexpected behaviors. By maintaining records of each model and application version, fully managed MLflow 3.0 offers traceability to connect AI responses to their source components, allowing you to quickly trace an issue directly to the specific code, data, or parameters that generated it. This dramatically reduces troubleshooting time and enables teams to focus more on innovation.

## MLflow integrations
<a name="mlflow-integrations"></a>

Use MLflow while training and evaluating models to find the best candidates for your use case. You can compare model performance, parameters, and metrics across experiments in the MLflow UI, keep track of your best models in the MLflow Model Registry, automatically register them as a SageMaker AI model, and deploy registered models to SageMaker AI endpoints.

**Amazon SageMaker AI with MLflow**

Use MLflow to track and manage the experimentation phase of the machine learning (ML) lifecycle with AWS integrations for model development, management, deployment, and tracking. 

**Amazon SageMaker Studio**

Create and manage tracking servers, run notebooks to create experiments, and access the MLflow UI to view and compare experiment runs all through Studio. 

**SageMaker Model Registry**

Manage model versions and catalog models for production by automatically registering models from MLflow Model Registry to SageMaker Model Registry. For more information, see [Automatically register SageMaker AI models with SageMaker Model Registry](mlflow-track-experiments-model-registration.md).

**SageMaker AI Inference**

Prepare your best models for deployment on a SageMaker AI endpoint using `ModelBuilder`. For more information, see [Deploy MLflow models with `ModelBuilder`](mlflow-track-experiments-model-deployment.md).

**AWS Identity and Access Management**

Configure access to MLflow using role-based access control (RBAC) with IAM. Write IAM identity policies to authorize the MLflow APIs that can be called by a client of an MLflow tracking server. All MLflow REST APIs are represented as IAM actions under the `sagemaker-mlflow` service prefix. For more information, see [Set up IAM permissions for MLflow](mlflow-create-tracking-server-iam.md).

**AWS CloudTrail**

View logs in AWS CloudTrail to help you enable operational and risk auditing, governance, and compliance of your AWS account. For more information, see [AWS CloudTrail logs](#mlflow-create-tracking-server-cloudtrail).

**Amazon EventBridge**

Automate the model review and deployment lifecycle using MLflow events captured by Amazon EventBridge. For more information, see [Amazon EventBridge events](#mlflow-create-tracking-server-eventbridge).

## Supported AWS Regions
<a name="mlflow-regions"></a>

Amazon SageMaker AI with MLflow is generally available in all AWS commercial [Regions](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html) where Amazon SageMaker Studio is available, except the China Regions. SageMaker AI with MLflow is available using only the AWS CLI in the Europe (Zurich) Region, Asia Pacific (Hyderabad) Region, Asia Pacific (Melbourne) Region, and Canada West (Calgary) Region.

Tracking servers are launched in a single availability zone within their specified Region. 

## How it works
<a name="mlflow-create-tracking-server-how-it-works"></a>

An MLflow Tracking Server has three main components: compute, backend metadata storage, and artifact storage. The compute that hosts the tracking server and the backend metadata storage are securely hosted in the SageMaker AI service account. The artifact storage lives in an Amazon S3 bucket in your own AWS account.

![\[A diagram showing the compute and metadata store for an MLflow Tracking Server.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-diagram.png)


A tracking server has an ARN. You can use this ARN to connect the MLflow SDK to your Tracking Server and start logging your training runs to MLflow.

Read on for more information about the following key concepts:
+ [Backend metadata storage](#mlflow-create-tracking-server-backend-store) 
+ [Artifact storage](#mlflow-create-tracking-server-artifact-store) 
+ [MLflow Tracking Server sizes](#mlflow-create-tracking-server-sizes) 
+ [Tracking server versions](#mlflow-create-tracking-server-versions) 
+ [AWS CloudTrail logs](#mlflow-create-tracking-server-cloudtrail) 
+ [Amazon EventBridge events](#mlflow-create-tracking-server-eventbridge) 

### Backend metadata storage
<a name="mlflow-create-tracking-server-backend-store"></a>

When you create an MLflow Tracking Server, a [backend store](https://mlflow.org/docs/latest/tracking/backend-stores.html), which persists various metadata for each [Run](https://mlflow.org/docs/latest/tracking.html#runs), such as run ID, start and end times, parameters, and metrics, is automatically configured within the SageMaker AI service account and fully managed for you. 

### Artifact storage
<a name="mlflow-create-tracking-server-artifact-store"></a>

To provide MLflow with persistent storage for metadata for each run, such as model weights, images, model files, and data files for your experiment runs, you must create an artifact store using Amazon S3. The artifact store must be set up within your AWS account and you must explicitly give MLflow access to Amazon S3 in order to access your artifact store. For more information, see [Artifact Stores](https://mlflow.org/docs/latest/tracking.html#artifact-stores) in the MLflow documentation.

**Note**  
SageMaker AI MLflow has a 200 MB download size limit.

### MLflow app versions
<a name="mlflow-create-mlflow-app-versions"></a>

The following MLflow versions are available to use with SageMaker AI MLflow Apps:


| MLflow version | Python version | 
| --- | --- | 
| [MLflow 3.4](https://mlflow.org/releases/3.4.0) (latest version) | [Python 3.9](https://www.python.org/downloads/release/python-390/) or later | 

The latest version of the MLflow App has the latest features, security patches, and bug fixes. When you create a new MLflow App it will be automatically updated to the latest supported version. For more information about creating an MLflow App, see [MLflow App Setup](mlflow-app-setup.md).

MLflow Apps use semantic versioning. Versions are in the following format: `major-version.minor-version.patch-version`.

### MLflow Tracking Server sizes
<a name="mlflow-create-tracking-server-sizes"></a>

You can optionally specify the size of your tracking server in the Studio UI or with the AWS CLI parameter `--tracking-server-size`. You can choose between `"Small"`, `"Medium"`, and `"Large"`. The default MLflow tracking server configuration size is `"Small"`. You can choose a size depending on the projected use of the tracking server such as the volume of data logged, number of users, and frequency of use.

We recommend using a small tracking server for teams of up to 25 users, a medium tracking server for teams of up to 50 users, and a large tracking server for teams of up to 100 users. We assume that all users will make concurrent requests to your MLflow Tracking Server to make these recommendations. You should select the tracking server size based on your expected usage pattern and the TPS (Transactions Per Second) supported by each tracking server. 

**Note**  
The nature of your workload and the type of requests that you make to the tracking server dictate the TPS you see.


| Tracking server size | Sustained TPS | Burst TPS | 
| --- | --- | --- | 
| Small | Up to 25 | Up to 50 | 
| Medium | Up to 50 | Up to 100 | 
| Large | Up to 100 | Up to 200 | 

### Tracking server versions
<a name="mlflow-create-tracking-server-versions"></a>

The following MLflow versions are available to use with SageMaker AI:


| MLflow version | Python version | 
| --- | --- | 
| [MLflow 3.0](https://mlflow.org/releases/3) (latest version) | [Python 3.9](https://www.python.org/downloads/release/python-390/) or later | 
| [MLflow 2.16](https://mlflow.org/releases/2.16.0) | [Python 3.8](https://www.python.org/downloads/release/python-380/) or later | 
| [MLflow 2.13](https://mlflow.org/releases/2.13.0) | [Python 3.8](https://www.python.org/downloads/release/python-380/) or later | 

The latest version of the tracking server has the latest features, security patches, and bug fixes. When you create a new tracking server, we recommend using the latest version. For more information about creating a tracking server, see [MLflow Tracking Servers](mlflow-create-tracking-server.md).

MLflow tracking servers use semantic versioning. Versions are in the following format: `major-version.minor-version.patch-version`.

The latest features, such as new UI elements and API functionality, are in the minor-version.

### AWS CloudTrail logs
<a name="mlflow-create-tracking-server-cloudtrail"></a>

AWS CloudTrail automatically logs activity related to your MLflow Tracking Server. The following control plane API calls are logged in CloudTrail:
+ CreateMlflowTrackingServer
+ DescribeMlflowTrackingServer
+ UpdateMlflowTrackingServer
+ DeleteMlflowTrackingServer
+ ListMlflowTrackingServers
+ CreatePresignedMlflowTrackingServer
+ StartMlflowTrackingServer
+ StopMlflowTrackingServer

AWS CloudTrail also automatically logs activity related to your MLflow data plane. The following data plane API calls are logged in CloudTrail. For event names, add the prefix `Mlflow` (for example, `MlflowCreateExperiment`).
+ CreateExperiment
+ CreateModelVersion
+ CreateRegisteredModel
+ CreateRun
+ DeleteExperiment
+ DeleteModelVersion
+ DeleteModelVersionTag
+ DeleteRegisteredModel
+ DeleteRegisteredModelAlias
+ DeleteRegisteredModelTag
+ DeleteRun
+ DeleteTag
+ GetDownloadURIForModelVersionArtifacts
+ GetExperiment
+ GetExperimentByName
+ GetLatestModelVersions
+ GetMetricHistory
+ GetModelVersion
+ GetModelVersionByAlias
+ GetRegisteredModel
+ GetRun
+ ListArtifacts
+ LogBatch
+ LogInputs
+ LogMetric
+ LogModel
+ LogParam
+ RenameRegisteredModel
+ RestoreExperiment
+ RestoreRun
+ SearchExperiments
+ SearchModelVersions
+ SearchRegisteredModels
+ SearchRuns
+ SetExperimentTag
+ SetModelVersionTag
+ SetRegisteredModelAlias
+ SetRegisteredModelTag
+ SetTag
+ TransitionModelVersionStage
+ UpdateExperiment
+ UpdateModelVersion
+ UpdateRegisteredModel
+ UpdateRun
+ FinalizeLoggedModel
+ GetLoggedModel
+ DeleteLoggedModel
+ SearchLoggedModels
+ SetLoggedModelTags
+ DeleteLoggedModelTag
+ ListLoggedModelArtifacts
+ LogLoggedModelParams
+ LogOutputs

For more information about CloudTrail, see the *[AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html)*.

### Amazon EventBridge events
<a name="mlflow-create-tracking-server-eventbridge"></a>

Use EventBridge to route events from using MLflow with SageMaker AI to consumer applications across your organization. The following events are emitted to EventBridge:
+ "SageMaker Tracking Server Creating"
+ "SageMaker Tracking Server Created“
+ "SageMaker Tracking Server Create Failed"
+ "SageMaker Tracking Server Updating"
+ "SageMaker Tracking Server Updated"
+ "SageMaker Tracking Server Update Failed"
+ "SageMaker Tracking Server Deleting"
+ "SageMaker Tracking Server Deleted"
+ "SageMaker Tracking Server Delete Failed"
+ "SageMaker Tracking Server Starting"
+ "SageMaker Tracking Server Started"
+ "SageMaker Tracking Server Start Failed"
+ "SageMaker Tracking Server Stopping"
+ "SageMaker Tracking Server Stopped"
+ "SageMaker Tracking Server Stop Failed"
+ "SageMaker Tracking Server Maintenance In Progress"
+ "SageMaker Tracking Server Maintenance Complete"
+ "SageMaker Tracking Server Maintenance Failed"
+ "SageMaker MLFlow Tracking Server Creating Run"
+ "SageMaker MLFlow Tracking Server Creating RegisteredModel"
+ "SageMaker MLFlow Tracking Server Creating ModelVersion"
+ "SageMaker MLFlow Tracking Server Transitioning ModelVersion Stage"
+ "SageMaker MLFlow Tracking Server Setting Registered Model Alias"

For more information about EventBridge, see the *[Amazon EventBridge User Guide](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-what-is.html)*.

**Topics**
+ [

## MLflow integrations
](#mlflow-integrations)
+ [

## Supported AWS Regions
](#mlflow-regions)
+ [

## How it works
](#mlflow-create-tracking-server-how-it-works)
+ [

# MLflow App Setup
](mlflow-app-setup.md)
+ [

# MLflow Tracking Servers
](mlflow-create-tracking-server.md)
+ [

# Launch the MLflow UI using a presigned URL
](mlflow-launch-ui.md)
+ [

# Integrate MLflow with your environment
](mlflow-track-experiments.md)
+ [

# MLflow tutorials using example Jupyter notebooks
](mlflow-tutorials.md)
+ [

# Troubleshoot common setup issues
](mlflow-troubleshooting.md)
+ [

# Clean up MLflow resources
](mlflow-cleanup.md)
+ [

# Amazon SageMaker Experiments in Studio Classic
](experiments.md)

# MLflow App Setup
<a name="mlflow-app-setup"></a>

An [MLflow App](https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-server-optional) is a stand-alone HTTP server that serves multiple REST API endpoints for tracking runs and experiments. An MLflow App is required to begin tracking your machine learning (ML) experiments with SageMaker AI and MLflow. You can create an MLflow App through the Studio UI, or through the AWS CLI for more granular security customization.

You must have the correct IAM permissions configured to create an MLflow App.

MLflow Apps are the latest managed MLflow offering on SageMaker and should be preferred over existing MLflow Tracking Servers. MLflow Apps offer additional features such as faster startup time, cross-account sharing, integrations with other SageMaker features, and other features beyond the existing MLflow Tracking Servers.

**Topics**
+ [

# MLflow App Setup Prequisites
](mlflow-app-setup-prerequisites.md)
+ [

# Create MLflow App
](mlflow-app-setup-create-app.md)

# MLflow App Setup Prequisites
<a name="mlflow-app-setup-prerequisites"></a>

# Set up IAM permissions for MLflow Apps
<a name="mlflow-app-setup-prerequisites-iam"></a>

You must configure the necessary IAM service roles to get started with MLflow Apps in Amazon SageMaker AI. 

If you create a new Amazon SageMaker AI domain to access your experiments in Studio, you can configure the necessary IAM permissions during domain setup. For more information, see [Set up MLflow IAM permissions when creating a new domain](mlflow-create-tracking-server-iam.md#mlflow-create-tracking-server-iam-role-manager).

To set up permissions using the IAM console, see [Create necessary IAM service roles in the IAM console](mlflow-create-tracking-server-iam.md#mlflow-create-tracking-server-iam-service-roles).

You must configure authorization controls for `sagemaker-mlflow` actions. You can optionally define more granular authorization controls to govern action-specific MLflow permissions. For more information, see [Create action-specific authorization controls](#mlflow-create-app-update-iam-actions).

## Set up MLflow IAM permissions when creating a new domain
<a name="mlflow-create-app-iam-role-manager"></a>

When setting up a new Amazon SageMaker AI domain for your organization, you can configure IAM permissions for your domain service role through the **Users and ML Activities** settings.

1. Set up a new domain using the SageMaker AI console. On the **Set up SageMaker AI domain** page, choose **Set up for organizations**. For more information, see [Custom setup using the console](onboard-custom.md#onboard-custom-instructions-console).

1. When setting up **Users and ML Activities**, choose from the following ML activities for MLflow: **Use MLflow**, **Manage MLflow Apps**, and **Access required to AWS Services for MLflow**. For more information about these activities, see the explanations that follow this procedure.

1. Complete the setup and creation of your new domain.

The following MLflow ML activities are available in Amazon SageMaker Role Manager:
+ **Use MLflow**: This ML activity grants the domain service role permission to call MLflow REST APIs in order to manage experiments, runs, and models in MLflow.
+ **Manage MLflow Apps**: This ML activity grants the domain service role permission to create, update, and delete MLflow Apps.
+ **Access required to AWS services for MLflow Apps**: This ML activity provides the domain service role permissions needed to access Amazon S3 and the SageMaker AI Model Registry. This allows you to use the domain service role as the tracking server service role.

For more information about ML activities in Role Manager, see [ML activity reference](role-manager-ml-activities.md).

## Create necessary IAM service roles in the IAM console
<a name="mlflow-create-app-iam-service-roles"></a>

If you did not create or update your domain service role, you must instead create the following service roles in the IAM console in order to create and use an MLflow Apps:
+ An MLflow App IAM service role that the App can use to access SageMaker AI resources
+ A SageMaker AI IAM service role that SageMaker AI can use to create and manage MLflow resources

### IAM policies for the MLflow App IAM service role
<a name="mlflow-create-app-iam-service-roles-ts"></a>

The MLflow App IAM service role is used by the app to access the resources it needs such as Amazon S3 and the SageMaker Model Registry.

When creating the app IAM service role, use the following IAM trust policy:

------
#### [ JSON ]

****  

```
{
     "Version":"2012-10-17",		 	 	 
     "Statement": [
         {
             "Effect": "Allow",
             "Principal": {
                 "Service": [                     
                      "sagemaker.amazonaws.com"
                 ]
             },
             "Action": "sts:AssumeRole"
         }
     ]
 }
```

------

In the IAM console, add the following permissions policy to your app service role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Put*",
                "s3:List*",
                "sagemaker:AddTags",
                "sagemaker:CreateModelPackageGroup",
                "sagemaker:CreateModelPackage",
                "sagemaker:UpdateModelPackage",
                "sagemaker:DescribeModelPackageGroup"
            ],
            "Resource": "*"
        }
    ]
}
```

------

### IAM policy for the SageMaker AI IAM service role
<a name="mlflow-create-app-iam-service-roles-sm"></a>

The SageMaker AI service role is used by the client accessing the MLflow App and needs permissions to call MLflow REST APIs. The SageMaker AI service role also needs SageMaker API permissions to create, view update, and delete apps. 

You can create a new role or update an existing role. The SageMaker AI service role needs the following policy: 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	     
    "Statement": [        
        {            
            "Effect": "Allow",            
            "Action": [
                "sagemaker-mlflow:*",
                "sagemaker:CreateMlflowTrackingServer",
                "sagemaker:ListMlflowTrackingServers",
                "sagemaker:UpdateMlflowTrackingServer",
                "sagemaker:DeleteMlflowTrackingServer",
                "sagemaker:StartMlflowTrackingServer",
                "sagemaker:StopMlflowTrackingServer",
                "sagemaker:CreatePresignedMlflowTrackingServerUrl"
            ],            
            "Resource": "*"        
        }        
    ]
}
```

------

## Create action-specific authorization controls
<a name="mlflow-create-app-update-iam-actions"></a>

You must set up authorization controls for `sagemaker-mlflow`, and can optionally configure action-specific authorization controls to govern more granular MLflow permissions that your users have on an MLflow Apps.

**Note**  
The following steps assume that you have an ARN for an MLflow Apps already available. 

### Data Plane IAM actions supported for MLflow Apps
<a name="mlflow-app-setup-iam-actions"></a>

The following SageMaker AI MLflow actions are supported for authorization access control:
+ sagemaker:CallMlflowAppApi

# Create MLflow App
<a name="mlflow-app-setup-create-app"></a>

# Create an app using the AWS CLI
<a name="mlflow-app-create-app-cli"></a>

You can create an app using the AWS CLI for more granular security customization.

## Prerequisites
<a name="mlflow-app-create-app-cli-prereqs"></a>

To create an app using the AWS CLI, you must have the following:
+ **Access to a terminal. **This can include local IDEs, an Amazon EC2 instance, or AWS CloudShell.
+ **Access to a development environment.** This can include local IDEs or a Jupyter notebook environment within Studio or Studio Classic.
+ **A configured AWS CLI installation**. For more information, see [Configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). 
+ **An IAM role with appropriate permissions**. The following steps require your environment to have `iam:CreateRole`, `iam:CreatePolicy`, `iam:AttachRolePolicy`, and `iam:ListPolicies` permissions. These permissions are needed on the role that is being used to run the steps in this user guide. The instructions in this guide create an IAM role that is used as the execution role of the MLflow App so that it can access data in your Amazon S3 buckets. Additionally, a policy is created to give the IAM role of the user that is interacting with the App via the MLflow SDK permission to call MLflow APIs. For more information, see [Modifying a role permissions policy (console) ](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html#roles-modify_permissions-policy). 

  If using a SageMaker Studio Notebook, update the service role for your Studio user profile with these IAM permissions. To update the service role, navigate to the SageMaker AI console and select the domain you are using. Then, under the domain, select the user profile you are using. You will see the service role listed there. Navigate to the IAM console, search for the service role under **Roles**, and update your role with a policy that allows the `iam:CreateRole`, `iam:CreatePolicy`, `iam:AttachRolePolicy`, and `iam:ListPolicies` actions. 

## Set up AWS CLI model
<a name="mlflow-app-create-app-cli-setup"></a>

Follow these command line steps within a terminal to set up the AWS CLI for Amazon SageMaker AI with MLflow.

1. Install an updated version of the AWS CLI. For more information, see [Install or update to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) in the *AWS CLI User Guide*.

1. Verify that the AWS CLI is installed using the following command: 

   ```
   aws sagemaker help
   ```

   Press `q` to exit the prompt.

   For troubleshooting help, see [Troubleshoot common setup issues](mlflow-troubleshooting.md).

## Set up MLflow infrastructure
<a name="mlflow-create-app-cli-infra-setup"></a>

The following section shows you how to set up an MLflow App along with the Amazon S3 bucket and IAM role needed for the app.

### Create an S3 bucket
<a name="mlflow-infra-setup-s3-bucket"></a>

Within your terminal, use the following commands to create a general purpose Amazon S3 bucket: 

**Important**  
When you provide the Amazon S3 URI for your artifact store, ensure the Amazon S3 bucket is in the same AWS Region as your MLflow App. **Cross-region artifact storage is not supported**.

```
bucket_name=bucket-name
  region=valid-region
  
  aws s3api create-bucket \
    --bucket $bucket_name \
    --region $region \
    --create-bucket-configuration LocationConstraint=$region
```

The output should look similar to the following:

```
{
      "Location": "/bucket-name"
  }
```

### Set up IAM trust policies
<a name="mlflow-app-create-app-cli-trust-policy"></a>

Use the following steps to create an IAM trust policy. For more information about roles and trust policies, see [Roles terms and concepts](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_terms-and-concepts.html) in the *AWS Identity and Access Management User Guide*.

1. Within your terminal, use the following command to create a file called `mlflow-trust-policy.json`.

   ```
   cat <<EOF > /tmp/mlflow-trust-policy.json
     {
          "Version": "2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": [                     
                           "sagemaker.amazonaws.com"
                      ]
                  },
                  "Action": "sts:AssumeRole"
              }
          ]
      }
     EOF
   ```

1. Within your terminal, use the following command to create a file called `custom-policy.json`.

   ```
   cat <<EOF > /tmp/custom-policy.json
     {
         "Version": "2012-10-17",		 	 	 
         "Statement": [
             {
                 "Effect": "Allow",
                 "Action": [
                     "s3:Get*",
                     "s3:Put*",
                     "sagemaker:AddTags",
                     "sagemaker:CreateModelPackageGroup",
                     "sagemaker:CreateModelPackage",
                     "sagemaker:DescribeModelPackageGroup",
                     "sagemaker:UpdateModelPackage",
                     "s3:List*"
                 ],
                 "Resource": "*"
             }
         ]
     }
     EOF
   ```

1. Use the trust policy file to create a role. Then, attach IAM role policies that allow MLflow to access Amazon S3 and SageMaker Model Registry within your account. MLflow must have access to Amazon S3 for your app's artifact store and SageMaker Model Registry for automatic model registration. 
**Note**  
If you are updating an existing role, use the following command instead: `aws iam update-assume-role-policy --role-name $role_name --policy-document file:///tmp/mlflow-trust-policy.json`.

   ```
   role_name=role-name
     
     aws iam  create-role \
       --role-name $role_name \
       --assume-role-policy-document file:///tmp/mlflow-trust-policy.json
     
     aws iam put-role-policy \
       --role-name $role_name \
       --policy-name custom-policy \
       --policy-document file:///tmp/custom-policy.json
     
     role_arn=$(aws iam get-role --role-name  $role_name --query 'Role.Arn' --output text)
   ```

## Create MLflow App
<a name="mlflow-app-create-app-cli-create"></a>

Within your terminal, use the `create-mlflow-app` API to create an app in the AWS Region of your choice. This step normally takes approximately 2-3 minutes.

The following command creates a new app with automatic model registration enabled. To deactivate automatic model registration, specify `--no-automatic-model-registration`. 

After creating your app, you can launch the MLflow UI. For more information, see [Launch the MLflow UI using a presigned URL](mlflow-launch-ui.md).

**Note**  
It may take up to 2-3 minutes to complete app creation. If the app takes over 3 minutes to create, check that you have the necessary IAM permissions. When you successfully create an app, it automatically starts.

By default, the app that is created is the latest version and will be automatically updated.

```
app_name=app-name
  region=valid-region
  version=valid-version        
  
  
  aws sagemaker create-mlflow-app \
   --name $app_name \
   --artifact-store-uri s3://$bucket_name \
   --role-arn $role_arn \
   --automatic-model-registration \
   --region $region
```

The output should be similar to the following: 

```
{
      "AppArn": "arn:aws:sagemaker:region:123456789012:mlflow-app/app-name"
  }
```

**Important**  
**Take note of the app ARN for later use.** You will also need the `$bucket_name` for clean up steps. 

# MLflow Tracking Servers
<a name="mlflow-create-tracking-server"></a>

An [MLflow Tracking Server](https://mlflow.org/docs/latest/tracking.html#mlflow-tracking-server-optional) is a stand-alone HTTP server that serves multiple REST API endpoints for tracking runs and experiments. A tracking server is required to begin tracking your machine learning (ML) experiments with SageMaker AI and MLflow. You can create a tracking server through the Studio UI, or through the AWS CLI for more granular security customization.

You must have the correct IAM permissions configured to create an MLflow Tracking Server.

**Topics**
+ [

# Set up IAM permissions for MLflow
](mlflow-create-tracking-server-iam.md)
+ [

# Create a tracking server using Studio
](mlflow-create-tracking-server-studio.md)
+ [

# Create a tracking server using the AWS CLI
](mlflow-create-tracking-server-cli.md)

# Set up IAM permissions for MLflow
<a name="mlflow-create-tracking-server-iam"></a>

You must configure the necessary IAM service roles to get started with MLflow in Amazon SageMaker AI. 

If you create a new Amazon SageMaker AI domain to access your experiments in Studio, you can configure the necessary IAM permissions during domain setup. For more information, see [Set up MLflow IAM permissions when creating a new domain](#mlflow-create-tracking-server-iam-role-manager).

To set up permissions using the IAM console, see [Create necessary IAM service roles in the IAM console](#mlflow-create-tracking-server-iam-service-roles).

You must configure authorization controls for `sagemaker-mlflow` actions. You can optionally define more granular authorization controls to govern action-specific MLflow permissions. For more information, see [Create action-specific authorization controls](#mlflow-create-tracking-server-update-iam-actions).

## Set up MLflow IAM permissions when creating a new domain
<a name="mlflow-create-tracking-server-iam-role-manager"></a>

When setting up a new Amazon SageMaker AI domain for your organization, you can configure IAM permissions for your domain service role through the **Users and ML Activities** settings.

**To configure IAM permissions for using MLflow with SageMaker AI when setting up a new domain**

1. Set up a new domain using the SageMaker AI console. On the **Set up SageMaker AI domain** page, choose **Set up for organizations**. For more information, see [Custom setup using the console](onboard-custom.md#onboard-custom-instructions-console).

1. When setting up **Users and ML Activities**, choose from the following ML activities for MLflow: **Use MLflow**, **Manage MLflow Tracking Servers**, and **Access required to AWS Services for MLflow**. For more information about these activities, see the explanations that follow this procedure.

1. Complete the setup and creation of your new domain.

The following MLflow ML activities are available in Amazon SageMaker Role Manager:
+ **Use MLflow**: This ML activity grants the domain service role permission to call MLflow REST APIs in order to manage experiments, runs, and models in MLflow.
+ **Manage MLflow Tracking Servers**: This ML activity grants the domain service role permission to create, update, start, stop, and delete tracking servers.
+ **Access required to AWS Services for MLflow**: This ML activity provides the domain service role permissions needed to access Amazon S3 and the SageMaker AI Model Registry. This allows you to use the domain service role as the tracking server service role.

For more information about ML activities in Role Manager, see [ML activity reference](role-manager-ml-activities.md).

## Create necessary IAM service roles in the IAM console
<a name="mlflow-create-tracking-server-iam-service-roles"></a>

If you did not create or update your domain service role, you must instead create the following service roles in the IAM console in order to create and use an MLflow Tracking Server:
+ A tracking server IAM service role that the tracking server can use to access SageMaker AI resources
+ A SageMaker AI IAM service role that SageMaker AI can use to create and manage MLflow resources

### IAM policies for the tracking server IAM service role
<a name="mlflow-create-tracking-server-iam-service-roles-ts"></a>

The tracking server IAM service role is used by the tracking server to access the resources it needs such as Amazon S3 and the SageMaker Model Registry.

When creating the tracking server IAM service role, use the following IAM trust policy:

------
#### [ JSON ]

****  

```
{
     "Version":"2012-10-17",		 	 	 
     "Statement": [
         {
             "Effect": "Allow",
             "Principal": {
                 "Service": [                     
                      "sagemaker.amazonaws.com"
                 ]
             },
             "Action": "sts:AssumeRole"
         }
     ]
 }
```

------

In the IAM console, add the following permissions policy to your tracking server service role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:Put*",
                "s3:List*",
                "sagemaker:AddTags",
                "sagemaker:CreateModelPackageGroup",
                "sagemaker:CreateModelPackage",
                "sagemaker:UpdateModelPackage",
                "sagemaker:DescribeModelPackageGroup"
            ],
            "Resource": "*"
        }
    ]
}
```

------

### IAM policy for the SageMaker AI IAM service role
<a name="mlflow-create-tracking-server-iam-service-roles-sm"></a>

The SageMaker AI service role is used by the client accessing the MLflow Tracking Server and needs permissions to call MLflow REST APIs. The SageMaker AI service role also needs SageMaker API permissions to create, view update, start, stop, and delete tracking servers. 

You can create a new role or update an existing role. The SageMaker AI service role needs the following policy: 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	     
    "Statement": [        
        {            
            "Effect": "Allow",            
            "Action": [
                "sagemaker-mlflow:*",
                "sagemaker:CreateMlflowTrackingServer",
                "sagemaker:ListMlflowTrackingServers",
                "sagemaker:UpdateMlflowTrackingServer",
                "sagemaker:DeleteMlflowTrackingServer",
                "sagemaker:StartMlflowTrackingServer",
                "sagemaker:StopMlflowTrackingServer",
                "sagemaker:CreatePresignedMlflowTrackingServerUrl"
            ],            
            "Resource": "*"        
        }        
    ]
}
```

------

## Create action-specific authorization controls
<a name="mlflow-create-tracking-server-update-iam-actions"></a>

You must set up authorization controls for `sagemaker-mlflow`, and can optionally configure action-specific authorization controls to govern more granular MLflow permissions that your users have on an MLflow Tracking Server.

**Note**  
The following steps assume that you have an ARN for an MLflow Tracking Server already available. To learn how to create a tracking server, see [Create a tracking server using Studio](mlflow-create-tracking-server-studio.md) or [Create a tracking server using the AWS CLI](mlflow-create-tracking-server-cli.md).

The following command creates a file called `mlflow-policy.json` that provides your tracking server with IAM permissions for all available SageMaker AI MLflow actions. You can optionally limit the permissions a user has by choosing the specific actions you want that user to perform. For a list of available actions, see [IAM actions supported for MLflow](#mlflow-create-tracking-server-iam-actions).

```
# Replace "Resource":"*" with "Resource":"TrackingServerArn" 
# Replace "sagemaker-mlflow:*" with specific actions

printf '{
    "Version": "2012-10-17",		 	 	     
    "Statement": [        
        {            
            "Effect": "Allow",            
            "Action": "sagemaker-mlflow:*",            
            "Resource": "*"        
        }        
    ]
}' > mlflow-policy.json
```

Use the `mlflow-policy.json` file to create an IAM policy using the AWS CLI. 

```
aws iam create-policy \
  --policy-name MLflowPolicy \
  --policy-document file://mlflow-policy.json
```

Retrieve your account ID and attach the policy to your IAM role.

```
# Get your account ID
aws sts get-caller-identity

# Attach the IAM policy using your exported role and account ID
aws iam attach-role-policy \
  --role-name $role_name \
  --policy-arn arn:aws:iam::123456789012:policy/MLflowPolicy
```

### IAM actions supported for MLflow
<a name="mlflow-create-tracking-server-iam-actions"></a>

The following SageMaker AI MLflow actions are supported for authorization access control:
+ sagemaker-mlflow:AccessUI
+ sagemaker-mlflow:CreateExperiment
+ sagemaker-mlflow:SearchExperiments
+ sagemaker-mlflow:GetExperiment
+ sagemaker-mlflow:GetExperimentByName
+ sagemaker-mlflow:DeleteExperiment
+ sagemaker-mlflow:RestoreExperiment
+ sagemaker-mlflow:UpdateExperiment
+ sagemaker-mlflow:CreateRun
+ sagemaker-mlflow:DeleteRun
+ sagemaker-mlflow:RestoreRun
+ sagemaker-mlflow:GetRun
+ sagemaker-mlflow:LogMetric
+ sagemaker-mlflow:LogBatch
+ sagemaker-mlflow:LogModel
+ sagemaker-mlflow:LogInputs
+ sagemaker-mlflow:SetExperimentTag
+ sagemaker-mlflow:SetTag
+ sagemaker-mlflow:DeleteTag
+ sagemaker-mlflow:LogParam
+ sagemaker-mlflow:GetMetricHistory
+ sagemaker-mlflow:SearchRuns
+ sagemaker-mlflow:ListArtifacts
+ sagemaker-mlflow:UpdateRun
+ sagemaker-mlflow:CreateRegisteredModel
+ sagemaker-mlflow:GetRegisteredModel
+ sagemaker-mlflow:RenameRegisteredModel
+ sagemaker-mlflow:UpdateRegisteredModel
+ sagemaker-mlflow:DeleteRegisteredModel
+ sagemaker-mlflow:GetLatestModelVersions
+ sagemaker-mlflow:CreateModelVersion
+ sagemaker-mlflow:GetModelVersion
+ sagemaker-mlflow:UpdateModelVersion
+ sagemaker-mlflow:DeleteModelVersion
+ sagemaker-mlflow:SearchModelVersions
+ sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts
+ sagemaker-mlflow:TransitionModelVersionStage
+ sagemaker-mlflow:SearchRegisteredModels
+ sagemaker-mlflow:SetRegisteredModelTag
+ sagemaker-mlflow:DeleteRegisteredModelTag
+ sagemaker-mlflow:DeleteModelVersionTag
+ sagemaker-mlflow:DeleteRegisteredModelAlias
+ sagemaker-mlflow:SetRegisteredModelAlias
+ sagemaker-mlflow:GetModelVersionByAlias
+ sagemaker-mlflow:FinalizeLoggedModel
+ sagemaker-mlflow:GetLoggedModel
+ sagemaker-mlflow:DeleteLoggedModel
+ sagemaker-mlflow:SearchLoggedModels
+ sagemaker-mlflow:SetLoggedModelTags
+ sagemaker-mlflow:DeleteLoggedModelTag
+ sagemaker-mlflow:ListLoggedModelArtifacts
+ sagemaker-mlflow:LogLoggedModelParams
+ sagemaker-mlflow:LogOutputs

# Create a tracking server using Studio
<a name="mlflow-create-tracking-server-studio"></a>

You can create a tracking server from the SageMaker Studio MLflow UI. If you created your SageMaker Studio domain following the **Set up for organizations** workflow, the service role for your SageMaker Studio domain has sufficient permissions to serve as the SageMaker AI IAM service roles and the tracking server IAM service role.

Create a tracking server from the SageMaker Studio MLflow UI with the following steps:

1. Navigate to Studio from the SageMaker AI console. Be sure that you are using the new Studio experience and have updated from Studio Classic. For more information, see [Migration from Amazon SageMaker Studio Classic](studio-updated-migrate.md).

1. Choose **MLflow** in the **Applications** pane of the Studio UI.

1. **(Optional)** If have not already created a Tracking Server or if you need to create a new one, you can choose **Create**. Then provide a unique tracking server name and S3 URI for artifact storage and create a tracking server. You can optionally choose **Configure** for more granular tracking server customization.

1. Choose **Create** in the **MLflow Tracking Servers** pane. The Studio domain IAM service role is used for the tracking server IAM service role.

1. Provide a unique name for your tracking server and an Amazon S3 URI for your tracking server artifact store. Your tracking server and the Amazon S3 bucket must be in the **same AWS Region**.
**Important**  
When you provide the Amazon S3 URI for your artifact store, ensure the Amazon S3 bucket is in the same AWS Region as your tracking server. **Cross-region artifact storage is not supported**. 

1. **(Optional)** Choose **Configure** to change default settings such as tracking server size, tags, and the IAM service role. 

1. Choose **Create**.
**Note**  
It may take up to 25 minutes to complete tracking server creation. If the tracking server takes over 25 minutes to create, check that you have the necessary IAM permissions. For more information on IAM permissions, see [Set up IAM permissions for MLflow](mlflow-create-tracking-server-iam.md). When you successfully create a tracking server, it automatically starts.

1. After creating your tracking server, you can launch the MLflow UI. For more information, see [Launch the MLflow UI using a presigned URL](mlflow-launch-ui.md).

![\[The Create MLflow Tracking Server prompt in the Studio UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-studio-create.png)


# Create a tracking server using the AWS CLI
<a name="mlflow-create-tracking-server-cli"></a>

You can create a tracking server using the AWS CLI for more granular security customization.

## Prerequisites
<a name="mlflow-create-tracking-server-cli-prereqs"></a>

To create a tracking server using the AWS CLI, you must have the following:
+ **Access to a terminal. **This can include local IDEs, an Amazon EC2 instance, or AWS CloudShell.
+ **Access to a development environment.** This can include local IDEs or a Jupyter notebook environment within Studio or Studio Classic.
+ **A configured AWS CLI installation**. For more information, see [Configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). 
+ **An IAM role with appropriate permissions**. The following steps require your environment to have `iam:CreateRole`, `iam:CreatePolicy`, `iam:AttachRolePolicy`, and `iam:ListPolicies` permissions. These permissions are needed on the role that is being used to run the steps in this user guide. The instructions in this guide create an IAM role that is used as the execution role of the MLflow Tracking Server so that it can access data in your Amazon S3 buckets. Additionally, a policy is created to give the IAM role of the user that is interacting with the Tracking Server via the MLflow SDK permission to call MLflow APIs. For more information, see [Modifying a role permissions policy (console) ](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html#roles-modify_permissions-policy). 

  If using a SageMaker Studio Notebook, update the service role for your Studio user profile with these IAM permissions. To update the service role, navigate to the SageMaker AI console and select the domain you are using. Then, under the domain, select the user profile you are using. You will see the service role listed there. Navigate to the IAM console, search for the service role under **Roles**, and update your role with a policy that allows the `iam:CreateRole`, `iam:CreatePolicy`, `iam:AttachRolePolicy`, and `iam:ListPolicies` actions. 

## Set up AWS CLI model
<a name="mlflow-create-tracking-server-cli-setup"></a>

Follow these command line steps within a terminal to set up the AWS CLI for Amazon SageMaker AI with MLflow.

1. Install an updated version of the AWS CLI. For more information, see [Install or update to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) in the *AWS CLI User Guide*.

1. Verify that the AWS CLI is installed using the following command: 

   ```
   aws sagemaker help
   ```

   Press `q` to exit the prompt.

   For troubleshooting help, see [Troubleshoot common setup issues](mlflow-troubleshooting.md).

## Set up MLflow infrastructure
<a name="mlflow-create-tracking-server-cli-infra-setup"></a>

The following section shows you how to set up an MLflow Tracking Server along with the Amazon S3 bucket and IAM role needed for the tracking server.

### Create an S3 bucket
<a name="mlflow-infra-setup-s3-bucket"></a>

Within your terminal, use the following commands to create a general purpose Amazon S3 bucket: 

**Important**  
When you provide the Amazon S3 URI for your artifact store, ensure the Amazon S3 bucket is in the same AWS Region as your tracking server. **Cross-region artifact storage is not supported**.

```
bucket_name=bucket-name
region=valid-region

aws s3api create-bucket \
  --bucket $bucket_name \
  --region $region \
  --create-bucket-configuration LocationConstraint=$region
```

The output should look similar to the following:

```
{
    "Location": "/bucket-name"
}
```

### Set up IAM trust policies
<a name="mlflow-create-tracking-server-cli-trust-policy"></a>

Use the following steps to create an IAM trust policy. For more information about roles and trust policies, see [Roles terms and concepts](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_terms-and-concepts.html) in the *AWS Identity and Access Management User Guide*.

1. Within your terminal, use the following command to create a file called `mlflow-trust-policy.json`.

   ```
   cat <<EOF > /tmp/mlflow-trust-policy.json
   {
        "Version": "2012-10-17",		 	 	 
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": [                     
                         "sagemaker.amazonaws.com"
                    ]
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
   EOF
   ```

1. Within your terminal, use the following command to create a file called `custom-policy.json`.

   ```
   cat <<EOF > /tmp/custom-policy.json
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:Get*",
                   "s3:Put*",
                   "sagemaker:AddTags",
                   "sagemaker:CreateModelPackageGroup",
                   "sagemaker:CreateModelPackage",
                   "sagemaker:DescribeModelPackageGroup",
                   "sagemaker:UpdateModelPackage",
                   "s3:List*"
               ],
               "Resource": "*"
           }
       ]
   }
   EOF
   ```

1. Use the trust policy file to create a role. Then, attach IAM role policies that allow MLflow to access Amazon S3 and SageMaker Model Registry within your account. MLflow must have access to Amazon S3 for your tracking server's artifact store and SageMaker Model Registry for automatic model registration. 
**Note**  
If you are updating an existing role, use the following command instead: `aws iam update-assume-role-policy --role-name $role_name --policy-document file:///tmp/mlflow-trust-policy.json`.

   ```
   role_name=role-name
   
   aws iam  create-role \
     --role-name $role_name \
     --assume-role-policy-document file:///tmp/mlflow-trust-policy.json
   
   aws iam put-role-policy \
     --role-name $role_name \
     --policy-name custom-policy \
     --policy-document file:///tmp/custom-policy.json
   
   role_arn=$(aws iam get-role --role-name  $role_name --query 'Role.Arn' --output text)
   ```

## Create MLflow tracking server
<a name="mlflow-create-tracking-server-cli-create"></a>

Within your terminal, use the `create-mlflow-tracking-server` API to create a tracking server in the AWS Region of your choice. This step can take up to 25 minutes.

You can optionally specify the size of your tracking server with the parameter `--tracking-server-config`. Choose between `"Small"`, `"Medium"`, and `"Large"`. The default MLflow Tracking Server configuration size is `"Small"`. You can choose a size depending on the projected use of the tracking server such as the volume of data logged, number of users, and frequency of use. For more information, see [MLflow Tracking Server sizes](mlflow.md#mlflow-create-tracking-server-sizes).

The following command creates a new tracking server with automatic model registration enabled. To deactivate automatic model registration, specify `--no-automatic-model-registration`. 

After creating your tracking server, you can launch the MLflow UI. For more information, see [Launch the MLflow UI using a presigned URL](mlflow-launch-ui.md).

**Note**  
It may take up to 25 minutes to complete tracking server creation. If the tracking server takes over 25 minutes to create, check that you have the necessary IAM permissions. For more information on IAM permissions, see [Set up IAM permissions for MLflow](mlflow-create-tracking-server-iam.md). When you successfully create a tracking server, it automatically starts.

When you create a tracking server, we recommend specifying the latest version. For information about the available versions, see [Tracking server versions](mlflow.md#mlflow-create-tracking-server-versions).

By default, the tracking server that's created is the latest version. However, we recommend always specifying the latest version explicitly because the underlying MLflow APIs can change.

```
ts_name=tracking-server-name
region=valid-region
version=valid-version        


aws sagemaker create-mlflow-tracking-server \
 --tracking-server-name $ts_name \
 --artifact-store-uri s3://$bucket_name \
 --role-arn $role_arn \
 --automatic-model-registration \
 --region $region \
 --mlflow-version $version
```

The output should be similar to the following: 

```
{
    "TrackingServerArn": "arn:aws:sagemaker:region:123456789012:mlflow-tracking-server/tracking-server-name"
}
```

**Important**  
**Take note of the tracking server ARN for later use.** You will also need the `$bucket_name` for clean up steps. 

# Launch the MLflow UI using a presigned URL
<a name="mlflow-launch-ui"></a>

You can access the MLflow UI to view your experiments using a presigned URL. You can launch the MLflow UI either through Studio or using the AWS CLI in a terminal of your choice. 

## Launch the MLflow UI using Studio
<a name="mlflow-launch-ui-studio"></a>

After creating your tracking server, you can launch the MLflow UI directly from Studio. 

1. Navigate to Studio from the SageMaker AI console. Be sure that you are using the new Studio experience and have updated from Studio Classic. For more information, see [Migration from Amazon SageMaker Studio Classic](studio-updated-migrate.md).

1. Choose **MLflow** in the **Applications** pane of the Studio UI.

1. **(Optional)** If have not already created a tracking server or if you need to create a new one, you can choose **Create**. Then provide a unique tracking server name and S3 URI for artifact storage and create a tracking server. You can optionally choose **Configure** for more granular tracking server customization.

1. Find the tracking server of your choice in the **MLflow Tracking Servers** pane. If the tracking server is **Off**, start the tracking server.

1. Choose the vertical menu icon in the right corner of the tracking server pane. Then, choose **Open MLflow**. This launches a presigned URL in a new tab in your current browser. 

![\[The option to open a presigned URL through the MLflow Tracking Servers pane in the Studio UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-studio-ui.png)


## Launch the MLflow UI using the AWS CLI
<a name="mlflow-launch-ui-cli"></a>

You can access the MLflow UI to view your experiments using a presigned URL.

Within your terminal, use the `create-presigned-mlflow-tracking-server-url` API to generate a presigned URL. 

```
aws sagemaker create-presigned-mlflow-tracking-server-url \
  --tracking-server-name $ts_name \
  --session-expiration-duration-in-seconds 1800 \
  --expires-in-seconds 300 \
  --region $region
```

The output should look similar to the following: 

```
{
    "AuthorizedUrl": "https://unique-key.us-west-2.experiments.sagemaker.aws.a2z.com/auth?authToken=example_token"
}
```

Copy the entire presigned URL into the browser of your choice. You can use a new tab or a new private window. Press `q` to exit the prompt.

The `--session-expiration-duration-in-seconds` parameter determines the length of time that your MLflow UI session remains valid. The session duration time is the amount of time that the MLflow UI can be loaded in the browser before a new presigned URL must be created. The minimum session duration is 30 minutes (1800 seconds) and the maximum session duration is 12 hours (43200 seconds). The default session duration is 12 hours if no other duration is specified. 

The `--expires-in-seconds parameter` determines the length of time that your presigned URL remains valid. The minimum URL expiration length is 5 seconds and the maximum URL expiration length is 5 minutes (300 seconds). The default URL expiration length is 300 seconds. The presigned URL can be used only once. 

The window should look similar to the following. 

![\[The MLflow UI that launches after creating and using a presigned URL\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-ui.png)


# Integrate MLflow with your environment
<a name="mlflow-track-experiments"></a>

The following page describes how to get started with the MLflow SDK and the AWS MLflow plugin within your development environment. This can include local IDEs or a Jupyter Notebook environment within Studio or Studio Classic.

Amazon SageMaker AI uses an MLflow plugin to customize the behavior of the MLflow Python client and integrate AWS tooling. The AWS MLflow plugin authenticates API calls made with MLflow using [AWS Signature Version 4](https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html). The AWS MLflow plugin allows you to connect to your MLflow tracking server using the tracking server ARN. For more information about plugins, see [AWS MLflow plugin](https://pypi.org/project/sagemaker-mlflow/) and [MLflow plugins](https://mlflow.org/docs/latest/plugins.html).

**Important**  
Your user IAM permissions within your development environment must have access to any relevant MLflow API actions to successfully run provided examples. For more information, see [Set up IAM permissions for MLflow](mlflow-create-tracking-server-iam.md).

For more information about using the MLflow SDK, see [Python API](https://mlflow.org/docs/2.13.2/python_api/index.html) in the MLflow documentation.

## Install MLflow and the AWS MLflow plugin
<a name="mlflow-track-experiments-install-plugin"></a>

Within your development environment, install both MLflow and the AWS MLflow plugin.

```
pip install sagemaker-mlflow
```

To ensure compatibility between your MLflow client and tracking server, use the corresponding MLflow version based on your tracking server version:
+ For tracking server 2.13.x, use `mlflow==2.13.2`
+ For tracking server 2.16.x, use `mlflow==2.16.2`
+ For tracking server 3.0.x, use `mlflow==3.0.0`

To see which versions of MLflow are available to use with SageMaker AI, see [Tracking server versions](mlflow.md#mlflow-create-tracking-server-versions).

## Connect to your MLflow Tracking Server
<a name="mlflow-track-experiments-tracking-server-connect"></a>

Use `[mlflow.set\$1tracking\$1uri](https://mlflow.org/docs/2.13.2/python_api/mlflow.html#mlflow.set_tracking_uri)` to connect to a your tracking server from your development environment using its ARN:

```
import mlflow

arn = "YOUR-TRACKING-SERVER-ARN"

mlflow.set_tracking_uri(arn)
```

# Log metrics, parameters, and MLflow models during training
<a name="mlflow-track-experiments-log-metrics"></a>

After connecting to your MLflow Tracking Server, you can use the MLflow SDK to log metrics, parameters, and MLflow models.

## Log training metrics
<a name="mlflow-track-experiments-log-metrics-example"></a>

Use `mlflow.log_metric` within an MLflow training run to track metrics. For more information about logging metrics using MLflow, see `[mlflow.log\$1metric](https://mlflow.org/docs/2.13.2/python_api/mlflow.html#mlflow.log_metric)`.

```
with mlflow.start_run():
    mlflow.log_metric("foo", 1)
    
print(mlflow.search_runs())
```

This script should create an experiment run and print out an output similar to the following:

```
run_id experiment_id status artifact_uri ... tags.mlflow.source.name tags.mlflow.user tags.mlflow.source.type tags.mlflow.runName
0 607eb5c558c148dea176d8929bd44869 0 FINISHED s3://dddd/0/607eb5c558c148dea176d8929bd44869/a... ... file.py user-id LOCAL experiment-code-name
```

Within the MLflow UI, this example should look similar to the following: 

![\[An experiment shown in the top-level MLflow Experiments menu.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-ui-experiments.png)


Choose **Run Name** to see more run details.

![\[An experiment parameter shown on an experiment run page in the MLflow UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-ui-foo.png)


## Log parameters and models
<a name="mlflow-track-experiments-log-params-models"></a>

**Note**  
The following example requires your environment to have `s3:PutObject` permissions. This permission should be associated with the IAM Role that the MLflow SDK user assumes when they log into or federate into their AWS account. For more information, see [User and role policy examples](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-policies-s3.html).

The following example takes you through a basic model training workflow using SKLearn and shows you how to track that model in an MLflow experiment run. This example logs parameters, metrics, and model artifacts.

```
import mlflow

from mlflow.models import infer_signature

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# This is the ARN of the MLflow Tracking Server you created
mlflow.set_tracking_uri(your-tracking-server-arn)
mlflow.set_experiment("some-experiment")

# Load the Iris dataset
X, y = datasets.load_iris(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the model hyperparameters
params = {"solver": "lbfgs", "max_iter": 1000, "multi_class": "auto", "random_state": 8888}

# Train the model
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Calculate accuracy as a target loss metric
accuracy = accuracy_score(y_test, y_pred)

# Start an MLflow run and log parameters, metrics, and model artifacts
with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(params)

    # Log the loss metric
    mlflow.log_metric("accuracy", accuracy)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "Basic LR model for iris data")

    # Infer the model signature
    signature = infer_signature(X_train, lr.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=lr,
        name="iris_model", # Changed from artifact_path to name for MLflow 3.0
        signature=signature,
        input_example=X_train,
        registered_model_name="tracking-quickstart",
    )
```

Within the MLflow UI, choose the experiment name in the left navigation pane to explore all associated runs. Choose the **Run Name** to see more information about each run. For this example, your experiment run page for this run should look similar to the following. 

![\[Tracked parameters for an experiment run in the MLflow UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-ui-parameters.png)


This example logs the logistic regression model. Within the MLflow UI, you should also see the logged model artifacts.

![\[Tracked model artifacts for an experiment run in the MLflow UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-ui-model-artifacts.png)


# Automatically register SageMaker AI models with SageMaker Model Registry
<a name="mlflow-track-experiments-model-registration"></a>

You can log MLflow models and automatically register them with SageMaker Model Registry using either the Python SDK or directly through the MLflow UI. 

**Note**  
Do not use spaces in a model name. While MLflow supports model names with spaces, SageMaker AI Model Package does not. The auto-registration process fails if you use spaces in your model name.

## Register models using the SageMaker Python SDK
<a name="mlflow-track-experiments-model-registration-sdk"></a>

Use `create_registered_model` within your MLflow client to automatically create a model package group in SageMaker AI that corresponds to an existing MLflow model of your choice.

```
import mlflow 
from mlflow import MlflowClient

mlflow.set_tracking_uri(arn)

client = MlflowClient()

mlflow_model_name = 'AutoRegisteredModel'
client.create_registered_model(mlflow_model_name, tags={"key1": "value1"})
```

Use `mlflow.register_model()` to automatically register a model with the SageMaker Model Registry during model training. When registering the MLflow model, a corresponding model package group and model package version are created in SageMaker AI. 

```
import mlflow.sklearn
from mlflow.models import infer_signature
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor

mlflow.set_tracking_uri(arn)
params = {"n_estimators": 3, "random_state": 42}
X, y = make_regression(n_features=4, n_informative=2, random_state=0, shuffle=False)

# Log MLflow entities
with mlflow.start_run() as run:
    rfr = RandomForestRegressor(**params).fit(X, y)
    signature = infer_signature(X, rfr.predict(X))
    mlflow.log_params(params)
    mlflow.sklearn.log_model(rfr, artifact_path="sklearn-model", signature=signature)

model_uri = f"runs:/{run.info.run_id}/sklearn-model"
mv = mlflow.register_model(model_uri, "RandomForestRegressionModel")

print(f"Name: {mv.name}")
print(f"Version: {mv.version}")
```

## Register models using the MLflow UI
<a name="mlflow-track-experiments-model-registration-ui"></a>

You can alternatively register a model with the SageMaker Model Registry directly in the MLflow UI. Within the **Models** menu in the MLflow UI, choose **Create Model**. Any models newly created in this way are added to the SageMaker Model Registry.

![\[Model registry creation within the MLflow UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-ui-register-model.png)


After logging a model during experiment tracking, navigate to the run page in the MLflow UI. Choose the **Artifacts** pane and choose **Register model** in the upper right corner to register the model version in both MLflow and SageMaker Model Registry. 

![\[Model registry creation within the MLflow UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-ui-register-model-2.png)


## View registered models in Studio
<a name="mlflow-track-experiments-model-registration-ui-view"></a>

Within the SageMaker Studio landing page, choose **Models** on the left navigation pane to view your registered models. For more information on getting started with Studio, see [Launch Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-launch.html).

![\[MLflow models registered in SageMaker Model Registry in the Studio UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-studio-model-registry.png)


# Deploy MLflow models with `ModelBuilder`
<a name="mlflow-track-experiments-model-deployment"></a>

You can deploy MLflow models to a SageMaker AI endpoint using Amazon SageMaker AI Model Builder. For more information about Amazon SageMaker AI Model Builder, see [Create a model in Amazon SageMaker AI with ModelBuilder](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-modelbuilder-creation.html).

`ModelBuilder` is a Python class that takes a framework model or a user-specified inference specification and converts it to a deployable model. For more details about the `ModelBuilder` class, see [ModelBuilder](https://sagemaker.readthedocs.io/en/stable/api/inference/model_builder.html#sagemaker.serve.builder.model_builder.ModelBuilder).

To deploy your MLflow model using `ModelBuilder`, provide a path to your MLflow artifacts in the `model_metadata["MLFLOW_MODEL_PATH"]` attribute. Read on for more information about valid model path input formats:

**Note**  
If you provide your model artifact path in the form of an MLflow run ID or MLflow model registry path, then you must also specify your tracking server ARN through the `model_metadata["MLFLOW_TRACKING_ARN"]` attribute.
+ [Model paths that require an ARN in the `model_metadata`](#mlflow-track-experiments-model-deployment-with-arn)
+ [Model paths that do not require an ARN in the `model_metadata`](#mlflow-track-experiments-model-deployment-without-arn)

## Model paths that require an ARN in the `model_metadata`
<a name="mlflow-track-experiments-model-deployment-with-arn"></a>

The following model paths do require that you specify an ARN in the `model_metadata` for deployment:
+ MLflow [run ID](https://mlflow.org/docs/latest/python_api/mlflow.entities.html?highlight=mlflow%20info#mlflow.entities.RunInfo.run_id): `runs:/aloy-run-id/run-relative/path/to/model`
+ MLflow [model registry path](https://mlflow.org/docs/latest/model-registry.html#find-registered-models): `models:/model-name/model-version`

## Model paths that do not require an ARN in the `model_metadata`
<a name="mlflow-track-experiments-model-deployment-without-arn"></a>

The following model paths do not require that you specify an ARN in the `model_metadata` for deployment:
+ Local model path: `/Users/me/path/to/local/model`
+ Amazon S3 model path: `s3://amzn-s3-demo-bucket/path/to/model`
+ Model package ARN: `arn:aws:sagemaker:region:account-id:mlflow-tracking-server/tracking-server-name`

For more information on how MLflow model deployment works with Amazon SageMaker AI, see [Deploy MLflow Model to Amazon SageMaker AI](https://mlflow.org/docs/latest/deployment/deploy-model-to-sagemaker.html) in the MLflow documentation.

If using an Amazon S3 path, you can find the path of your registered model with the following commands:

```
registered_model = client.get_registered_model(name='AutoRegisteredModel')
source_path = registered_model.latest_versions[0].source
```

The following sample is an overview of how to deploy your MLflow model using `ModelBuilder` and an MLflow model registry path. Because this sample provides the model artifact path in the form of an MLflow model registry path, the call to `ModelBuilder` must also specify a tracking server ARN through the `model_metadata["MLFLOW_TRACKING_ARN"]` attribute.

**Important**  
You must use version [2.224.0](https://pypi.org/project/sagemaker/2.224.0/) or later of the SageMaker Python SDK to use `ModelBuilder`.

**Note**  
Use the following code example for reference. For end-to-end examples that show you how to deploy registered MLflow models, see [MLflow tutorials using example Jupyter notebooks](mlflow-tutorials.md).

```
from sagemaker.serve import ModelBuilder
from sagemaker.serve.mode.function_pointers import Mode
from sagemaker.serve import SchemaBuilder

my_schema = SchemaBuilder(
    sample_input=sample_input, 
    sample_output=sample_output
)

model_builder = ModelBuilder(
    mode=Mode.SAGEMAKER_ENDPOINT,
    schema_builder=my_schema,
    role_arn="Your-service-role-ARN",
    model_metadata={
        # both model path and tracking server ARN are required if you use an mlflow run ID or mlflow model registry path as input
        "MLFLOW_MODEL_PATH": "models:/sklearn-model/1"
        "MLFLOW_TRACKING_ARN": "arn:aws:sagemaker:region:account-id:mlflow-tracking-server/tracking-server-name"
    }
)
model = model_builder.build()
predictor = model.deploy( initial_instance_count=1, instance_type="ml.c6i.xlarge" )
```

To maintain [lineage tracking](https://docs.aws.amazon.com/sagemaker/latest/dg/lineage-tracking.html) for MLflow models deployed using `ModelBuilder`, you must have the following IAM permissions:
+ `sagemaker:CreateArtifact`
+ `sagemaker:ListArtifacts`
+ `sagemaker:AddAssociation`
+ `sagemaker:DescribeMLflowTrackingServer`

**Important**  
Lineage tracking is optional. Deployment succeeds without the permissions related to lineage tracking. If you do not have the permissions configured, you will see a lineage tracking permissions error when calling `model.deploy()`. However, the endpoint deployment still succeeds and you can directly interact with your model endpoint. If the permissions above are configured, lineage tracking information is automatically created and stored.

For more information and end-to-end examples, see [MLflow tutorials using example Jupyter notebooks](mlflow-tutorials.md).

# MLflow tutorials using example Jupyter notebooks
<a name="mlflow-tutorials"></a>

The following tutorials demonstrate how to integrate MLflow experiments into your training workflows. To clean up resources created by a notebook tutorial, see [Clean up MLflow resources](mlflow-cleanup.md). 

You can run SageMaker AI example notebooks using JupyterLab in Studio. For more information on JupyterLab, see [JupyterLab user guide](studio-updated-jl-user-guide.md).

Explore the following example notebooks:
+ [SageMaker Training with MLflow](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-mlflow/sagemaker_training_mlflow.html) — Train and register a Scikit-Learn model using SageMaker AI in script mode. Learn how to integrate MLflow experiments into your training script. For more information on model training, see [Train a Model with Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html).
+ [SageMaker AI HPO with MLflow](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-mlflow/sagemaker_hpo_mlflow.html) — Learn how to track your ML experiment in MLflow with Amazon SageMaker AI automatic model tuning (AMT) and the SageMaker AI Python SDK. Each training iteration is logged as a run within the same experiment. For more information about hyperparameter optimization (HPO), see [Perform Automatic Model Tuning with Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html).
+ [SageMaker Pipelines with MLflow](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-mlflow/sagemaker_pipelines_mlflow.html) — Use Amazon SageMaker Pipelines and MLflow to train, evaluate and register a model. This notebook uses the `@step` decorator to build a SageMaker AI Pipeline. For more information on pipelines and the `@step` decorator, see [Create a pipeline with `@step`-decorated functions](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-step-decorator-create-pipeline.html).
+ [Deploy an MLflow Model to SageMaker AI](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-mlflow/sagemaker_deployment_mlflow.html) — Train a decision tree model using SciKit-Learn. Then, use Amazon SageMaker AI `ModelBuilder` to deploy the model to a SageMaker AI endpoint and run inference using the deployed model. For more information about `ModelBuilder`, see [Deploy MLflow models with `ModelBuilder`](mlflow-track-experiments-model-deployment.md).

# Troubleshoot common setup issues
<a name="mlflow-troubleshooting"></a>

Explore common troubleshooting issues.

## Could not find executable named 'groff'
<a name="mlflow-troubleshooting-groff"></a>

When using the AWS CLI, you might encounter the following error: `Could not find executable named 'groff'`.

If using a Mac, you can resolve this issue with the following command:

```
brew install groff
```

On a Linux machine, use the following commands:

```
sudo apt-get update -y
sudo apt-get install groff -y
```

## Command not found: jq
<a name="mlflow-troubleshooting-jq"></a>

When creating your AuthZ permission policy JSON file, you might encounter the following error: `jq: command not found`.

If using a Mac, you can resolve this issue with the following command:

```
brew install jq
```

On a Linux machine, use the following commands:

```
sudo apt-get update -y
sudo apt-get install jq -y
```

## AWS MLflow plugin installation speeds
<a name="mlflow-troubleshooting-speeds"></a>

Installing the AWS MLflow plugin can take several minutes when using a Mac Python environment.

## UnsupportedModelRegistryStoreURIException
<a name="mlflow-troubleshooting-uri-exception"></a>

If you see the `UnsupportedModelRegistryStoreURIException`, do the following:

1. Restart your Jupyter notebook Kernel.

1. Reinstall the AWS MLflow plugin:

   ```
   !pip install --force-reinstall sagemaker-mlflow
   ```

# Clean up MLflow resources
<a name="mlflow-cleanup"></a>

We recommend deleting any resources when you no longer need them. You can delete tracking servers through Amazon SageMaker Studio or using the AWS CLI. You can delete additional resources such as Amazon S3 buckets, IAM roles, and IAM policies using the AWS CLI or directly in the AWS console.

**Important**  
Don't delete the IAM role that you've used to create until you've deleted the tracking server itself. Otherwise, you'll lose access to the tracking server.

## Stop tracking servers
<a name="mlflow-cleanup-stop-server"></a>

We recommend stopping your tracking server when it is no longer in use. You can stop a tracking server in Studio or using the AWS CLI.

### Stop a tracking server using Studio
<a name="mlflow-cleanup-stop-server-ui"></a>

To stop a tracking server in Studio: 

1. Navigate to Studio.

1. Choose **MLflow** in the **Applications** pane of the Studio UI.

1. Find the tracking server of your choice in the **MLflow Tracking Servers** pane. Choose the **Stop** icon in the right corner of the tracking server pane.
**Note**  
If your tracking server is **Off**, you see the **Start** icon. If the tracking server is **On**, you see the **Stop** icon.

### Stop a tracking server using the AWS CLI
<a name="mlflow-cleanup-stop-server-cli"></a>

To stop the tracking server using the AWS CLI, use the following command: 

```
aws sagemaker stop-mlflow-tracking-server \
  --tracking-server-name $ts_name \
  --region $region
```

To start the tracking server using the AWS CLI, use the following command: 

**Note**  
It may take up to 25 minutes to start your tracking server.

```
aws sagemaker start-mlflow-tracking-server \
  --tracking-server-name $ts_name \
  --region $region
```

## Delete tracking servers
<a name="mlflow-cleanup-delete-server"></a>

You can fully delete a tracking server in Studio or using the AWS CLI. 

### Delete a tracking server using Studio
<a name="mlflow-cleanup-delete-server-ui"></a>

To delete a tracking server in Studio: 

1. Navigate to Studio.

1. Choose **MLflow** in the **Applications** pane of the Studio UI.

1. Find the tracking server of your choice in the **MLflow Tracking Servers** pane. Choose the vertical menu icon in the right corner of the tracking server pane. Then, choose **Delete**. 

1. Choose **Delete** to confirm deletion.

![\[The deletion option on a tracking server card in the MLflow Tracking Servers pane of the Studio UI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mlflow/mlflow-studio-delete.png)


### Delete a tracking server using the AWS CLI
<a name="mlflow-cleanup-delete-server-cli"></a>

Use the `DeleteMLflowTrackingServer` API to delete any tracking servers that you created. This may take some time.

```
aws sagemaker delete-mlflow-tracking-server \
  --tracking-server-name $ts_name \
  --region $region
```

To view the status of your tracking server, use the `DescribeMLflowTrackingServer` API and check the `TrackingServerStatus`. 

```
aws sagemaker describe-mlflow-tracking-server \
  --tracking-server-name $ts_name \
  --region $region
```

## Delete Amazon S3 buckets
<a name="mlflow-cleanup-delete-bucket"></a>

Delete any Amazon S3 bucket used as an artifact store for your tracking server using the following commands:

```
aws s3 rm s3://$bucket_name --recursive
aws s3 rb s3://$bucket_name
```

You can alternatively delete an Amazon S3 bucket associated with your tracking server directly in the AWS console. For more information, see [Deleting a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html) in the *Amazon S3 User Guide*.

## Delete registered models
<a name="mlflow-cleanup-delete-bucket"></a>

You can delete any model groups and model versions created with MLflow directly in Studio. For more information, see [Delete a Model Group](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-delete-model-group.html) and [Delete a Model Version](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-delete-model-version.html).

## Delete experiments or runs
<a name="mlflow-cleanup-delete-experiments"></a>

You can use the MLflow SDK to delete experiments or runs.
+ [mlflow.delete\$1experiment](https://mlflow.org/docs/latest/python_api/mlflow.html?highlight=delete_experiment#mlflow.delete_experiment)
+ [mlflow.delete\$1run](https://mlflow.org/docs/latest/python_api/mlflow.html?highlight=delete_experiment#mlflow.delete_run)

# Amazon SageMaker Experiments in Studio Classic
<a name="experiments"></a>

**Important**  
Experiment tracking using the SageMaker Experiments Python SDK is only available in Studio Classic. We recommend using the new Studio experience and creating experiments using the latest SageMaker AI integrations with MLflow. There is no MLflow UI integration with Studio Classic. If you want to use MLflow with Studio, you must launch the MLflow UI using the AWS CLI. For more information, see [Launch the MLflow UI using the AWS CLI](mlflow-launch-ui.md#mlflow-launch-ui-cli).

Amazon SageMaker Experiments Classic is a capability of Amazon SageMaker AI that lets you create, manage, analyze, and compare your machine learning experiments in Studio Classic. Use SageMaker Experiments to view, manage, analyze, and compare both custom experiments that you programmatically create and experiments automatically created from SageMaker AI jobs. 

Experiments Classic automatically tracks the inputs, parameters, configurations, and results of your iterations as *runs*. You can assign, group, and organize these runs into *experiments*. SageMaker Experiments is integrated with Amazon SageMaker Studio Classic, providing a visual interface to browse your active and past experiments, compare runs on key performance metrics, and identify the best performing models. SageMaker Experiments tracks all of the steps and artifacts that went into creating a model, and you can quickly revisit the origins of a model when you are troubleshooting issues in production, or auditing your models for compliance verifications.

## Migrate from Experiments Classic to Amazon SageMaker AI with MLflow
<a name="experiments-mlflow-migration"></a>

Past experiments created using Experiments Classic are still available to view in Studio Classic. If you want to maintain and use past experiment code with MLflow, you must update your training code to use the MLflow SDK and run the training experiments again. For more information on getting started with the MLflow SDK and the AWS MLflow plugin, see [Integrate MLflow with your environment](mlflow-track-experiments.md).

# Example notebooks for Experiments Classic
<a name="experiments-examples"></a>

The following example notebooks demonstrate how to track runs for various model training experiments. You can view the resulting experiments in Studio Classic after running the notebooks. For a tutorial that showcases additional features of Studio Classic, see [Amazon SageMaker Studio Classic Tour](gs-studio-end-to-end.md).

## Track experiments in a notebook environment
<a name="experiments-tutorials-notebooks"></a>

To learn more about tracking experiments in a notebook environment, see the following example notebooks:
+ [Track an experiment while training a Keras model locally](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-experiments/local_experiment_tracking/keras_experiment.html)
+ [Track an experiment while training a Pytorch model locally or in your notebook](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-experiments/local_experiment_tracking/pytorch_experiment.html)

## Track bias and explainability for your experiments with SageMaker Clarify
<a name="experiments-tutorials-clarify"></a>

For a step-by-step guide on tracking bias and explainability for your experiments, see the following example notebook:
+ [ Fairness and Explainability with SageMaker Clarify ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-experiments/sagemaker_clarify_integration/tracking_bias_explainability.html)

## Track experiments for SageMaker training jobs using script mode
<a name="experiments-tutorials-scripts"></a>

For more information about tracking experiments for SageMaker training jobs, see the following example notebooks:
+ [Run a SageMaker AI Experiment with Pytorch Distributed Data Parallel - MNIST Handwritten Digits Classification](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-experiments/sagemaker_job_tracking/pytorch_distributed_training_experiment.html)
+ [Track an experiment while training a Pytorch model with a SageMaker Training Job](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-experiments/sagemaker_job_tracking/pytorch_script_mode_training_job.html)
+ [Train a TensorFlow model with a SageMaker training job and track it using SageMaker Experiments](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-experiments/sagemaker_job_tracking/tensorflow_script_mode_training_job.html)

# View experiments and runs
<a name="experiments-view-compare"></a>

Amazon SageMaker Studio Classic provides an experiments browser that you can use to view lists of experiments and runs. You can choose one of these entities to view detailed information about the entity or choose multiple entities for comparison. You can filter the list of experiments by entity name, type, and tags.

**To view experiments and runs**

1. To view the experiment in Studio Classic, in the left sidebar, choose **Experiments**.

   Select the name of the experiment to view all associated runs. You can search experiments by typing directly into the **Search** bar or filtering for experiment type. You can also choose which columns to display in your experiment or run list.

   It might take a moment for the list to refresh and display a new experiment or experiment run. You can click **Refresh** to update the page. Your experiment list should look similar to the following:  
![\[A list of experiments in the SageMaker Experiments UI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/experiments-classic/experiments-overview.png)

1. In the experiments list, double-click an experiment to display a list of the runs in the experiment.
**Note**  
Experiment runs that are automatically created by SageMaker AI jobs and containers are visible in the Experiments Studio Classic UI by default. To hide runs created by SageMaker AI jobs for a given experiment, choose the settings icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Settings_squid.png)) and toggle **Show jobs**.  
![\[A list of experiment runs in the SageMaker Experiments UI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/experiments-classic/experiments-runs-overview.png)

1. Double-click a run to display information about a specific run.

   In the **Overview** pane, choose any of the following headings to see available information about each run:
   + **Metrics** – Metrics that are logged during a run.
   + **Charts** – Build your own charts to compare runs.
   + **Output artifacts** – Any resulting artifacts of the experiment run and the artifact locations in Amazon S3.
   + **Bias reports** – Pre-training or post-training bias reports generated using Clarify.
   + ** Explainability**– Explainability reports generated using Clarify.
   + **Debugs** – A list of debugger rules and any issues found.

# Automatic model tuning with SageMaker AI
<a name="automatic-model-tuning"></a>

Amazon SageMaker AI automatic model tuning (AMT) finds the best version of a model by running many training jobs on your dataset. Amazon SageMaker AI automatic model tuning (AMT) is also known as hyperparameter tuning. To do this, AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that creates a model that performs the best, as measured by a metric that you choose.

For example, running a *[binary classification](https://docs.aws.amazon.com/glossary/latest/reference/glos-chap.html#binary-classification-model)* problem on a marketing dataset. Your goal is to maximize the *[area under the curve (AUC)](https://docs.aws.amazon.com/glossary/latest/reference/glos-chap.html#AUC)* metric of the algorithm by training an [XGBoost algorithm with Amazon SageMaker AI](xgboost.md) model. You want to find which values for the `eta`, `alpha`, `min_child_weight`, and `max_depth` hyperparameters that will train the best model. Specify a range of values for these hyperparameters. Then, SageMaker AI hyperparameter tuning searches within the ranges to find a combination that creates a training job that creates a model with the highest AUC. To conserve resources or meet a specific model quality expectation, set up completion criteria to stop tuning after the criteria have been met.

You can use SageMaker AI AMT with built-in algorithms, custom algorithms, or SageMaker AI pre-built containers for machine learning frameworks.

SageMaker AI AMT can use an Amazon EC2 Spot instance to optimize costs when running training jobs. For more information, see [Managed Spot Training in Amazon SageMaker AI](model-managed-spot-training.md).

Before you start using hyperparameter tuning, you should have a well-defined machine learning problem, including the following:
+ A dataset
+ An understanding of the type of algorithm that you need to train
+ A clear understanding of how you measure success

Prepare your dataset and algorithm so that they work in SageMaker AI and successfully run a training job at least once. For information about setting up and running a training job, see [Guide to getting set up with Amazon SageMaker AI](gs.md).

**Topics**
+ [

# Understand the hyperparameter tuning strategies available in Amazon SageMaker AI
](automatic-model-tuning-how-it-works.md)
+ [

# Define metrics and environment variables
](automatic-model-tuning-define-metrics-variables.md)
+ [

# Define Hyperparameter Ranges
](automatic-model-tuning-define-ranges.md)
+ [

# Track and set completion criteria for your tuning job
](automatic-model-tuning-progress.md)
+ [

# Tune Multiple Algorithms with Hyperparameter Optimization to Find the Best Model
](multiple-algorithm-hpo.md)
+ [

# Example: Hyperparameter Tuning Job
](automatic-model-tuning-ex.md)
+ [

# Stop Training Jobs Early
](automatic-model-tuning-early-stopping.md)
+ [

# Run a Warm Start Hyperparameter Tuning Job
](automatic-model-tuning-warm-start.md)
+ [

# Resource Limits for Automatic Model Tuning
](automatic-model-tuning-limits.md)
+ [

# Best Practices for Hyperparameter Tuning
](automatic-model-tuning-considerations.md)

# Understand the hyperparameter tuning strategies available in Amazon SageMaker AI
<a name="automatic-model-tuning-how-it-works"></a>

When you build complex machine learning systems like deep learning neural networks, exploring all of the possible combinations is impractical. Hyperparameter tuning can accelerate your productivity by trying many variations of a model. It looks for the best model automatically by focusing on the most promising combinations of hyperparameter values within the ranges that you specify. To get good results, you must choose the right ranges to explore. This page provides a brief explanation of the different hyperparameter tuning strategies that you can use with Amazon SageMaker AI.

Use the [API reference guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/Welcome.html?icmpid=docs_sagemaker_lp) to understand how to interact with hyperparameter tuning. You can use the tuning strategies described on this page with the [HyperParameterTuningJobConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobConfig.html) and [HyperbandStrategyConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperbandStrategyConfig.html) APIs.

**Note**  
Because the algorithm itself is stochastic, the hyperparameter tuning model may fail to converge on the best answer. This can occur even if the best possible combination of values is within the ranges that you choose.

## Grid search
<a name="automatic-tuning-grid-search"></a>

 When using grid search, hyperparameter tuning chooses combinations of values from the range of categorical values that you specify when you create the job. Only categorical parameters are supported when using the grid search strategy. You do not need to specify the `MaxNumberOfTrainingJobs`. The number of training jobs created by the tuning job is automatically calculated to be the total number of distinct categorical combinations possible. If specified, the value of `MaxNumberOfTrainingJobs` should equal the total number of distinct categorical combinations possible.

## Random search
<a name="automatic-tuning-random-search"></a>

When using random search, hyperparameter tuning chooses a random combination of hyperparameter values in the ranges that you specify for each training job it launches. The choice of hyperparameter values doesn't depend on the results of previous training jobs. As a result, you can run the maximum number of concurrent training jobs without changing the performance of the tuning.

For an example notebook that uses random search, see the [ Random search and hyperparameter scaling with SageMaker XGBoost and Automatic Model Tuning](https://github.com/aws/amazon-sagemaker-examples-community/blob/215215eb25b40eadaf126d055dbb718a245d7603/training/sagemaker-automatic-model-tuning/hpo_xgboost_random_log.ipynb) notebook.

## Bayesian optimization
<a name="automatic-tuning-bayesian-optimization"></a>

Bayesian optimization treats hyperparameter tuning like a *[regression](https://docs.aws.amazon.com/glossary/latest/reference/glos-chap.html#[regression])* problem. Given a set of input features (the hyperparameters), hyperparameter tuning optimizes a model for the metric that you choose. To solve a regression problem, hyperparameter tuning makes guesses about which hyperparameter combinations are likely to get the best results. It then runs training jobs to test these values. After testing a set of hyperparameter values, hyperparameter tuning uses regression to choose the next set of hyperparameter values to test.

Hyperparameter tuning uses an Amazon SageMaker AI implementation of Bayesian optimization.

When choosing the best hyperparameters for the next training job, hyperparameter tuning considers everything that it knows about this problem so far. Sometimes it chooses a combination of hyperparameter values close to the combination that resulted in the best previous training job to incrementally improve performance. This allows hyperparameter tuning to use the best known results. Other times, it chooses a set of hyperparameter values far removed from those it has tried. This allows it to explore the range of hyperparameter values to try to find new areas that are not yet well understood. The explore/exploit trade-off is common in many machine learning problems.

For more information about Bayesian optimization, see the following:

**Basic Topics on Bayesian Optimization**
+ [A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning](https://arxiv.org/abs/1012.2599)
+ [Practical Bayesian Optimization of Machine Learning Algorithms](https://arxiv.org/abs/1206.2944)
+ [Taking the Human Out of the Loop: A Review of Bayesian Optimization](https://ieeexplore.ieee.org/document/7352306?reload=true)

**Speeding up Bayesian Optimization**
+ [Google Vizier: A Service for Black-Box Optimization](https://dl.acm.org/doi/10.1145/3097983.3098043)
+ [Learning Curve Prediction with Bayesian Neural Networks](https://openreview.net/forum?id=S11KBYclx)
+ [Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves](https://dl.acm.org/doi/10.5555/2832581.2832731)

**Advanced Modeling and Transfer Learning**
+ [Scalable Hyperparameter Transfer Learning](https://papers.nips.cc/paper_files/paper/2018/hash/14c879f3f5d8ed93a09f6090d77c2cc3-Abstract.html)
+ [Bayesian Optimization with Tree-structured Dependencies](http://proceedings.mlr.press/v70/jenatton17a.html)
+ [Bayesian Optimization with Robust Bayesian Neural Networks](https://papers.nips.cc/paper_files/paper/2016/hash/291597a100aadd814d197af4f4bab3a7-Abstract.html)
+ [Scalable Bayesian Optimization Using Deep Neural Networks](http://proceedings.mlr.press/v37/snoek15.pdf)
+ [Input Warping for Bayesian Optimization of Non-stationary Functions](https://arxiv.org/abs/1402.0929)

## Hyperband
<a name="automatic-tuning-hyperband"></a>

Hyperband is a multi-fidelity based tuning strategy that dynamically reallocates resources. Hyperband uses both intermediate and final results of training jobs to re-allocate epochs to well-utilized hyperparameter configurations and automatically stops those that underperform. It also seamlessly scales to using many parallel training jobs. These features can significantly speed up hyperparameter tuning over random search and Bayesian optimization strategies.

Hyperband should only be used to tune iterative algorithms that publish results at different resource levels. For example, Hyperband can be used to tune a neural network for image classification which publishes accuracy metrics after every epoch.

For more information about Hyperband, see the following links:
+ [Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization](http://arxiv.org/pdf/1603.06560)
+ [Massively Parallel Hyperparameter Tuning](https://liamcli.com/assets/pdf/asha_arxiv.pdf)
+ [BOHB: Robust and Efficient Hyperparameter Optimization at Scale](http://proceedings.mlr.press/v80/falkner18a/falkner18a.pdf)
+ [Model-based Asynchronous Hyperparameter and Neural Architecture Search](https://openreview.net/pdf?id=a2rFihIU7i)

### Hyperband with early stopping
<a name="automatic-tuning-hyperband-early-stopping"></a>

Training jobs can be stopped early when they are unlikely to improve the objective metric of the hyperparameter tuning job. This can help reduce compute time and avoid overfitting your model. Hyperband uses an advanced internal mechanism to apply early stopping. The parameter `TrainingJobEarlyStoppingType` in the `HyperParameterTuningJobConfig` API must be set to `OFF` when using the Hyperband internal early stopping feature.

**Note**  
Hyperparameter tuning might not improve your model. It is an advanced tool for building machine solutions. As such, it should be considered part of the scientific development process. 

# Define metrics and environment variables
<a name="automatic-model-tuning-define-metrics-variables"></a>

A tuning job optimizes hyperparameters for training jobs that it launches by using a metric to evaluate performance. This guide shows how to define metrics so that you can use a custom algorithm for training, or use a built-in algorithm from Amazon SageMaker AI. This guide also shows how to specify environment variables during an Automatic model tuning (AMT) job.

## Define metrics
<a name="automatic-model-tuning-define-metrics"></a>

Amazon SageMaker AI hyperparameter tuning parses your machine learning algorithm's `stdout` and `stderr` streams to find metrics, such as loss or validation-accuracy. The metrics show how well the model is performing on the dataset. 

The following sections describe how to use two types of algorithms for training: built-in and custom.

### Use a built-in algorithm for training
<a name="automatic-model-tuning-define-metrics-builtin"></a>

If you use one of the [SageMaker AI built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html), metrics are already defined for you. In addition, built-in algorithms automatically send metrics to hyperparameter tuning for optimization. These metrics are also written to Amazon CloudWatch logs. For more information, see [Log Amazon SageMaker AI Events with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/logging-cloudwatch.html). 

For the objective metric for the tuning job, choose one of the metrics that the built-in algorithm emits. For a list of available metrics, see the model tuning section for the appropriate algorithm in [Use Amazon SageMaker AI Built-in Algorithms or Pre-trained Models](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html).

You can choose up to 40 metrics to monitor in your [tuning job](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterAlgorithmSpecification.html). Select one of those metrics to be the objective metric. The hyperparameter tuning job returns the [training job ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html#sagemaker-DescribeHyperParameterTuningJob-response-BestTrainingJob) that performed the best against the objective metric.

**Note**  
Hyperparameter tuning automatically sends an additional hyperparameter `_tuning_objective_metric` to pass your objective metric to the tuning job for use during training.

### Use a custom algorithm for training
<a name="automatic-model-tuning-define-metrics-custom"></a>

This section shows how to define your own metrics to use your own custom algorithm for training. When doing so, make sure that your algorithm writes at least one metric to `stderr` or `stdout`. Hyperparameter tuning parses these streams to find algorithm metrics that show how well the model is performing on the dataset.

You can define custom metrics by specifying a name and regular expression for each metric that your tuning job monitors. Then, pass these metric definitions to the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) API in the `TrainingJobDefinition` parameter in the `MetricDefinitions` field of `AlgorithmSpecification`.

The following shows sample output from a log written to `stderr` or `stdout` by a training algorithm.

```
GAN_loss=0.138318;  Scaled_reg=2.654134; disc:[-0.017371,0.102429] real 93.3% gen 0.0% disc-combined=0.000000; disc_train_loss=1.374587;  Loss = 16.020744;  Iteration 0 took 0.704s;  Elapsed=0s
```

The following code example shows how to use regular expressions in Python (regex). This is used to search the sample log output and capture the numeric values of four different metrics.

```
[
    {
        "Name": "ganloss",
        "Regex": "GAN_loss=(.*?);",
    },
    {
        "Name": "disc-combined",
        "Regex": "disc-combined=(.*?);",
    },
    {
        "Name": "discloss",
        "Regex": "disc_train_loss=(.*?);",
    },
    {
        "Name": "loss",
        "Regex": "Loss = (.*?);",
    },
]
```

In regular expressions, parenthesis `()` are used to group parts of the regular expression together.
+ For the `loss` metric that is defined in the code example, the expression `(.*?);` captures any character between the exact text `"Loss="` and the first semicolon (`;`) character.
+ The character `.` instructs the regular expression to match any character.
+  The character `*` means to match zero or more characters. 
+ The character `?` means capture only until the first instance of the `;` character. 

The loss metric defined in the code sample will capture `Loss = 16.020744` from the sample output.

Choose one of the metrics that you define as the objective metric for the tuning job. If you are using the SageMaker API, specify the value of the `name` key in the `HyperParameterTuningJobObjective` field of the `HyperParameterTuningJobConfig` parameter that you send to the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) operation.

## Specify environment variables
<a name="automatic-model-tuning-define-variables"></a>

SageMaker AI AMT optimizes hyperparameters within a tuning job to find the best parameters for model performance. You can use environment variables to configure your tuning job to change its behavior. You can also use environment variables that you used during training inside your tuning job.

If you want to use an environment variable from your tuning job or specify a new environment variable, input a string value for `Environment` within the SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html) API. Pass this training job definition to the [CreateHyperParameterTuningJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) API.

For example, the environment variable `SM_LOG_LEVEL` can be set to the following values to tailor the output from a Python container.

```
NOTSET=0
DEBUG=10
INFO=20
WARN=30
ERROR=40
CRITICAL=50
```

As an example, to set the log level to `10` to debug your container logs, set the environment variable inside the [HyperParameterTrainingJobDefinition](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html), as follows.

```
{
   "[HyperParameterTuningJobConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html#sagemaker-CreateHyperParameterTuningJob-request-HyperParameterTuningJobConfig)": { 
   ...,
   }
   "[TrainingJobDefinition](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html#sagemaker-CreateHyperParameterTuningJob-request-TrainingJobDefinition)": { 
      ...,
      "Environment" : [
          {
            "SM_LOG_LEVEL": 10 
          }
      ],
      ...,
   },
   ...,        
}
```

# Define Hyperparameter Ranges
<a name="automatic-model-tuning-define-ranges"></a>

This guide shows how to use SageMaker APIs to define hyperparameter ranges. It also provides a list of hyperparameter scaling types that you can use.

Choosing hyperparameters and ranges significantly affects the performance of your tuning job. Hyperparameter tuning finds the best hyperparameter values for your model by searching over a [range](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html#sagemaker-Type-HyperParameterTrainingJobDefinition-HyperParameterRanges) of values that you specify for each tunable hyperparameter. You can also specify up to 100 [static hyperparameters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html#sagemaker-Type-HyperParameterTrainingJobDefinition-StaticHyperParameters) that do not change over the course of the tuning job. You can use up to 100 hyperparameters in total (static \$1 tunable). For guidance on choosing hyperparameters and ranges, see [Best Practices for Hyperparameter Tuning](automatic-model-tuning-considerations.md). You can also use autotune to find optimal tuning job settings. For more information, see the following **Autotune** section.

**Note**  
SageMaker AI Automatic Model Tuning (AMT) may add additional hyperparameters(s) that contribute to the limit of 100 total hyperparameters. Currently, to pass your objective metric to the tuning job for use during training, SageMaker AI adds `_tuning_objective_metric` automatically.

## Static hyperparameters
<a name="automatic-model-tuning-define-ranges-static"></a>

Use static hyperparameters for the following cases:    If you have background knowledge that guides you to select a constant value.   If you don't want to explore a value range for the hyperparameters.   For example, you can use AMT to tune your model using `param1` (a tunable parameter) and `param2` (a static parameter). If you do, then use a search space for `param1` that lies between two values, and pass `param2` as a static hyperparameter, as follows.

```
param1: ["range_min","range_max"]
param2: "static_value"
```

Static hyperparameters have the following structure:

```
"StaticHyperParameters": {
    "objective" : "reg:squarederror",
    "dropout_rate": "0.3"
}
```

You can use the Amazon SageMaker API to specify key value pairs in the [StaticHyperParameters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html#sagemaker-Type-HyperParameterTrainingJobDefinition-StaticHyperParameters) field of the `HyperParameterTrainingJobDefinition` parameter that you pass to the [CreateHyperParameterTuningJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) operation.

## Dynamic hyperparameters
<a name="automatic-model-tuning-define-ranges-dynamic"></a>

You can use the SageMaker API to define [hyperparameter ranges](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html#sagemaker-Type-HyperParameterTrainingJobDefinition-HyperParameterRanges). Specify the names of hyperparameters and ranges of values in the `ParameterRanges` field of the `HyperParameterTuningJobConfig` parameter that you pass to the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) operation. 

The `ParameterRanges` field has three subfields: categorical, integer, and continuous. You can define up to 30 total (categorical \$1 integer \$1 continuous) tunable hyperparameters to search over. 

**Note**  
Each categorical hyperparameter can have at most 30 different values.

Dynamic hyperparameters have the following structure:

```
"ParameterRanges": {
    "CategoricalParameterRanges": [
        {
            "Name": "tree_method",
            "Values": ["auto", "exact", "approx", "hist"]
        }
    ],
    "ContinuousParameterRanges": [
        {
            "Name": "eta",
            "MaxValue" : "0.5",
            "MinValue": "0",
            "ScalingType": "Auto"
        }
    ],
    "IntegerParameterRanges": [
        {
            "Name": "max_depth",
            "MaxValue": "10",
            "MinValue": "1",
            "ScalingType": "Auto"
        }
    ]
}
```

If you create a tuning job with a `Grid` strategy, you can only specify categorical values. You don't need to provide the `MaxNumberofTrainingJobs`. This value is inferred from the total number of configurations that can be produced from your categorical parameters. If specified, the value of `MaxNumberOfTrainingJobs` should be equal to the total number of distinct categorical combinations possible.

## Autotune
<a name="automatic-model-tuning-define-ranges-autotune"></a>

To save time and resources searching for hyperparameter ranges, resources or objective metrics, autotune can automatically guess optimal values for some hyperparameter fields. Use autotune to find optimal values for the following fields:
+ **[ParameterRanges](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobConfig.html#sagemaker-Type-HyperParameterTuningJobConfig-ParameterRanges)** – The names and ranges of hyperparameters that a tuning job can optimize.
+ **[ResourceLimits](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ResourceLimits.html) ** – The maximum resources to be used in a tuning job. These resources can include the maximum number of training jobs, maximum runtime of a tuning job, and the maximum number of training jobs that can be run at the same time.
+ **[TrainingJobEarlyStoppingType](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobConfig.html#sagemaker-Type-HyperParameterTuningJobConfig-TrainingJobEarlyStoppingType)** – A flag that stops a training job if a job is not significantly improving against an objective metric. Defaults to enabled. For more information, see [Stop Training Jobs Early](automatic-model-tuning-early-stopping.md).
+ **[RetryStrategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTrainingJobDefinition.html#sagemaker-Type-HyperParameterTrainingJobDefinition-RetryStrategy)** – The number of times to retry a training job. Non-zero values for `RetryStrategy` can increase the likelihood that your job will complete successfully.
+ **[Strategy](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobConfig.html#sagemaker-Type-HyperParameterTuningJobConfig-Strategy)** – Specifies how hyperparameter tuning chooses the combinations of hyperparameter values to use for the training job that it launches.
+ **[ConvergenceDetected](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ConvergenceDetected.html)** – A flag to indicate that Automatic Model Tuning (AMT) has detected model convergence.

To use autotune, do the following:

1. Specify the hyperparameter and an example value in the `AutoParameters` field of the [ParameterRanges](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ParameterRanges.html) API.

1. Enable autotune.

AMT will determine if your hyperparameters and example values are eligible for autotune. Hyperparameters that can be used in autotune are automatically assigned to the appropriate parameter range type. Then, AMT uses `ValueHint` to select an optimal range for you. You can use the `DescribeHyperParameterTrainingJob` API to view these ranges.

The following example shows you how to configure a tuning job that uses autotune. In the configuration example, the hyperparameter `max_depth` has `ValueHint` containing an example value of `4`.

```
config = {
    'Autotune': {'Mode': 'Enabled'},
    'HyperParameterTuningJobName':'my-autotune-job',
    'HyperParameterTuningJobConfig': {
        'HyperParameterTuningJobObjective': {'Type': 'Minimize', 'MetricName': 'validation:rmse'},
        'ResourceLimits': {'MaxNumberOfTrainingJobs': 5, 'MaxParallelTrainingJobs': 1},
        'ParameterRanges': {       
            'AutoParameters': [
                {'Name': 'max_depth', 'ValueHint': '4'}
            ]
        }
    },
    'TrainingJobDefinition': {
    .... }
```

Continuing the previous example, a tuning job is created after the previous configuration is included in a call to the `CreateHyperParameterTuningJob` API. Then, autotune converts the max\$1depth hyperparameter in AutoParameters to the hyperparameter `IntegerParameterRanges`. The following response from a `DescribeHyperParameterTrainingJob` API shows that the optimal `IntegerParameterRanges` for `max_depth` are between `2` and `8`.

```
{
    'HyperParameterTuningJobName':'my_job',
    'HyperParameterTuningJobConfig': {
        'ParameterRanges': {
            'IntegerParameterRanges': [
                {'Name': 'max_depth', 'MinValue': '2', 'MaxValue': '8'},
            ],
        }
    },
    'TrainingJobDefinition': {
        ...
    },
    'Autotune': {'Mode': 'Enabled'}
    
}
```

## Hyperparameter scaling types
<a name="scaling-type"></a>

For integer and continuous hyperparameter ranges, you can choose the scale that you want hyperparameter tuning to use. For example, to search the range of values, you can specify a value for the `ScalingType` field of the hyperparameter range. You can choose from the following hyperparameter scaling types:

Auto  
SageMaker AI hyperparameter tuning chooses the best scale for the hyperparameter.

Linear  
Hyperparameter tuning searches the values in the hyperparameter range by using a linear scale. Typically, you choose this if the range of all values from the lowest to the highest is relatively small (within one order of magnitude). Uniformly searching values from the range provides a reasonable exploration of the entire range.

Logarithmic  
Hyperparameter tuning searches the values in the hyperparameter range by using a logarithmic scale.  
Logarithmic scaling works only for ranges that have values greater than 0.  
Choose logarithmic scaling when you're searching a range that spans several orders of magnitude.   
For example, if you're tuning a [Tune a linear learner model](linear-learner.md) model, and you specify a range of values between .0001 and 1.0 for the `learning_rate` hyperparameter, consider the following: Searching uniformly on a logarithmic scale gives you a better sample of the entire range than searching on a linear scale would. This is because searching on a linear scale would, on average, devote 90 percent of your training budget to only the values between .1 and 1.0. As a result, that leaves only 10 percent of your training budget for the values between .0001 and .1.

`ReverseLogarithmic`  
Hyperparameter tuning searches the values in the hyperparameter range by using a reverse logarithmic scale. Reverse logarithmic scaling is supported only for continuous hyperparameter ranges. It is not supported for integer hyperparameter ranges.  
Choose reverse logarithmic scaling when you are searching a range that is highly sensitive to small changes that are very close to 1.  
Reverse logarithmic scaling works only for ranges that are entirely within the range 0<=x<1.0.

For an example notebook that uses hyperparameter scaling, see these [Amazon SageMaker AI hyperparameter examples on GitHub](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning).

# Track and set completion criteria for your tuning job
<a name="automatic-model-tuning-progress"></a>

You can use completion criteria to instruct Automatic model tuning (AMT) to stop your tuning job if certain conditions are met. With these conditions, you can set a minimum model performance or maximum number of training jobs that don’t improve when evaluated against the objective metric. You can also track the progress of your tuning job and decide to let it continue or to stop it manually. This guide shows you how to set completion criteria, check the progress of and stop your tuning job manually.

## Set completion criteria for your tuning job
<a name="automatic-model-tuning-progress-completion"></a>

During hyperparameter optimization, a tuning job will launch several training jobs inside a loop. The tuning job will do the following. 
+ Check your training jobs for completion and update statistics accordingly
+ Decide what combination of hyperparameters to evaluate next.

AMT will continuously check the training jobs that were launched from your tuning job to update statistics. These statistics include tuning job runtime and best training job. Then, AMT determines whether it should stop the job according to your completion criteria. You can also check these statistics and stop your job manually. For more information about stopping a job manually, see the [Stopping your tuning job manually](#automatic-model-tuning-progress-stop) section.

As an example, if your tuning job meets your objective, you can stop tuning early to conserve resources or ensure model quality. AMT checks your job performance against your completion criteria and stops the tuning job if any have been met. 

You can specify the following kinds of completion criteria:
+ `MaxNumberOfTrainingJobs` – The maximum number of training jobs to be run before tuning is stopped.
+ `MaxNumberOfTrainingJobsNotImproving` – The maximum number of training jobs that do not improve performance against the objective metric from the current best training job. As an example, if the best training job returned an objective metric that had an accuracy of `90%`, and `MaxNumberOfTrainingJobsNotImproving` is set to `10`. In this example, tuning will stop after `10` training jobs fail to return an accuracy higher than `90`%.
+ `MaxRuntimeInSeconds` – The upper limit of wall clock time in seconds of how long a tuning job can run.
+ `TargetObjectiveMetricValue` – The value of the objective metric against which the tuning job is evaluated. Once this value is met, AMT stops the tuning job.
+ `CompleteOnConvergence` – A flag to stop tuning after an internal algorithm determines that the tuning job is unlikely to improve more than 1% over the objective metric from the best training job.

### Selecting completion criteria
<a name="automatic-model-tuning-progress-completion-how"></a>

You can choose one or multiple completion criteria to stop your hyperparameter tuning job after a condition has been meet. The following instructions show you how to select completion criteria and how to decide which is the most appropriate for your use case.
+ Use `MaxNumberOfTrainingJobs` in the [ResourceLimits](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ResourceLimits.html) API to set an upper limit for the number of training jobs that can be run before your tuning job is stopped. Start with a large number and adjust it based on model performance against your tuning job objective. Most users input values of around `50` or more training jobs to find an optimal hyperparameter configuration. Users looking for higher levels of model performance will use `200` or more training jobs.
+ Use `MaxNumberOfTrainingJobsNotImproving` in the [BestObjectiveNotImproving](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_BestObjectiveNotImproving.html) API field to stop training if model performance fails to improve after a specified number of jobs. Model performance is evaluated against an objective function. After the `MaxNumberOfTrainingJobsNotImproving` is met, AMT will stop the tuning job. Tuning jobs tend to make the most progress in the beginning of the job. Improving model performance against an objective function will require a larger number of training jobs towards the end of tuning. Select a value for `MaxNumberOfTrainingJobsNotImproving` by checking the performance of similar training jobs against your objective metric.
+ Use `MaxRuntimeInSeconds` in the [ResourceLimits](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ResourceLimits.html) API to set an upper limit for the amount of wall clock time that the tuning job may take. Use this field to meet a deadline by which the tuning job must complete or to limit compute resources.

  To get an estimated total compute time in seconds for a tuning job, use the following formula:

  Estimated max compute time in seconds= `MaxRuntimeInSeconds` \$1 `MaxParallelTrainingJobs` \$1 `MaxInstancesPerTrainingJob` 
**Note**  
The actual duration of a tuning job may deviate slightly from the value specified in this field.
+ Use `TargetObjectiveMetricValue` in the [TuningJobCompletionCriteria](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TuningJobCompletionCriteria.html) API to stop your tuning job. You stop the tuning job after any training job that is launched by the tuning job reaches this objective metric value. Use this field if your use case depends on reaching a specific performance level, rather than spending compute resources to find the best possible model.
+ Use `CompleteOnConvergence` in the [TuningJobCompletionCriteria](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TuningJobCompletionCriteria.html) API to stop a tuning job after AMT has detected that the tuning job has converged, and is unlikely to make further significant progress. Use this field when it is not clear what values for any of the other completion criteria should be used. AMT determines convergence based on an algorithm developed and tested on a wide range of diverse benchmarks. A tuning job is defined to have converged when none of the training jobs return significant improvement (1% or less). Improvement is measured against the objective metric returned by the highest performing job, so far.

### Combining different completion criteria
<a name="automatic-model-tuning-progress-completion-combine"></a>

You can also combine any of the different completion criteria in the same tuning job. AMT will stop the tuning job when any one of the completion criteria is met. For example, if you want to tune your model until it meets an objective metric, but don't want to keep tuning if your job has converged, use the following guidance.
+ Specify `TargetObjectiveMetricValue` in the [TuningJobCompletionCriteria](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TuningJobCompletionCriteria.html) API to set a target objective metrics value to reach.
+ Set [CompleteOnConvergence](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ConvergenceDetected.html) to `Enabled` to stop a tuning job if AMT has determined that model performance is unlikely to improve.

## Track tuning job progress
<a name="automatic-model-tuning-progress-track"></a>

You can use the `DescribeHyperParameterTuningJob` API to track the progress of your tuning job at any time while it is running. You don't have to specify completion criteria to obtain tracking information for your tuning job. Use the following fields to obtain statistics about your tuning job.
+ [BestTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html#sagemaker-DescribeHyperParameterTuningJob-response-BestTrainingJob) – An object that describes the best training job obtained so far, evaluated against your objective metric. Use this field to check your current model performance and the value of the objective metric of this best training job.
+ [ObjectiveStatusCounters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html#sagemaker-DescribeHyperParameterTuningJob-response-ObjectiveStatusCounters) – An object that specifies the total number of training jobs completed in a tuning job. To estimate average duration of a tuning job, use `ObjectiveStatusCounters` and the total runtime of a tuning job. You can use the average duration to estimate how much longer your tuning job will run.
+ `ConsumedResources` – The total resources, such as `RunTimeInSeconds`, consumed by your tuning job. Compare `ConsumedResources`, found in the [DescribeHyperParameterTuningJob ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html) API, against `BestTrainingJob` in the same API. You can also compare `ConsumedResources` against the response from the [ListTrainingJobsForHyperParameterTuningJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListTrainingJobsForHyperParameterTuningJob.html) API to assess if your tuning job is making satisfactory progress given the resources being consumed.
+ [TuningJobCompletionDetails](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobCompletionDetails.html) – Tuning job completion information that includes the following:
  + The timestamp of when convergence is detected if the job has converged.
  + The number of training jobs that have not improved model performance. Model performance is evaluated against the objective metric from the best training job.

  Use the tuning job completion criteria to assess how likely your tuning job is to improve your model performance. Model performance is evaluated against the best objective metric if it ran to completion.

## Stopping your tuning job manually
<a name="automatic-model-tuning-progress-stop"></a>

You can determine if you should let the tuning job run until it completes or if you should stop the tuning job manually. To determine this, use the information returned by the parameters in the `DescribeHyperParameterTuningJob` API, as shown in the previous **Tracking tuning job progress** section. As an example, if your model performance does not improve after several training jobs complete, you may choose to stop the tuning job. Model performance is evaluated against the best objective metric.

To stop the tuning job manually, use the [StopHyperParameterTuningJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopHyperParameterTuningJob.html) API and provide the name of the tuning job to be stopped.

# Tune Multiple Algorithms with Hyperparameter Optimization to Find the Best Model
<a name="multiple-algorithm-hpo"></a>

To create a new hyperparameter optimization (HPO) job with Amazon SageMaker AI that tunes multiple algorithms, you must provide job settings that apply to all of the algorithms to be tested and a training definition for each of these algorithms. You must also specify the resources you want to use for the tuning job.
+ The **job settings** to configure include warm starting, early stopping, and the tuning strategy. Warm starting and early stopping are available only when tuning a single algorithm.
+ The **training job definition** to specify the name, algorithm source, objective metric, and the range of values, when required, to configure the set of hyperparameter values for each training job. It configures the channels for data inputs, data output locations, and any checkpoint storage locations for each training job. The definition also configures the resources to deploy for each training job, including instance types and counts, managed spot training, and stopping conditions.
+ The **tuning job resources**: to deploy, including the maximum number of concurrent training jobs that a hyperparameter tuning job can run concurrently and the maximum number of training jobs that the hyperparameter tuning job can run.

## Get Started
<a name="multiple-algorithm-hpo-get-started"></a>

You can create a new hyperparameter tuning job, clone a job, add, or edit tags to a job from the console. You can also use the search feature to find jobs by their name, creation time, or status. Alternatively, you can also hyperparameter tuning jobs with the SageMaker AI API.
+ **In the console**: To create a new job, open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/), choose **Hyperparameter tuning jobs** from the **Training**, menu, and then choose **Create hyperparameter tuning job**. Then following the configuration steps to create a training job for each algorithm that you want to use. These steps are documented in the [Create a Hyperparameter Optimization Tuning Job for One or More Algorithms (Console)](multiple-algorithm-hpo-create-tuning-jobs.md) topic. 
**Note**  
When you start the configuration steps, note that the warm start and early stopping features are not available to use with multi-algorithm HPO. If you want to use these features, you can only tune a single algorithm at a time. 
+ **With the API**: For instructions on using the SageMaker API to create a hyperparameter tuning job, see [Example: Hyperparameter Tuning Job](automatic-model-tuning-ex.html). When you call [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) to tune multiple algorithms, you must provide a list of training definitions using [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html#sagemaker-CreateHyperParameterTuningJob-request-TrainingJobDefinitions](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html#sagemaker-CreateHyperParameterTuningJob-request-TrainingJobDefinitions) instead of specifying a single [TrainingJobDefinition](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html#sagemaker-CreateHyperParameterTuningJob-request-TrainingJobDefinition). You must provide job settings that apply to all of the algorithms to be tested and a training definition for each of these algorithms. You must also specify the resources that you want to use for the tuning job. Choose only one of these definition types depending on the number of algorithms that are being tuned. 

**Topics**
+ [

## Get Started
](#multiple-algorithm-hpo-get-started)
+ [

# Create a Hyperparameter Optimization Tuning Job for One or More Algorithms (Console)
](multiple-algorithm-hpo-create-tuning-jobs.md)
+ [

# Manage Hyperparameter Tuning and Training Jobs
](multiple-algorithm-hpo-manage-tuning-jobs.md)

# Create a Hyperparameter Optimization Tuning Job for One or More Algorithms (Console)
<a name="multiple-algorithm-hpo-create-tuning-jobs"></a>

This guide shows you how to create a new hyperparameter optimization (HPO) tuning job for one or more algorithms. To create an HPO job, define the settings for the tuning job, and create training job definitions for each algorithm being tuned. Next, configure the resources for and create the tuning job. The following sections provide details about how to complete each step. We provide an example of how to tune multiple algorithms using the SageMaker AI SDK for Python client at the end of this guide.

## Components of a tuning job
<a name="multiple-algorithm-hpo-create-tuning-jobs-define-settings"></a>

An HPO tuning job contains the following three components:
+ Tuning job settings
+ Training job definitions
+ Tuning job configuration

The way that these components are included in your HPO tuning job depends on whether your tuning job contains one or multiple training algorithms. The following guide describes each of the components and gives an example of both types of tuning jobs.

### Tuning job settings
<a name="multiple-algorithm-hpo-create-tuning-jobs-components-tuning-settings"></a>

Your tuning job settings are applied across all of the algorithms in the HPO tuning job. Warm start and early stopping are available only when you're tuning a single algorithm. After you define the job settings, you can create individual training definitions for each algorithm or variation that you want to tune. 

**Warm start**  
If you cloned this job, you can use the results from a previous tuning job to improve the performance of this new tuning job. This is the warm start feature, and it's only available when tuning a single algorithm. With the warm start option, you can choose up to five previous hyperparameter tuning jobs to use. Alternatively, you can use transfer learning to add additional data to the parent tuning job. When you select this option, you choose one previous tuning job as the parent. 

**Note**  
Warm start is compatible only with tuning jobs that were created after October 1, 2018. For more information, see [Run a warm start job](automatic-model-tuning-considerations.html).

**Early stopping**  
To reduce compute time and avoid overfitting your model, you can stop training jobs early. Early stopping is helpful when the training job is unlikely to improve the current best objective metric of the hyperparameter tuning job. Like warm start, this feature is only available when tuning a single algorithm. This is an automatic feature without configuration options, and it’s disabled by default. For more information about how early stopping works, the algorithms that support it, and how to use it with your own algorithms, see [Stop Training Jobs Early](automatic-model-tuning-early-stopping.html).

**Tuning strategy**  
Tuning strategy can be either random, Bayesian, or Hyperband. These selections specify how automatic tuning algorithms search specified hyperparameter ranges that are selected in a later step. Random search chooses random combinations of values from the specified ranges and can be run sequentially or in parallel. Bayesian optimization chooses values based on what is likely to get the best result according to the known history of previous selections. Hyperband uses a multi-fidelity strategy that dynamically allocates resources toward well-utilized jobs and automatically stops those that underperform. The new configuration that starts after stopping other configurations is chosen randomly.

 Hyperband can only be used with iterative algorithms, or algorithms that run steps in iterations, such as [https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) or [Random Cut Forest](https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html). Hyperband can't be used with non-iterative algorithms, such as decision trees or [k-Nearest Neighbors](https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html). For more information about search strategies, see [How Hyperparameter Tuning Works](automatic-model-tuning-how-it-works.html).

**Note**  
Hyperband uses an advanced internal mechanism to apply early stopping. Therefore, when you use the Hyperband internal early stopping feature, the parameter `TrainingJobEarlyStoppingType` in the `HyperParameterTuningJobConfig` API must be set to `OFF`.

**Tags**  
To help you manage tuning jobs, you can enter tags as key-value pairs to assign metadata to tuning jobs. Values in the key-value pair are not required. You can use the key without values. To see the keys associated with a job, choose the **Tags** tab on the details page for tuning job. For more information about using tags for tuning jobs, see [Manage Hyperparameter Tuning and Training Jobs](multiple-algorithm-hpo-manage-tuning-jobs.md).

### Training job definitions
<a name="multiple-algorithm-hpo-create-tuning-jobs-training-definitions"></a>

To create a training job definition, you must configure the algorithm and parameters, define the data input and output, and configure resources. Provide at least one [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TrainingJobDefinition.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TrainingJobDefinition.html) for each HPO tuning job. Each training definition specifies the configuration for an algorithm.

To create several definitions for your training job, you can clone a job definition. Cloning a job can save time because it copies all of the job settings, including data channels and Amazon S3 storage locations for output artifacts. You can edit a cloned job to change what you need for your use case.

**Topics**
+ [

#### Configure algorithm and parameters
](#multiple-algorithm-hpo-algorithm-configuration)
+ [

#### Define data input and output
](#multiple-algorithm-hpo-data)
+ [

#### Configure training job resources
](#multiple-algorithm-hpo-training-job-definition-resources)
+ [

#### Add or clone a training job
](#multiple-algorithm-hpo-add-training-job)

#### Configure algorithm and parameters
<a name="multiple-algorithm-hpo-algorithm-configuration"></a>

 The following list describes what you need to configure the set of hyperparameter values for each training job. 
+ A name for your tuning job
+ Permission to access services
+ Parameters for any algorithm options
+ An objective metric
+ The range of hyperparameter values, when required

**Name**  
 Provide a unique name for the training definition. 

**Permissions**  
 Amazon SageMaker AI requires permissions to call other services on your behalf. Choose an AWS Identity and Access Management (IAM) role, or let AWS create a role with the `AmazonSageMakerFullAccess` IAM policy attached. 

**Optional security settings**  
 The network isolation setting prevents the container from making any outbound network calls. This is required for AWS Marketplace machine learning offerings. 

 You can also choose to use a virtual private cloud (VPC).

**Note**  
 Inter-container encryption is only available when you create a job definition from the API. 

**Algorithm options**  
You can choose built-in algorithms, your own algorithm, your own container with an algorithm, or you can subscribe to an algorithm from AWS Marketplace. 
+ If you choose a built-in algorithm, it has the Amazon Elastic Container Registry (Amazon ECR) image information pre-populated.
+ If you choose your own container, you must specify the (Amazon ECR) image information. You can select the input mode for the algorithm as file or pipe.
+ If you plan to supply your data using a CSV file from Amazon S3, you should select the file.

**Metrics**  
When you choose a built-in algorithm, metrics are provided for you. If you choose your own algorithm, you must define your metrics. You can define up to 20 metrics for your tuning job to monitor. You must choose one metric as the objective metric. For more information about how to define a metric for a tuning job, see [Define metrics](automatic-model-tuning-define-metrics-variables.md#automatic-model-tuning-define-metrics).

**Objective metric**  
To find the best training job, set an objective metric and whether to maximize or minimize it. After the training job is complete, you can view the tuning job detail page. The detail page provides a summary of the best training job that is found using this objective metric. 

**Hyperparameter configuration**  
When you choose a built-in algorithm, the default values for its hyperparameters are set for you, using ranges that are optimized for the algorithm that's being tuned. You can change these values as you see fit. For example, instead of a range, you can set a fixed value for a hyperparameter by setting the parameter’s type to **static**. Each algorithm has different required and optional parameters. For more information, see [Best Practices for Hyperparameter Tuning](automatic-model-tuning-considerations.html) and [Define Hyperparameter Ranges](automatic-model-tuning-define-ranges.html). 

#### Define data input and output
<a name="multiple-algorithm-hpo-data"></a>

Each training job definition for a tuning job must configure the channels for data inputs, data output locations, and optionally, any checkpoint storage locations for each training job. 

**Input data configuration**  
Input data is defined by channels. Each channel its own source location (Amazon S3 or Amazon Elastic File System), compression, and format options. You can define up to 20 channels of input sources. If the algorithm that you choose supports multiple input channels, you can specify those, too. For example, when you use the [XGBoost churn prediction notebook](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.html), you can add two channels: train and validation.

**Checkpoint configuration**  
Checkpoints are periodically generated during training. For the checkpoints to be saved, you must choose an Amazon S3 location. Checkpoints are used in metrics reporting, and are also used to resume managed spot training jobs. For more information, see [Checkpoints in Amazon SageMaker AI](model-checkpoints.md).

**Output data configuration**  
Define an Amazon S3 location for the artifacts of the training job to be stored. You have the option of adding encryption to the output using an AWS Key Management Service (AWS KMS) key. 

#### Configure training job resources
<a name="multiple-algorithm-hpo-training-job-definition-resources"></a>

Each training job definition for a tuning job must configure the resources to deploy, including instance types and counts, managed spot training, and stopping conditions.

**Resource configuration**  
Each training definition can have a different resource configuration. You choose the instance type and number of nodes. 

**Managed spot training**  
You can save computer costs for jobs if you have flexibility in start and end times by allowing SageMaker AI to use spare capacity to run jobs. For more information, see [Managed Spot Training in Amazon SageMaker AI](model-managed-spot-training.md).

**Stopping condition**  
The stopping condition specifies the maximum duration that's allowed for each training job. 

#### Add or clone a training job
<a name="multiple-algorithm-hpo-add-training-job"></a>

After you create a training job definition for a tuning job, you will return to the **Training Job Definition(s)** panel. This panel is where you can create additional training job definitions to train additional algorithms. You can select the **Add training job definition** and work through the steps to define a training job again. 

Alternatively, to replicate an existing training job definition and edit it for the new algorithm, choose **Clone** from the **Action** menu. The clone option can save time because it copies all of the job’s settings, including the data channels and Amazon S3 storage locations. For more information about cloning, see [Manage Hyperparameter Tuning and Training Jobs](multiple-algorithm-hpo-manage-tuning-jobs.md).

### Tuning job configuration
<a name="multiple-algorithm-hpo-resource-config"></a>

**Resource Limits**  
You can specify the maximum number of concurrent training jobs that a hyperparameter tuning job can run concurrently (10 at most). You can also specify the maximum number of training jobs that the hyperparameter tuning job can run (500 at most). The number of parallel jobs should not exceed the number of nodes that you have requested across all of your training definitions. The total number of jobs can’t exceed the number of jobs that your definitions are expected to run.

Review the job settings, the training job definitions, and the resource limits. Then select **Create hyperparameter tuning job**.

## HPO tuning job example
<a name="multiple-algorithm-hpo-create-tuning-jobs-define-example"></a>

To run a hyperparameter optimization (HPO) training job, first create a training job definition for each algorithm that's being tuned. Next, define the tuning job settings and configure the resources for the tuning job. Finally, run the tuning job.

If your HPO tuning job contains a single training algorithm, the SageMaker AI tuning function will call the `HyperparameterTuner` API directly and pass in your parameters. If your HPO tuning job contains multiple training algorithms, your tuning function will call the `create` function of the `HyperparameterTuner` API. The `create` function tells the API to expect a dictionary containing one or more estimators.

In the following section, code examples show how to tune a job containing either a single training algorithm or multiple algorithms using the SageMaker AI Python SDK.

### Create training job definitions
<a name="multiple-algorithm-hpo-create-tuning-jobs-define-example-train"></a>

When you create a tuning job that includes multiple training algorithms, your tuning job configuration will include the estimators and metrics and other parameters for your training jobs. Therefore, you need to create the training job definition first, and then configure your tuning job. 

The following code example shows how to retrieve two SageMaker AI containers containing the built-in algorithms [https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) and [https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html](https://docs.aws.amazon.com/sagemaker/latest/dg/linear-learner.html). If your tuning job contains only one training algorithm, omit one of the containers and one of the estimators.

```
import sagemaker
from sagemaker import image_uris

from sagemaker.estimator import Estimator

sess = sagemaker.Session()
region = sess.boto_region_name
role = sagemaker.get_execution_role()

bucket = sess.default_bucket()
prefix = "sagemaker/multi-algo-hpo"

# Define the training containers and intialize the estimators
xgb_container = image_uris.retrieve("xgboost", region, "latest")
ll_container = image_uris.retrieve("linear-learner", region, "latest")

xgb_estimator = Estimator(
    xgb_container,
    role=role,
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path='s3://{}/{}/xgb_output".format(bucket, prefix)',
    sagemaker_session=sess,
)

ll_estimator = Estimator(
    ll_container,
    role,
    instance_count=1,
    instance_type="ml.c4.xlarge",
    output_path="s3://{}/{}/ll_output".format(bucket, prefix),
    sagemaker_session=sess,
)

# Set static hyperparameters
ll_estimator.set_hyperparameters(predictor_type="binary_classifier")
xgb_estimator.set_hyperparameters(
    eval_metric="auc",
    objective="binary:logistic",
    num_round=100,
    rate_drop=0.3,
    tweedie_variance_power=1.4,
)
```

Next, define your input data by specifying the training, validation, and testing datasets, as shown in the following code example. This example shows how to tune multiple training algorithms.

```
training_data = sagemaker.inputs.TrainingInput(
    s3_data="s3://{}/{}/train".format(bucket, prefix), content_type="csv"
)
validation_data = sagemaker.inputs.TrainingInput(
    s3_data="s3://{}/{}/validate".format(bucket, prefix), content_type="csv"
)
test_data = sagemaker.inputs.TrainingInput(
    s3_data="s3://{}/{}/test".format(bucket, prefix), content_type="csv"
)

train_inputs = {
    "estimator-1": {
        "train": training_data,
        "validation": validation_data,
        "test": test_data,
    },
    "estimator-2": {
        "train": training_data,
        "validation": validation_data,
        "test": test_data,
    },
}
```

If your tuning algorithm contains only one training algorithm, your `train_inputs` should contain only one estimator.

You must upload the inputs for the training, validation, and training datasets to your Amazon S3 bucket before you use those in an HPO tuning job.

### Define resources and settings for your tuning job
<a name="multiple-algorithm-hpo-create-tuning-jobs-define-example-resources"></a>

This section shows how to initialize a tuner, define resources, and specify job settings for your tuning job. If your tuning job contains multiple training algorithms, these settings are applied to all of the algorithms that are contained inside your tuning job. This section provides two code examples to define a tuner. The code examples show you how to optimize a single training algorithm followed by an example of how to tune multiple training algorithms.

#### Tune a single training algorithm
<a name="multiple-algorithm-hpo-create-tuning-jobs-define-example-resources-single"></a>

The following code example shows how to initialize a tuner and set hyperparameter ranges for one SageMaker AI built-in algorithm, XGBoost.

```
from sagemaker.tuner import HyperparameterTuner
from sagemaker.parameter import ContinuousParameter, IntegerParameter

hyperparameter_ranges = {
    "max_depth": IntegerParameter(1, 10),
    "eta": ContinuousParameter(0.1, 0.3),
}

objective_metric_name = "validation:accuracy"

tuner = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    objective_type="Maximize",
    max_jobs=5,
    max_parallel_jobs=2,
)
```

#### Tune multiple training algorithms
<a name="multiple-algorithm-hpo-create-tuning-jobs-define-example-resources-multiple"></a>

Each training job requires different configurations, and these are specified using a dictionary. The following code example shows how to initialize a tuner with configurations for two SageMaker AI built-in algorithms, XGBoost and Linear Learner. The code example also shows how to set a tuning strategy and other job settings, such as the compute resources for the tuning job. The following code example uses `metric_definitions_dict`, which is optional.

```
from sagemaker.tuner import HyperparameterTuner
from sagemaker.parameter import ContinuousParameter, IntegerParameter

# Initialize your tuner
tuner = HyperparameterTuner.create(
    estimator_dict={
        "estimator-1": xgb_estimator,
        "estimator-2": ll_estimator,
    },
    objective_metric_name_dict={
        "estimator-1": "validation:auc",
        "estimator-2": "test:binary_classification_accuracy",
    },
    hyperparameter_ranges_dict={
        "estimator-1": {"eta": ContinuousParameter(0.1, 0.3)},
        "estimator-2": {"learning_rate": ContinuousParameter(0.1, 0.3)},
    },
    metric_definitions_dict={
        "estimator-1": [
            {"Name": "validation:auc", "Regex": "Overall test accuracy: (.*?);"}
        ],
        "estimator-2": [
            {
                "Name": "test:binary_classification_accuracy",
                "Regex": "Overall test accuracy: (.*?);",
            }
        ],
    },
    strategy="Bayesian",
    max_jobs=10,
    max_parallel_jobs=3,
)
```

### Run your HPO tuning job
<a name="multiple-algorithm-hpo-create-tuning-jobs-define-example-run"></a>

Now you can run your tuning job by passing your training inputs to the `fit` function of the `HyperparameterTuner` class. The following code example shows how to pass the `train_inputs` parameter, that is defined in a previous code example, to your tuner.

```
tuner.fit(inputs=train_inputs, include_cls_metadata ={}, estimator_kwargs ={})   
```

# Manage Hyperparameter Tuning and Training Jobs
<a name="multiple-algorithm-hpo-manage-tuning-jobs"></a>

A tuning job can contain many training jobs and creating and managing these jobs and their definitions can become a complex and onerous task. SageMaker AI provides tools to help facilitate the management of these jobs. Tuning jobs you have run can be accessed from the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/). Select **Hyperparameter tuning job** from the **Training** menu to see the list. This page is also where you start the procedure to create a new tuning job by selecting **Create hyperparameter tuning job**. 

To see the training jobs run a part of a tuning job, select one of the hyperparameter tuning jobs from the list. The tabs on the tuning job page allow you to inspect the training jobs, their definitions, the tags and configuration used for the tuning job, and the best training job found during tuning. You can select the best training job or any of the other training jobs that belong to the tuning job to see all of their settings. From here you can create a model that uses the hyperparameter values found by a training job by selecting **Create Model** or you can clone the training job by selecting **Clone**.

**Cloning**  
You can save time by cloning a training job that belongs to a hyperparameter tuning job. Cloning copies all of the job’s settings, including data channels, S3 storage locations for output artifacts. You can do this for training jobs you have already run from the tuning job page, as just described, or when you are creating additional training job definitions while creating a hyperparameter tuning job, as described in [Add or clone a training job](multiple-algorithm-hpo-create-tuning-jobs.md#multiple-algorithm-hpo-add-training-job) step of that procedure. 

**Tagging**  
Automatic Model Tuning launches multiple training jobs within a single parent tuning job to discover the ideal weighting of model hyperparameters. Tags can be added to the parent tuning job as described in the [Components of a tuning job](multiple-algorithm-hpo-create-tuning-jobs.md#multiple-algorithm-hpo-create-tuning-jobs-define-settings) section and these tags are then propagated to the individual training jobs underneath. Customers can use these tags for purposes, such as cost allocation or access control. To add tags using the SageMaker SDK, use [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AddTags.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AddTags.html) API. For more information about using tagging for AWS resources, see [Tagging AWS resources](https://docs.aws.amazon.com/general/latest/gr/aws_tagging.html).

# Example: Hyperparameter Tuning Job
<a name="automatic-model-tuning-ex"></a>

This example shows how to create a new notebook for configuring and launching a hyperparameter tuning job. The tuning job uses the [XGBoost algorithm with Amazon SageMaker AI](xgboost.md) to train a model to predict whether a customer will enroll for a term deposit at a bank after being contacted by phone.

You use the low-level SDK for Python (Boto3) to configure and launch the hyperparameter tuning job, and the AWS Management Console to monitor the status of hyperparameter tuning jobs. You can also use the Amazon SageMaker AI high-level [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to configure, run, monitor, and analyze hyperparameter tuning jobs. For more information, see [https://github.com/aws/sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk).

## Prerequisites
<a name="automatic-model-tuning-ex-prereq"></a>

To run the code in this example, you need
+ [An AWS account and an administrator user](gs-set-up.md)
+ An Amazon S3 bucket for storing your training dataset and the model artifacts created during training
+ [A running SageMaker AI notebook instance](gs-setup-working-env.md)

**Topics**
+ [

## Prerequisites
](#automatic-model-tuning-ex-prereq)
+ [

# Create a Notebook Instance
](automatic-model-tuning-ex-notebook.md)
+ [

# Get the Amazon SageMaker AI Boto 3 Client
](automatic-model-tuning-ex-client.md)
+ [

# Get the SageMaker AI Execution Role
](automatic-model-tuning-ex-role.md)
+ [

# Use an Amazon S3 bucket for input and output
](automatic-model-tuning-ex-bucket.md)
+ [

# Download, Prepare, and Upload Training Data
](automatic-model-tuning-ex-data.md)
+ [

# Configure and Launch a Hyperparameter Tuning Job
](automatic-model-tuning-ex-tuning-job.md)
+ [

# Clean up
](automatic-model-tuning-ex-cleanup.md)

# Create a Notebook Instance
<a name="automatic-model-tuning-ex-notebook"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

Create a Jupyter notebook that contains a pre-installed environment with the default Anaconda installation and Python3. 

**To create a Jupyter notebook**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Open a running notebook instance, by choosing **Open** next to its name. The Jupyter notebook server page appears:

     
![\[Example Jupyter notebook server page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/notebook-dashboard.png)

1. To create a notebook, choose **Files**, **New**, and **conda\$1python3**. .

1. Name the notebook.

## Next Step
<a name="automatic-model-tuning-ex-next-client"></a>

[Get the Amazon SageMaker AI Boto 3 Client](automatic-model-tuning-ex-client.md)

# Get the Amazon SageMaker AI Boto 3 Client
<a name="automatic-model-tuning-ex-client"></a>

Import Amazon SageMaker Python SDK, AWS SDK for Python (Boto3), and other Python libraries. In a new Jupyter notebook, paste the following code to the first cell:

```
import sagemaker
import boto3

import numpy as np                                # For performing matrix operations and numerical processing
import pandas as pd                               # For manipulating tabular data
from time import gmtime, strftime
import os

region = boto3.Session().region_name
smclient = boto3.Session().client('sagemaker')
```

The preceding code cell defines `region` and `smclient` objects that you will use to call the built-in XGBoost algorithm and set the SageMaker AI hyperparameter tuning job.

## Next Step
<a name="automatic-model-tuning-ex-next-role"></a>

[Get the SageMaker AI Execution Role](automatic-model-tuning-ex-role.md)

# Get the SageMaker AI Execution Role
<a name="automatic-model-tuning-ex-role"></a>

Get the execution role for the notebook instance. This is the IAM role that you created for your notebook instance.

To find the ARN of the IAM execution role attached to a notebook instance:

1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. On the left navigation pane, choose **Notebook** then **Notebook instances**.

1. From the list of notebooks, select the notebook that you want to view.

1. The ARN is in the **Permissions and encryption** section.

Alternatively, [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) users can retrieve the ARN of the execution role attached to their user profile or a notebook instance by running the following code:

```
from sagemaker import get_execution_role

role = get_execution_role()
print(role)
```

For more information about using `get_execution_role` in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable), see [Session](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html). For more information about roles, see [How to use SageMaker AI execution roles](sagemaker-roles.md).

## Next Step
<a name="automatic-model-tuning-ex-next-bucket"></a>

[Use an Amazon S3 bucket for input and output](automatic-model-tuning-ex-bucket.md)

# Use an Amazon S3 bucket for input and output
<a name="automatic-model-tuning-ex-bucket"></a>

Set up a S3 bucket to upload training datasets and save training output data for your hyperparameter tuning job.

**To use a default S3 bucket**

Use the following code to specify the default S3 bucket allocated for your SageMaker AI session. `prefix` is the path within the bucket where SageMaker AI stores the data for the current training job.

```
sess = sagemaker.Session()
bucket = sess.default_bucket() # Set a default S3 bucket
prefix = 'DEMO-automatic-model-tuning-xgboost-dm'
```

**To use a specific S3 bucket (Optional)**

If you want to use a specific S3 bucket, use the following code and replace the strings to the exact name of the S3 bucket. The name of the bucket must contain **sagemaker**, and be globally unique. The bucket must be in the same AWS Region as the notebook instance that you use for this example.

```
bucket = "sagemaker-your-preferred-s3-bucket"

sess = sagemaker.Session(
    default_bucket = bucket
)
```

**Note**  
The name of the bucket doesn't need to contain **sagemaker** if the IAM role that you use to run the hyperparameter tuning job has a policy that gives the `S3FullAccess` permission.

## Next Step
<a name="automatic-model-tuning-ex-next-data"></a>

[Download, Prepare, and Upload Training Data](automatic-model-tuning-ex-data.md)

# Download, Prepare, and Upload Training Data
<a name="automatic-model-tuning-ex-data"></a>

For this example, you use a training dataset of information about bank customers that includes the customer's job, marital status, and how they were contacted during the bank's direct marketing campaign. To use a dataset for a hyperparameter tuning job, you download it, transform the data, and then upload it to an Amazon S3 bucket.

For more information about the dataset and the data transformation that the example performs, see the *hpo\$1xgboost\$1direct\$1marketing\$1sagemaker\$1APIs* notebook in the **Hyperparameter Tuning** section of the **SageMaker AI Examples** tab in your notebook instance.

## Download and Explore the Training Dataset
<a name="automatic-model-tuning-ex-data-download"></a>

To download and explore the dataset, run the following code in your notebook:

```
!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
!unzip -o bank-additional.zip
data = pd.read_csv('./bank-additional/bank-additional-full.csv', sep=';')
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 5)         # Keep the output on one page
data
```

## Prepare and Upload Data
<a name="automatic-model-tuning-ex-data-transform"></a>

Before creating the hyperparameter tuning job, prepare the data and upload it to an S3 bucket where the hyperparameter tuning job can access it.

Run the following code in your notebook:

```
data['no_previous_contact'] = np.where(data['pdays'] == 999, 1, 0)                                 # Indicator variable to capture when pdays takes a value of 999
data['not_working'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unemployed']), 1, 0)   # Indicator for individuals not actively employed
model_data = pd.get_dummies(data)                                                                  # Convert categorical variables to sets of indicators
model_data
model_data = model_data.drop(['duration', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'], axis=1)

train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9*len(model_data))])

pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False)
pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False)
pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False)

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')
```

## Next Step
<a name="automatic-model-tuning-ex-next-tuning-job"></a>

[Configure and Launch a Hyperparameter Tuning Job](automatic-model-tuning-ex-tuning-job.md)

# Configure and Launch a Hyperparameter Tuning Job
<a name="automatic-model-tuning-ex-tuning-job"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

A hyperparameter is a high-level parameter that influences the learning process during model training. To get the best model predictions, you can optimize a hyperparameter configuration or set hyperparameter values. The process of finding an optimal configuration is called hyperparameter tuning. To configure and launch a hyperparameter tuning job, complete the steps in these guides.

**Topics**
+ [

## Settings for the hyperparameter tuning job
](#automatic-model-tuning-ex-low-tuning-config)
+ [

## Configure the training jobs
](#automatic-model-tuning-ex-low-training-def)
+ [

## Name and launch the hyperparameter tuning job
](#automatic-model-tuning-ex-low-launch)
+ [

# Monitor the Progress of a Hyperparameter Tuning Job
](automatic-model-tuning-monitor.md)
+ [

## View the Status of the Training Jobs
](#automatic-model-tuning-monitor-training)
+ [

## View the Best Training Job
](#automatic-model-tuning-best-training-job)

## Settings for the hyperparameter tuning job
<a name="automatic-model-tuning-ex-low-tuning-config"></a>

To specify settings for the hyperparameter tuning job, define a JSON object when you create the tuning job. Pass this JSON object as the value of the `HyperParameterTuningJobConfig` parameter to the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) API.

In this JSON object, specify the following:

In this JSON object, you specify:
+ `HyperParameterTuningJobObjective` – The objective metric used to evaluate the performance of the training job launched by the hyperparameter tuning job.
+ `ParameterRanges` – The range of values that a tunable hyperparameter can use during optimization. For more information, see [Define Hyperparameter Ranges](automatic-model-tuning-define-ranges.md)
+ `RandomSeed` – A value used to initialize a pseudo-random number generator. Setting a random seed will allow the hyperparameter tuning search strategies to produce more consistent configurations for the same tuning job (optional).
+ `ResourceLimits` – The maximum number of training and parallel training jobs that the hyperparameter tuning job can use.

**Note**  
If you use your own algorithm for hyperparameter tuning, rather than a SageMaker AI [built-in algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html), you must define metrics for your algorithm. For more information, see [Define metrics](automatic-model-tuning-define-metrics-variables.md#automatic-model-tuning-define-metrics).

The following code example shows how to configure a hyperparameter tuning job using the built-in [XGBoost algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html). The code example shows how to define ranges for the `eta`, `alpha`, `min_child_weight`, and `max_depth` hyperparameters. For more information about these and other hyperparameters see [XGBoost Parameters](https://xgboost.readthedocs.io/en/release_1.2.0/parameter.html). 

In this code example, the objective metric for the hyperparameter tuning job finds the hyperparameter configuration that maximizes `validation:auc`. SageMaker AI built-in algorithms automatically write the objective metric to CloudWatch Logs. The following code example also shows how to set a `RandomSeed`. 

```
tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta"
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha"
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "max_depth"
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 20,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:auc",
      "Type": "Maximize"
    },
    "RandomSeed" : 123
  }
```

## Configure the training jobs
<a name="automatic-model-tuning-ex-low-training-def"></a>

The hyperparameter tuning job will launch training jobs to find an optimal configuration of hyperparameters. These training jobs should be configured using the SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) API. 

To configure the training jobs, define a JSON object and pass it as the value of the `TrainingJobDefinition` parameter inside `CreateHyperParameterTuningJob`.

In this JSON object, you can specify the following: 
+ `AlgorithmSpecification` – The [registry path](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html) of the Docker image containing the training algorithm and related metadata. To specify an algorithm, you can use your own [custom built algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms.html) inside a [Docker](https://docs.docker.com/get-started/overview/) container or a [SageMaker AI built-in algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) (required).
+ `InputDataConfig` – The input configuration, including the `ChannelName`, `ContentType`, and data source for your training and test data (required).
+ `InputDataConfig` – The input configuration, including the `ChannelName`, `ContentType`, and data source for your training and test data (required).
+ The storage location for the algorithm's output. Specify the S3 bucket where you want to store the output of the training jobs.
+ `RoleArn` – The [Amazon Resource Name](https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html) (ARN) of an AWS Identity and Access Management (IAM) role that SageMaker AI uses to perform tasks. Tasks include reading input data, downloading a Docker image, writing model artifacts to an S3 bucket, writing logs to Amazon CloudWatch Logs, and writing metrics to Amazon CloudWatch (required).
+ `StoppingCondition` – The maximum runtime in seconds that a training job can run before being stopped. This value should be greater than the time needed to train your model (required).
+ `MetricDefinitions` – The name and regular expression that defines any metrics that the training jobs emit. Define metrics only when you use a custom training algorithm. The example in the following code uses a built-in algorithm, which already has metrics defined. For information about defining metrics (optional), see [Define metrics](automatic-model-tuning-define-metrics-variables.md#automatic-model-tuning-define-metrics).
+ `TrainingImage` – The [Docker](https://docs.docker.com/get-started/overview/)container image that specifies the training algorithm (optional).
+ `StaticHyperParameters` – The name and values of hyperparameters that are not tuned in the tuning job (optional).

The following code example sets static values for the `eval_metric`, `num_round`, `objective`, `rate_drop`, and `tweedie_variance_power` parameters of the [XGBoost algorithm with Amazon SageMaker AI](xgboost.md) built-in algorithm.

------
#### [ SageMaker Python SDK v1 ]

```
from sagemaker.amazon.amazon_estimator import get_image_uri
training_image = get_image_uri(region, 'xgboost', repo_version='1.0-1')

s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_validation ='s3://{}/{}/validation/'.format(bucket, prefix)

training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 2,
      "InstanceType": "ml.c4.2xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "auc",
      "num_round": "100",
      "objective": "binary:logistic",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}
```

------
#### [ SageMaker Python SDK v2 ]

```
training_image = sagemaker.image_uris.retrieve('xgboost', region, '1.0-1')

s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_validation ='s3://{}/{}/validation/'.format(bucket, prefix)

training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 2,
      "InstanceType": "ml.c4.2xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "auc",
      "num_round": "100",
      "objective": "binary:logistic",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}
```

------

## Name and launch the hyperparameter tuning job
<a name="automatic-model-tuning-ex-low-launch"></a>

After you configure the hyperparameter tuning job, you can launch it by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) API. The following code example uses `tuning_job_config` and `training_job_definition`. These were defined in the previous two code examples to create a hyperparameter tuning job.

```
tuning_job_name = "MyTuningJob"
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                           HyperParameterTuningJobConfig = tuning_job_config,
                                           TrainingJobDefinition = training_job_definition)
```

# Monitor the Progress of a Hyperparameter Tuning Job
<a name="automatic-model-tuning-monitor"></a>

To monitor the progress of a hyperparameter tuning job and the training jobs that it launches, use the Amazon SageMaker AI console.

**Topics**
+ [

## View the Status of the Hyperparameter Tuning Job
](#automatic-model-tuning-monitor-tuning)

## View the Status of the Hyperparameter Tuning Job
<a name="automatic-model-tuning-monitor-tuning"></a>

**To view the status of the hyperparameter tuning job**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Hyperparameter tuning jobs**.  
![\[Hyperparameter tuning job console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/console-tuning-jobs.png)

1. In the list of hyperparameter tuning jobs, check the status of the hyperparameter tuning job you launched. A tuning job can be:
   + `Completed`—The hyperparameter tuning job successfully completed.
   + `InProgress`—The hyperparameter tuning job is in progress. One or more training jobs are still running.
   + `Failed`—The hyperparameter tuning job failed.
   + `Stopped`—The hyperparameter tuning job was manually stopped before it completed. All training jobs that the hyperparameter tuning job launched are stopped.
   + `Stopping`—The hyperparameter tuning job is in the process of stopping.

## View the Status of the Training Jobs
<a name="automatic-model-tuning-monitor-training"></a>

**To view the status of the training jobs that the hyperparameter tuning job launched**

1. In the list of hyperparameter tuning jobs, choose the job that you launched.

1. Choose **Training jobs**.  
![\[Location of Training jobs in the .\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperparameter-training-jobs.png)

1. View the status of each training job. To see more details about a job, choose it in the list of training jobs. To view a summary of the status of all of the training jobs that the hyperparameter tuning job launched, see **Training job status counter**.

   A training job can be:
   + `Completed`—The training job successfully completed.
   + `InProgress`—The training job is in progress.
   + `Stopped`—The training job was manually stopped before it completed.
   + `Failed (Retryable)`—The training job failed, but can be retried. A failed training job can be retried only if it failed because an internal service error occurred.
   + `Failed (Non-retryable)`—The training job failed and can't be retried. A failed training job can't be retried when a client error occurs.
**Note**  
Hyperparameter tuning jobs can be stopped and the underlying resources [ deleted](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-ex-cleanup.html), but the jobs themselves cannot be deleted.

## View the Best Training Job
<a name="automatic-model-tuning-best-training-job"></a>

A hyperparameter tuning job uses the objective metric that each training job returns to evaluate training jobs. While the hyperparameter tuning job is in progress, the best training job is the one that has returned the best objective metric so far. After the hyperparameter tuning job is complete, the best training job is the one that returned the best objective metric.

To view the best training job, choose **Best training job**.

![\[Location of Best training job in the hyperparameter tuning job console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/best-training-job.png)


To deploy the best training job as a model that you can host at a SageMaker AI endpoint, choose **Create model**.

### Next Step
<a name="automatic-model-tuning-ex-next-cleanup"></a>

[Clean up](automatic-model-tuning-ex-cleanup.md)

# Clean up
<a name="automatic-model-tuning-ex-cleanup"></a>

To avoid incurring unnecessary charges, when you are done with the example, use the AWS Management Console to delete the resources that you created for it. 

**Note**  
If you plan to explore other examples, you might want to keep some of these resources, such as your notebook instance, S3 bucket, and IAM role.

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) and delete the notebook instance. Stop the instance before deleting it.

1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/) and delete the bucket that you created to store model artifacts and the training dataset. 

1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/) and delete the IAM role. If you created permission policies, you can delete them, too.

1. Open the Amazon CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/) and delete all of the log groups that have names starting with `/aws/sagemaker/`.

# Stop Training Jobs Early
<a name="automatic-model-tuning-early-stopping"></a>

Stop the training jobs that a hyperparameter tuning job launches early when they are not improving significantly as measured by the objective metric. Stopping training jobs early can help reduce compute time and helps you avoid overfitting your model. To configure a hyperparameter tuning job to stop training jobs early, do one of the following:
+ If you are using the AWS SDK for Python (Boto3), set the `TrainingJobEarlyStoppingType` field of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobConfig.html) object that you use to configure the tuning job to `AUTO`.
+ If you are using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable), set the `early_stopping_type` parameter of the [HyperParameterTuner](https://sagemaker.readthedocs.io/en/stable/tuner.html) object to `Auto`.
+ In the Amazon SageMaker AI console, in the **Create hyperparameter tuning job** workflow, under **Early stopping**, choose **Auto**.

For a sample notebook that demonstrates how to use early stopping, see [https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter\$1tuning/image\$1classification\$1early\$1stopping/hpo\$1image\$1classification\$1early\$1stopping.ipynb](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/image_classification_early_stopping/hpo_image_classification_early_stopping.ipynb) or open the `hpo_image_classification_early_stopping.ipynb` notebook in the **Hyperparameter Tuning** section of the **SageMaker AI Examples** in a notebook instance.

## How Early Stopping Works
<a name="automatic-tuning-early-stop-how"></a>

When you enable early stopping for a hyperparameter tuning job, SageMaker AI evaluates each training job the hyperparameter tuning job launches as follows:
+ After each epoch of training, get the value of the objective metric.
+ Compute the running average of the objective metric for all previous training jobs up to the same epoch, and then compute the median of all of the running averages.
+ If the value of the objective metric for the current training job is worse (higher when minimizing or lower when maximizing the objective metric) than the median value of running averages of the objective metric for previous training jobs up to the same epoch, SageMaker AI stops the current training job.

## Algorithms That Support Early Stopping
<a name="automatic-tuning-early-stopping-algos"></a>

To support early stopping, an algorithm must emit objective metrics for each epoch. The following built-in SageMaker AI algorithms support early stopping:
+ [LightGBM](lightgbm.md)
+ [CatBoost](catboost.md)
+ [AutoGluon-Tabular](autogluon-tabular.md)
+ [TabTransformer](tabtransformer.md)
+ [Linear Learner Algorithm](linear-learner.md)—Supported only if you use `objective_loss` as the objective metric.
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)
+ [Image Classification - MXNet](image-classification.md)
+ [Object Detection - MXNet](object-detection.md)
+ [Sequence-to-Sequence Algorithm](seq-2-seq.md)
+ [IP Insights](ip-insights.md)

**Note**  
This list of built-in algorithms that support early stopping is current as of December 13, 2018. Other built-in algorithms might support early stopping in the future. If an algorithm emits a metric that can be used as an objective metric for a hyperparameter tuning job (preferably a validation metric), then it supports early stopping.

To use early stopping with your own algorithm, you must write your algorithms such that it emits the value of the objective metric after each epoch. The following list shows how you can do that in different frameworks:

TensorFlow  
Use the `tf.keras.callbacks.ProgbarLogger` class. For information, see the [tf.keras.callbacks.ProgbarLogger API](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ProgbarLogger).

MXNet  
Use the `mxnet.callback.LogValidationMetricsCallback`. For information, see the [mxnet.callback APIs](https://mxnet.apache.org/versions/master/api/python/docs/api/legacy/callback/index.html).

Chainer  
Extend chainer by using the `extensions.Evaluator` class. For information, see the [chainer.training.extensions.Evaluator API](https://docs.chainer.org/en/v1.24.0/reference/extensions.html#evaluator).

PyTorch and Spark  
There is no high-level support. You must explicitly write your training code so that it computes objective metrics and writes them to logs after each epoch.

# Run a Warm Start Hyperparameter Tuning Job
<a name="automatic-model-tuning-warm-start"></a>

Use warm start to start a hyperparameter tuning job using one or more previous tuning jobs as a starting point. The results of previous tuning jobs are used to inform which combinations of hyperparameters to search over in the new tuning job. Hyperparameter tuning uses either Bayesian or random search to choose combinations of hyperparameter values from ranges that you specify. For more information, see [Understand the hyperparameter tuning strategies available in Amazon SageMaker AI](automatic-model-tuning-how-it-works.md). Using information from previous hyperparameter tuning jobs can help increase the performance of the new hyperparameter tuning job by making the search for the best combination of hyperparameters more efficient.

**Note**  
Warm start tuning jobs typically take longer to start than standard hyperparameter tuning jobs, because the results from the parent jobs have to be loaded before the job can start. The increased time depends on the total number of training jobs launched by the parent jobs.

Reasons to consider warm start include the following:
+ To gradually increase the number of training jobs over several tuning jobs based on results after each iteration.
+ To tune a model using new data that you received.
+ To change hyperparameter ranges that you used in a previous tuning job, change static hyperparameters to tunable, or change tunable hyperparameters to static values.
+ You stopped a previous hyperparameter job early or it stopped unexpectedly.

**Topics**
+ [

## Types of Warm Start Tuning Jobs
](#tuning-warm-start-types)
+ [

## Warm Start Tuning Restrictions
](#warm-start-tuning-restrictions)
+ [

## Warm Start Tuning Sample Notebook
](#warm-start-tuning-sample-notebooks)
+ [

## Create a Warm Start Tuning Job
](#warm-start-tuning-example)

## Types of Warm Start Tuning Jobs
<a name="tuning-warm-start-types"></a>

There are two different types of warm start tuning jobs:

`IDENTICAL_DATA_AND_ALGORITHM`  
The new hyperparameter tuning job uses the same input data and training image as the parent tuning jobs. You can change the hyperparameter ranges to search and the maximum number of training jobs that the hyperparameter tuning job launches. You can also change hyperparameters from tunable to static, and from static to tunable, but the total number of static plus tunable hyperparameters must remain the same as it is in all parent jobs. You cannot use a new version of the training algorithm, unless the changes in the new version do not affect the algorithm itself. For example, changes that improve logging or adding support for a different data format are allowed.  
Use identical data and algorithm when you use the same training data as you used in a previous hyperparameter tuning job, but you want to increase the total number of training jobs or change ranges or values of hyperparameters.  
When you run an warm start tuning job of type `IDENTICAL_DATA_AND_ALGORITHM`, there is an additional field in the response to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeHyperParameterTuningJob.html) named `OverallBestTrainingJob`. The value of this field is the [TrainingJobSummary](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TrainingJobSummary.html) for the training job with the best objective metric value of all training jobs launched by this tuning job and all parent jobs specified for the warm start tuning job.

`TRANSFER_LEARNING`  
The new hyperparameter tuning job can include input data, hyperparameter ranges, maximum number of concurrent training jobs, and maximum number of training jobs that are different than those of its parent hyperparameter tuning jobs. You can also change hyperparameters from tunable to static, and from static to tunable, but the total number of static plus tunable hyperparameters must remain the same as it is in all parent jobs. The training algorithm image can also be a different version from the version used in the parent hyperparameter tuning job. When you use transfer learning, changes in the dataset or the algorithm that significantly affect the value of the objective metric might reduce the usefulness of using warm start tuning.

## Warm Start Tuning Restrictions
<a name="warm-start-tuning-restrictions"></a>

The following restrictions apply to all warm start tuning jobs:
+ A tuning job can have a maximum of 5 parent jobs, and all parent jobs must be in a terminal state (`Completed`, `Stopped`, or `Failed`) before you start the new tuning job.
+ The objective metric used in the new tuning job must be the same as the objective metric used in the parent jobs.
+ The total number of static plus tunable hyperparameters must remain the same between parent jobs and the new tuning job. Because of this, if you think you might want to use a hyperparameter as tunable in a future warm start tuning job, you should add it as a static hyperparameter when you create a tuning job.
+ The type of each hyperparameter (continuous, integer, categorical) must not change between parent jobs and the new tuning job.
+ The number of total changes from tunable hyperparameters in the parent jobs to static hyperparameters in the new tuning job, plus the number of changes in the values of static hyperparameters cannot be more than 10. For example, if the parent job has a tunable categorical hyperparameter with the possible values `red` and `blue`, you change that hyperparameter to static in the new tuning job, that counts as 2 changes against the allowed total of 10. If the same hyperparameter had a static value of `red` in the parent job, and you change the static value to `blue` in the new tuning job, it also counts as 2 changes.
+ Warm start tuning is not recursive. For example, if you create `MyTuningJob3` as a warm start tuning job with `MyTuningJob2` as a parent job, and `MyTuningJob2` is itself an warm start tuning job with a parent job `MyTuningJob1`, the information that was learned when running `MyTuningJob1` is not used for `MyTuningJob3`. If you want to use the information from `MyTuningJob1`, you must explicitly add it as a parent for `MyTuningJob3`.
+ The training jobs launched by every parent job in a warm start tuning job count against the 500 maximum training jobs for a tuning job.
+ Hyperparameter tuning jobs created before October 1, 2018 cannot be used as parent jobs for warm start tuning jobs.

## Warm Start Tuning Sample Notebook
<a name="warm-start-tuning-sample-notebooks"></a>

For a sample notebook that shows how to use warm start tuning, see [https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter\$1tuning/image\$1classification\$1warmstart/hpo\$1image\$1classification\$1warmstart.ipynb](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/image_classification_warmstart/hpo_image_classification_warmstart.ipynb).

## Create a Warm Start Tuning Job
<a name="warm-start-tuning-example"></a>

You can use either the low-level AWS SDK for Python (Boto 3) or the high-level SageMaker AI Python SDK to create a warm start tuning job.

**Topics**
+ [

### Create a Warm Start Tuning Job ( Low-level SageMaker AI API for Python (Boto 3))
](#warm-start-tuning-example-boto)
+ [

### Create a Warm Start Tuning Job (SageMaker AI Python SDK)
](#warm-start-tuning-example-sdk)

### Create a Warm Start Tuning Job ( Low-level SageMaker AI API for Python (Boto 3))
<a name="warm-start-tuning-example-boto"></a>

To use warm start tuning, you specify the values of a [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobWarmStartConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobWarmStartConfig.html) object, and pass that as the `WarmStartConfig` field in a call to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html).

The following code shows how to create a [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobWarmStartConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_HyperParameterTuningJobWarmStartConfig.html) object and pass it to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) job by using the low-level SageMaker AI API for Python (Boto 3).

Create the `HyperParameterTuningJobWarmStartConfig` object:

```
warm_start_config = {
          "ParentHyperParameterTuningJobs" : [
          {"HyperParameterTuningJobName" : 'MyParentTuningJob'}
          ],
          "WarmStartType" : "IdenticalDataAndAlgorithm"
}
```

Create the warm start tuning job:

```
smclient = boto3.Session().client('sagemaker')
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = 'MyWarmStartTuningJob',
   HyperParameterTuningJobConfig = tuning_job_config, # See notebook for tuning configuration
   TrainingJobDefinition = training_job_definition, # See notebook for job definition
   WarmStartConfig = warm_start_config)
```

### Create a Warm Start Tuning Job (SageMaker AI Python SDK)
<a name="warm-start-tuning-example-sdk"></a>

To use the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) to run a warm start tuning job, you:
+ Specify the parent jobs and the warm start type by using a `WarmStartConfig` object.
+ Pass the `WarmStartConfig` object as the value of the `warm_start_config` argument of a [HyperparameterTuner](https://sagemaker.readthedocs.io/en/stable/tuner.html) object.
+ Call the `fit` method of the `HyperparameterTuner` object.

For more information about using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) for hyperparameter tuning, see [https://github.com/aws/sagemaker-python-sdk\$1sagemaker-automatic-model-tuning](https://github.com/aws/sagemaker-python-sdk#sagemaker-automatic-model-tuning).

This example uses an estimator that uses the [Image Classification - MXNet](image-classification.md) algorithm for training. The following code sets the hyperparameter ranges that the warm start tuning job searches within to find the best combination of values. For information about setting hyperparameter ranges, see [Define Hyperparameter Ranges](automatic-model-tuning-define-ranges.md).

```
hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.0, 0.1),
                         'momentum': ContinuousParameter(0.0, 0.99)}
```

The following code configures the warm start tuning job by creating a `WarmStartConfig` object.

```
from sagemaker.tuner import WarmStartConfig,WarmStartTypes

parent_tuning_job_name = "MyParentTuningJob"
warm_start_config = WarmStartConfig(warm_start_type=WarmStartTypes.IDENTICAL_DATA_AND_ALGORITHM, parents={parent_tuning_job_name})
```

Now set the values for static hyperparameters, which are hyperparameters that keep the same value for every training job that the warm start tuning job launches. In the following code, `imageclassification` is an estimator that was created previously.

```
imageclassification.set_hyperparameters(num_layers=18,
                                        image_shape='3,224,224',
                                        num_classes=257,
                                        num_training_samples=15420,
                                        mini_batch_size=128,
                                        epochs=30,
                                        optimizer='sgd',
                                        top_k='2',
                                        precision_dtype='float32',
                                        augmentation_type='crop')
```

Now create the `HyperparameterTuner` object and pass the `WarmStartConfig` object that you previously created as the `warm_start_config` argument.

```
tuner_warm_start = HyperparameterTuner(imageclassification,
                            'validation:accuracy',
                            hyperparameter_ranges,
                            objective_type='Maximize',
                            max_jobs=10,
                            max_parallel_jobs=2,
                            base_tuning_job_name='warmstart',
                            warm_start_config=warm_start_config)
```

Finally, call the `fit` method of the `HyperparameterTuner` object to launch the warm start tuning job.

```
tuner_warm_start.fit(
        {'train': s3_input_train, 'validation': s3_input_validation},
        include_cls_metadata=False)
```

# Resource Limits for Automatic Model Tuning
<a name="automatic-model-tuning-limits"></a>

SageMaker AI sets the following default limits for resources used by automatic model tuning:


| Resource | Regions | Default limits | Can be increased to | 
| --- | --- | --- | --- | 
|  Number of parallel (concurrent) hyperparameter tuning jobs  |  All  |  100  |  N/A  | 
|  Number of hyperparameters that can be searched \$1  |  All  |  30  |  N/A  | 
|  Number of metrics defined per hyperparameter tuning job  |  All  |  20  |  N/A  | 
|  Number of parallel training jobs per hyperparameter tuning job  |  All  |  10  |  100  | 
|  [Bayesian optimization] Number of training jobs per hyperparameter tuning job  |  All  |  750  |  N/A  | 
|  [Random search] Number of training jobs per hyperparameter tuning job  |  All  |  750  |  10000  | 
|  [Hyperband] Number of training jobs per hyperparameter tuning job  |  All  |  750  |  N/A  | 
|  [Grid] Number of training jobs per hyperparameter tuning job, either specified explicitly or inferred from the search space  |  All  |  750  |  N/A  | 
|  Maximum run time for a hyperparameter tuning job  |  All  |  30 days  |  N/A  | 

\$1 Each categorical hyperparameter can have at most 30 different values.

## Resource limit example
<a name="automatic-model-tuning-limits-example"></a>

When you plan hyperparameter tuning jobs, you also have to take into account the limits on training resources. For information about the default resource limits for SageMaker AI training jobs, see [SageMaker AI Limits](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html#limits_sagemaker). Every concurrent training instance on which all of your hyperparameter tuning jobs run counts against the total number of training instances allowed. For example, if you run 10 concurrent hyperparameter tuning jobs, each of those hyperparameter tuning jobs runs 100 total training jobs and 20 concurrent training jobs. Each of those training jobs runs on one **ml.m4.xlarge** instance. The following limits apply: 
+ Number of concurrent hyperparameter tuning jobs: You don't need to increase the limit, because 10 tuning jobs is below the limit of 100.
+ Number of training jobs per hyperparameter tuning job: You don't need to increase the limit, because 100 training jobs is below the limit of 750.
+ Number of concurrent training jobs per hyperparameter tuning job: You need to request a limit increase to 20, because the default limit is 10.
+ SageMaker AI training **ml.m4.xlarge** instances: You need to request a limit increase to 200, because you have 10 hyperparameter tuning jobs, each of which is running 20 concurrent training jobs. The default limit is 20 instances.
+ SageMaker AI training total instance count: You need to request a limit increase to 200, because you have 10 hyperparameter tuning jobs, each of which is running 20 concurrent training jobs. The default limit is 20 instances.

**To request a quota increase:**

1. Open the [AWS Support Center](https://console.aws.amazon.com/support/home#/) page, sign in if necessary, and then choose **Create case**. 

1. On the **Create case** page, choose **Service limit increase**.

1. On the **Case details** panel, select **SageMaker AI Automatic Model Tuning [Hyperparameter Optimization]** for the **Limit type** 

1. On the **Requests** panel for **Request 1**, select the **Region**, the resource **Limit** to increase and the **New Limit value** you are requesting. Select **Add another request** if you have additional requests for quota increases.  
![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hpo/hpo-quotas-service-linit-increase-request.PNG)

1. In the **Case description** panel, provide a description of your use case .

1. In the **Contact options** panel, select your preferred **Contact methods** (**Web**, **Chat** or **Phone**) and then choose **Submit**. 

# Best Practices for Hyperparameter Tuning
<a name="automatic-model-tuning-considerations"></a>

Hyperparameter optimization (HPO) is not a fully-automated process. To improve optimization, follow these best practices for hyperparameter tuning.

**Topics**
+ [

## Choosing a tuning strategy
](#automatic-model-tuning-strategy)
+ [

## Choosing the number of hyperparameters
](#automatic-model-tuning-num-hyperparameters)
+ [

## Choosing hyperparameter ranges
](#automatic-model-tuning-choosing-ranges)
+ [

## Using the correct scales for hyperparameters
](#automatic-model-tuning-log-scales)
+ [

## Choosing the best number of parallel training jobs
](#automatic-model-tuning-parallelism)
+ [

## Running training jobs on multiple instances
](#automatic-model-tuning-distributed-metrics)
+ [

## Using a random seed to reproduce hyperparameter configurations
](#automatic-model-tuning-random-seed)

## Choosing a tuning strategy
<a name="automatic-model-tuning-strategy"></a>

For large jobs, using the [Hyperband](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html#automatic-tuning-hyperband) tuning strategy can reduce computation time. Hyperband has an early stopping mechanism to stop under-performing jobs. Hyperband can also reallocate resources towards well-utilized hyperparameter configurations and run parallel jobs. For smaller training jobs using less runtime, use either [random search](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html#automatic-tuning-random-search) or [Bayesian optimization](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html#automatic-tuning-bayesian-optimization.title). 

Use Bayesian optimization to make increasingly informed decisions about improving hyperparameter configurations in the next run. Bayesian optimization uses information gathered from prior runs to improve subsequent runs. Because of its sequential nature, Bayesian optimization cannot massively scale. 

Use random search to run a large number of parallel jobs. In random search, subsequent jobs do not depend on the results from prior jobs and can be run independently. Compared to other strategies, random search is able to run the largest number of parallel jobs. 

Use [grid search](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html#automatic-tuning-grid-search) to reproduce results of a tuning job, or if simplicity and transparency of the optimization algorithm are important. You can also use grid search to explore the entire hyperparameter search space evenly. Grid search methodically searches through every hyperparameter combination to find optimal hyperparameter values. Unlike grid search, Bayesian optimization, random search and Hyperband all draw hyperparameters randomly from the search space. Because grid search analyzes every combination of hyperparameters, optimal hyperparameter values will be identical between tuning jobs that use the same hyperparameters. 

## Choosing the number of hyperparameters
<a name="automatic-model-tuning-num-hyperparameters"></a>

During optimization, the computational complexity of a hyperparameter tuning job depends on the following:
+ The number of hyperparameters
+ The range of values that Amazon SageMaker AI has to search

Although you can simultaneously specify up to 30 hyperparameters, limiting your search to a smaller number can reduce computation time. Reducing computation time allows SageMaker AI to converge more quickly to an optimal hyperparameter configuration.

## Choosing hyperparameter ranges
<a name="automatic-model-tuning-choosing-ranges"></a>

The range of values that you choose to search can adversely affect hyperparameter optimization. For example, a range that covers every possible hyperparameter value can lead to large compute times and a model that doesn't generalize well to unseen data. If you know that using a subset of the largest possible range is appropriate for your use case, consider limiting the range to that subset.

## Using the correct scales for hyperparameters
<a name="automatic-model-tuning-log-scales"></a>

During hyperparameter tuning, SageMaker AI attempts to infer if your hyperparameters are log-scaled or linear-scaled. Initially, SageMaker AI assumes linear scaling for hyperparameters. If hyperparameters are log-scaled, choosing the correct scale will make your search more efficient. You can also select `Auto` for `ScalingType` in the [CreateHyperParameterTuningJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) API if you want SageMaker AI to detect the scale for you.

## Choosing the best number of parallel training jobs
<a name="automatic-model-tuning-parallelism"></a>

You can use the results of previous trials to improve the performance of subsequent trials. Choose the largest number of parallel jobs that would provide a meaningful incremental result that is also within your region and account compute constraints. Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ResourceLimits.html#MaxParallelTrainingJobs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ResourceLimits.html#MaxParallelTrainingJobs) field to limit the number of training jobs that a hyperparameter tuning job can launch in parallel. For more information, see [Running multiple HPO jobs in parallel on Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/running-multiple-hpo-jobs-in-parallel-on-amazon-sagemaker).

## Running training jobs on multiple instances
<a name="automatic-model-tuning-distributed-metrics"></a>

When a training job runs on multiple machines in distributed mode, each machine emits an objective metric. HPO can only use one of these emitted objective metrics to evaluate model performance, In distributed mode, HPO uses the objective metric that was reported by the last running job across all instances. 

## Using a random seed to reproduce hyperparameter configurations
<a name="automatic-model-tuning-random-seed"></a>

You can specify an integer as a random seed for hyperparameter tuning and use that seed during hyperparameter generation. Later, you can use the same seed to reproduce hyperparameter configurations that are consistent with your previous results. For random search and Hyperband strategies, using the same random seed can provide up to 100% reproducibility of the previous hyperparameter configuration for the same tuning job. For Bayesian strategy, using the same random seed will improve reproducibility for the same tuning job.

# Data refining during training with Amazon SageMaker smart sifting
<a name="train-smart-sifting"></a>

SageMaker smart sifting is a capability of SageMaker Training that helps improve the efficiency of your training datasets and reduce total training time and cost.

Modern deep learning models such as large language models (LLMs) or vision transformer models often require massive datasets to achieve acceptable accuracy. For example, LLMs often require trillions of tokens or petabytes of data to converge. The growing size of training datasets, along with the size of state-of-the-art models, can increase the compute time and cost of model training.

Invariably, samples in a dataset do not contribute equally to the learning process during model training. A significant proportion of computational resources provisioned during training might be spent on processing easy samples that do not contribute substantially to the overall accuracy of a model. Ideally, training datasets would only include samples that are actually improving the model convergence. Filtering out less helpful data can reduce training time and compute cost. However, identifying less helpful data can be challenging and risky. It is practically difficult to identify which samples are less informative before training, and model accuracy can be impacted if the wrong samples or too many samples are excluded.

Smart sifting of data with Amazon SageMaker AI can help reduce training time and cost by improving data efficiency. The SageMaker smart sifting algorithm evaluates the loss value of each data during the data loading stage of a training job and excludes samples which are less informative to the model. By using refined data for training, the total time and cost of training your model is reduced by eliminating unnecessary forward and backward passes on non-improving data. Therefore, there is minimal or no impact on the accuracy of the model.

SageMaker smart sifting is available through SageMaker Training Deep Learning Containers (DLCs) and supports PyTorch workloads via the PyTorch `DataLoader`. Just a few lines of code change are needed to implement SageMaker smart sifting and you do not need to change your existing training or data processing workflows.

**Topics**
+ [

# How SageMaker smart sifting works
](train-smart-sifting-how-it-works.md)
+ [

# Supported frameworks and AWS Regions
](train-smart-sifting-what-is-supported.md)
+ [

# SageMaker smart sifting within your training script
](train-smart-sifting-apply-to-script.md)
+ [

# Troubleshooting
](train-smart-sifting-best-prac-considerations-troubleshoot.md)
+ [

# Security in SageMaker smart sifting
](train-smart-sifting-security.md)
+ [

# SageMaker smart sifting Python SDK reference
](train-smart-sifting-pysdk-reference.md)
+ [

# SageMaker smart sifting release notes
](train-smart-sifting-release-notes.md)

# How SageMaker smart sifting works
<a name="train-smart-sifting-how-it-works"></a>

The goal of SageMaker smart sifting is to sift through your training data during the training process and only feed more informative samples to the model. During typical training with PyTorch, data is iteratively sent in batches to the training loop and to accelerator devices (such as GPUs or Trainium chips) by the [PyTorch `DataLoader`](https://pytorch.org/docs/stable/data.html). SageMaker smart sifting is implemented at this data loading stage and is thus independent of any upstream data pre-processing in your training pipeline. SageMaker smart sifting uses your model and its user-specified loss function to do an evaluative forward pass of each data sample as it is loaded. Samples that return *low-loss* values have less of an impact on the model's learning and are thus excluded from training, because it is already *easy* for the model to make the right prediction about them with high confidence. Meanwhile, those relatively high-loss samples are what the model still needs to learn, so these are kept for training. A key input you can set for SageMaker smart sifting is the proportion of data to exclude. For example, by setting the proportion to 25%, samples distributed in the lowest quartile of the distribution of loss (taken from a user-specified number of previous samples) are excluded from training. High-loss samples are accumulated in a refined data batch. The refined data batch is sent to the training loop (forward and backward pass), and the model learns and trains on the refined data batch. 

The following diagram shows an overview of how the SageMaker smart sifting algorithm is designed.

![\[Architecture diagram of how SageMaker smart sifting operates during training as data is loaded.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/smartsifting-arch.png)


In short, SageMaker smart sifting operates during training as data is loaded. The SageMaker smart sifting algorithm runs loss calculation over the batches, and sifts non-improving data out before the forward and backward pass of each iteration. The refined data batch is then used for the forward and backward pass. 

**Note**  
Smart sifting of data on SageMaker AI uses additional forward passes to analyze and filter your training data. In turn, there are fewer backward passes as less impactful data is excluded from your training job. Because of this, models which have long or expensive backward passes see the greatest efficiency gains when using smart sifting. Meanwhile, if your model's forward pass takes longer than its backward pass, overhead could increase total training time. To measure the time spent by each pass, you can run a pilot training job and collect logs that record the time on the processes. Also consider using SageMaker Profiler that provides profiling tools and UI application. To learn more, see [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md).

SageMaker smart sifting works for PyTorch-based training jobs with classic distributed data parallelism, which makes model replicas on each GPU worker and performs `AllReduce`. It works with PyTorch DDP and the SageMaker AI distributed data parallel library.

# Supported frameworks and AWS Regions
<a name="train-smart-sifting-what-is-supported"></a>

Before using SageMaker smart sifting data loader, check if your framework of choice is supported, that the instance types are available in your AWS account, and that your AWS account is in one of the supported AWS Regions.

**Note**  
SageMaker smart sifting supports PyTorch model training with traditional data parallelism and distributed data parallelism, which makes model replicas in all GPU workers and uses the `AllReduce` operation. It doesn’t work with model parallelism techniques, including sharded data parallelism. Because SageMaker smart sifting works for data parallelism jobs, make sure that the model you train fits in each GPU memory.

## Supported Frameworks
<a name="train-smart-sifting-supported-frameworks"></a>

SageMaker smart sifting supports the following deep learning frameworks and is available through AWS Deep Learning Containers.

**Topics**
+ [

### PyTorch
](#train-smart-sifting-supported-frameworks-pytorch)

### PyTorch
<a name="train-smart-sifting-supported-frameworks-pytorch"></a>


| Framework | Framework version | Deep Learning Container URI | 
| --- | --- | --- | 
| PyTorch | 2.1.0 |  *763104351884*.dkr.ecr.*region*.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker  | 

For more information about the pre-built containers, see [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) in the *AWS Deep Learning Containers GitHub repository*.

## AWS Regions
<a name="train-smart-sifting-supported-regions"></a>

The [containers packaged with the SageMaker smart sifting library](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-training-compiler-containers) are available in the AWS Regions where [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) are in service.

## Instance types
<a name="train-smart-sifting-instance-types"></a>

You can use SageMaker smart sifting for any PyTorch training jobs on any instance types. We recommend that you use P4d, P4de, or P5 instances.

# SageMaker smart sifting within your training script
<a name="train-smart-sifting-apply-to-script"></a>

The SageMaker smart sifting library is packaged in the [SageMaker AI framework DLCs](train-smart-sifting-what-is-supported.md#train-smart-sifting-supported-frameworks) as a complementary library. It provides a filtering logic against training samples that have relatively lower impact on model training, and your model can reach the desired model accuracy with fewer training samples when compared to the model training with full data samples.

To learn how to implement the smart sifting tool into your training script, choose one of the following based on the framework you use.

**Topics**
+ [

# Apply SageMaker smart sifting to your PyTorch script
](train-smart-sifting-apply-to-pytorch-script.md)
+ [

# Apply SageMaker smart sifting to your Hugging Face Transformers script
](train-smart-sifting-apply-to-hugging-face-transformers-script.md)

# Apply SageMaker smart sifting to your PyTorch script
<a name="train-smart-sifting-apply-to-pytorch-script"></a>

These instructions demonstrate how to enable SageMaker smart sifting with your training script.

1. Configure the SageMaker smart sifting interface.

   The SageMaker smart sifting library implements a relative-threshold loss-based sampling technique that helps filter out samples with lower impact on reducing the loss value. The SageMaker smart sifting algorithm calculates the loss value of every input data sample using a forward pass, and calculates its relative percentile against the loss values of preceding data. 

   The following two parameters are what you need to specify to the `RelativeProbabilisticSiftConfig` class for creating a sifting configuration object. 
   + Specify the proportion of data that should be used for training to the `beta_value` parameter.
   + Specify the number of samples used in the comparison with the `loss_history_length` parameter.

   The following code example demonstrates setting up an object of the `RelativeProbabilisticSiftConfig` class.

   ```
   from smart_sifting.sift_config.sift_configs import (
       RelativeProbabilisticSiftConfig
       LossConfig
       SiftingBaseConfig
   )
   
   sift_config=RelativeProbabilisticSiftConfig(
       beta_value=0.5,
       loss_history_length=500,
       loss_based_sift_config=LossConfig(
            sift_config=SiftingBaseConfig(sift_delay=0)
       )
   )
   ```

   For more information about the `loss_based_sift_config` parameter and related classes, see [SageMaker smart sifting configuration modules](train-smart-sifting-pysdk-reference.md#train-smart-sifting-pysdk-base-config-modules) in the SageMaker smart sifting Python SDK reference section.

   The `sift_config` object in the preceding code example is used in step 4 for setting up the `SiftingDataloader` class.

1. (Optional) Configure a SageMaker smart sifting batch transform class.

   Different training use cases require different training data formats. Given the variety of data formats, the SageMaker smart sifting algorithm needs to identify how to perform sifting on a particular batch. To address this, SageMaker smart sifting provides a batch transform module that helps convert batches into standardized formats that it can efficiently sift. 

   1. SageMaker smart sifting handles batch transform of training data in the following formats: Python lists, dictionaries, tuples, and tensors. For these data formats, SageMaker smart sifting automatically handles the batch data format conversion, and you can skip the rest of this step. If you skip this step, in step 4 for configuring `SiftingDataloader`, leave the `batch_transforms` parameter of `SiftingDataloader` to its default value, which is `None`.

   1. If your dataset is not in these format, you should proceed to the rest of this step to create a custom batch transform using `SiftingBatchTransform`. 

      In cases in which your dataset isn’t in one of the supported formats by SageMaker smart sifting, you might run into errors. Such data format errors can be resolved by adding the `batch_format_index` or `batch_transforms` parameter to the `SiftingDataloader` class, which you set up in step 4. The following shows example errors due to an incompatible data format and resolutions for them.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/train-smart-sifting-apply-to-pytorch-script.html)

      To resolve the aforementioned issues, you need to create a custom batch transform class using the `SiftingBatchTransform` module. A batch transform class should consist of a pair of transform and reverse-transform functions. The function pair converts your data format to a format that SageMaker smart sifting algorithm can process. After you create a batch transform class, the class returns a `SiftingBatch` object that you'll pass to the `SiftingDataloader` class in step 4.

      The following are examples of custom batch transform classes of the `SiftingBatchTransform` module.
      + An example of a custom list batch transform implementation with SageMaker smart sifting for cases where the dataloader chunk has inputs, masks, and labels.

        ```
        from typing import Any
        
        import torch
        
        from smart_sifting.data_model.data_model_interface import SiftingBatchTransform
        from smart_sifting.data_model.list_batch import ListBatch
        
        class ListBatchTransform(SiftingBatchTransform):
            def transform(self, batch: Any):
                inputs = batch[0].tolist()
                labels = batch[-1].tolist()  # assume the last one is the list of labels
                return ListBatch(inputs, labels)
        
            def reverse_transform(self, list_batch: ListBatch):
                a_batch = [torch.tensor(list_batch.inputs), torch.tensor(list_batch.labels)]
                return a_batch
        ```
      + An example of a custom list batch transform implementation with SageMaker smart sifting for cases where no labels are needed for reverse transformation.

        ```
        class ListBatchTransformNoLabels(SiftingBatchTransform):
            def transform(self, batch: Any):
                return ListBatch(batch[0].tolist())
        
            def reverse_transform(self, list_batch: ListBatch):
                a_batch = [torch.tensor(list_batch.inputs)]
                return a_batch
        ```
      + An example of a custom tensor batch implementation with SageMaker smart sifting for cases where the data loader chunk has inputs, masks, and labels.

        ```
        from typing import Any
        
        from smart_sifting.data_model.data_model_interface import SiftingBatchTransform
        from smart_sifting.data_model.tensor_batch import TensorBatch
        
        class TensorBatchTransform(SiftingBatchTransform):
            def transform(self, batch: Any):
                a_tensor_batch = TensorBatch(
                    batch[0], batch[-1]
                )  # assume the last one is the list of labels
                return a_tensor_batch
        
            def reverse_transform(self, tensor_batch: TensorBatch):
                a_batch = [tensor_batch.inputs, tensor_batch.labels]
                return a_batch
        ```

      After you create a `SiftingBatchTransform`-implemted batch transform class, you use this class in step 4 for setting up the `SiftingDataloader` class. The rest of this guide assumes that a `ListBatchTransform` class is created. In step 4, this class is passed to the `batch_transforms`.

1. Create a class for implementing the SageMaker smart sifting `Loss` interface. This tutorial assumes that the class is named `SiftingImplementedLoss`. While setting up this class, we recommend that you use the same loss function in the model training loop. Go through the following substeps for creating a SageMaker smart sifting `Loss` implemented class.

   1. SageMaker smart sifting calculates a loss value for each training data sample, as opposed to calculating a single loss value for a batch. To ensure that SageMaker smart sifting uses the same loss calculation logic, create a smart-sifting-implemented loss function using the SageMaker smart sifting `Loss` module that uses your loss function and calculates loss per training sample. 
**Tip**  
SageMaker smart sifting algorithm runs on every data sample, not on the entire batch, so you should add an initialization function to set the PyTorch loss function without any reduction strategy.  

      ```
      class SiftingImplementedLoss(Loss):  
          def __init__(self):
              self.loss = torch.nn.CrossEntropyLoss(reduction='none')
      ```
This is also shown in the following code example.

   1. Define a loss function that accepts the `original_batch` (or `transformed_batch` if you have set up a batch transform in step 2) and the PyTorch model. Using the specified loss function with no reduction, SageMaker smart sifting runs a forward pass for each data sample to evaluate its loss value. 

   The following code is an example of a smart-sifting-implemented `Loss` interface named `SiftingImplementedLoss`.

   ```
   from typing import Any
   
   import torch
   import torch.nn as nn
   from torch import Tensor
   
   from smart_sifting.data_model.data_model_interface import SiftingBatch
   from smart_sifting.loss.abstract_sift_loss_module import Loss
   
   model=... # a PyTorch model based on torch.nn.Module
   
   class SiftingImplementedLoss(Loss):   
       # You should add the following initializaztion function 
       # to calculate loss per sample, not per batch.
       def __init__(self):
           self.loss_no_reduction = torch.nn.CrossEntropyLoss(reduction='none')
   
       def loss(
           self,
           model: torch.nn.Module,
           transformed_batch: SiftingBatch,
           original_batch: Any = None,
       ) -> torch.Tensor:
           device = next(model.parameters()).device
           batch = [t.to(device) for t in original_batch] # use this if you use original batch and skipped step 2
           # batch = [t.to(device) for t in transformed_batch] # use this if you transformed batches in step 2
   
           # compute loss
           outputs = model(batch)
           return self.loss_no_reduction(outputs.logits, batch[2])
   ```

   Before the training loop hits the actual forward pass, this sifting loss calculation is done during the data loading phase of fetching a batch in each iteration. The individual loss value is then compared to previous loss values, and its relative percentile is estimated per the object of `RelativeProbabilisticSiftConfig` you have set up in step 1.

1. Wrap the PyTroch data loader by the SageMaker AI `SiftingDataloader` class.

   Finally, use all the SageMaker smart sifting implemented classes you configured in the previous steps to the SageMaker AI `SiftingDataloder` configuration class. This class is a wrapper for PyTorch [https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader). By wrapping PyTorch `DataLoader`, SageMaker smart sifting is registered to run as part of data loading in each iteration of a PyTorch training job. The following code example demonstrates implementing SageMaker AI data sifting to a PyTorch `DataLoader`.

   ```
   from smart_sifting.dataloader.sift_dataloader import SiftingDataloader
   from torch.utils.data import DataLoader
   
   train_dataloader = DataLoader(...) # PyTorch data loader
   
   # Wrap the PyTorch data loader by SiftingDataloder
   train_dataloader = SiftingDataloader(
       sift_config=sift_config, # config object of RelativeProbabilisticSiftConfig
       orig_dataloader=train_dataloader,
       batch_transforms=ListBatchTransform(), # Optional, this is the custom class from step 2
       loss_impl=SiftingImplementedLoss(), # PyTorch loss function wrapped by the Sifting Loss interface
       model=model,
       log_batch_data=False
   )
   ```

# Apply SageMaker smart sifting to your Hugging Face Transformers script
<a name="train-smart-sifting-apply-to-hugging-face-transformers-script"></a>

There are two ways to implement the SageMaker smart sifting into the Transformers `Trainer` class.

**Note**  
If you use one of the DLCs for PyTorch with the SageMaker smart sifting package installed, note that you need to install the `transformers` library. You can install additional packages by [extending the DLCs](prebuilt-containers-extend.md) or passing `requirements.txt` to the training job launcher class for PyTorch ([https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html)) in the SageMaker AI Python SDK.

## Simple setup
<a name="train-smart-sifting-apply-to-hugging-face-transformers-script-simple"></a>

The simplest way to implement SageMaker smart sifting into the Transformers `Trainer` class is to use the `enable_sifting` function. This function accepts an existing `Trainer` object, and wraps the existing `DataLoader` object with `SiftingDataloader`. You can continue using the same training object. See the following example usage.

```
from smart_sifting.integrations.trainer import enable_sifting
from smart_sifting.loss.abstract_sift_loss_module import Loss
from smart_sifting.sift_config.sift_configs import (
    RelativeProbabilisticSiftConfig
    LossConfig
    SiftingBaseConfig
)

class SiftingImplementedLoss(Loss):
   def loss(self, model, transformed_batch, original_batch):
        loss_fct = MSELoss(reduction="none") # make sure to set reduction to "none"
        logits = model.bert(**original_batch)
        return loss_fct(logits, original_batch.get("labels"))

sift_config = RelativeProbabilisticSiftConfig(
    beta_value=0.5,
    loss_history_length=500,
    loss_based_sift_config=LossConfig(
         sift_config=SiftingBaseConfig(sift_delay=0)
    )
)

trainer = Trainer(...)
enable_sifting(trainer, sift_config, loss=SiftingImplementedLoss()) # updates the trainer with Sifting Loss and config
trainer.train()
```

The `SiftingDataloader` class is an iterable data loader. The exact size of the resulting dataset is not known beforehand due to the random sampling during sifting. As a result, the Hugging Face `Trainer` expects the [`max_steps` training argument](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.max_steps). Note that this argument overrides the epoch configuration parameter `num_train_epochs`. If your original data loader was also iterable, or your training uses `max_steps` and a single epoch, then the `SiftingDataloader` performs the same as the existing dataloader. If the original dataloader was not iterable or `max_steps` was not provided, the Hugging Face Trainer might throw an error message similar to the following. 

```
args.max_steps must be set to a positive value if dataloader does not have a length,
was -1
```

To address this, the `enable_sifting` function provides an optional `set_epochs` parameter. This enables training with epochs, using the number of epochs provided by [num\$1train\$1epochs argument](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.num_train_epochs(float,) of the `Trainer` class, and sets `max_steps` to the maximum system integer, allowing training to progress until the specified epochs have completed.

## Custom setup
<a name="train-smart-sifting-apply-to-hugging-face-transformers-script-custom-trainer"></a>

For a custom integration of the SageMaker smart sifting dataloader, you can utilize a custom Hugging Face `Trainer` class. Within any subclass of `Trainer`, the `get_train_dataloader()` function can be overridden to return an object of the `SiftingDataloader` class instead. For cases with existing custom trainers, this approach might be less intrusive but requires code changes than the simple setup option. The following is an example implementation of SageMaker smart sifting into a custom Hugging Face `Trainer` class.

```
from smart_sifting.sift_config.sift_configs import (
    RelativeProbabilisticSiftConfig
    LossConfig
    SiftingBaseConfig
)
from smart_sifting.dataloader.sift_dataloader import SiftingDataloader
from smart_sifting.loss.abstract_sift_loss_module import Loss
from smart_sifting.data_model.data_model_interface import SiftingBatch, SiftingBatchTransform
from smart_sifting.data_model.list_batch import ListBatch

class SiftingListBatchTransform(SiftingBatchTransform):
    def transform(self, batch: Any):
        inputs = batch[0].tolist()
        labels = batch[-1].tolist()  # assume the last one is the list of labels
        return ListBatch(inputs, labels)

    def reverse_transform(self, list_batch: ListBatch):
        a_batch = [torch.tensor(list_batch.inputs), torch.tensor(list_batch.labels)]
        return a_batch

class SiftingImplementedLoss():
    # You should add the following initializaztion function 
    # to calculate loss per sample, not per batch.
    def __init__(self):
        self.celoss = torch.nn.CrossEntropyLoss(reduction='none')

    def loss(
        self,
        model: torch.nn.Module,
        transformed_batch: SiftingBatch,
        original_batch: Any = None,
    ) -> torch.Tensor:
        device = next(model.parameters()).device
        batch = [t.to(device) for t in original_batch]

        # compute loss
        outputs = model(batch)
        return self.celoss(outputs.logits, batch[2])

class SiftingImplementedTrainer(Trainer):
    def get_train_dataloader(self):
        dl = super().get_train_dataloader()

        sift_config = RelativeProbabilisticSiftConfig(
            beta_value=0.5,
            loss_history_length=500,
            loss_based_sift_config=LossConfig(
                sift_config=SiftingBaseConfig(sift_delay=0)
            )
        )

        return SiftingDataloader(
                sift_config=sift_config,
                orig_dataloader=dl,
                batch_transforms=SiftingListBatchTransform(),
                loss_impl=SiftingImplementedLoss(),
                model=self.model
        )
```

Using the wrapped `Trainer` class, create an object of it as follows.

```
trainer = SiftingImplementedTrainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset
)

trainer.train()
```

# Troubleshooting
<a name="train-smart-sifting-best-prac-considerations-troubleshoot"></a>

If you run into an error, use the following list to try to troubleshoot the issue. If you need further support, reach out to the SageMaker AI team at sm-smart-sifting-feedback@amazon.com.

**Exceptions from the SageMaker smart sifting library**

Use the following reference of exceptions raised by the SageMaker smart sifting library to troubleshoot errors and identify causes.


| Exception Name | Description | 
| --- | --- | 
| SiftConfigValidationException | Thrown from the SageMaker smart sifting library in case of any missing Config key or unsupported value type for Sift Key | 
| UnsupportedDataFormatException | Thrown from the SageMaker smart sifting library in case of any unsupported DataFormat for Sifting logic | 
| LossImplementationNotProvidedException | Thrown in case of missing or not implementing Loss interface | 

# Security in SageMaker smart sifting
<a name="train-smart-sifting-security"></a>

Because the SageMaker smart sifting library runs processes of removing less valuable training samples, it requires full access to training datasets as they are produced by the data loader. This access is not different than the access already provided to PyTorch in normal training scenario.

SageMaker smart sifting has built-in logging with security implications. By default, SageMaker smart sifting logs are only application-level logs containing metrics, latencies, and user errors or warnings. Users can, however, choose to enable verbose logs, which log full batch data to show which samples were removed from a given batch. These logs are emitted using Python loggers and are not uploaded or stored anywhere by the library. In the case of automatic log uploading to CloudWatch or similar services, please note that using verbose logs may result in sensitive training data being uploaded off of the training instance.

Beyond the aforementioned logging, SageMaker smart sifting does not have any network functionality nor does it interact with the local file system. User data is stored as in-memory objects for the entirety of the time it is used by the library.

# SageMaker smart sifting Python SDK reference
<a name="train-smart-sifting-pysdk-reference"></a>

This page provides a reference of Python modules you need for applying SageMaker smart sifting to your training script.

## SageMaker smart sifting configuration modules
<a name="train-smart-sifting-pysdk-base-config-modules"></a>

**`class smart_sifting.sift_config.sift_configs.RelativeProbabilisticSiftConfig()`**

The SageMaker smart sifting configuration class.

**Parameters**
+ `beta_value` (float) – A beta (constant) value. It is used to calculate the probability of selecting a sample for training based on the percentile of the loss in the loss values history. Lowering the beta value results in a lower percentage of data sifted, and raising it results in a higher percentage of data sifted. There’s no minimum or maximum value for the beta value, other than it must be a positive value. The following reference table gives information for sifting rates with respect to `beta_value`.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/train-smart-sifting-pysdk-reference.html)
+ `loss_history_length` (int) – The number of previous training losses to store for the relative threshold loss based sampling.
+ `loss_based_sift_config` (dict or a `LossConfig` object) – Specify a `LossConfig` object that returns the SageMaker smart sifting Loss interface configuration.

**`class smart_sifting.sift_config.sift_configs.LossConfig()`**

The configuration class for the `loss_based_sift_config` parameter of the `RelativeProbabilisticSiftConfig` class.

**Parameters**
+ `sift_config` (dict or a `SiftingBaseConfig` object) – Specify a `SiftingBaseConfig` object that returns a sifting base configuration dictionary.

**`class smart_sifting.sift_config.sift_configs.SiftingBaseConfig()`**

The configuration class for the `sift_config` parameter of `LossConfig`.

**Parameters**
+ `sift_delay` (int) – The number of training steps to wait for before starting sifting. We recommend that you start sifting after all the layers in the model have enough view of the training data. The default value is `1000`.
+ `repeat_delay_per_epoch` (bool) – Specify whether to delay sifting every epoch. The default value is `False`.

## SageMaker smart sifting data batch transform modules
<a name="train-smart-sifting-pysdk-batch-transform-modules"></a>

`class smart_sifting.data_model.data_model_interface.SiftingBatchTransform`

A SageMaker smart sifting Python module for defining how to perform batch transform. Using this, you can set up a batch transform class that converts the data format of your training data to `SiftingBatch` format. SageMaker smart sifting can sift and accumulate data in this format into a sifted batch.

`class smart_sifting.data_model.data_model_interface.SiftingBatch`

An interface to define a batch data type that can be sifted and accumulated.

`class smart_sifting.data_model.list_batch.ListBatch`

A module for keeping track of a list batch for sifting.

`class smart_sifting.data_model.tensor_batch.TensorBatch`

A module for keeping track of a tensor batch for sifting.

## SageMaker smart sifting loss implementation module
<a name="train-smart-sifting-pysdk-loss-interface-moddule"></a>

`class smart_sifting.loss.abstract_sift_loss_module.Loss`

A wrapper module for registering the SageMaker smart sifting interface to the loss function of a PyTorch-based model.

## SageMaker smart sifting data loader wrapper module
<a name="train-smart-sifting-pysdk-dataloader-wrapper-module"></a>

`class smart_sifting.dataloader.sift_dataloader.SiftingDataloader`

A wrapper module for registering the SageMaker smart sifting interface to the data loader of a PyTorch-based model.

The Main Sifting Dataloader iterator sifts out training samples from a dataloader based on a sift configuration.

**Parameters**
+ `sift_config` (dict or a `RelativeProbabilisticSiftConfig` object) – A `RelativeProbabilisticSiftConfig` object.
+ `orig_dataloader` (a PyTorch DataLoader object) – Specify the PyTorch Dataloader object to be wrapped.
+ `batch_transforms` (a `SiftingBatchTransform` object) – (Optional) If your data format is not supported by the SageMaker smart sifting library’s default transform, you must create a batch transform class using the `SiftingBatchTransform` module. This parameter is used to pass the batch transform class. This class is used for `SiftingDataloader` to convert the data into a format that the SageMaker smart sifting algorithm can accept. 
+ `model` (a PyTorch model object) – The original PyTorch model
+ `loss_impl` (a sifting loss function of `smart_sifting.loss.abstract_sift_loss_module.Loss`) – A sifting loss function that is configured with the `Loss` module and wraps the PyTorch loss function.
+ `log_batch_data` (bool) – Specify whether to log batch data. If set to `True`, SageMaker smart sifting logs the details of the batches that are kept or sifted. We recommend that you turn it on only for a pilot training job. When logging is on, the samples are loaded to GPU and transferred to CPU, which introduces overhead. The default value is `False`.

# SageMaker smart sifting release notes
<a name="train-smart-sifting-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker smart sifting capability.

## SageMaker smart sifting release notes: November 29, 2023
<a name="train-smart-sifting-release-notes-20231129"></a>

**New Features**
+ Launched the Amazon SageMaker smart sifting library at AWS re:Invent 2023.

**Migration to AWS Deep Learning Containers**
+ The SageMaker smart sifting library passed integration testing and is available in AWS Deep Learning Containers. To find a complete list of the pre-built containers with the SageMaker smart sifting library, see [Supported frameworks and AWS Regions](train-smart-sifting-what-is-supported.md).

# Debugging and improving model performance
<a name="train-debug-and-improve-model-performance"></a>

The essence of training machine learning models, deep learning neural networks, transformer models is in achieving stable model convergence, and as such, state-of-the-art models have millions, billions, or trillions of model parameters. The number of operations to update the gigantic number of model parameters during each iteration can easily become astronomical. To identify model convergence issues, it is important to be able to access the model parameters, activations, and gradients computed during optimization processes. 

Amazon SageMaker AI provides two debugging tools to help identify such convergence issues and gain visibility into your models.

**Amazon SageMaker AI with TensorBoard**

To offer greater compatibility with the open-source community tools within the SageMaker AI Training platform, SageMaker AI hosts TensorBoard as an application in [SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/sm-domain.html). You can bring your training jobs to SageMaker AI and keep using the TensorBoard summary writer to collect the model output tensors. Because TensorBoard is implemented into [SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/sm-domain.html), it also gives you more options to manage user profiles under the SageMaker AI domain in your AWS account, and provides fine control over the user profiles by granting access to specific actions and resources. To learn more, see [TensorBoard in Amazon SageMaker AI](tensorboard-on-sagemaker.md).

**Amazon SageMaker Debugger**

Amazon SageMaker Debugger is a capability of SageMaker AI that provides tools to register hooks to callbacks to extract model output tensors and save them in Amazon Simple Storage Service. It provides [built-in rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) for detecting model convergence issues, such as overfitting, saturated activation functions, vanishing gradients, and more. You can also set up the built-in rules with Amazon CloudWatch Events and AWS Lambda for taking automated actions against detected issues, and set up Amazon Simple Notification Service to receive email or text notifications. To learn more, see [Amazon SageMaker Debugger](train-debugger.md).

**Topics**
+ [

# TensorBoard in Amazon SageMaker AI
](tensorboard-on-sagemaker.md)
+ [

# Amazon SageMaker Debugger
](train-debugger.md)
+ [

# Access a training container through AWS Systems Manager for remote debugging
](train-remote-debugging.md)
+ [

# Release notes for debugging capabilities of Amazon SageMaker AI
](debugger-release-notes.md)

# TensorBoard in Amazon SageMaker AI
<a name="tensorboard-on-sagemaker"></a>

Amazon SageMaker AI with TensorBoard is a capability of Amazon SageMaker AI that brings the [TensorBoard](https://www.tensorflow.org/tensorboard) visualization tools to SageMaker AI and integrated with SageMaker Training and domain. It provides options to administer your AWS account and users belonging to the account through [SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/sm-domain.html), to give the domain users access to the TensorBoard data with appropriate permissions to Amazon S3, and help the domain users perform model debugging tasks using the TensorBoard visualization plugins. SageMaker AI with TensorBoard is extended with the SageMaker AI Data Manager plugin, with which domain users can access a number of training jobs in one place within the TensorBoard application.

**Note**  
This feature is for debugging the training of deep learning models using PyTorch or TensorFlow.

**For data scientists**

Training large models can have scientific problems that require data scientists to debug and resolve them in order to improve model convergence and stabilize gradient descent processes.

When you encounter model training issues, such as loss not converging, or vanishing or exploding weights and gradients, you need to access tensor data to dive deep and analyze the model parameters, scalars, and any custom metrics. Using SageMaker AI with TensorBoard, you can visualize model output tensors extracted from training jobs. As you experiment with different models, multiple training runs, and model hyperparameters, you can select multiple training jobs in TensorBoard and compare them in one place.

**For administrators**

Through the TensorBoard landing page in the SageMaker AI console or [SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/sm-domain.html), you can manage TensorBoard application users if you are an administrator of an AWS account or SageMaker AI domain. Each domain user can access their own TensorBoard application given the granted permissions. As a SageMaker AI domain administrator and domain user, you can create and delete the TensorBoard application given the permission level you have.

**Note**  
You cannot share the TensorBoard application for collaboration purposes because SageMaker AI domain does not allow application sharing among users. Users can share the output tensors saved in an S3 bucket, if they have access to the bucket.

## Supported frameworks and AWS Regions
<a name="debugger-htb-support"></a>

The TensorBoard application in SageMaker AI is available for the following machine learning frameworks and AWS Regions.

**Frameworks**
+ PyTorch
+ TensorFlow
+ Hugging Face Transformers

**AWS Regions**
+ US East (N. Virginia) (`us-east-1`)
+ US East (Ohio) (`us-east-2`)
+ US West (Oregon) (`us-west-2`)
+ Europe (Frankfurt) (`eu-central-1`)
+ Europe (Ireland) (`eu-west-1`)

**Note**  
Amazon SageMaker AI with TensorBoard runs on an `ml.r5.large` instance and incurs charges after the SageMaker AI free tier or the free trial period of the feature. For more information, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

**Topics**
+ [

## Supported frameworks and AWS Regions
](#debugger-htb-support)
+ [

# Prepare a training job to collect TensorBoard output data
](debugger-htb-prepare-training-job.md)
+ [

# Accessing the TensorBoard application on SageMaker AI
](debugger-htb-access-tb.md)
+ [

# Load and visualize output tensors using the TensorBoard application
](debugger-htb-access-tb-data.md)
+ [

# Delete unused TensorBoard applications
](debugger-htb-delete-app.md)

# Prepare a training job to collect TensorBoard output data
<a name="debugger-htb-prepare-training-job"></a>

A typical training job for machine learning in SageMaker AI consists of two main steps: preparing a training script and configuring a SageMaker AI estimator object of the SageMaker AI Python SDK. In this section, you'll learn about the required changes to collect TensorBoard-compatible data from SageMaker training jobs.

## Prerequisites
<a name="debugger-htb-prerequisites"></a>

The following list shows the prerequisites to start using SageMaker AI with TensorBoard.
+ A SageMaker AI domain that's set up with Amazon VPC in your AWS account. 

  For instructions on setting up a domain, see [Onboard to Amazon SageMaker AI domain using quick setup](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html). You also need to add domain user profiles for individual users to access the TensorBoard on SageMaker AI. For more information, see [Add user profiles](domain-user-profile-add.md).
+ The following list is the minimum set of permissions for using TensorBoard on SageMaker AI.
  + `sagemaker:CreateApp`
  + `sagemaker:DeleteApp`
  + `sagemaker:DescribeTrainingJob`
  + `sagemaker:Search`
  + `s3:GetObject`
  + `s3:ListBucket`

## Step 1: Modify your training script with open-source TensorBoard helper tools
<a name="debugger-htb-prepare-training-job-1"></a>

Make sure you determine which output tensors and scalars to collect, and modify code lines in your training script using any of the following tools: TensorBoardX, TensorFlow Summary Writer, PyTorch Summary Writer, or SageMaker Debugger.

Also make sure that you specify the TensorBoard data output path as the log directory (`log_dir`) for callback in the training container. 

For more information about callbacks per framework, see the following resources.
+ For PyTorch, use [torch.utils.tensorboard.SummaryWriter](https://pytorch.org/docs/stable/tensorboard.html#module-torch.utils.tensorboard). See also the [Using TensorBoard in PyTorch](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html#using-tensorboard-in-pytorch) and [Log scalars](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html#log-scalars) sections in the *PyTorch tutorials*. Alternatively, you can use [TensorBoardX Summary Writer](https://tensorboardx.readthedocs.io/en/latest/tutorial.html).

  ```
  LOG_DIR="/opt/ml/output/tensorboard"
  tensorboard_callback=torch.utils.tensorboard.writer.SummaryWriter(log_dir=LOG_DIR)
  ```
+ For TensorFlow, use the native callback for TensorBoard, [tf.keras.callbacks.TensorBoard](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard).

  ```
  LOG_DIR="/opt/ml/output/tensorboard"
  tensorboard_callback=tf.keras.callbacks.TensorBoard(
      log_dir=LOG_DIR, histogram_freq=1)
  ```
+ For Transformers with PyTorch, you can use [transformers.integrations.TensorBoardCallback](https://huggingface.co/docs/transformers/main/en/main_classes/callback#transformers.integrations.TensorBoardCallback). 

  For Transformers with TensorFlow, use the `tf.keras.tensorboard.callback`, and pass that to the keras callback in transformers.
**Tip**  
You can also use a different container local output path. However, in [Step 2: Create a SageMaker training estimator object with the TensorBoard output configuration](#debugger-htb-prepare-training-job-2), you must map the paths correctly for SageMaker AI to successfully search the local path and save the TensorBoard data to the S3 output bucket.
+ For guidance on modifying training scripts using the SageMaker Debugger Python library, see [Adapting your training script to register a hook](debugger-modify-script.md).

## Step 2: Create a SageMaker training estimator object with the TensorBoard output configuration
<a name="debugger-htb-prepare-training-job-2"></a>

Use the `sagemaker.debugger.TensorBoardOutputConfig` while configuring a SageMaker AI framework estimator. This configuration API maps the S3 bucket you specify for saving TensorBoard data with the local path in the training container (`/opt/ml/output/tensorboard`). Pass the object of the module to the `tensorboard_output_config` parameter of the estimator class. The following code snippet shows an example of preparing a TensorFlow estimator with the TensorBoard output configuration parameter.

**Note**  
This example assumes that you use the SageMaker Python SDK. If you use the low-level SageMaker API, you should include the following to the request syntax of the [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API.  

```
"TensorBoardOutputConfig": { 
  "LocalPath": "/opt/ml/output/tensorboard",
  "S3OutputPath": "s3_output_bucket"
}
```

```
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import TensorBoardOutputConfig

# Set variables for training job information, 
# such as s3_out_bucket and other unique tags.
... 

LOG_DIR="/opt/ml/output/tensorboard"

output_path = os.path.join(
    "s3_output_bucket", "sagemaker-output", "date_str", "your-training_job_name"
)

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output_path, 'tensorboard'),
    container_local_output_path=LOG_DIR
)

estimator = TensorFlow(
    entry_point="train.py",
    source_dir="src",
    role=role,
    image_uri=image_uri,
    instance_count=1,
    instance_type="ml.c5.xlarge",
    base_job_name="your-training_job_name",
    tensorboard_output_config=tensorboard_output_config,
    hyperparameters=hyperparameters
)
```

**Note**  
The TensorBoard application does not provide out-of-the-box support for SageMaker AI hyperparameter tuning jobs, as the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateHyperParameterTuningJob.html) API is not integrated with the TensorBoard output configuration for the mapping. To use the TensorBoard application for hyperparameter tuning jobs, you need to write code for uploading metrics to Amazon S3 in your training script. Once the metrics are uploaded to an Amazon S3 bucket, you can then load the bucket into the TensorBoard application on SageMaker AI.

# Accessing the TensorBoard application on SageMaker AI
<a name="debugger-htb-access-tb"></a>

You can access TensorBoard by two methods: programmatically using the `sagemaker.interactive_apps.tensorboard` module that generates an unsigned or a presigned URL, or using the TensorBoard landing page in the SageMaker AI console. After you open TensorBoard, SageMaker AI runs the TensorBoard plugin and automatically finds all training job output data in a TensorBoard-compatible file format.

**Topics**
+ [

# Open TensorBoard using the `sagemaker.interactive_apps.tensorboard` module
](debugger-htb-access-tb-url.md)
+ [

# Open TensorBoard using the `get_app_url` function as an `estimator` class method
](debugger-htb-access-tb-get-app-url-estimator-method.md)
+ [

# Open TensorBoard through the SageMaker AI console
](debugger-htb-access-tb-console.md)

# Open TensorBoard using the `sagemaker.interactive_apps.tensorboard` module
<a name="debugger-htb-access-tb-url"></a>

The `sagemaker.interactive_apps.tensorboard` module provides a function called `get_app_url` that generates unsigned or presigned URLs to open the TensorBoard application in any environment in SageMaker AI or Amazon EC2. This is to provide a unified experience for both Studio Classic and non-Studio Classic users. For the Studio environment, you can open TensorBoard by running the `get_app_url()` function as it is, or you can also specify a job name to start tracking as the TensorBoard application opens. For non-Studio Classic environments, you can open TensorBoard by providing your domain and user profile information to the utility function. With this functionality, regardless of where or how you run training code and launch training jobs, you can directly access TensorBoard by running the `get_app_url` function in your Jupyter notebook or terminal.

**Note**  
This functionality is available in the SageMaker Python SDK v2.184.0 and later. To use this functionality, make sure that you upgrade the SDK by running `pip install sagemaker --upgrade`.

**Topics**
+ [

## Option 1: For SageMaker AI Studio Classic
](#debugger-htb-access-tb-url-unsigned)
+ [

## Option 2: For non-Studio Classic environments
](#debugger-htb-access-tb-url-presigned)

## Option 1: For SageMaker AI Studio Classic
<a name="debugger-htb-access-tb-url-unsigned"></a>

If you are using SageMaker Studio Classic, you can directly open the TensorBoard application or retrieve an unsigned URL by running the `get_app_url` function as follows. As you are already within the Studio Classic environment and signed in as a domain user, `get_app_url()` generates unsigned URL because it is not necessary to authenticate again. 

**To open the TensorBoard application** 

The following code automatically opens the TensorBoard application from the unsigned URL that the `get_app_url()` function returns in the your environment's default web browser.

```
from sagemaker.interactive_apps import tensorboard

region = "us-west-2"
app = tensorboard.TensorBoardApp(region)

app.get_app_url(
    training_job_name="your-training_job_name" # Optional. Specify the job name to track a specific training job 
)
```

**To retrieve an unsigned URL and open the TensorBoard application manually**

The following code prints an unsigned URL that you can copy to a web browser and open the TensorBoard application.

```
from sagemaker.interactive_apps import tensorboard

region = "us-west-2"
app = tensorboard.TensorBoardApp(region)
print("Navigate to the following URL:")
print(
    app.get_app_url(
        training_job_name="your-training_job_name", # Optional. Specify the name of the job to track.
        open_in_default_web_browser=False           # Set to False to print the URL to terminal.
    )
)
```

Note that if you run the preceding two code samples outside the SageMaker AI Studio Classic environment, the function will return a URL to the TensorBoard landing page in the SageMaker AI console, because these do not have sign-in information to your domain and user profile. For creating a presigned URL, see Option 2 in the following section.

## Option 2: For non-Studio Classic environments
<a name="debugger-htb-access-tb-url-presigned"></a>

If you use non-Studio Classic environments, such as SageMaker Notebook instance or Amazon EC2, and want to open TensorBoard directly from the environment you are in, you need to generate a URL presigned with your domain and user profile information. A *presigned* URL is a URL that's signed in to Amazon SageMaker Studio Classic while the URL is being created with your domain and user profile, and therefore granted access to all of the domain applications and files associated with your domain. To open TensorBoard through a presigned URL, use the `get_app_url` function with your domain and user profile name as follows.

Note that this option requires the domain user to have the `sagemaker:CreatePresignedDomainUrl` permission. Without the permission, the domain user will receive an exception error.

**Important**  
Do not share any presigned URLs. The `get_app_url` function creates presigned URLs, which automatically authenticates with your domain and user profile and gives access to any applications and files associated with your domain.

```
print(
    app.get_app_url(
        training_job_name="your-training_job_name", # Optional. Specify the name of the job to track.
        create_presigned_domain_url=True,           # Reguired to be set to True for creating a presigned URL.
        domain_id="your-domain-id",                 # Required if creating a presigned URL (create_presigned_domain_url=True).
        user_profile_name="your-user-profile-name", # Required if creating a presigned URL (create_presigned_domain_url=True).
        open_in_default_web_browser=False,          # Optional. Set to False to print the URL to terminal.
        optional_create_presigned_url_kwargs={}     # Optional. Add any additional args for Boto3 create_presigned_domain_url
    )
)
```

**Tip**  
The `get_app_url` function runs the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_presigned_domain_url.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_presigned_domain_url.html) API in the AWS SDK for Python (Boto3) in the backend. As the Boto3 `create_presigned_domain_url` API creates presigned domain URLs that expire in 300 seconds by default, presigned TensorBoard application URLs also expire in 300 seconds. If you want to extend the expiration time, pass the `ExpiresInSeconds` argument to the `optional_create_presigned_url_kwargs` argument of the `get_app_url` function as follows.  

```
optional_create_presigned_url_kwargs={"ExpiresInSeconds": 1500}
```

**Note**  
If any of your input passed to the arguments of `get_app_url` is invalid, the function outputs a URL to the TensorBoard landing page instead of opening the TensorBoard application. The output message would be similar to the following.  

```
Navigate to the following URL:
https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/tensor-board-landing
```

# Open TensorBoard using the `get_app_url` function as an `estimator` class method
<a name="debugger-htb-access-tb-get-app-url-estimator-method"></a>

If you are in the process of running a training job using the `estimator` class of the SageMaker Python SDK and have an active object of the `estimator` class, you can also access the [`get_app_url` function as a class method](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.EstimatorBase.get_app_url) of the `estimator` class. Open the TensorBoard application or retrieve an unsigned URL by running the `get_app_url` method as follows. The `get_app_url` class method pulls the training job name from the estimator and opens the TensorBoard application with the specified job.

**Note**  
This functionality is available in the SageMaker Python SDK v2.184.0 and later. To use this functionality, make sure that you upgrade the SDK by running `pip install sagemaker --upgrade`.

**Topics**
+ [

## Option 1: For SageMaker Studio Classic
](#debugger-htb-access-tb-get-app-url-estimator-method-studio)
+ [

## Option 2: For non-Studio Classic environments
](#debugger-htb-access-tb-get-app-url-estimator-method-non-studio)

## Option 1: For SageMaker Studio Classic
<a name="debugger-htb-access-tb-get-app-url-estimator-method-studio"></a>

**To open the TensorBoard application** 

The following code automatically opens the TensorBoard application from the unsigned URL that the `get_app_url()` method returns in the your environment's default web browser.

```
estimator.get_app_url(
    app_type=SupportedInteractiveAppTypes.TENSORBOARD # Required.
)
```

**To retrieve an unsigned URL and open the TensorBoard application manually**

The following code prints an unsigned URL that you can copy to a web browser and open the TensorBoard application.

```
print(
    estimator.get_app_url(
        app_type=SupportedInteractiveAppTypes.TENSORBOARD, # Required.
        open_in_default_web_browser=False, # Optional. Set to False to print the URL to terminal.
    )
)
```

Note that if you run the preceding two code samples outside the SageMaker AI Studio Classic environment, the function will return a URL to the TensorBoard landing page in the SageMaker AI console, because these do not have sign-in information to your domain and user profile. For creating a presigned URL, see Option 2 in the following section.

## Option 2: For non-Studio Classic environments
<a name="debugger-htb-access-tb-get-app-url-estimator-method-non-studio"></a>

If you use non-Studio Classic environments, such as SageMaker Notebook instance and Amazon EC2, and want to generate a presigned URL to open the TensorBoard application, use the `get_app_url` method with your domain and user profile information as follows.

Note that this option requires the domain user to have the `sagemaker:CreatePresignedDomainUrl` permission. Without the permission, the domain user will receive an exception error.

**Important**  
Do not share any presigned URLs. The `get_app_url` function creates presigned URLs, which automatically authenticates with your domain and user profile and gives access to any applications and files associated with your domain.

```
print(
    estimator.get_app_url(
        app_type=SupportedInteractiveAppTypes.TENSORBOARD, # Required
        create_presigned_domain_url=True,           # Reguired to be set to True for creating a presigned URL.
        domain_id="your-domain-id",                 # Required if creating a presigned URL (create_presigned_domain_url=True).
        user_profile_name="your-user-profile-name", # Required if creating a presigned URL (create_presigned_domain_url=True).
        open_in_default_web_browser=False,            # Optional. Set to False to print the URL to terminal.
        optional_create_presigned_url_kwargs={}       # Optional. Add any additional args for Boto3 create_presigned_domain_url
    )
)
```

# Open TensorBoard through the SageMaker AI console
<a name="debugger-htb-access-tb-console"></a>

You can also use the SageMaker AI console UI to open the TensorBoard application. There are two options to open the TensorBoard application through the SageMaker AI console.

**Topics**
+ [

## Option 1: Launch TensorBoard from the domain details page
](#debugger-htb-access-tb-console-domain-detail)
+ [

## Option 2: Launch TensorBoard from the TensorBoard landing page
](#debugger-htb-access-tb-console-landing-pg)

## Option 1: Launch TensorBoard from the domain details page
<a name="debugger-htb-access-tb-console-domain-detail"></a>

**Navigate to the domain details page**

 The following procedure shows how to navigate to the domain details page. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. From the list of domains, select the domain in which you want to launch the TensorBoard application.

**Launch a user profile application**

The following procedure shows how to launch a Studio Classic application that is scoped to a user profile. 

1. On the domain details page, choose the **User profiles** tab. 

1. Identify the user profile for which you want to launch the Studio Classic application. 

1. Choose **Launch** for your selected user profile, then choose **TensorBoard**. 

## Option 2: Launch TensorBoard from the TensorBoard landing page
<a name="debugger-htb-access-tb-console-landing-pg"></a>

The following procedure describes how to launch a TensorBoard application from the TensorBoard landing page. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **TensorBoard**.

1. Under **Get started**, select the domain in which you want to launch the Studio Classic application. If your user profile only belongs to one domain, you do not see the option for selecting a domain.

1. Select the user profile for which you want to launch the Studio Classic application. If there is no user profile in the domain, choose **Create user profile**. For more information, see [Add and Remove User Profiles](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-user-profile-add.html).

1. Choose **Open TensorBoard**.

The following screenshot shows the location of TensorBoard in the left navigation pane of the SageMaker AI console and the SageMaker AI with TensorBoard landing page in the main pane.

![\[The TensorBoard landing page\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/htb-landing-page.png)


# Load and visualize output tensors using the TensorBoard application
<a name="debugger-htb-access-tb-data"></a>

You can conduct an online or offline analysis by loading collected output tensors from S3 buckets paired with training jobs during or after training.

When you open the TensorBoard application, TensorBoard opens with the **SageMaker AI Data Manager** tab. The following screenshot shows the full view of the SageMaker AI Data Manager tab in the TensorBoard application.

**Note**  
The visualization plugins might not appear when you first launch the TensorBoard application. After you select training jobs in the SageMaker AI Data Manager plugin, the TensorBoard application loads the TensorBoard data and populates the visualization plugins.

**Note**  
The TensorBoard application automatically shuts down after 1 hour of inactivity. If you want to shut the application down when you are done using it, make sure to manually shut down TensorBoard to avoid paying for the instance hosting it. For instructions on deleting the application, see [Delete unused TensorBoard applications](debugger-htb-delete-app.md).

![\[The SageMaker AI Data Manager tab view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/htb-sagemaker-manager-tab.png)


In the **SageMaker AI Data Manager** tab, you can select any training job and load TensorBoard-compatible training output data from Amazon S3. 

1. In the **Search training jobs** section, use the filters to narrow down the list of training jobs you want to find, load, and visualize.

1. In the **List of training jobs** section, use the check boxes to choose training jobs from which you want to pull data and visualize for debugging.

1. Choose **Add selected jobs**. The selected jobs should appear in the **Tracked training jobs** section, as shown in the following screenshot.   
![\[The Tracked training jobs section.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/htb-sagemaker-manager-tab-tracked-jobs.png)

**Note**  
The **SageMaker AI Data Manager** tab only shows training jobs configured with the `TensorBoardOutputConfig` parameter. Make sure you have configured the SageMaker AI estimator with this parameter. For more information, see [Step 2: Create a SageMaker training estimator object with the TensorBoard output configuration](debugger-htb-prepare-training-job.md#debugger-htb-prepare-training-job-2).

**Note**  
The visualization tabs might not appear if you are using SageMaker AI with TensorBoard for the first time or no data is loaded from a previous use. After adding training jobs and waiting for a few seconds, refresh the viewer by choosing the clockwise circular arrow on the upper-right corner. The visualization tabs should appear after the job data are successfully loaded. You can also set to auto-refresh using the **Settings** button next to the refresh button in the upper right corner.

## Visualization of the output tensors in TensorBoard
<a name="debugger-htb-explore"></a>

In the graphics tabs, you can find the list of the loaded training jobs in the left pane. You can also use the check boxes of the training jobs to show or hide visualizations. The TensorBoard dynamic plugins are activated dynamically depending on how you have set your training script to include summary writers and pass callbacks for tensor and scalar collection, and therefore the graphics tabs also appear dynamically. The following screenshots show example views of each tab with visualization of two training jobs that collected metrics for time series, scalar, graph, distribution, and histogram plugins.

**The TIME SERIES tab view**

![\[The TIME SERIES tab view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/htb-time-series.png)


**The SCALARS tab view**

![\[The SCALARS tab view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/htb-scalars.png)


**The GRAPHS tab view**

![\[The GRAPHS tab view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/htb-graphs.png)


**The DISTRIBUTIONS tab view**

![\[The DISTRIBUTIONS tab view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/htb-distribution.png)


**The HISTOGRAMS tab view**

![\[The HISTOGRAMS tab view.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/htb-histogram.png)


# Delete unused TensorBoard applications
<a name="debugger-htb-delete-app"></a>

After you are done with monitoring and experimenting with jobs in TensorBoard, shut the TensorBoard application down.

1. Open the SageMaker AI console.

1. On the left navigation pane, choose **Admin configurations**.

1. Under **Admin configurations**, choose **domains**. 

1. Choose your domain.

1. Choose your user profile.

1. Under **Apps**, choose **Delete App** for the TensorBoard row.

1. Choose **Yes, delete app**.

1. Type **delete** in the text box, then choose **Delete**.

1. A blue message should appear at the top of the screen: **default is being deleted**.

# Amazon SageMaker Debugger
<a name="train-debugger"></a>

Debug model output tensors from machine learning training jobs in real time and detect non-converging issues using Amazon SageMaker Debugger.

## Amazon SageMaker Debugger features
<a name="debugger-features"></a>

A machine learning (ML) training job can have problems such as overfitting, saturated activation functions, and vanishing gradients, which can compromise model performance.

SageMaker Debugger provides tools to debug training jobs and resolve such problems to improve the performance of your model. Debugger also offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors.

SageMaker Debugger supports the Apache MXNet, PyTorch, TensorFlow, and XGBoost frameworks. For more information about available frameworks and versions supported by SageMaker Debugger, see [Supported frameworks and algorithms](debugger-supported-frameworks.md).

![\[Overview of how Amazon SageMaker Debugger works.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-main.png)


The high-level Debugger workflow is as follows:

1. Modify your training script with the `sagemaker-debugger` Python SDK if needed.

1. Configure a SageMaker training job with SageMaker Debugger.
   + Configure using the SageMaker AI Estimator API (for Python SDK).
   + Configure using the SageMaker AI [`CreateTrainingJob` request (for Boto3 or CLI)](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).
   + Configure [custom training containers](debugger-bring-your-own-container.md) with SageMaker Debugger.

1. Start a training job and monitor training issues in real time.
   + [List of Debugger built-in rules](debugger-built-in-rules.md).

1. Get alerts and take prompt actions against the training issues.
   + Receive texts and emails and stop training jobs when training issues are found using [Use Debugger built-in actions for rules](debugger-built-in-actions.md).
   + Set up your own actions using [Amazon CloudWatch Events and AWS Lambda](debugger-cloudwatch-lambda.md).

1. Explore deep analysis of the training issues.
   + For debugging model output tensors, see [Visualize Debugger Output Tensors in TensorBoard](debugger-enable-tensorboard-summaries.md).

1. Fix the issues, consider the suggestions provided by Debugger, and repeat steps 1–5 until you optimize your model and achieve target accuracy.

The SageMaker Debugger developer guide walks you through the following topics.

**Topics**
+ [

## Amazon SageMaker Debugger features
](#debugger-features)
+ [

# Supported frameworks and algorithms
](debugger-supported-frameworks.md)
+ [

# Amazon SageMaker Debugger architecture
](debugger-how-it-works.md)
+ [

# Debugger tutorials
](debugger-tutorial.md)
+ [

# Debugging training jobs using Amazon SageMaker Debugger
](debugger-debug-training-jobs.md)
+ [

# List of Debugger built-in rules
](debugger-built-in-rules.md)
+ [

# Creating custom rules using the Debugger client library
](debugger-custom-rules.md)
+ [

# Use Debugger with custom training containers
](debugger-bring-your-own-container.md)
+ [

# Configure Debugger using SageMaker API
](debugger-createtrainingjob-api.md)
+ [

# Amazon SageMaker Debugger references
](debugger-reference.md)

# Supported frameworks and algorithms
<a name="debugger-supported-frameworks"></a>

The following table shows SageMaker AI machine learning frameworks and algorithms supported by Debugger. 


| 
| 
| **SageMaker AI-supported frameworks and algorithms** |  **Debugging output tensors**  | 
| --- |--- |
|  [TensorFlow](https://sagemaker.readthedocs.io/en/stable/using_tf.html)   |  [AWS TensorFlow deep learning containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) 1.15.4 or later  | 
|  [PyTorch](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html)  |  [AWS PyTorch deep learning containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) 1.5.0 or later  | 
|  [MXNet](https://sagemaker.readthedocs.io/en/stable/using_mxnet.html)   |  [AWS MXNet deep learning containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) 1.6.0 or later  | 
|  [XGBoost](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html)  |  1.0-1, 1.2-1, 1.3-1  | 
|  [SageMaker AI generic estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)  |  [Custom training containers](debugger-bring-your-own-container.md) (available for TensorFlow, PyTorch, MXNet, and XGBoost with manual hook registration)  | 
+ **Debugging output tensors** – Track and debug model parameters, such as weights, gradients, biases, and scalar values of your training job. Available deep learning frameworks are Apache MXNet, TensorFlow, PyTorch, and XGBoost.
**Important**  
For the TensorFlow framework with Keras, SageMaker Debugger deprecates the zero code change support for debugging models built using the `tf.keras` modules of TensorFlow 2.6 and later. This is due to breaking changes announced in the [TensorFlow 2.6.0 release note](https://github.com/tensorflow/tensorflow/releases/tag/v2.6.0). For instructions on how to update your training script, see [Adapt your TensorFlow training script](debugger-modify-script-tensorflow.md).
**Important**  
From PyTorch v1.12.0 and later, SageMaker Debugger deprecates the zero code change support for debugging models.  
This is due to breaking changes that cause SageMaker Debugger to interfere with the `torch.jit` functionality. For instructions on how to update your training script, see [Adapt your PyTorch training script](debugger-modify-script-pytorch.md).

If the framework or algorithm that you want to train and debug is not listed in the table, go to the [AWS Discussion Forum](https://forums.aws.amazon.com/) and leave feedback on SageMaker Debugger.

## AWS Regions
<a name="debugger-support-aws-regions"></a>

Amazon SageMaker Debugger is available in all regions where Amazon SageMaker AI is in service except the following region.
+ Asia Pacific (Jakarta): `ap-southeast-3`

To find if Amazon SageMaker AI is in service in your AWS Region, see [AWS Regional Services](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/).

## Use Debugger with Custom Training Containers
<a name="debugger-byoc-intro"></a>

Bring your training containers to SageMaker AI and gain insights into your training jobs using Debugger. Maximize your work efficiency by optimizing your model on Amazon EC2 instances using the monitoring and debugging features.

For more information about how to build your training container with the `sagemaker-debugger` client library, push it to the Amazon Elastic Container Registry (Amazon ECR), and monitor and debug, see [Use Debugger with custom training containers](debugger-bring-your-own-container.md).

## Debugger Open-Source GitHub Repositories
<a name="debugger-opensource"></a>

Debugger APIs are provided through the SageMaker Python SDK and designed to construct Debugger hook and rule configurations for the SageMaker AI [ CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) and [ DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) API operations. The `sagemaker-debugger` client library provides tools to register *hooks* and access the training data through its *trial* feature, all through its flexible and powerful API operations. It supports the machine learning frameworks TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6 and later. 

For direct resources about the Debugger and `sagemaker-debugger` API operations, see the following links: 
+ [The Amazon SageMaker Python SDK documentation](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html)
+ [The Amazon SageMaker Python SDK - Debugger APIs](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html)
+ [The `sagemaker-debugger` Python SDK documentation](https://sagemaker-debugger.readthedocs.io/en/website/index.html) for [the Amazon SageMaker Debugger open source client library](https://github.com/awslabs/sagemaker-debugger#amazon-sagemaker-debugger)
+ [The `sagemaker-debugger` PyPI](https://pypi.org/project/smdebug/)

If you use the SDK for Java to conduct SageMaker training jobs and want to configure Debugger APIs, see the following references:
+ [Amazon SageMaker Debugger APIs](debugger-reference.md#debugger-apis)
+ [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md)

# Amazon SageMaker Debugger architecture
<a name="debugger-how-it-works"></a>

This topic walks you through a high-level overview of the Amazon SageMaker Debugger workflow.

Debugger supports profiling functionality for *performance optimization* to identify computation issues, such as system bottlenecks and underutilization, and to help optimize hardware resource utilization at scale. 

Debugger's debugging functionality for *model optimization* is about analyzing non-converging training issues that can arise while minimizing the loss functions using optimization algorithms, such as gradient descent and its variations. 

The following diagram shows the architecture of SageMaker Debugger. The blocks with bold boundary lines are what Debugger manages to analyze your training job. 

![\[Overview of how Amazon SageMaker Debugger works.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger_new_diagram.png)


Debugger stores the following data from your training jobs in your secured Amazon S3 bucket:
+ **Output tensors** – Collections of scalars and model parameters that are continuously updated during the forward and backward passes while training ML models. The output tensors include scalar values (accuracy and loss) and matrices (weights, gradients, input layers, and output layers).
**Note**  
By default, Debugger monitors and debugs SageMaker training jobs without any Debugger-specific parameters configured in SageMaker AI estimators. Debugger collects system metrics every 500 milliseconds and basic output tensors (scalar outputs such as loss and accuracy) every 500 steps. It also runs the `ProfilerReport` rule to analyze the system metrics and aggregate the Studio Debugger insights dashboard and a profiling report. Debugger saves the output data in your secured Amazon S3 bucket.

The Debugger built-in rules run on processing containers, which are designed to evaluate machine learning models by processing the training data collected in your S3 bucket (see [Process Data and Evaluate Models](https://docs.aws.amazon.com//sagemaker/latest/dg/processing-job.html)). The built-in rules are fully managed by Debugger. You can also create your own rules customized to your model to watch for any issues you want to monitor. 

# Debugger tutorials
<a name="debugger-tutorial"></a>

The following topics walk you through tutorials from the basics to advanced use cases of monitoring, profiling, and debugging SageMaker training jobs using Debugger. Explore the Debugger features and learn how you can debug and improve your machine learning models efficiently by using Debugger.

**Topics**
+ [

# Debugger tutorial videos
](debugger-videos.md)
+ [

# Debugger example notebooks
](debugger-notebooks.md)
+ [

# Debugger advanced demos and visualization
](debugger-visualization.md)

# Debugger tutorial videos
<a name="debugger-videos"></a>

The following videos provide a tour of Amazon SageMaker Debugger capabilities using SageMaker Studio and SageMaker AI notebook instances. 

**Topics**
+ [

## Debugging models with Amazon SageMaker Debugger in Studio Classic
](#debugger-video-get-started)
+ [

## Deep dive on Amazon SageMaker Debugger and SageMaker AI model monitor
](#debugger-video-dive-deep)

## Debugging models with Amazon SageMaker Debugger in Studio Classic
<a name="debugger-video-get-started"></a>

*Julien Simon, AWS Technical Evangelist \$1 Length: 14 minutes 17 seconds*

This tutorial video demonstrates how to use Amazon SageMaker Debugger to capture and inspect debugging information from a training model. The example training model used in this video is a simple convolutional neural network (CNN) based on Keras with the TensorFlow backend. SageMaker AI in a TensorFlow framework and Debugger enable you to build an estimator directly using the training script and debug the training job.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/MqPdTj0Znwg/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/MqPdTj0Znwg)


You can find the example notebook in the video in [ this Studio Demo repository](https://gitlab.com/juliensimon/amazon-studio-demos/-/tree/master) provided by the author. You need to clone the `debugger.ipynb` notebook file and the `mnist_keras_tf.py` training script to your SageMaker Studio or a SageMaker notebook instance. After you clone the two files, specify the path `keras_script_path` to the `mnist_keras_tf.py` file inside the `debugger.ipynb` notebook. For example, if you cloned the two files in the same directory, set it as `keras_script_path = "mnist_keras_tf.py"`.

## Deep dive on Amazon SageMaker Debugger and SageMaker AI model monitor
<a name="debugger-video-dive-deep"></a>

*Julien Simon, AWS Technical Evangelist \$1 Length: 44 minutes 34 seconds*

This video session explores advanced features of Debugger and SageMaker Model Monitor that help boost productivity and the quality of your models. First, this video shows how to detect and fix training issues, visualize tensors, and improve models with Debugger. Next, at 22:41, the video shows how to monitor models in production and identify prediction issues such as missing features or data drift using SageMaker AI Model Monitor. Finally, it offers cost optimization tips to help you make the most of your machine learning budget.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/0zqoeZxakOI/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/0zqoeZxakOI)


You can find the example notebook in the video in [ this AWS Dev Days 2020 repository](https://gitlab.com/juliensimon/awsdevdays2020/-/tree/master/mls1) offered by the author.

# Debugger example notebooks
<a name="debugger-notebooks"></a>

[SageMaker Debugger example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/) are provided in the [aws/amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) repository. The Debugger example notebooks walk you through basic to advanced use cases of debugging and profiling training jobs. 

We recommend that you run the example notebooks on SageMaker Studio or a SageMaker Notebook instance because most of the examples are designed for training jobs in the SageMaker AI ecosystem, including Amazon EC2, Amazon S3, and Amazon SageMaker Python SDK. 

To clone the example repository to SageMaker Studio, follow the instructions at [Amazon SageMaker Studio Tour](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-end-to-end.html).

**Important**  
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the `SMDebug` client library. In your iPython kernel, Jupyter Notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Debugger example notebooks for profiling training jobs
<a name="debugger-notebooks-profiling"></a>

The following list shows Debugger example notebooks introducing Debugger's adaptability to monitor and profile training jobs for various machine learning models, datasets, and frameworks.


| Notebook Title | Framework | Model | Dataset | Description | 
| --- | --- | --- | --- | --- | 
|  [Amazon SageMaker Debugger Profiling Data Analysis](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/debugger_interactive_analysis_profiling/interactive_analysis_profiling_data.html)  |  TensorFlow  |  Keras ResNet50  | Cifar-10 |  This notebook provides an introduction to interactive analysis of profiled data captured by SageMaker Debugger. Explore the full functionality of the `SMDebug` interactive analysis tools.  | 
|  [Profile machine learning training with Amazon SageMaker Debugger ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_nlp_sentiment_analysis/sentiment-analysis-tf-distributed-training-bringyourownscript.html)  |  TensorFlow  |  1-D Convolutional Neural Network  |  IMDB dataset  |  Profile a TensorFlow 1-D CNN for sentiment analysis of IMDB data that consists of movie reviews labeled as having positive or negative sentiment. Explore the Studio Debugger insights and Debugger profiling report.  | 
|  [Profiling TensorFlow ResNet model training with various distributed training settings](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_profiling)  |  TensorFlow  | ResNet50 | Cifar-10 |  Run TensorFlow training jobs with various distributed training settings, monitor system resource utilization, and profile model performance using Debugger.  | 
|  [Profiling PyTorch ResNet model training with various distributed training settings](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_profiling)   | PyTorch |  ResNet50  | Cifar-10 |  Run PyTorch training jobs with various distributed training settings, monitor system resource utilization, and profile model performance using Debugger.  | 

## Debugger example notebooks for analyzing model parameters
<a name="debugger-notebooks-debugging"></a>

The following list shows Debugger example notebooks introducing Debugger's adaptability to debug training jobs for various machine learning models, datasets, and frameworks.


| Notebook Title | Framework | Model | Dataset | Description | 
| --- | --- | --- | --- | --- | 
|  [Amazon SageMaker Debugger - Use built-in rule](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_builtin_rule)  |  TensorFlow  |  Convolutional Neural Network  | MNIST |  Use the Amazon SageMaker Debugger built-in rules for debugging a TensorFlow model.  | 
|  [Amazon SageMaker Debugger - Tensorflow 2.1](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow2)  |  TensorFlow  |  ResNet50  | Cifar-10 |  Use the Amazon SageMaker Debugger hook configuration and built-in rules for debugging a model with the Tensorflow 2.1 framework.  | 
|  [Visualizing Debugging Tensors of MXNet training](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_plot)  |  MXNet  |  Gluon Convolutional Neural Network  | Fashion MNIST |  Run a training job and configure SageMaker Debugger to store all tensors from this job, then visualize those tensors ina notebook.  | 
|  [Enable Spot Training with Amazon SageMaker Debugger](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_spot_training)   | MXNet |  Gluon Convolutional Neural Network  | Fashion MNIST |  Learn how Debugger collects tensor data from a training job on a spot instance, and how to use the Debugger built-in rules with managed spot training.  | 
| [Explain an XGBoost model that predicts an individual’s income with Amazon SageMaker Debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/xgboost_census_explanations/xgboost-census-debugger-rules.html) | XGBoost |  XGBoost Regression  |  [Adult Census dataset](https://archive.ics.uci.edu/ml/datasets/adult)  | Learn how to use the Debugger hook and built-in rules for collecting and visualizing tensor data from an XGBoost regression model, such as loss values, features, and SHAP values. | 

To find advanced visualizations of model parameters and use cases, see the next topic at [Debugger advanced demos and visualization](debugger-visualization.md).

# Debugger advanced demos and visualization
<a name="debugger-visualization"></a>

The following demos walk you through advanced use cases and visualization scripts using Debugger.

**Topics**
+ [

## Training and pruning models with Amazon SageMaker Experiments and Debugger
](#debugger-visualization-video-model-pruning)
+ [

## [Using SageMaker Debugger to monitor a convolutional autoencoder model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/autoencoder_mnist/autoencoder_mnist.html)
](#debugger-visualization-autoencoder_mnist)
+ [

## [Using SageMaker Debugger to monitor attentions in BERT model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/bert_attention_head_view/bert_attention_head_view.html)
](#debugger-visualization-bert_attention_head_view)
+ [

## [Using SageMaker Debugger to visualize class activation maps in convolutional neural networks (CNNs)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/cnn_class_activation_maps/cnn_class_activation_maps.html)
](#debugger-visualization-cnn_class_activation_maps)

## Training and pruning models with Amazon SageMaker Experiments and Debugger
<a name="debugger-visualization-video-model-pruning"></a>

*Dr. Nathalie Rauschmayr, AWS Applied Scientist \$1 Length: 49 minutes 26 seconds*

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/Tnv6HsT1r4I/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/Tnv6HsT1r4I)


Find out how Amazon SageMaker Experiments and Debugger can simplify the management of your training jobs. Amazon SageMaker Debugger provides transparent visibility into training jobs and saves training metrics into your Amazon S3 bucket. SageMaker Experiments enables you to call the training information as *trials* through SageMaker Studio and supports visualization of the training job. This helps you keep model quality high while reducing less important parameters based on importance rank.

This video demonstrates a *model pruning* technique that makes pre-trained ResNet50 and AlexNet models lighter and affordable while keeping high standards for model accuracy.

SageMaker AI Estimator trains those algorithms supplied from the PyTorch model zoo in an AWS Deep Learning Containers with PyTorch framework, and Debugger extracts training metrics from the training process.

The video also demonstrates how to set up a Debugger custom rule to watch the accuracy of a pruned model, to trigger an Amazon CloudWatch event and an AWS Lambda function when the accuracy hits a threshold, and to automatically stop the pruning process to avoid redundant iterations. 

Learning objectives are as follows: 
+  Learn how to use SageMaker AI to accelerate ML model training and improve model quality. 
+  Understand how to manage training iterations with SageMaker Experiments by automatically capturing input parameters, configurations, and results. 
+  Discover how Debugger makes the training process transparent by automatically capturing real-time tensor data from metrics such as weights, gradients, and activation outputs of convolutional neural networks.
+ Use CloudWatch to trigger Lambda when Debugger catches issues.
+  Master the SageMaker training process using SageMaker Experiments and Debugger.

You can find the notebooks and training scripts used in this video from [SageMaker Debugger PyTorch Iterative Model Pruning](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_iterative_model_pruning).

The following image shows how the iterative model pruning process reduces the size of AlexNet by cutting out the 100 least significant filters based on importance rank evaluated by activation outputs and gradients.

The pruning process reduced the initial 50 million parameters to 18 million. It also reduced the estimated model size from 201 MB to 73 MB. 

![\[An image containing model pruning result output visualizations\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-model-pruning-results-alexnet.gif)


You also need to track model accuracy, and the following image shows how you can plot the model pruning process to visualize changes in model accuracy based on the number of parameters in SageMaker Studio.

![\[An image of tensor visualization using Debugger in SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-model-pruning-studio.png)


In SageMaker Studio, choose the **Experiments** tab, select a list of tensors saved by Debugger from the pruning process, and then compose a **Trial Component List** panel. Select all ten iterations and then choose **Add chart** to create a **Trial Component Chart**. After you decide on a model to deploy, choose the trial component and choose a menu to perform an action or choose **Deploy model**.

**Note**  
To deploy a model through SageMaker Studio using the following notebook example, add a line at the end of the `train` function in the `train.py` script.  

```
# In the train.py script, look for the train function in line 58.
def train(epochs, batch_size, learning_rate):
    ...
        print('acc:{:.4f}'.format(correct/total))
        hook.save_scalar("accuracy", correct/total, sm_metric=True)

    # Add the following code to line 128 of the train.py script to save the pruned models
    # under the current SageMaker Studio model directory
    torch.save(model.state_dict(), os.environ['SM_MODEL_DIR'] + '/model.pt')
```

## [Using SageMaker Debugger to monitor a convolutional autoencoder model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/autoencoder_mnist/autoencoder_mnist.html)
<a name="debugger-visualization-autoencoder_mnist"></a>

This notebook demonstrates how SageMaker Debugger visualizes tensors from an unsupervised (or self-supervised) learning process on a MNIST image dataset of handwritten numbers.

The training model in this notebook is a convolutional autoencoder with the MXNet framework. The convolutional autoencoder has a bottleneck-shaped convolutional neural network that consists of an encoder part and a decoder part. 

The encoder in this example has two convolution layers to produce compressed representation (latent variables) of the input images. In this case, the encoder produces a latent variable of size (1, 20) from an original input image of size (28, 28) and significantly reduces the size of data for training by 40 times.

The decoder has two *deconvolutional* layers and ensures that the latent variables preserve key information by reconstructing output images.

The convolutional encoder powers clustering algorithms with smaller input data size and the performance of clustering algorithms such as k-means, k-NN, and t-Distributed Stochastic Neighbor Embedding (t-SNE).

This notebook example demonstrates how to visualize the latent variables using Debugger, as shown in the following animation. It also demonstrates how the t-SNE algorithm classifies the latent variables into ten clusters and projects them into a two-dimensional space. The scatter plot color scheme on the right side of the image reflects the true values to show how well the BERT model and t-SNE algorithm organize the latent variables into the clusters.

![\[A conceptual image of convolutional autoencoder\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-cnn-autoencoder-plot.gif)


## [Using SageMaker Debugger to monitor attentions in BERT model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/bert_attention_head_view/bert_attention_head_view.html)
<a name="debugger-visualization-bert_attention_head_view"></a>

Bidirectional Encode Representations from Transformers (BERT) is a language representation model. As the name of model reflects, the BERT model builds on *transfer learning* and the *Transformer model* for natural language processing (NLP).

The BERT model is pre-trained on unsupervised tasks such as predicting missing words in a sentence or predicting the next sentence that naturally follows a previous sentence. The training data contains 3.3 billion words (tokens) of English text, from sources such as Wikipedia and electronic books. For a simple example, the BERT model can give a high *attention* to appropriate verb tokens or pronoun tokens from a subject token.

The pre-trained BERT model can be fine-tuned with an additional output layer to achieve state-of-the-art model training in NLP tasks, such as automated responses to questions, text classification, and many others. 

Debugger collects tensors from the fine-tuning process. In the context of NLP, the weight of neurons is called *attention*. 

This notebook demonstrates how to use the [ pre-trained BERT model from the GluonNLP model zoo](https://gluon-nlp.mxnet.io/model_zoo/bert/index.html) on the Stanford Question and Answering dataset and how to set up SageMaker Debugger to monitor the training job.

Plotting *attention scores* and individual neurons in the query and key vectors can help to identify causes of incorrect model predictions. With SageMaker AI Debugger, you can retrieve the tensors and plot the *attention-head view* in real time as training progresses and understand what the model is learning.

The following animation shows the attention scores of the first 20 input tokens for ten iterations in the training job provided in the notebook example.

![\[An animation of the attention scores\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-attention_scores.gif)


## [Using SageMaker Debugger to visualize class activation maps in convolutional neural networks (CNNs)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/cnn_class_activation_maps/cnn_class_activation_maps.html)
<a name="debugger-visualization-cnn_class_activation_maps"></a>

This notebook demonstrates how to use SageMaker Debugger to plot class activation maps for image detection and classification in convolutional neural networks (CNNs). In deep learning, a *convolutional neural network (CNN or ConvNet)* is a class of deep neural networks, most commonly applied to analyzing visual imagery. One of the applications that adopts the class activation maps is self-driving cars, which require instantaneous detection and classification of images such as traffic signs, roads, and obstacles.

In this notebook, the PyTorch ResNet model is trained on [the German Traffic Sign Dataset](http://benchmark.ini.rub.de/), which contains more than 40 classes of traffic-related objects and more than 50,000 images in total.

![\[An animation of CNN class activation maps\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-cnn-class-activation-maps.gif)


During the training process, SageMaker Debugger collects tensors to plot the class activation maps in real time. As shown in the animated image, the class activation map (also called as a *saliency map*) highlights regions with high activation in red color. 

Using tensors captured by Debugger, you can visualize how the activation map evolves during the model training. The model starts by detecting the edge on the lower-left corner at the beginning of the training job. As the training progresses, the focus shifts to the center and detects the speed limit sign, and the model successfully predicts the input image as Class 3, which is a class of speed limit 60km/h signs, with a 97% confidence level.

# Debugging training jobs using Amazon SageMaker Debugger
<a name="debugger-debug-training-jobs"></a>

To prepare your training script and run training jobs with SageMaker Debugger to debug model training progress, you follow the typical two-step process: modify your training script using the `sagemaker-debugger` Python SDK, and construct a SageMaker AI estimator using the SageMaker Python SDK. Go through the following topics to learn how to use SageMaker Debugger's debugging functionality.

**Topics**
+ [

# Adapting your training script to register a hook
](debugger-modify-script.md)
+ [

# Launch training jobs with Debugger using the SageMaker Python SDK
](debugger-configuration-for-debugging.md)
+ [

# SageMaker Debugger interactive report for XGBoost
](debugger-report-xgboost.md)
+ [

# Action on Amazon SageMaker Debugger rules
](debugger-action-on-rules.md)
+ [

# Visualize Amazon SageMaker Debugger output tensors in TensorBoard
](debugger-enable-tensorboard-summaries.md)

# Adapting your training script to register a hook
<a name="debugger-modify-script"></a>

Amazon SageMaker Debugger comes with a client library called the [`sagemaker-debugger` Python SDK](https://sagemaker-debugger.readthedocs.io/en/website). The `sagemaker-debugger` Python SDK provides tools for adapting your training script before training and analysis tools after training. In this page, you'll learn how to adapt your training script using the client library. 

The `sagemaker-debugger` Python SDK provides wrapper functions that help register a hook to extract model tensors, without altering your training script. To get started with collecting model output tensors and debug them to find training issues, make the following modifications in your training script.

**Tip**  
While you're following this page, use the [`sagemaker-debugger` open source SDK documentation](https://sagemaker-debugger.readthedocs.io/en/website/index.html) for API references.

**Topics**
+ [

# Adapt your PyTorch training script
](debugger-modify-script-pytorch.md)
+ [

# Adapt your TensorFlow training script
](debugger-modify-script-tensorflow.md)

# Adapt your PyTorch training script
<a name="debugger-modify-script-pytorch"></a>

To start collecting model output tensors and debug training issues, make the following modifications to your PyTorch training script.

**Note**  
SageMaker Debugger cannot collect model output tensors from the [https://pytorch.org/docs/stable/nn.functional.html](https://pytorch.org/docs/stable/nn.functional.html) API operations. When you write a PyTorch training script, it is recommended to use the [https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) modules instead.

## For PyTorch 1.12.0
<a name="debugger-modify-script-pytorch-1-12-0"></a>

If you bring a PyTorch training script, you can run the training job and extract model output tensors with a few additional code lines in your training script. You need to use the [hook APIs](https://sagemaker-debugger.readthedocs.io/en/website/hook-api.html) in the `sagemaker-debugger` client library. Walk through the following instructions that break down the steps with code examples.

1. Create a hook.

   **(Recommended) For training jobs within SageMaker AI**

   ```
   import smdebug.pytorch as smd
   hook=smd.get_hook(create_if_not_exists=True)
   ```

   When you launch a training job in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md) with any of the DebuggerHookConfig, TensorBoardConfig, or Rules in your estimator, SageMaker AI adds a JSON configuration file to your training instance that is picked up by the `get_hook` function. Note that if you do not include any of the configuration APIs in your estimator, there will be no configuration file for the hook to find, and the function returns `None`.

   **(Optional) For training jobs outside SageMaker AI**

   If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2 instances, or your own local devices, use `smd.Hook` class to create a hook. However, this approach can only store the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s built-in Rules don’t work with the local mode because the Rules require SageMaker AI ML training instances and S3 to store outputs from the remote instances in real time. The `smd.get_hook` API returns `None` in this case. 

   If you want to create a manual hook to save tensors in local mode, use the following code snippet with the logic to check if the `smd.get_hook` API returns `None` and create a manual hook using the `smd.Hook` class. Note that you can specify any output directory in your local machine.

   ```
   import smdebug.pytorch as smd
   hook=smd.get_hook(create_if_not_exists=True)
   
   if hook is None:
       hook=smd.Hook(
           out_dir='/path/to/your/local/output/',
           export_tensorboard=True
       )
   ```

1. Wrap your model with the hook’s class methods.

   The `hook.register_module()` method takes your model and iterates through each layer, looking for any tensors that match with regular expressions that you’ll provide through the configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md). The collectable tensors through this hook method are weights, biases, activations, gradients, inputs, and outputs.

   ```
   hook.register_module(model)
   ```
**Tip**  
If you collect the entire output tensors from a large deep learning model, the total size of those collections can exponentially grow and might cause bottlenecks. If you want to save specific tensors, you can also use the `hook.save_tensor()` method. This method helps you pick the variable for the specific tensor and save to a custom collection named as you want. For more information, see [step 7](#debugger-modify-script-pytorch-save-custom-tensor) of this instruction.

1. Warp the loss function with the hook’s class methods.

   The `hook.register_loss` method is to wrap the loss function. It extracts loss values every `save_interval` that you’ll set during configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md), and saves them to the `"losses"` collection.

   ```
   hook.register_loss(loss_function)
   ```

1. Add `hook.set_mode(ModeKeys.TRAIN)` in the train block. This indicates the tensor collection is extracted during the training phase.

   ```
   def train():
       ...
       hook.set_mode(ModeKeys.TRAIN)
   ```

1. Add `hook.set_mode(ModeKeys.EVAL)` in the validation block. This indicates the tensor collection is extracted during the validation phase.

   ```
   def validation():
       ...
       hook.set_mode(ModeKeys.EVAL)
   ```

1. Use [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar) to save custom scalars. You can save scalar values that aren’t in your model. For example, if you want to record the accuracy values computed during evaluation, add the following line of code below the line where you calculate accuracy.

   ```
   hook.save_scalar("accuracy", accuracy)
   ```

   Note that you need to provide a string as the first argument to name the custom scalar collection. This is the name that'll be used for visualizing the scalar values in TensorBoard, and can be any string you want.

1. <a name="debugger-modify-script-pytorch-save-custom-tensor"></a>Use [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_tensor](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_tensor) to save custom tensors. Similarly to [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar), you can save additional tensors, defining your own tensor collection. For example, you can extract input image data that are passed into the model and save as a custom tensor by adding the following code line, where `"images"` is an example name of the custom tensor, `image_inputs` is an example variable for the input image data.

   ```
   hook.save_tensor("images", image_inputs)
   ```

   Note that you must provide a string to the first argument to name the custom tensor. `hook.save_tensor()` has the third argument `collections_to_write` to specify the tensor collection to save the custom tensor. The default is `collections_to_write="default"`. If you don't explicitely specify the third argument, the custom tensor is saved to the `"default"` tensor collection.

After you have completed adapting your training script, proceed to [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md).

# Adapt your TensorFlow training script
<a name="debugger-modify-script-tensorflow"></a>

To start collecting model output tensors and debug training issues, make the following modifications to your TensorFlow training script.

**Create a hook for training jobs within SageMaker AI**

```
import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True)
```

This creates a hook when you start a SageMaker training job. When you launch a training job in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md) with any of the `DebuggerHookConfig`, `TensorBoardConfig`, or `Rules` in your estimator, SageMaker AI adds a JSON configuration file to your training instance that is picked up by the `smd.get_hook` method. Note that if you do not include any of the configuration APIs in your estimator, there will be no configuration file for the hook to find, and the function returns `None`.

**(Optional) Create a hook for training jobs outside SageMaker AI**

If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2 instances, or your own local devices, use `smd.Hook` class to create a hook. However, this approach can only store the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s built-in Rules don’t work with the local mode. The `smd.get_hook` method also returns `None` in this case. 

If you want to create a manual hook, use the following code snippet with the logic to check if the hook returns `None` and create a manual hook using the `smd.Hook` class.

```
import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True) 

if hook is None:
    hook=smd.KerasHook(
        out_dir='/path/to/your/local/output/',
        export_tensorboard=True
    )
```

After adding the hook creation code, proceed to the following topic for TensorFlow Keras.

**Note**  
SageMaker Debugger currently supports TensorFlow Keras only.

## Register the hook in your TensorFlow Keras training script
<a name="debugger-modify-script-tensorflow-keras"></a>

The following precedure walks you through how to use the hook and its methods to collect output scalars and tensors from your model and optimizer.

1. Wrap your Keras model and optimizer with the hook’s class methods.

   The `hook.register_model()` method takes your model and iterates through each layer, looking for any tensors that match with regular expressions that you’ll provide through the configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md). The collectable tensors through this hook method are weights, biases, and activations.

   ```
   model=tf.keras.Model(...)
   hook.register_model(model)
   ```

1. Wrap the optimizer by the `hook.wrap_optimizer()` method.

   ```
   optimizer=tf.keras.optimizers.Adam(...)
   optimizer=hook.wrap_optimizer(optimizer)
   ```

1. Compile the model in eager mode in TensorFlow.

   To collect tensors from the model, such as the input and output tensors of each layer, you must run the training in eager mode. Otherwise, SageMaker AI Debugger will not be able to collect the tensors. However, other tensors, such as model weights, biases, and the loss, can be collected without explicitly running in eager mode.

   ```
   model.compile(
       loss="categorical_crossentropy", 
       optimizer=optimizer, 
       metrics=["accuracy"],
       # Required for collecting tensors of each layer
       run_eagerly=True
   )
   ```

1. Register the hook to the [https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method.

   To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` class method. This will pass the `sagemaker-debugger` hook as a Keras callback.

   ```
   model.fit(
       X_train, Y_train,
       batch_size=batch_size,
       epochs=epoch,
       validation_data=(X_valid, Y_valid),
       shuffle=True, 
       callbacks=[hook]
   )
   ```

1. TensorFlow 2.x provides only symbolic gradient variables that do not provide access to their values. To collect gradients, wrap `tf.GradientTape` by the [https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html#tensorflow-specific-hook-api](https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html#tensorflow-specific-hook-api) method, which requires you to write your own training step as follows.

   ```
   def training_step(model, dataset):
       with hook.wrap_tape(tf.GradientTape()) as tape:
           pred=model(data)
           loss_value=loss_fn(labels, pred)
       grads=tape.gradient(loss_value, model.trainable_variables)
       optimizer.apply_gradients(zip(grads, model.trainable_variables))
   ```

   By wrapping the tape, the `sagemaker-debugger` hook can identify output tensors such as gradients, parameters, and losses. Wrapping the tape ensures that the `hook.wrap_tape()` method around functions of the tape object, such as `push_tape()`, `pop_tape()`, `gradient()`, will set up the writers of SageMaker Debugger and save tensors that are provided as input to `gradient()` (trainable variables and loss) and output of `gradient()` (gradients).
**Note**  
To collect with a custom training loop, make sure that you use eager mode. Otherwise, SageMaker Debugger is not able to collect any tensors.

For a full list of actions that the `sagemaker-debugger` hook APIs offer to construct hooks and save tensors, see [Hook Methods](https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html) in the *`sagemaker-debugger` Python SDK documentation*.

After you have completed adapting your training script, proceed to [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md).

# Launch training jobs with Debugger using the SageMaker Python SDK
<a name="debugger-configuration-for-debugging"></a>

To configure a SageMaker AI estimator with SageMaker Debugger, use [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and specify Debugger-specific parameters. To fully utilize the debugging functionality, there are three parameters you need to configure: `debugger_hook_config`, `tensorboard_output_config`, and `rules`.

**Important**  
Before constructing and running the estimator fit method to launch a training job, make sure that you adapt your training script following the instructions at [Adapting your training script to register a hook](debugger-modify-script.md).

## Constructing a SageMaker AI Estimator with Debugger-specific parameters
<a name="debugger-configuration-structure"></a>

The code examples in this section show how to construct a SageMaker AI estimator with the Debugger-specific parameters.

**Note**  
The following code examples are templates for constructing the SageMaker AI framework estimators and not directly executable. You need to proceed to the next sections and configure the Debugger-specific parameters.

------
#### [ PyTorch ]

```
# An example of constructing a SageMaker AI PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=PyTorch(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.12.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ TensorFlow ]

```
# An example of constructing a SageMaker AI TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ MXNet ]

```
# An example of constructing a SageMaker AI MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=MXNet(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.7.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ XGBoost ]

```
# An example of constructing a SageMaker AI XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=XGBoost(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.5-1",

    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ Generic estimator ]

```
# An example of constructing a SageMaker AI generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------

Configure the following parameters to activate SageMaker Debugger:
+ `debugger_hook_config` (an object of [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig)) – Required to activate the hook in the adapted training script during [Adapting your training script to register a hook](debugger-modify-script.md), configure the SageMaker training launcher (estimator) to collect output tensors from your training job, and save the tensors into your secured S3 bucket or local machine. To learn how to configure the `debugger_hook_config` parameter, see [Configuring SageMaker Debugger to save tensors](debugger-configure-hook.md).
+ `rules` (a list of [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule) objects) – Configure this parameter to activate SageMaker Debugger built-in rules that you want to run in real time. The built-in rules are logics that automatically debug the training progress of your model and find training issues by analyzing the output tensors saved in your secured S3 bucket. To learn how to configure the `rules` parameter, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md). To find a complete list of built-in rules for debugging output tensors, see [Debugger rule](debugger-built-in-rules.md#debugger-built-in-rules-Rule). If you want to create your own logic to detect any training issues, see [Creating custom rules using the Debugger client library](debugger-custom-rules.md).
**Note**  
The built-in rules are available only through SageMaker training instances. You cannot use them in local mode.
+ `tensorboard_output_config` (an object of [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig)) – Configure SageMaker Debugger to collect output tensors in the TensorBoard-compatible format and save to your S3 output path specified in the `TensorBoardOutputConfig` object. To learn more, see [Visualize Amazon SageMaker Debugger output tensors in TensorBoard](debugger-enable-tensorboard-summaries.md).
**Note**  
The `tensorboard_output_config` must be configured with the `debugger_hook_config` parameter, which also requires you to adapt your training script by adding the `sagemaker-debugger` hook.

**Note**  
SageMaker Debugger securely saves output tensors in subfolders of your S3 bucket. For example, the format of the default S3 bucket URI in your account is `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/`. There are two subfolders created by SageMaker Debugger: `debug-output`, and `rule-output`. If you add the `tensorboard_output_config` parameter, you'll also find `tensorboard-output` folder.

See the following topics to find more examples of how to configure the Debugger-specific parameters in detail.

**Topics**
+ [

## Constructing a SageMaker AI Estimator with Debugger-specific parameters
](#debugger-configuration-structure)
+ [

# Configuring SageMaker Debugger to save tensors
](debugger-configure-hook.md)
+ [

# How to configure Debugger built-in rules
](use-debugger-built-in-rules.md)
+ [

# Turn off Debugger
](debugger-turn-off.md)
+ [

# Useful SageMaker AI estimator class methods for Debugger
](debugger-estimator-classmethods.md)

# Configuring SageMaker Debugger to save tensors
<a name="debugger-configure-hook"></a>

*Tensors* are data collections of updated parameters from the backward and forward pass of each training iteration. SageMaker Debugger collects the output tensors to analyze the state of a training job. SageMaker Debugger's [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) and [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig) API operations provide methods for grouping tensors into *collections* and saving them to a target S3 bucket. The following topics show how to use the `CollectionConfig` and `DebuggerHookConfig` API operations, followed by examples of how to use Debugger hook to save, access, and visualize output tensors.

While constructing a SageMaker AI estimator, activate SageMaker Debugger by specifying the `debugger_hook_config` parameter. The following topics include examples of how to set up the `debugger_hook_config` using the `CollectionConfig` and `DebuggerHookConfig` API operations to pull tensors out of your training jobs and save them.

**Note**  
After properly configured and activated, SageMaker Debugger saves the output tensors in a default S3 bucket, unless otherwise specified. The format of the default S3 bucket URI is `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/`.

**Topics**
+ [

# Configure tensor collections using the `CollectionConfig` API
](debugger-configure-tensor-collections.md)
+ [

# Configure the `DebuggerHookConfig` API to save tensors
](debugger-configure-tensor-hook.md)
+ [

# Example notebooks and code samples to configure Debugger hook
](debugger-save-tensors.md)

# Configure tensor collections using the `CollectionConfig` API
<a name="debugger-configure-tensor-collections"></a>

Use the `CollectionConfig` API operation to configure tensor collections. Debugger provides pre-built tensor collections that cover a variety of regular expressions (regex) of parameters if using Debugger-supported deep learning frameworks and machine learning algorithms. As shown in the following example code, add the built-in tensor collections you want to debug.

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(name="weights"),
    CollectionConfig(name="gradients")
]
```

The preceding collections set up the Debugger hook to save the tensors every 500 steps based on the default `"save_interval"` value.

For a full list of available Debugger built-in collections, see [Debugger Built-in Collections](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#collection).

If you want to customize the built-in collections, such as changing the save intervals and tensor regex, use the following `CollectionConfig` template to adjust parameters.

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(
        name="tensor_collection",
        parameters={
            "key_1": "value_1",
            "key_2": "value_2",
            ...
            "key_n": "value_n"
        }
    )
]
```

For more information about available parameter keys, see [CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). For example, the following code example shows how you can adjust the save intervals of the "losses" tensor collection at different phases of training: save loss every 100 steps in training phase and validation loss every 10 steps in validation phase. 

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(
        name="losses",
        parameters={
            "train.save_interval": "100",
            "eval.save_interval": "10"
        }
    )
]
```

**Tip**  
This tensor collection configuration object can be used for both [DebuggerHookConfig](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-hook.html#debugger-configure-tensor-hook) and [Rule](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html#debugger-built-in-rules-configuration-param-change) API operations.

# Configure the `DebuggerHookConfig` API to save tensors
<a name="debugger-configure-tensor-hook"></a>

Use the [DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html                 #sagemaker.debugger.DebuggerHookConfig) API to create a `debugger_hook_config` object using the `collection_configs` object you created in the previous step.

```
from sagemaker.debugger import DebuggerHookConfig

debugger_hook_config=DebuggerHookConfig(
    collection_configs=collection_configs
)
```

Debugger saves the model training output tensors into the default S3 bucket. The format of the default S3 bucket URI is `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/.`

If you want to specify an exact S3 bucket URI, use the following code example:

```
from sagemaker.debugger import DebuggerHookConfig

debugger_hook_config=DebuggerHookConfig(
    s3_output_path="specify-uri"
    collection_configs=collection_configs
)
```

For more information, see [DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# Example notebooks and code samples to configure Debugger hook
<a name="debugger-save-tensors"></a>

The following sections provide notebooks and code examples of how to use Debugger hook to save, access, and visualize output tensors.

**Topics**
+ [

## Tensor visualization example notebooks
](#debugger-tensor-visualization-notebooks)
+ [

## Save tensors using Debugger built-in collections
](#debugger-save-built-in-collections)
+ [

## Save tensors by modifying Debugger built-in collections
](#debugger-save-modified-built-in-collections)
+ [

## Save tensors using Debugger custom collections
](#debugger-save-custom-collections)

## Tensor visualization example notebooks
<a name="debugger-tensor-visualization-notebooks"></a>

The following two notebook examples show advanced use of Amazon SageMaker Debugger for visualizing tensors. Debugger provides a transparent view into training deep learning models.
+ [Interactive Tensor Analysis in SageMaker Studio Notebook with MXNet](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_analysis)

  This notebook example shows how to visualize saved tensors using Amazon SageMaker Debugger. By visualizing the tensors, you can see how the tensor values change while training deep learning algorithms. This notebook includes a training job with a poorly configured neural network and uses Amazon SageMaker Debugger to aggregate and analyze tensors, including gradients, activation outputs, and weights. For example, the following plot shows the distribution of gradients of a convolutional layer that is suffering from a vanishing gradient problem.  
![\[A graph plotting the distribution of gradients.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-vanishing-gradient.gif)

  This notebook also illustrates how a good initial hyperparameter setting improves the training process by generating the same tensor distribution plots. 
+ [ Visualizing and Debugging Tensors from MXNet Model Training](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_plot)

   This notebook example shows how to save and visualize tensors from an MXNet Gluon model training job using Amazon SageMaker Debugger. It illustrates that Debugger is set to save all tensors to an Amazon S3 bucket and retrieves ReLu activation outputs for the visualization. The following figure shows a three-dimensional visualization of the ReLu activation outputs. The color scheme is set to blue to indicate values close to 0 and yellow to indicate values close to 1.   
![\[A visualization of the ReLU activation outputs\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/tensorplot.gif)

  In this notebook, the `TensorPlot` class imported from `tensor_plot.py` is designed to plot convolutional neural networks (CNNs) that take two-dimensional images for inputs. The `tensor_plot.py` script provided with the notebook retrieves tensors using Debugger and visualizes the CNN. You can run this notebook on SageMaker Studio to reproduce the tensor visualization and implement your own convolutional neural network model. 
+ [Real-time Tensor Analysis in a SageMaker Notebook with MXNet](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_realtime_analysis)

  This example guides you through installing required components for emitting tensors in an Amazon SageMaker training job and using the Debugger API operations to access those tensors while training is running. A gluon CNN model is trained on the Fashion MNIST dataset. While the job is running, you will see how Debugger retrieves activation outputs of the first convolutional layer from each of 100 batches and visualizes them. Also, this will show you how to visualize weights after the job is done.

## Save tensors using Debugger built-in collections
<a name="debugger-save-built-in-collections"></a>

You can use built-in collections of tensors using the `CollectionConfig` API and save them using the `DebuggerHookConfig` API. The following example shows how to use the default settings of Debugger hook configurations to construct a SageMaker AI TensorFlow estimator. You can also utilize this for MXNet, PyTorch, and XGBoost estimators.

**Note**  
In the following example code, the `s3_output_path` parameter for `DebuggerHookConfig` is optional. If you do not specify it, Debugger saves the tensors at `s3://<output_path>/debug-output/`, where the `<output_path>` is the default output path of SageMaker training jobs. For example:  

```
"s3://sagemaker-us-east-1-111122223333/sagemaker-debugger-training-YYYY-MM-DD-HH-MM-SS-123/debug-output"
```

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to call built-in collections
collection_configs=[
        CollectionConfig(name="weights"),
        CollectionConfig(name="gradients"),
        CollectionConfig(name="losses"),
        CollectionConfig(name="biases")
    ]

# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-built-in-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

To see a list of Debugger built-in collections, see [Debugger Built-in Collections](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#collection).

## Save tensors by modifying Debugger built-in collections
<a name="debugger-save-modified-built-in-collections"></a>

You can modify the Debugger built-in collections using the `CollectionConfig` API operation. The following example shows how to tweak the built-in `losses` collection and construct a SageMaker AI TensorFlow estimator. You can also use this for MXNet, PyTorch, and XGBoost estimators.

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to call and modify built-in collections
collection_configs=[
    CollectionConfig(
                name="losses", 
                parameters={"save_interval": "50"})]

# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-modified-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

For a full list of `CollectionConfig` parameters, see [ Debugger CollectionConfig API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk).

## Save tensors using Debugger custom collections
<a name="debugger-save-custom-collections"></a>

You can also save a reduced number of tensors instead of the full set of tensors (for example, if you want to reduce the amount of data saved in your Amazon S3 bucket). The following example shows how to customize the Debugger hook configuration to specify target tensors that you want to save. You can use this for TensorFlow, MXNet, PyTorch, and XGBoost estimators.

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to create a custom collection
collection_configs=[
        CollectionConfig(
            name="custom_activations_collection",
            parameters={
                "include_regex": "relu|tanh", # Required
                "reductions": "mean,variance,max,abs_mean,abs_variance,abs_max"
            })
    ]
    
# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-custom-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

For a full list of `CollectionConfig` parameters, see [ Debugger CollectionConfig](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk).

# How to configure Debugger built-in rules
<a name="use-debugger-built-in-rules"></a>

In the following topics, you'll learn how to use the SageMaker Debugger built-in rules. Amazon SageMaker Debugger's built-in rules analyze tensors emitted during the training of a model. SageMaker AI Debugger offers the `Rule` API operation that monitors training job progress and errors for the success of training your model. For example, the rules can detect whether gradients are getting too large or too small, whether a model is overfitting or overtraining, and whether a training job does not decrease loss function and improve. To see a full list of available built-in rules, see [List of Debugger built-in rules](debugger-built-in-rules.md).

**Topics**
+ [

# Use Debugger built-in rules with the default parameter settings
](debugger-built-in-rules-configuration.md)
+ [

# Use Debugger built-in rules with custom parameter values
](debugger-built-in-rules-configuration-param-change.md)
+ [

# Example notebooks and code samples to configure Debugger rules
](debugger-built-in-rules-example.md)

# Use Debugger built-in rules with the default parameter settings
<a name="debugger-built-in-rules-configuration"></a>

To specify Debugger built-in rules in an estimator, you need to configure a list object. The following example code shows the basic structure of listing the Debugger built-in rules:

```
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.built_in_rule_name_1()),
    Rule.sagemaker(rule_configs.built_in_rule_name_2()),
    ...
    Rule.sagemaker(rule_configs.built_in_rule_name_n()),
    ... # You can also append more profiler rules in the ProfilerRule.sagemaker(rule_configs.*()) format.
]
```

For more information about default parameter values and descriptions of the built-in rule, see [List of Debugger built-in rules](debugger-built-in-rules.md).

To find the SageMaker Debugger API reference, see [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.sagemaker.debugger.rule_configs](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.sagemaker.debugger.rule_configs) and [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule).

For example, to inspect the overall training performance and progress of your model, construct a SageMaker AI estimator with the following built-in rule configuration. 

```
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.stalled_training_rule())
]
```

When you start the training job, Debugger collects system resource utilization data every 500 milliseconds and the loss and accuracy values every 500 steps by default. Debugger analyzes the resource utilization to identify if your model is having bottleneck problems. The `loss_not_decreasing`, `overfit`, `overtraining`, and `stalled_training_rule` monitors if your model is optimizing the loss function without those training issues. If the rules detect training anomalies, the rule evaluation status changes to `IssueFound`. You can set up automated actions, such as notifying training issues and stopping training jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see [Action on Amazon SageMaker Debugger rules](debugger-action-on-rules.md).


# Use Debugger built-in rules with custom parameter values
<a name="debugger-built-in-rules-configuration-param-change"></a>

If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure the `base_config` and `rule_parameters` parameters for the `ProfilerRule.sagemaker` and `Rule.sagemaker` classmethods. In case of the `Rule.sagemaker` class methods, you can also customize tensor collections through the `collections_to_save` parameter. The instruction of how to use the `CollectionConfig` class is provided at [Configure tensor collections using the `CollectionConfig` API](debugger-configure-tensor-collections.md). 

Use the following configuration template for built-in rules to customize parameter values. By changing the rule parameters as you want, you can adjust the sensitivity of the rules to be triggered. 
+ The `base_config` argument is where you call the built-in rule methods.
+ The `rule_parameters` argument is to adjust the default key values of the built-in rules listed in [List of Debugger built-in rules](debugger-built-in-rules.md).
+ The `collections_to_save` argument takes in a tensor configuration through the `CollectionConfig` API, which requires `name` and `parameters` arguments. 
  + To find available tensor collections for `name`, see [ Debugger Built-in Tensor Collections ](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections). 
  + For a full list of adjustable `parameters`, see [ Debugger CollectionConfig API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk).

For more information about the Debugger rule class, methods, and parameters, see [SageMaker AI Debugger Rule class](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig

rules=[
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule_name(),
        rule_parameters={
                "key": "value"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="tensor_collection_name", 
                parameters={
                    "key": "value"
                } 
            )
        ]
    )
]
```

The parameter descriptions and value customization examples are provided for each rule at [List of Debugger built-in rules](debugger-built-in-rules.md).

# Example notebooks and code samples to configure Debugger rules
<a name="debugger-built-in-rules-example"></a>

In the following sections, notebooks and code samples of how to use Debugger rules to monitor SageMaker training jobs are provided.

**Topics**
+ [

## Debugger built-in rules example notebooks
](#debugger-built-in-rules-notebook-example)
+ [

## Debugger built-in rules example code
](#debugger-deploy-built-in-rules)
+ [

## Use Debugger built-in rules with parameter modifications
](#debugger-deploy-modified-built-in-rules)

## Debugger built-in rules example notebooks
<a name="debugger-built-in-rules-notebook-example"></a>

The following example notebooks show how to use Debugger built-in rules when running training jobs with Amazon SageMaker AI: 
+ [Using a SageMaker Debugger built-in rule with TensorFlow](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_builtin_rule)
+ [Using a SageMaker Debugger built-in rule with Managed Spot Training and MXNet](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_spot_training)
+ [Using a SageMaker Debugger built-in rule with parameter modifications for a real-time training job analysis with XGBoost](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/xgboost_realtime_analysis)

While running the example notebooks in SageMaker Studio, you can find the training job trial created on the **Studio Experiment List** tab. For example, as shown in the following screenshot, you can find and open a **Describe Trial Component** window of your current training job. On the Debugger tab, you can check if the Debugger rules, `vanishing_gradient()` and `loss_not_decreasing()`, are monitoring the training session in parallel. For a full instruction of how to find your training job trial components in the Studio UI, see [SageMaker Studio - View Experiments, Trials, and Trial Components](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks.html#studio-tasks-experiments).

![\[An image of running a training job with Debugger built-in rules activated in SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-rule-studio.png)


There are two ways of using the Debugger built-in rules in the SageMaker AI environment: deploy the built-in rules as it is prepared or adjust their parameters as you want. The following topics show you how to use the built-in rules with example codes.

## Debugger built-in rules example code
<a name="debugger-deploy-built-in-rules"></a>

The following code sample shows how to set the Debugger built-in rules using the `Rule.sagemaker` method. To specify built-in rules that you want to run, use the `rules_configs` API operation to call the built-in rules. To find a full list of Debugger built-in rules and default parameter values, see [List of Debugger built-in rules](debugger-built-in-rules.md).

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

# call built-in rules that you want to use.
built_in_rules=[ 
            Rule.sagemaker(rule_configs.vanishing_gradient())
            Rule.sagemaker(rule_configs.loss_not_decreasing())
]

# construct a SageMaker AI estimator with the Debugger built-in rules
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-built-in-rules-demo',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",

    # debugger-specific arguments below
    rules=built_in_rules
)
sagemaker_estimator.fit()
```

**Note**  
The Debugger built-in rules run in parallel with your training job. The maximum number of built-in rule containers for a training job is 20. 

For more information about the Debugger rule class, methods, and parameters, see the [SageMaker Debugger Rule class](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). 

To find an example of how to adjust the Debugger rule parameters, see the following [Use Debugger built-in rules with parameter modifications](#debugger-deploy-modified-built-in-rules) section.

## Use Debugger built-in rules with parameter modifications
<a name="debugger-deploy-modified-built-in-rules"></a>

The following code example shows the structure of built-in rules to adjust parameters. In this example, the `stalled_training_rule` collects the `losses` tensor collection from a training job at every 50 steps and an evaluation stage at every 10 steps. If the training process starts stalling and not collecting tensor outputs for 120 seconds, the `stalled_training_rule` stops the training job. 

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

# call the built-in rules and modify the CollectionConfig parameters

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

built_in_rules_modified=[
    Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                'threshold': '120',
                'training_job_name_prefix': base_job_name_prefix,
                'stop_training_on_fire' : 'True'
        }
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                      "train.save_interval": "50"
                      "eval.save_interval": "10"
                } 
            )
        ]
    )
]

# construct a SageMaker AI estimator with the modified Debugger built-in rule
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name=base_job_name_prefix,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",

    # debugger-specific arguments below
    rules=built_in_rules_modified
)
sagemaker_estimator.fit()
```

For an advanced configuration of the Debugger built-in rules using the `CreateTrainingJob` API, see [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md).

# Turn off Debugger
<a name="debugger-turn-off"></a>

If you want to completely turn off Debugger, do one of the following:
+ Before starting a training job, do the following:

  To stop both monitoring and profiling, include the `disable_profiler` parameter to your estimator and set it to `True`.
**Warning**  
If you disable it, you won't be able to view the comprehensive Studio Debugger insights dashboard and the autogenerated profiling report.

  To stop debugging, set the `debugger_hook_config` parameter to `False`.
**Warning**  
If you disable it, you won't be able to collect output tensors and cannot debug your model parameters.

  ```
  estimator=Estimator(
      ...
      disable_profiler=True
      debugger_hook_config=False
  )
  ```

  For more information about the Debugger-specific parameters, see [SageMaker AI Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).
+ While a training job is running, do the following:

  To disable both monitoring and profiling while your training job is running, use the following estimator classmethod:

  ```
  estimator.disable_profiling()
  ```

  To disable framework profiling only and keep system monitoring, use the `update_profiler` method:

  ```
  estimator.update_profiler(disable_framework_metrics=true)
  ```

  For more information about the estimator extension methods, see the [estimator.disable\$1profiling](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.disable_profiling) and [estimator.update\$1profiler](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.update_profiler) classmethods in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation.

# Useful SageMaker AI estimator class methods for Debugger
<a name="debugger-estimator-classmethods"></a>

The following estimator class methods are useful for accessing your SageMaker training job information and retrieving output paths of training data collected by Debugger. The following methods are executable after you initiate a training job with the `estimator.fit()` method.
+ To check the base S3 bucket URI of a SageMaker training job:

  ```
  estimator.output_path
  ```
+ To check the base job name of a SageMaker training job:

  ```
  estimator.latest_training_job.job_name
  ```
+ To see a full `CreateTrainingJob` API operation configuration of a SageMaker training job:

  ```
  estimator.latest_training_job.describe()
  ```
+ To check a full list of the Debugger rules while a SageMaker training job is running:

  ```
  estimator.latest_training_job.rule_job_summary()
  ```
+ To check the S3 bucket URI where the model parameter data (output tensors) are saved:

  ```
  estimator.latest_job_debugger_artifacts_path()
  ```
+ To check the S3 bucket URI at where the model performance data (system and framework metrics) are saved:

  ```
  estimator.latest_job_profiler_artifacts_path()
  ```
+ To check the Debugger rule configuration for debugging output tensors:

  ```
  estimator.debugger_rule_configs
  ```
+ To check the list of the Debugger rules for debugging while a SageMaker training job is running:

  ```
  estimator.debugger_rules
  ```
+ To check the Debugger rule configuration for monitoring and profiling system and framework metrics:

  ```
  estimator.profiler_rule_configs
  ```
+ To check the list of the Debugger rules for monitoring and profiling while a SageMaker training job is running:

  ```
  estimator.profiler_rules
  ```

For more information about the SageMaker AI estimator class and its methods, see [Estimator API](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# SageMaker Debugger interactive report for XGBoost
<a name="debugger-report-xgboost"></a>

Receive training reports autogenerated by Debugger. The Debugger reports provide insights into your training jobs and suggest recommendations to improve your model performance. For SageMaker AI XGBoost training jobs, use the Debugger [CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) rule to receive a comprehensive training report of the training progress and results. Following this guide, specify the [CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) rule while constructing an XGBoost estimator, download the report using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) or the Amazon S3 console, and gain insights into the training results.

**Note**  
You can download a Debugger reports while your training job is running or after the job has finished. During training, Debugger concurrently updates the report reflecting the current rules' evaluation status. You can download a complete Debugger report only after the training job has completed.

**Important**  
In the report, plots and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

**Topics**
+ [

# Construct a SageMaker AI XGBoost estimator with the Debugger XGBoost Report rule
](debugger-training-xgboost-report-estimator.md)
+ [

# Download the Debugger XGBoost training report
](debugger-training-xgboost-report-download.md)
+ [

# Debugger XGBoost training report walkthrough
](debugger-training-xgboost-report-walkthrough.md)

# Construct a SageMaker AI XGBoost estimator with the Debugger XGBoost Report rule
<a name="debugger-training-xgboost-report-estimator"></a>

The [CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) rule collects the following output tensors from your training job: 
+ `hyperparameters` – Saves at the first step.
+ `metrics` – Saves loss and accuracy every 5 steps.
+ `feature_importance` – Saves every 5 steps.
+ `predictions` – Saves every 5 steps.
+ `labels` – Saves every 5 steps.

The output tensors are saved at a default S3 bucket. For example, `s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/debug-output/`.

When you construct a SageMaker AI estimator for an XGBoost training job, specify the rule as shown in the following example code.

------
#### [ Using the SageMaker AI generic estimator ]

```
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.create_xgboost_report())
]

region = boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-xgboost-report-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Add the Debugger XGBoost report rule
    rules=rules
)

estimator.fit(wait=False)
```

------

# Download the Debugger XGBoost training report
<a name="debugger-training-xgboost-report-download"></a>

Download the Debugger XGBoost training report while your training job is running or after the job has finished using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and AWS Command Line Interface (CLI).

------
#### [ Download using the SageMaker Python SDK and AWS CLI ]

1. Check the current job's default S3 output base URI.

   ```
   estimator.output_path
   ```

1. Check the current job name.

   ```
   estimator.latest_training_job.job_name
   ```

1. The Debugger XGBoost report is stored under `<default-s3-output-base-uri>/<training-job-name>/rule-output`. Configure the rule output path as follows:

   ```
   rule_output_path = estimator.output_path + "/" + estimator.latest_training_job.job_name + "/rule-output"
   ```

1. To check if the report is generated, list directories and files recursively under the `rule_output_path` using `aws s3 ls` with the `--recursive` option.

   ```
   ! aws s3 ls {rule_output_path} --recursive
   ```

   This should return a complete list of files under autogenerated folders that are named `CreateXgboostReport` and `ProfilerReport-1234567890`. The XGBoost training report is stored in the `CreateXgboostReport`, and the profiling report is stored in the `ProfilerReport-1234567890` folder. To learn more about the profiling report generated by default with the XGBoost training job, see [SageMaker Debugger interactive report](debugger-profiling-report.md).  
![\[An example of rule output.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-ls.png)

   The `xgboost_report.html` is an autogenerated XGBoost training report by Debugger. The `xgboost_report.ipynb` is a Jupyter notebook that's used to aggregate training results into the report. You can download all of the files, browse the HTML report file, and modify the report using the notebook.

1. Download the files recursively using `aws s3 cp`. The following command saves all of the rule output files to the `ProfilerReport-1234567890` folder under the current working directory.

   ```
   ! aws s3 cp {rule_output_path} ./ --recursive
   ```
**Tip**  
If you are using a Jupyter notebook server, run `!pwd` to verify the current working directory.

1. Under the `/CreateXgboostReport` directory, open `xgboost_report.html`. If you are using JupyterLab, choose **Trust HTML** to see the autogenerated Debugger training report.  
![\[An example of rule output.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-open-trust.png)

1. Open the `xgboost_report.ipynb` file to explore how the report is generated. You can customize and extend the training report using the Jupyter notebook file.

------
#### [ Download using the Amazon S3 console ]

1. Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base S3 bucket name should be in the following format: `sagemaker-<region>-111122223333`. Look up the base S3 bucket through the **Find bucket by name** field.  
![\[The Find bucket by name field in the Amazon S3 console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-0.png)

1. In the base S3 bucket, look up the training job name by entering your job name prefix in **Find objects by prefix** and then choosing the training job name.  
![\[The Find objects by prefix field in the Amazon S3 console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-1.png)

1. In the training job's S3 bucket, choose **rule-output/** subfolder. There must be three subfolders for training data collected by Debugger: **debug-output/**, **profiler-output/**, and **rule-output/**.   
![\[An example to the rule output S3 bucket URI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-2.png)

1. In the **rule-output/** folder, choose the **CreateXgboostReport/** folder. The folder contains **xbgoost\$1report.html** (the autogenerated report in html) and **xbgoost\$1report.ipynb** (a Jupyter notebook with scripts that are used for generating the report).

1. Choose the **xbgoost\$1report.html** file, choose **Download actions**, and then choose **Download**.  
![\[An example to the rule output S3 bucket URI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-s3-download.png)

1. Open the downloaded **xbgoost\$1report.html** file in a web browser.

------

# Debugger XGBoost training report walkthrough
<a name="debugger-training-xgboost-report-walkthrough"></a>

This section walks you through the Debugger XGBoost training report. The report is automatically aggregated depending on the output tensor regex, recognizing what type of your training job is among binary classification, multiclass classification, and regression.

**Important**  
In the report, plots and and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

**Topics**
+ [

## Distribution of true labels of the dataset
](#debugger-training-xgboost-report-walkthrough-dist-label)
+ [

## Loss versus step graph
](#debugger-training-xgboost-report-walkthrough-loss-vs-step)
+ [

## Feature importance
](#debugger-training-xgboost-report-walkthrough-feature-importance)
+ [

## Confusion matrix
](#debugger-training-xgboost-report-walkthrough-confusion-matrix)
+ [

## Evaluation of the confusion matrix
](#debugger-training-xgboost-report-walkthrough-eval-conf-matrix)
+ [

## Accuracy rate of each diagonal element over iteration
](#debugger-training-xgboost-report-walkthrough-accuracy-rate)
+ [

## Receiver operating characteristic curve
](#debugger-training-xgboost-report-walkthrough-rec-op-char)
+ [

## Distribution of residuals at the last saved step
](#debugger-training-xgboost-report-walkthrough-dist-residual)
+ [

## Absolute validation error per label bin over iteration
](#debugger-training-xgboost-report-walkthrough-val-error-per-label-bin)

## Distribution of true labels of the dataset
<a name="debugger-training-xgboost-report-walkthrough-dist-label"></a>

This histogram shows the distribution of labeled classes (for classification) or values (for regression) in your original dataset. Skewness in your dataset could contribute to inaccuracies. This visualization is available for the following model types: binary classification, multiclassification, and regression.

![\[An example of a distribution of true labels of the dataset graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-dist-label.png)


## Loss versus step graph
<a name="debugger-training-xgboost-report-walkthrough-loss-vs-step"></a>

This is a line chart that shows the progression of loss on training data and validation data throughout training steps. The loss is what you defined in your objective function, such as mean squared error. You can gauge whether the model is overfit or underfit from this plot. This section also provides insights that you can use to determine how to resolve the overfit and underfit problems. This visualization is available for the following model types: binary classification, multiclassification, and regression. 

![\[An example of a loss versus step graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-loss-vs-step.png)


## Feature importance
<a name="debugger-training-xgboost-report-walkthrough-feature-importance"></a>

There are three different types of feature importance visualizations provided: Weight, Gain and Coverage. We provide detailed definitions for each of the three in the report. Feature importance visualizations help you learn what features in your training dataset contributed to the predictions. Feature importance visualizations are available for the following model types: binary classification, multiclassification, and regression. 

![\[An example of a feature importance graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-feature-importance.png)


## Confusion matrix
<a name="debugger-training-xgboost-report-walkthrough-confusion-matrix"></a>

This visualization is only applicable to binary and multiclass classification models. Accuracy alone might not be sufficient for evaluating the model performance. For some use cases, such as healthcare and fraud detection, it’s also important to know the false positive rate and false negative rate. A confusion matrix gives you the additional dimensions for evaluating your model performance.

![\[An example of confusion matrix.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-confusion-matrix.png)


## Evaluation of the confusion matrix
<a name="debugger-training-xgboost-report-walkthrough-eval-conf-matrix"></a>

This section provides you with more insights on the micro, macro, and weighted metrics on precision, recall, and F1-score for your model.

![\[Evaluation of the confusion matrix.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-eval-conf-matrix.png)


## Accuracy rate of each diagonal element over iteration
<a name="debugger-training-xgboost-report-walkthrough-accuracy-rate"></a>

This visualization is only applicable to binary classification and multiclass classification models. This is a line chart that plots the diagonal values in the confusion matrix throughout the training steps for each class. This plot shows you how the accuracy of each class progresses throughout the training steps. You can identify the under-performing classes from this plot. 

![\[An example of an accuracy rate of each diagonal element over iteration graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-accuracy-rate.gif)


## Receiver operating characteristic curve
<a name="debugger-training-xgboost-report-walkthrough-rec-op-char"></a>

This visualization is only applicable to binary classification models. The Receiver Operating Characteristic curve is commonly used to evaluate binary classification model performance. The y-axis of the curve is True Positive Rate (TPF) and x-axis is false positive rate (FPR). The plot also displays the value for the area under the curve (AUC). The higher the AUC value, the more predictive your classifier. You can also use the ROC curve to understand the trade-off between TPR and FPR and identify the optimum classification threshold for your use case. The classification threshold can be adjusted to tune the behavior of the model to reduce more of one or another type of error (FP/FN).

![\[An example a receiver operating characteristic curve graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-rec-op-char.png)


## Distribution of residuals at the last saved step
<a name="debugger-training-xgboost-report-walkthrough-dist-residual"></a>

This visualization is a column chart that shows the residual distributions in the last step Debugger captures. In this visualization, you can check whether the residual distribution is close to normal distribution that’s centered at zero. If the residuals are skewed, your features may not be sufficient for predicting the labels. 

![\[An example of a distribution of residuals at the last saved step graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-dist-residual.png)


## Absolute validation error per label bin over iteration
<a name="debugger-training-xgboost-report-walkthrough-val-error-per-label-bin"></a>

This visualization is only applicable to regression models. The actual target values are split into 10 intervals. This visualization shows how validation errors progress for each interval throughout the training steps in line plots. Absolute validation error is the absolute value of difference between prediction and actual during validation. You can identify the underperforming intervals from this visualization.

![\[An example an absolute validation error per label bin over iteration graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-val-error-per-label-bin.png)


# Action on Amazon SageMaker Debugger rules
<a name="debugger-action-on-rules"></a>

Based on the Debugger rule evaluation status, you can set up automated actions such as stopping a training job and sending notifications using Amazon Simple Notification Service (Amazon SNS). You can also create your own actions using Amazon CloudWatch Events and AWS Lambda. To learn how to set up automated actions based on the Debugger rule evaluation status, see the following topics.

**Topics**
+ [

# Use Debugger built-in actions for rules
](debugger-built-in-actions.md)
+ [

# Actions on rules using Amazon CloudWatch and AWS Lambda
](debugger-cloudwatch-lambda.md)

# Use Debugger built-in actions for rules
<a name="debugger-built-in-actions"></a>

Use Debugger built-in actions to respond to issues found by [Debugger rule](debugger-built-in-rules.md#debugger-built-in-rules-Rule). The Debugger `rule_configs` class provides tools to configure a list of actions, including automatically stopping training jobs and sending notifications using Amazon Simple Notification Service (Amazon SNS) when the Debugger rules find training issues. The following topics takes you through the steps to accomplish these tasks.

**Topics**
+ [

## Set up Amazon SNS, create an `SMDebugRules` topic, and subscribe to the topic
](#debugger-built-in-actions-sns)
+ [

## Set up your IAM role to attach required policies
](#debugger-built-in-actions-iam)
+ [

## Configure Debugger rules with the built-in actions
](#debugger-built-in-actions-on-rule)
+ [

## Considerations for using the Debugger built-in actions
](#debugger-built-in-actions-considerations)

## Set up Amazon SNS, create an `SMDebugRules` topic, and subscribe to the topic
<a name="debugger-built-in-actions-sns"></a>

This section walks you through how to set up an Amazon SNS **SMDebugRules** topic, subscribe to it, and confirm the subscription to receive notifications from the Debugger rules.

**Note**  
For more information about billing for Amazon SNS, see [Amazon SNS pricing](https://aws.amazon.com/sns/pricing/) and [Amazon SNS FAQs](https://aws.amazon.com/sns/faqs/).

**To create a SMDebugRules topic**

1. Sign in to the AWS Management Console and open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the left navigation pane, choose **Topics**. 

1. On the **Topics** page, choose **Create topic**.

1. On the **Create topic** page, in the **Details** section, do the following:

   1. For **Type**, choose **Standard** for topic type.

   1. In **Name**, enter **SMDebugRules**.

1. Skip all other optional settings and choose **Create topic**. If you want to learn more about the optional settings, see [Creating an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html).

**To subscribe to the SMDebugRules topic**

1. Open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the left navigation pane, choose **Subscriptions**. 

1. On the **Subscriptions** page, choose **Create subscription**.

1. On the **Create subscription** page, in the **Details** section, do the following: 

   1. For **Topic ARN**, choose the **SMDebugRules** topic ARN. The ARN should be in format of `arn:aws:sns:<region-id>:111122223333:SMDebugRules`.

   1. For **Protocol**, choose **Email** or **SMS**. 

   1. For **Endpoint**, enter the endpoint value, such as an email address or a phone number that you want to receive notifications.
**Note**  
Make sure you type the correct email address and phone number. Phone numbers must include `+`, a country code, and phone number, with no special characters or spaces. For example, the phone number \$11 (222) 333-4444 is formatted as **\$112223334444**.

1. Skip all other optional settings and choose **Create subscription**. If you want to learn more about the optional settings, see [Subscribing to an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html).

After you subscribe to the **SMDebugRules** topic, you receive the following confirmation message in email or by phone:

![\[A subscription confirmation email message for the Amazon SNS SMDebugRules topic.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-subscription-confirmation.png)


For more information about Amazon SNS, see [Mobile text messaging (SMS)](https://docs.aws.amazon.com/sns/latest/dg/sns-mobile-phone-number-as-subscriber.html) and [Email notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-email-notifications.html) in the *Amazon SNS Developer Guide*.

## Set up your IAM role to attach required policies
<a name="debugger-built-in-actions-iam"></a>

In this step, you add the required policies to your IAM role.

**To add the required policies to your IAM role**

1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the left navigation pane, choose **Policies**, and choose **Create policy**.

1. On the **Create policy** page, do the following to create a new sns-access policy:

   1. Choose the **JSON** tab.

   1. Paste the JSON strings formatted in bold in the following code into the `"Statement"`, replacing the 12-digit AWS account ID with your AWS account ID.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": [
                      "sns:Publish",
                      "sns:CreateTopic",
                      "sns:Subscribe"
                  ],
                  "Resource": "arn:aws:sns:*:111122223333:SMDebugRules"
              }
          ]
      }
      ```

------

   1. At the bottom of the page, choose **Review policy**.

   1. On the **Review policy** page, for **Name**, enter **sns-access**.

   1. At the bottom of the page, choose **Create policy**.

1. Go back to the IAM console, and choose **Roles** in the left navigation pane.

1. Look up the IAM role that you use for SageMaker AI model training and choose that IAM role.

1. On the **Permissions** tab of the **Summary** page, choose **Attach policies**.

1. Search for the **sns-access** policy, select the check box next to the policy, and then choose **Attach policy**.

For more examples of setting up IAM policies for Amazon SNS, see [Example cases for Amazon SNS access control](https://docs.aws.amazon.com/sns/latest/dg/sns-access-policy-use-cases.html).

## Configure Debugger rules with the built-in actions
<a name="debugger-built-in-actions-on-rule"></a>

After successfully finishing the required settings in the preceding steps, you can configure the Debugger built-in actions for debugging rules as shown in the following example script. You can choose which built-in actions to use while building the `actions` list object. The `rule_configs` is a helper module that provides high-level tools to configure Debugger built-in rules and actions. The following built-in actions are available for Debugger:
+ `rule_configs.StopTraining()` – Stops a training job when the Debugger rule finds an issue.
+ `rule_configs.Email("abc@abc.com")` – Sends a notification via email when the Debugger rule finds an issue. Use the email address that you used when you set up your SNS topic subscription.
+ `rule_configs.SMS("+1234567890")` – Sends a notification via text message when the Debugger rule finds an issue. Use the phone number that you used when you set up your SNS topic subscription.
**Note**  
Make sure you type the correct email address and phone number. Phone numbers must include `+`, a country code, and a phone number, with no special characters or spaces. For example, the phone number \$11 (222) 333-4444 is formatted as **\$112223334444**.

You can use all of the built-in actions or a subset of actions by wrapping up using the `rule_configs.ActionList()` method, which takes the built-in actions and configures a list of actions.

**To add all of the three built-in actions to a single rule**

If you want to assign all of the three built-in actions to a single rule, configure a Debugger built-in action list while constructing an estimator. Use the following template to construct the estimator, and Debugger will stop training jobs and send notifications through email and text for any rules that you use to monitor your training job progress.

```
from sagemaker.debugger import Rule, rule_configs

# Configure an action list object for Debugger rules
actions = rule_configs.ActionList(
    rule_configs.StopTraining(), 
    rule_configs.Email("abc@abc.com"), 
    rule_configs.SMS("+1234567890")
)

# Configure rules for debugging with the actions parameter
rules = [
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule(),         # Required
        rule_parameters={"paramter_key": value },        # Optional
        actions=actions
    )
]

estimator = Estimator(
    ...
    rules = rules
)

estimator.fit(wait=False)
```

**To create multiple built-in action objects to assign different actions to a single rule**

If you want to assign the built-in actions to be triggered at different threshold values of a single rule, you can create multiple built-in action objects as shown in the following script. To avoid a conflict error by running the same rule, you must submit different rule job names (specify different strings for the rules' `name` attribute) as shown in the following example script template. This example shows how to set up [StalledTrainingRule](debugger-built-in-rules.md#stalled-training) to take two different actions: send an email to `abc@abc.com` when a training job stalls for 60 seconds, and stop the training job if stalling for 120 seconds.

```
from sagemaker.debugger import Rule, rule_configs
import time

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

# Configure an action object for StopTraining
action_stop_training = rule_configs.ActionList(
    rule_configs.StopTraining()
)

# Configure an action object for Email
action_email = rule_configs.ActionList(
    rule_configs.Email("abc@abc.com")
)

# Configure a rule with the Email built-in action to trigger if a training job stalls for 60 seconds
stalled_training_job_rule_email = Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "60", 
                "training_job_name_prefix": base_job_name_prefix
        },
        actions=action_email
)
stalled_training_job_rule_text.name="StalledTrainingJobRuleEmail"

# Configure a rule with the StopTraining built-in action to trigger if a training job stalls for 120 seconds
stalled_training_job_rule = Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "120", 
                "training_job_name_prefix": base_job_name_prefix
        },
        actions=action_stop_training
)
stalled_training_job_rule.name="StalledTrainingJobRuleStopTraining"

estimator = Estimator(
    ...
    rules = [stalled_training_job_rule_email, stalled_training_job_rule]
)

estimator.fit(wait=False)
```

While the training job is running, the Debugger built-in action sends notification emails and text messages whenever the rule finds issues with your training job. The following screenshot shows an example of email notification for a training job that has a stalled training job issue. 

![\[An example email notification sent by Debugger when it detects a StalledTraining issue.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-email.png)


The following screenshot shows an example text notification that Debugger sends when the rule finds a StalledTraining issue.

![\[An example text notification sent by Debugger when it detects a StalledTraining issue.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-text.png)


## Considerations for using the Debugger built-in actions
<a name="debugger-built-in-actions-considerations"></a>
+ To use the Debugger built-in actions, an internet connection is required. This feature is not supported in the network isolation mode provided by Amazon SageMaker AI or Amazon VPC.
+ The built-in actions cannot be used for [Profiler rules](debugger-built-in-profiler-rules.md#debugger-built-in-profiler-rules-ProfilerRule).
+ The built-in actions cannot be used on training jobs with spot training interruptions.
+ In email or text notifications, `None` appears at the end of messages. This does not have any meaning, so you can disregard the text `None`.

# Actions on rules using Amazon CloudWatch and AWS Lambda
<a name="debugger-cloudwatch-lambda"></a>

Amazon CloudWatch collects Amazon SageMaker AI model training job logs and Amazon SageMaker Debugger rule processing job logs. Configure Debugger with Amazon CloudWatch Events and AWS Lambda to take action based on Debugger rule evaluation status. 

## Example notebooks
<a name="debugger-test-stop-training"></a>

You can run the following example notebooks, which are prepared for experimenting with stopping a training job using actions on Debugger's built-in rules using Amazon CloudWatch and AWS Lambda.
+ [Amazon SageMaker Debugger - Reacting to CloudWatch Events from Rules](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_action_on_rule/tf-mnist-stop-training-job.html)

  This example notebook runs a training job that has a vanishing gradient issue. The Debugger [VanishingGradient](debugger-built-in-rules.md#vanishing-gradient) built-in rule is used while constructing the SageMaker AI TensorFlow estimator. When the Debugger rule detects the issue, the training job is terminated.
+ [Detect Stalled Training and Invoke Actions Using SageMaker Debugger Rule](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_action_on_rule/detect_stalled_training_job_and_actions.html)

  This example notebook runs a training script with a code line that forces it to sleep for 10 minutes. The Debugger [StalledTrainingRule](debugger-built-in-rules.md#stalled-training) built-in rule invokes issues and stops the training job.

**Topics**
+ [

## Example notebooks
](#debugger-test-stop-training)
+ [

# Access CloudWatch logs for Debugger rules and training jobs
](debugger-cloudwatch-metric.md)
+ [

# Set up Debugger for automated training job termination using CloudWatch and Lambda
](debugger-stop-training.md)
+ [

# Disable the CloudWatch Events rule to stop using the automated training job termination
](debugger-disable-cw.md)

# Access CloudWatch logs for Debugger rules and training jobs
<a name="debugger-cloudwatch-metric"></a>

You can use the training and Debugger rule job status in the CloudWatch logs to take further actions when there are training issues. The following procedure shows how to access the related CloudWatch logs. For more information about monitoring training jobs using CloudWatch, see [Monitor Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-overview.html).

**To access training job logs and Debugger rule job logs**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the left navigation pane under the **Log** node, choose **Log Groups**.

1. In the log groups list, do the following:
   + Choose **/aws/sagemaker/TrainingJobs** for training job logs.
   + Choose **/aws/sagemaker/ProcessingJobs** for Debugger rule job logs.

# Set up Debugger for automated training job termination using CloudWatch and Lambda
<a name="debugger-stop-training"></a>

The Debugger rules monitor training job status, and a CloudWatch Events rule watches the Debugger rule training job evaluation status. The following sections outline the process needed to automate training job termination using using CloudWatch and Lambda.

**Topics**
+ [

## Step 1: Create a Lambda function
](#debugger-lambda-function-create)
+ [

## Step 2: Configure the Lambda function
](#debugger-lambda-function-configure)
+ [

## Step 3: Create a CloudWatch events rule and link to the Lambda function for Debugger
](#debugger-cloudwatch-events)

## Step 1: Create a Lambda function
<a name="debugger-lambda-function-create"></a>

**To create a Lambda function**

1. Open the AWS Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/).

1. In the left navigation pane, choose **Functions** and then choose **Create function**.

1. On the **Create function** page, choose **Author from scratch** option.

1. In the **Basic information** section, enter a **Function name** (for example, **debugger-rule-stop-training-job**).

1. For **Runtime**, choose **Python 3.7**.

1. For **Permissions**, expand the drop down option, and choose **Change default execution role**.

1. For **Execution role**, choose **Use an existing role** and choose the IAM role that you use for training jobs on SageMaker AI.
**Note**  
Make sure you use the execution role with `AmazonSageMakerFullAccess` and `AWSLambdaBasicExecutionRole` attached. Otherwise, the Lambda function won't properly react to the Debugger rule status changes of the training job. If you are unsure which execution role is being used, run the following code in a Jupyter notebook cell to retrieve the execution role output:  

   ```
   import sagemaker
   sagemaker.get_execution_role()
   ```

1. At the bottom of the page, choose **Create function**.

The following figure shows an example of the **Create function** page with the input fields and selections completed.

![\[Create Function page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-lambda-create.png)


## Step 2: Configure the Lambda function
<a name="debugger-lambda-function-configure"></a>

**To configure the Lambda function**

1. In the **Function code** section of the configuration page, paste the following Python script in the Lambda code editor pane. The `lambda_handler` function monitors the Debugger rule evaluation status collected by CloudWatch and triggers the `StopTrainingJob` API operation. The AWS SDK for Python (Boto3) `client` for SageMaker AI provides a high-level method, `stop_training_job`, which triggers the `StopTrainingJob` API operation.

   ```
   import json
   import boto3
   import logging
   
   logger = logging.getLogger()
   logger.setLevel(logging.INFO)
   
   def lambda_handler(event, context):
       training_job_name = event.get("detail").get("TrainingJobName")
       logging.info(f'Evaluating Debugger rules for training job: {training_job_name}')
       eval_statuses = event.get("detail").get("DebugRuleEvaluationStatuses", None)
   
       if eval_statuses is None or len(eval_statuses) == 0:
           logging.info("Couldn't find any debug rule statuses, skipping...")
           return {
               'statusCode': 200,
               'body': json.dumps('Nothing to do')
           }
   
       # should only attempt stopping jobs with InProgress status
       training_job_status = event.get("detail").get("TrainingJobStatus", None)
       if training_job_status != 'InProgress':
           logging.debug(f"Current Training job status({training_job_status}) is not 'InProgress'. Exiting")
           return {
               'statusCode': 200,
               'body': json.dumps('Nothing to do')
           }
   
       client = boto3.client('sagemaker')
   
       for status in eval_statuses:
           logging.info(status.get("RuleEvaluationStatus") + ', RuleEvaluationStatus=' + str(status))
           if status.get("RuleEvaluationStatus") == "IssuesFound":
               secondary_status = event.get("detail").get("SecondaryStatus", None)
               logging.info(
                   f'About to stop training job, since evaluation of rule configuration {status.get("RuleConfigurationName")} resulted in "IssuesFound". ' +
                   f'\ntraining job "{training_job_name}" status is "{training_job_status}", secondary status is "{secondary_status}"' +
                   f'\nAttempting to stop training job "{training_job_name}"'
               )
               try:
                   client.stop_training_job(
                       TrainingJobName=training_job_name
                   )
               except Exception as e:
                   logging.error(
                       "Encountered error while trying to "
                       "stop training job {}: {}".format(
                           training_job_name, str(e)
                       )
                   )
                   raise e
       return None
   ```

   For more information about the Lambda code editor interface, see [Creating functions using the AWS Lambda console editor](https://docs.aws.amazon.com/lambda/latest/dg/code-editor.html).

1. Skip all other settings and choose **Save** at the top of the configuration page.

## Step 3: Create a CloudWatch events rule and link to the Lambda function for Debugger
<a name="debugger-cloudwatch-events"></a>

**To create a CloudWatch Events rule and link to the Lambda function for Debugger**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the left navigation pane, choose **Rules** under the **Events** node.

1. Choose **Create rule**.

1. In the **Event Source** section of the **Step 1: Create rule** page, choose **SageMaker AI** for **Service Name**, and choose **SageMaker AI Training Job State Change** for **Event Type**. The Event Pattern Preview should look like the following example JSON strings: 

   ```
   {
       "source": [
           "aws.sagemaker"
       ],
       "detail-type": [
           "SageMaker Training Job State Change"
       ]
   }
   ```

1. In the **Targets** section, choose **Add target\$1**, and choose the **debugger-rule-stop-training-job** Lambda function that you created. This step links the CloudWatch Events rule with the Lambda function.

1. Choose **Configure details** and go to the **Step 2: Configure rule details** page.

1. Specify the CloudWatch rule definition name. For example, **debugger-cw-event-rule**.

1. Choose **Create rule** to finish.

1. Go back to the Lambda function configuration page and refresh the page. Confirm that it's configured correctly in the **Designer** panel. The CloudWatch Events rule should be registered as a trigger for the Lambda function. The configuration design should look like the following example:  
<a name="lambda-designer-example"></a>![\[Designer panel for the CloudWatch configuration.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-lambda-designer.png)

# Disable the CloudWatch Events rule to stop using the automated training job termination
<a name="debugger-disable-cw"></a>

If you want to disable the automated training job termination, you need to disable the CloudWatch Events rule. In the Lambda **Designer** panel, choose the **EventBridge (CloudWatch Events)** block linked to the Lambda function. This shows an **EventBridge** panel below the **Designer** panel (for example, see the previous screen shot). Select the check box next to **EventBridge (CloudWatch Events): debugger-cw-event-rule**, and then choose **Disable**. If you want to use the automated termination functionality later, you can enable the CloudWatch Events rule again.

# Visualize Amazon SageMaker Debugger output tensors in TensorBoard
<a name="debugger-enable-tensorboard-summaries"></a>

**Important**  
This page is deprecated in favor of Amazon SageMaker AI with TensoBoard, which provides a comprehensive TensorBoard experience integrated with SageMaker Training and the access control functionalities of SageMaker AI domain. To learn more, see [TensorBoard in Amazon SageMaker AI](tensorboard-on-sagemaker.md).

Use SageMaker Debugger to create output tensor files that are compatible with TensorBoard. Load the files to visualize in TensorBoard and analyze your SageMaker training jobs. Debugger automatically generates output tensor files that are compatible with TensorBoard. For any hook configuration you customize for saving output tensors, Debugger has the flexibility to create scalar summaries, distributions, and histograms that you can import to TensorBoard. 

![\[An architecture diagram of the Debugger output tensor saving mechanism.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-tensorboard-concept.png)


You can enable this by passing `DebuggerHookConfig` and `TensorBoardOutputConfig` objects to an `estimator`.

The following procedure explains how to save scalars, weights, and biases as full tensors, histograms, and distributions that can be visualized with TensorBoard. Debugger saves them to the training container's local path (the default path is `/opt/ml/output/tensors`) and syncs to the Amazon S3 locations passed through the Debugger output configuration objects.

**To save TensorBoard compatible output tensor files using Debugger**

1. Set up a `tensorboard_output_config` configuration object to save TensorBoard output using the Debugger `TensorBoardOutputConfig` class. For the `s3_output_path` parameter, specify the default S3 bucket of the current SageMaker AI session or a preferred S3 bucket. This example does not add the `container_local_output_path` parameter; instead, it is set to the default local path `/opt/ml/output/tensors`.

   ```
   import sagemaker
   from sagemaker.debugger import TensorBoardOutputConfig
   
   bucket = sagemaker.Session().default_bucket()
   tensorboard_output_config = TensorBoardOutputConfig(
       s3_output_path='s3://{}'.format(bucket)
   )
   ```

   For additional information, see the Debugger `[TensorBoardOutputConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig)` API in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

1. Configure the Debugger hook and customize the hook parameter values. For example, the following code configures a Debugger hook to save all scalar outputs every 100 steps in training phases and 10 steps in validation phases, the `weights` parameters every 500 steps (the default `save_interval` value for saving tensor collections is 500), and the `bias` parameters every 10 global steps until the global step reaches 500.

   ```
   from sagemaker.debugger import CollectionConfig, DebuggerHookConfig
   
   hook_config = DebuggerHookConfig(
       hook_parameters={
           "train.save_interval": "100",
           "eval.save_interval": "10"
       },
       collection_configs=[
           CollectionConfig("weights"),
           CollectionConfig(
               name="biases",
               parameters={
                   "save_interval": "10",
                   "end_step": "500",
                   "save_histogram": "True"
               }
           ),
       ]
   )
   ```

   For more information about the Debugger configuration APIs, see the Debugger `[CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig)` and `[DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig)` APIs in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

1. Construct a SageMaker AI estimator with the Debugger parameters passing the configuration objects. The following example template shows how to create a generic SageMaker AI estimator. You can replace `estimator` and `Estimator` with other SageMaker AI frameworks' estimator parent classes and estimator classes. Available SageMaker AI framework estimators for this functionality are `[TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#create-an-estimator)`, `[PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#create-an-estimator)`, and `[MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#create-an-estimator)`.

   ```
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(
       ...
       # Debugger parameters
       debugger_hook_config=hook_config,
       tensorboard_output_config=tensorboard_output_config
   )
   estimator.fit()
   ```

   The `estimator.fit()` method starts a training job, and Debugger writes the output tensor files in real time to the Debugger S3 output path and to the TensorBoard S3 output path. To retrieve the output paths, use the following estimator methods:
   + For the Debugger S3 output path, use `estimator.latest_job_debugger_artifacts_path()`.
   + For the TensorBoard S3 output path, use `estimator.latest_job_tensorboard_artifacts_path()`.

1. After the training has completed, check the names of saved output tensors:

   ```
   from smdebug.trials import create_trial
   trial = create_trial(estimator.latest_job_debugger_artifacts_path())
   trial.tensor_names()
   ```

1. Check the TensorBoard output data in Amazon S3:

   ```
   tensorboard_output_path=estimator.latest_job_tensorboard_artifacts_path()
   print(tensorboard_output_path)
   !aws s3 ls {tensorboard_output_path}/
   ```

1. Download the TensorBoard output data to your notebook instance. For example, the following AWS CLI command downloads the TensorBoard files to `/logs/fit` under the current working directory of your notebook instance.

   ```
   !aws s3 cp --recursive {tensorboard_output_path} ./logs/fit
   ```

1. Compress the file directory to a TAR file to download to your local machine.

   ```
   !tar -cf logs.tar logs
   ```

1. Download and extract the Tensorboard TAR file to a directory on your device, launch a Jupyter notebook server, open a new notebook, and run the TensorBoard app.

   ```
   !tar -xf logs.tar
   %load_ext tensorboard
   %tensorboard --logdir logs/fit
   ```

The following animated screenshot illustrates steps 5 through 8. It demonstrates how to download the Debugger TensorBoard TAR file and load the file in a Jupyter notebook on your local device.

![\[Animation on how to download and load the Debugger TensorBoard file on your local machine.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-tensorboard.gif)


# List of Debugger built-in rules
<a name="debugger-built-in-rules"></a>

You can use the Debugger built-in rules, provided by Amazon SageMaker Debugger, to analyze metrics and tensors collected while training your models. The following lists the debugger rules, including information and an example on how to configure and deploy each built-in rule.

The Debugger built-in rules monitor various common conditions that are critical for the success of a training job. You can call the built-in rules using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) or the low-level SageMaker API operations. 

There's no additional cost for using the built-in rules. For more information about billing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Note**  
The maximum numbers of built-in rules that you can attach to a training job is 20. SageMaker Debugger fully manages the built-in rules and analyzes your training job synchronously.

**Important**  
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Debugger rule
<a name="debugger-built-in-rules-Rule"></a>

The following rules are the Debugger built-in rules that are callable using the `Rule.sagemaker` classmethod.

Debugger built-in rules for generating training reports


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Training Report for SageMaker AI XGboost training job |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 

Debugger built-in rules for debugging model training data (output tensors)


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Deep learning frameworks (TensorFlow, MXNet, and PyTorch) |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 
| Deep learning frameworks (TensorFlow, MXNet, and PyTorch) and the XGBoost algorithm  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 
| Deep learning applications |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 
| XGBoost algorithm |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 

**To use the built-in rules with default parameter values** – use the following configuration format:

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    Rule.sagemaker(rule_configs.built_in_rule_name_1()),
    Rule.sagemaker(rule_configs.built_in_rule_name_2()),
    ...
    Rule.sagemaker(rule_configs.built_in_rule_name_n())
]
```

**To use the built-in rules with customizing the parameter values** – use the following configuration format:

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule_name(),
        rule_parameters={
                "key": "value"
        }
        collections_to_save=[ 
            CollectionConfig(
                name="tensor_collection_name", 
                parameters={
                    "key": "value"
                } 
            )
        ]
    )
]
```

To find available keys for the `rule_parameters` parameter, see the parameter description tables.

Sample rule configuration codes are provided for each built-in rule below the parameter description tables.
+ For a full instruction and examples of using the Debugger built-in rules, see [Debugger built-in rules example code](debugger-built-in-rules-example.md#debugger-deploy-built-in-rules).
+ For a full instruction on using the built-in rules with the low-level SageMaker API operations, see [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md).

## CreateXgboostReport
<a name="create-xgboost-report"></a>

The CreateXgboostReport rule collects output tensors from an XGBoost training job and autogenerates a comprehensive training report. You can download a comprehensive profiling report while a training job is running or after the training job is complete, and check progress of training or the final result of the training job. The CreateXgboostReport rule collects the following output tensors by default: 
+ `hyperparameters` – Saves at the first step
+ `metrics` – Saves loss and accuracy every 5 steps
+ `feature_importance` – Saves every 5 steps
+ `predictions` – Saves every 5 steps
+ `labels` – Saves every 5 steps

Parameter Descriptions for the CreateXgboostReport Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 

```
rules=[
    Rule.sagemaker(
        rule_configs.create_xgboost_report()
    )  
]
```

## DeadRelu
<a name="dead-relu"></a>

This rule detects when the percentage of rectified linear unit (ReLU) activation functions in a trial are considered dead because their activation activity has dropped below a threshold. If the percent of inactive ReLUs in a layer is greater than the `threshold_layer` value of inactive ReLUs, the rule returns `True`.

Parameter Descriptions for the DeadRelu Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: `".*relu_output"`  | 
| threshold\$1inactivity |  Defines a level of activity below which a ReLU is considered to be dead. A ReLU might be active in the beginning of a trial and then slowly die during the training process. If the ReLU is active less than the `threshold_inactivity`, it is considered to be dead. **Optional** Valid values: Float Default values: `1.0` (in percentage)  | 
| threshold\$1layer |  Returns `True` if the percentage of inactive ReLUs in a layer is greater than `threshold_layer`. Returns `False` if the percentage of inactive ReLUs in a layer is less than `threshold_layer`. **Optional** Valid values: Float Default values: `50.0` (in percentage)  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.dead_relu(),
        rule_parameters={
                "tensor_regex": ".*relu_output|.*ReLU_output",
                "threshold_inactivity": "1.0",
                "threshold_layer": "50.0"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_relu_collection", 
                parameters={
                    "include_regex: ".*relu_output|.*ReLU_output",
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## ExplodingTensor
<a name="exploding-tensor"></a>

This rule detects whether the tensors emitted during training have non-finite values, either infinite or NaN (not a number). If a non-finite value is detected, the rule returns `True`.

Parameter Descriptions for the ExplodingTensor Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: String Default value: `None`  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: String  Default value: `None`  | 
| only\$1nan |   `True` to monitor the `base_trial` tensors only for `NaN` values and not for infinity.  `False` to treat both `NaN` and infinity as exploding values and to monitor for both. **Optional** Default value: `False`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.exploding_tensor(),
        rule_parameters={
                "tensor_regex": ".*gradient",
                "only_nan": "False"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="gradients", 
                parameters={
                    "save_interval": "500"
                }
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## PoorWeightInitialization
<a name="poor-weight-initialization"></a>

 This rule detects if your model parameters have been poorly initialized. 

Good initialization breaks the symmetry of the weights and gradients in a neural network and maintains commensurate activation variances across layers. Otherwise, the neural network doesn't learn effectively. Initializers like Xavier aim to keep variance constant across activations, which is especially relevant for training very deep neural nets. Too small an initialization can lead to vanishing gradients. Too large an initialization can lead to exploding gradients. This rule checks the variance of activation inputs across layers, the distribution of gradients, and the loss convergence for the initial steps to determine if a neural network has been poorly initialized.

Parameter Descriptions for the PoorWeightInitialization Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| activation\$1inputs\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: String Default value: `".*relu_input"`  | 
| threshold |  If the ratio between minimum and maximum variance of weights per layer exceeds the `threshold` at a step, the rule returns `True`. **Optional** Valid values: Float Default value: `10.0`  | 
| distribution\$1range |  If the minimum difference between 5th and 95th percentiles of the gradient distribution is less than the `distribution_range`, the rule returns `True`. **Optional** Valid values: Float Default value: `0.001`  | 
| patience |  The number of steps to wait until the loss is considered to be no longer decreasing. **Optional** Valid values: Integer Default value: `5`  | 
| steps |  The number of steps this rule analyzes. You typically need to check only the first few iterations. **Optional** Valid values: Float Default value: `10`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.poor_weight_initialization(),
        rule_parameters={
                "activation_inputs_regex": ".*relu_input|.*ReLU_input",
                "threshold": "10.0",
                "distribution_range": "0.001",
                "patience": "5",
                "steps": "10"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_relu_collection", 
                parameters={
                    "include_regex": ".*relu_input|.*ReLU_input",
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## SaturatedActivation
<a name="saturated-activation"></a>

This rule detects if the tanh and sigmoid activation layers are becoming saturated. An activation layer is saturated when the input of the layer is close to the maximum or minimum of the activation function. The minimum and maximum of the tanh and sigmoid activation functions are defined by their respective `min_threshold` and `max_thresholds` values. If the activity of a node drops below the `threshold_inactivity` percentage, it is considered saturated. If more than a `threshold_layer` percent of the nodes are saturated, the rule returns `True`.

Parameter Descriptions for the SaturatedActivation Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: None  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: String  Default value: `".*tanh_input\|.*sigmoid_input".`  | 
| threshold\$1tanh\$1min |  The minimum and maximum thresholds that define the extremes of the input for a tanh activation function, defined as: `(min_threshold, max_threshold)`. The default values are determined based on a vanishing gradient threshold of 0.0000001. **Optional** Valid values: Float Default values: `-9.4999`  | 
| threshold\$1tanh\$1max |  The minimum and maximum thresholds that define the extremes of the input for a tanh activation function, defined as: `(min_threshold, max_threshold)`. The default values are determined based on a vanishing gradient threshold of 0.0000001. **Optional** Valid values: Float Default values: `9.4999`  | 
| threshold\$1sigmoid\$1min |  The minimum and maximum thresholds that define the extremes of the input for a sigmoid activation function, defined as: `(min_threshold, max_threshold)`. The default values are determined based on a vanishing gradient threshold of 0.0000001. **Optional** Valid values: Float Default values: `-23`  | 
| threshold\$1sigmoid\$1max |  The minimum and maximum thresholds that define the extremes of the input for a sigmoid activation function, defined as: `(min_threshold, max_threshold)`. The default values are determined based on a vanishing gradient threshold of 0.0000001. **Optional** Valid values: Float Default values: `16.99999`  | 
| threshold\$1inactivity |  The percentage of inactivity below which the activation layer is considered to be saturated. The activation might be active in the beginning of a trial and then slowly become less active during the training process. **Optional** Valid values: Float Default values: `1.0`  | 
| threshold\$1layer |  Returns `True` if the number of saturated activations in a layer is greater than the `threshold_layer` percentage. Returns `False` if the number of saturated activations in a layer is less than the `threshold_layer` percentage. **Optional** Valid values: Float Default values: `50.0`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.saturated_activation(),
        rule_parameters={
                "tensor_regex": ".*tanh_input|.*sigmoid_input",
                "threshold_tanh_min": "-9.4999",
                "threshold_tanh_max": "9.4999",
                "threshold_sigmoid_min": "-23",
                "threshold_sigmoid_max": "16.99999",
                "threshold_inactivity": "1.0",
                "threshold_layer": "50.0"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_activations_collection",
                parameters={
                    "include_regex": ".*tanh_input|.*sigmoid_input"
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## VanishingGradient
<a name="vanishing-gradient"></a>

This rule detects if the gradients in a trial become extremely small or drop to a zero magnitude. If the mean of the absolute values of the gradients drops below a specified `threshold`, the rule returns `True`.

Parameters Descriptions for the VanishingGradient Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold | The value at which the gradient is determined to be vanishing.**Optional**Valid values: FloatDefault value: `0.0000001`. | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.vanishing_gradient(),
        rule_parameters={
                "threshold": "0.0000001"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="gradients", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## WeightUpdateRatio
<a name="weight-update-ratio"></a>

This rule keeps track of the ratio of updates to weights during training and detects if that ratio gets too large or too small. If the ratio of updates to weights is larger than the `large_threshold value` or if this ratio is smaller than `small_threshold`, the rule returns `True`.

Conditions for training are best when the updates are commensurate to gradients. Excessively large updates can push the weights away from optimal values, and very small updates result in very slow convergence. This rule requires weights to be available for two training steps, and `train.save_interval` needs to be set equal to `num_steps`.

Parameter Descriptions for the WeightUpdateRatio Rule


| Parameter Name, | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| num\$1steps |  The number of steps across which the rule checks to determine if the tensor has changed.  The number of steps across which you want to compare the weight ratios. If you pass no value, the rule runs by default against the current step and the immediately previous saved step. If you override the default by passing a value for this parameter, the comparison is done between weights at step `s` and at a step `>= s - num_steps`. **Optional** Valid values: Integer Default value: `None`  | 
| large\$1threshold |  The maximum value that the ratio of updates to weight can take before the rule returns `True`.  **Optional** Valid values: Float Default value: `10.0`  | 
| small\$1threshold |  The minimum value that the ratio of updates to weight can take, below which the rule returns `True`. **Optional** Valid values: Float Default value: `0.00000001`  | 
| epsilon |  A small constant used to ensure that Debugger does not divide by zero when computing the ratio updates to weigh. **Optional** Valid values: Float Default value: `0.000000001`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.weight_update_ratio(),
        rule_parameters={
                "num_steps": "100",
                "large_threshold": "10.0",
                "small_threshold": "0.00000001",
                "epsilon": "0.000000001"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="weights", 
                parameters={
                    "train.save_interval": "100"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## AllZero
<a name="all-zero"></a>

This rule detects if all or a specified percentage of the tensor values are zero.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified, the rule inspects the union of tensors from both sets.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameters Descriptions for the AllZero Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: `None`  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string Default value: `None`  | 
| threshold |  Specifies the percentage of values in the tensor that needs to be zero for this rule to be invoked.  **Optional** Valid values: Float Default value: 100 (in percentage)  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.all_zero(),
        rule_parameters={
                "tensor_regex": ".*",
                "threshold": "100"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="all", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## ClassImbalance
<a name="class-imbalance"></a>

This rule measures sampling imbalances between classes and throws errors if the imbalance exceeds a threshold or if too many mispredictions for underrepresented classes occur as a result of the imbalance.

Classification models require well-balanced classes in the training dataset or a proper weighting/sampling of classes during training. The rule performs the following checks:
+  It counts the occurrences per class. If the ratio of number of samples between smallest and largest class is larger than the `threshold_imbalance`, an error is thrown.
+  It checks the prediction accuracy per class. If resampling or weighting has not been correctly applied, then the model can reach high accuracy for the class with many training samples, but low accuracy for the classes with few training samples. If a fraction of mispredictions for a certain class is above `threshold_misprediction`, an error is thrown.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the ClassImbalance Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold\$1imbalance |  The acceptable imbalance between the number of samples in the smallest class and in the largest class. Exceeding this threshold value throws an error. **Optional** Valid values: Float Default value: `10`  | 
| threshold\$1misprediction |  A limit on the fraction of mispredictions allowed for each class. Exceeding this threshold throws an error. The underrepresented classes are most at risk of crossing this threshold.  **Optional** Valid values: Float Default value: `0.7`  | 
| samples |  The number of labels that have to be processed before an imbalance is evaluated. The rule might not be triggered until it has seen sufficient samples across several steps. The more classes that your dataset contains, the larger this `sample` number should be.  **Optional** Valid values: Integer Default value: `500` (assuming a dataset like MNIST with 10 classes)  | 
| argmax |  If `True`, [np.argmax](https://docs.scipy.org/doc/numpy-1.9.3/reference/generated/numpy.argmax.html) is applied to the prediction tensor. Required when you have a vector of probabilities for each class. It is used to determine which class has the highest probability. **Conditional** Valid values: Boolean Default value: `False`  | 
| labels\$1regex |  The name of the tensor that contains the labels. **Optional** Valid values: String Default value: `".*labels"`  | 
| predictions\$1regex |  The name of the tensor that contains the predictions. **Optional** Valid values: String Default value: `".*predictions"`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.class_imbalance(),
        rule_parameters={
                "threshold_imbalance": "10",
                "threshold_misprediction": "0.7",
                "samples": "500",
                "argmax": "False",
                "labels_regex": ".*labels",
                "predictions_regex": ".*predictions"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_output_collection",
                parameters={
                    "include_regex": ".*labels|.*predictions",
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## LossNotDecreasing
<a name="loss-not-decreasing"></a>

This rule detects when the loss is not decreasing in value at an adequate rate. These losses must be scalars. 

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified, the rule inspects the union of tensors from both sets.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the LossNotDecreasing Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: `None`  | 
| tensor\$1regex |  A list of regex patterns that is used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: `None`  | 
| use\$1losses\$1collection |  If set to `True`, looks for losses in the collection named "losses" when the collection is present. **Optional** Valid values: Boolean Default value: `True`  | 
| num\$1steps |  The minimum number of steps after which the rule checks if the loss has decreased. Rule evaluation happens every `num_steps`. The rule compares the loss for this step with the loss at a step which is at least `num_steps` behind the current step. For example, suppose that the loss is being saved every three steps, but `num_steps` is set to 10. At step 21, loss for step 21 is compared with loss for step 9. The next step at which loss is checked is step 33, because ten steps after step 21 is step 31, and at step 31 and step 32 loss is not saved.  **Optional** Valid values: Integer Default value: `10`  | 
| diff\$1percent |  The minimum percentage difference by which the loss should decrease between `num_steps`. **Optional** Valid values: `0.0` < float < `100` Default value: `0.1` (in percentage)  | 
| increase\$1threshold\$1percent |  The maximum threshold percent that loss is allowed to increase in case loss has been increasing **Optional** Valid values: `0` < float < `100` Default value: `5` (in percentage)  | 
| mode |  The name of the Debugger mode to query tensor values for rule checking. If this is not passed, the rule checks in order by default for the `mode.EVAL`, then `mode.TRAIN`, and then `mode.GLOBAL`.  **Optional** Valid values: String (`EVAL`, `TRAIN`, or `GLOBAL`) Default value: `GLOBAL`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.loss_not_decreasing(),
        rule_parameters={
                "tensor_regex": ".*",
                "use_losses_collection": "True",
                "num_steps": "10",
                "diff_percent": "0.1",
                "increase_threshold_percent": "5",
                "mode": "GLOBAL"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## Overfit
<a name="overfit"></a>

This rule detects if your model is being overfit to the training data by comparing the validation and training losses.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
A standard way to prevent overfitting is to regularize your model.

Parameter Descriptions for the Overfit Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: None  | 
| start\$1step |  The step from which to start comparing the validation and training loss. **Optional** Valid values: Integer Default value: `0`  | 
| patience |  The number of steps for which the `ratio_threshold` is allowed to exceed the value set before the model is considered to be overfit. **Optional** Valid values: Integer Default value: `1`  | 
| ratio\$1threshold |  The maximum ratio of the difference between the mean validation loss and mean training loss to the mean training loss. If this threshold is exceeded for a `patience` number of steps, the model is being overfit and the rule returns `True`. **Optional** Valid values: Float Default value: `0.1`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.overfit(),
        rule_parameters={
                "tensor_regex": ".*",
                "start_step": "0",
                "patience": "1",
                "ratio_threshold": "0.1"
        },
        collections_to_save=[
            CollectionConfig(
                name="losses", 
                parameters={
                    "train.save_interval": "100",
                    "eval.save_interval": "10"
                } 
            )
        ]
    )
]
```

## Overtraining
<a name="overtraining"></a>

This rule detects if a model is being overtrained. After a number of training iterations on a well-behaved model (both training and validation loss decrease), the model approaches to a minimum of the loss function and does not improve anymore. If the model continues training it can happen that validation loss starts increasing, because the model starts overfitting. This rule sets up thresholds and conditions to determine if the model is not improving, and prevents overfitting problems due to overtraining.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
Overtraining can be avoided by early stopping. For information on early stopping, see [Stop Training Jobs Early](automatic-model-tuning-early-stopping.md). For an example that shows how to use spot training with Debugger, see [Enable Spot Training with Amazon SageMaker Debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/mxnet_spot_training/mxnet-spot-training-with-sagemakerdebugger.html). 

Parameter Descriptions for the Overtraining Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| patience\$1train |  The number of steps to wait before the training loss is considered to not to be improving anymore. **Optional** Valid values: Integer Default value: `5`  | 
| patience\$1validation | The number of steps to wait before the validation loss is considered to not to be improving anymore.**Optional**Valid values: IntegerDefault value: `10` | 
| delta |  The minimum threshold by how much the error should improve before it is considered as a new optimum. **Optional** Valid values: Float Default value: `0.01`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.overtraining(),
        rule_parameters={
                "patience_train": "5",
                "patience_validation": "10",
                "delta": "0.01"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## SimilarAcrossRuns
<a name="similar-across-runs"></a>

This rule compares tensors gathered from a base trial with tensors from another trial. 

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the SimilarAcrossRuns Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| other\$1trials |  A completed training job name whose tensors you want to compare to those tensors gathered from the current `base_trial`. **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: None  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: None  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.similar_across_runs(),
        rule_parameters={
                "other_trials": "<specify-another-job-name>",
                "collection_names": "losses",
                "tensor_regex": ".*"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## StalledTrainingRule
<a name="stalled-training"></a>

StalledTrainingRule detects if there is no progress made on training job, and stops the training job if the rule fires. This rule requires tensors to be periodically saved in a time interval defined by its `threshold` parameter. This rule keeps on monitoring for new tensors, and if no new tensor has been emitted for threshold interval rule gets fired. 

Parameter Descriptions for the StalledTrainingRule Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold |  A threshold that defines by how much time in seconds the rule waits for a tensor output until it fires a stalled training issue. Default value is 1800 seconds. **Optional** Valid values: Integer Default value: `1800`  | 
| stop\$1training\$1on\$1fire |  If set to `True`, watches if the base training job outputs tensors in "`threshold`" seconds. **Optional** Valid values: Boolean Default value: `False`  | 
| training\$1job\$1name\$1prefix |  The prefix of base training job name. If `stop_training_on_fire` is true, the rule searches for SageMaker training jobs with this prefix in the same account. If there is an inactivity found, the rule takes a `StopTrainingJob` action. Note if there are multiple jobs found with same prefix, the rule skips termination. It is important that the prefix is set unique per each training job. **Optional** Valid values: String  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "1800",
                "stop_training_on_fire": "True",
                "training_job_name_prefix": "<specify-training-base-job-name>"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## TensorVariance
<a name="tensor-variance"></a>

This rule detects if you have tensors with very high or low variances. Very high or low variances in a tensor could lead to neuron saturation, which reduces the learning ability of the neural network. Very high variance in tensors can also eventually lead to exploding tensors. Use this rule to detect such issues early.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified, the rule inspects the union of tensors from both sets.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the TensorVariance Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: None  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: None  | 
| max\$1threshold |  The threshold for the upper bound of tensor variance. **Optional** Valid values: Float Default value: None  | 
| min\$1threshold |  The threshold for the lower bound of tensor variance. **Optional** Valid values: Float Default value: None  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.tensor_variance(),
        rule_parameters={
                "collection_names": "weights",
                "max_threshold": "10",
                "min_threshold": "0.00001",
        },
        collections_to_save=[ 
            CollectionConfig(
                name="weights", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## UnchangedTensor
<a name="unchanged-tensor"></a>

This rule detects whether a tensor is no longer changing across steps. 

This rule runs the [numpy.allclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html) method to check if the tensor isn't changing.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified, the rule inspects the union of tensors from both sets.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the UnchangedTensor Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: None  | 
| tensor\$1regex |  A list of regex patternsused to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: None  | 
| num\$1steps |  The number of steps across which the rule checks to determine if the tensor has changed.  This checks the last `num_steps` that are available. They don't need to be consecutive. If `num_steps` is 2, at step s it doesn't necessarily check for s-1 and s. If s-1 isn't available, it checks the last available step along with s. In that case, it checks the last available step with the current step. **Optional** Valid values: Integer Default value: `3`  | 
| rtol |  The relative tolerance parameter to be passed to the `[numpy.allclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html)` method.  **Optional** Valid values: Float Default value: `1e-05`  | 
| atol |  The absolute tolerance parameter to be passed to the `[numpy.allclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html)` method. **Optional** Valid values: Float Default value: `1e-08`  | 
| equal\$1nan |  Whether to compare NaNs as equal. If `True`, NaNs in input array a are considered equal to NaNs in input array b in the output array. This parameter is passed to the `[numpy.allclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html)` method. **Optional** Valid values: Boolean Default value: `False`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.unchanged_tensor(),
        rule_parameters={
                "collection_names": "losses",
                "tensor_regex": "",
                "num_steps": "3",
                "rtol": "1e-05",
                "atol": "1e-08",
                "equal_nan": "False"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## CheckInputImages
<a name="checkinput-mages"></a>

This rule checks if input images have been correctly normalized. Specifically, it detects if the mean of the sample data differs by more than a threshold value from zero. Many computer vision models require that input data has a zero mean and unit variance.

This rule is applicable to deep learning applications.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the CheckInputImages Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold\$1mean |  A threshold that defines by how much mean of the input data can differ from 0. **Optional** Valid values: Float Default value: `0.2`  | 
| threshold\$1samples |  The number of images that have to be sampled before an error can be thrown. If the value is too low, the estimation of the dataset mean will be inaccurate. **Optional** Valid values: Integer Default value: `500`  | 
| regex |  The name of the input data tensor. **Optional** Valid values: String Default value: `".*hybridsequential0_input_0"` (the name of the input tensor for Apache MXNet models using HybridSequential)  | 
| channel |  The position of the color channel in the input tensor shape array.  **Optional** Valid values: Integer Default value: `1` (for example, MXNet expects input data in the form of (batch\$1size, channel, height, width))  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.check_input_images(),
        rule_parameters={
                "threshold_mean": "0.2",
                "threshold_samples": "500",
                "regex": ".*hybridsequential0_input_0",
                "channel": "1"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_inputs_collection", 
                parameters={
                    "include_regex": ".*hybridsequential0_input_0",
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## NLPSequenceRatio
<a name="nlp-sequence-ratio"></a>

This rule calculates the ratio of specific tokens given the rest of the input sequence that is useful for optimizing performance. For example, you can calculate the percentage of padding end-of-sentence (EOS) tokens in your input sequence. If the number of EOS tokens is too high, an alternate bucketing strategy should be performed. You also can calculate the percentage of unknown tokens in your input sequence. If the number of unknown words is too high, an alternate vocabulary could be used.

This rule is applicable to deep learning applications.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the NLPSequenceRatio Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: `".*embedding0_input_0"` (assuming an embedding as the initial layer of the network)  | 
| token\$1values |  A string of a list of the numerical values of the tokens. For example, "3, 0". **Optional** Valid values: Comma-separated string of numerical values Default value: `0`  | 
| token\$1thresholds\$1percent |  A string of a list of thresholds (in percentages) that correspond to each of the `token_values`. For example,"50.0, 50.0". **Optional** Valid values: Comma-separated string of floats Default value: `"50"`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.nlp_sequence_ratio(),
        rule_parameters={
                "tensor_regex": ".*embedding0_input_0",
                "token_values": "0",
                "token_thresholds_percent": "50"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_inputs_collection", 
                parameters={
                    "include_regex": ".*embedding0_input_0"
                } 
            )
        ]
    )
]
```

## Confusion
<a name="confusion"></a>

This rule evaluates the goodness of a confusion matrix for a classification problem.

It creates a matrix of size `category_no*category_no` and populates it with data coming from (`labels`, `predictions`) pairs. For each (`labels`, `predictions`) pair, the count in `confusion[labels][predictions]` is incremented by 1. When the matrix is fully populated, the ratio of data on-diagonal values and off-diagonal values are evaluated as follows:
+ For elements on the diagonal: `confusion[i][i]/sum_j(confusion[j][j])>=min_diag`
+ For elements off the diagonal: `confusion[j][i])/sum_j(confusion[j][i])<=max_off_diag`

This rule can be applied to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the Confusion Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| category\$1no |  The number of categories. **Optional** Valid values: Integer ≥2 Default value: `"None"`  | 
| labels |  The `labels` tensor collection or an 1-d vector of true labels.  **Optional** Valid values: String Default value: `"labels"`  | 
| predictions |  The `predictions` tensor collection or an 1-d vector of estimated labels.  **Optional** Valid values: String Default value: `"predictions"`  | 
| labels\$1collection |  The rule inspects the tensors in this collection for `labels`. **Optional** Valid values: String Default value: `"labels"`  | 
| predictions\$1collection |  The rule inspects the tensors in this collection for `predictions`. **Optional** Valid values: String Default value: `"predictions"`  | 
| min\$1diag |  The minimum threshold for the ratio of data on the diagonal. **Optional** Valid values: `0`≤float≤`1` Default value: `0.9`  | 
| max\$1off\$1diag |  The maximum threshold for the ratio of data off the diagonal. **Optional** Valid values: `0`≤float≤`1` Default value: `0.1`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.confusion(),
        rule_parameters={
                "category_no": "10",
                "labels": "labels",
                "predictions": "predictions",
                "labels_collection": "labels",
                "predictions_collection": "predictions",
                "min_diag": "0.9",
                "max_off_diag": "0.1"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="labels",
                parameters={
                    "save_interval": "500"
                } 
            ),
            CollectionConfig(
                name="predictions",
                parameters={
                    "include_regex": "500"
                } 
            )
        ]
    )
]
```

**Note**  
This rule infers default values for the optional parameters if their values aren't specified.

## FeatureImportanceOverweight
<a name="feature_importance_overweight"></a>

This rule accumulates the weights of the n largest feature importance values per step and ensures that they do not exceed the threshold. For example, you can set the threshold for the top 3 features to not hold more than 80 percent of the total weights of the model.

This rule is valid only for the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the FeatureImportanceOverweight Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold |  Defines the threshold for the proportion of the cumulative sum of the `n` largest features. The number `n` is defined by the `nfeatures` parameter. **Optional** Valid values: Float Default value: `0.8`  | 
| nfeatures |  The number of largest features. **Optional** Valid values: Integer Default value: `3`  | 
| tensor\$1regex |  Regular expression (regex) of tensor names the rule to analyze. **Optional** Valid values: String Default value: `".*feature_importance/weight"`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.feature_importance_overweight(),
        rule_parameters={
                "threshold": "0.8",
                "nfeatures": "3",
                "tensor_regex": ".*feature_importance/weight"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="feature_importance", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## TreeDepth
<a name="tree-depth"></a>

This rule measures the depth of trees in an XGBoost model. XGBoost rejects splits if they do not improve loss. This regularizes the training. As a result, the tree might not grow as deep as defined by the `depth` parameter.

This rule is valid only for the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the TreeDepth Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| depth |  The depth of the tree. The depth of the tree is obtained by computing the base 2 logarithm of the largest node ID. **Optional** Valid values: Float Default value: `4`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.tree_depth(),
        rule_parameters={
                "depth": "4"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="tree", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

# Creating custom rules using the Debugger client library
<a name="debugger-custom-rules"></a>

You can create custom rules to monitor your training job using the Debugger rule APIs and the open source [`smdebug` Python library](https://github.com/awslabs/sagemaker-debugger/) that provide tools to build your own rule containers.

## Prerequisites for creating a custom rule
<a name="debugger-custom-rules-prerequisite"></a>

To create Debugger custom rules, you need the following prerequisites.
+ [SageMaker Debugger Rule.custom API](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule.custom)
+ [The open source smdebug Python library](https://github.com/awslabs/sagemaker-debugger/)
+ Your own custom rule python script
+ [Amazon SageMaker Debugger image URIs for custom rule evaluators](debugger-reference.md#debuger-custom-rule-registry-ids)

**Topics**
+ [

## Prerequisites for creating a custom rule
](#debugger-custom-rules-prerequisite)
+ [

# Use the `smdebug` client library to create a custom rule as a Python script
](debugger-custom-rules-python-script.md)
+ [

# Use the Debugger APIs to run your own custom rules
](debugger-custom-rules-python-sdk.md)

# Use the `smdebug` client library to create a custom rule as a Python script
<a name="debugger-custom-rules-python-script"></a>

The `smdebug` Rule API provides an interface to set up your own custom rules. The following python script is a sample of how to construct a custom rule, `CustomGradientRule`. This tutorial custom rule watches if the gradients are getting too large and set the default threshold as 10. The custom rule takes a base trial created by a SageMaker AI estimator when it initiates training job. 

```
from smdebug.rules.rule import Rule

class CustomGradientRule(Rule):
    def __init__(self, base_trial, threshold=10.0):
        super().__init__(base_trial)
        self.threshold = float(threshold)

    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(collection="gradients"):
            t = self.base_trial.tensor(tname)
            abs_mean = t.reduction_value(step, "mean", abs=True)
            if abs_mean > self.threshold:
                return True
        return False
```

You can add multiple custom rule classes as many as you want in the same python script and deploy them to any training job trials by constructing custom rule objects in the following section.

# Use the Debugger APIs to run your own custom rules
<a name="debugger-custom-rules-python-sdk"></a>

The following code sample shows how to configure a custom rule with the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). This example assumes that the custom rule script you created in the previous step is located at '*path/to/my\$1custom\$1rule.py*'.

```
from sagemaker.debugger import Rule, CollectionConfig

custom_rule = Rule.custom(
    name='MyCustomRule',
    image_uri='759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', 
    instance_type='ml.t3.medium',     
    source='path/to/my_custom_rule.py', 
    rule_to_invoke='CustomGradientRule',     
    collections_to_save=[CollectionConfig("gradients")], 
    rule_parameters={"threshold": "20.0"}
)
```

The following list explains the Debugger `Rule.custom` API arguments.
+ `name` (str): Specify a custom rule name as you want.
+ `image_uri` (str): This is the image of the container that has the logic of understanding your custom rule. It sources and evaluates the specified tensor collections you save in the training job. You can find the list of open source SageMaker AI rule evaluator images from [Amazon SageMaker Debugger image URIs for custom rule evaluators](debugger-reference.md#debuger-custom-rule-registry-ids).
+ `instance_type` (str): You need to specify an instance to build a rule docker container. This spins up the instance in parallel with a training container.
+ `source` (str): This is the local path or the Amazon S3 URI to your custom rule script.
+ `rule_to_invoke` (str): This specifies the particular Rule class implementation in your custom rule script. SageMaker AI supports only one rule to be evaluated at a time in a rule job.
+ `collections_to_save` (str): This specifies which tensor collections you will save for the rule to run.
+ `rule_parameters` (dictionary): This accepts parameter inputs in a dictionary format. You can adjust the parameters that you configured in the custom rule script.

After you set up the `custom_rule` object, you can use it for building a SageMaker AI estimator for any training jobs. Specify the `entry_point` to your training script. You do not need to make any change of your training script.

```
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
                role=sagemaker.get_execution_role(),
                base_job_name='smdebug-custom-rule-demo-tf-keras',
                entry_point='path/to/your_training_script.py'
                train_instance_type='ml.p2.xlarge'
                ...
                
                # debugger-specific arguments below
                rules = [custom_rule]
)

estimator.fit()
```

For more variations and advanced examples of using Debugger custom rules, see the following example notebooks.
+ [Monitor your training job with Amazon SageMaker Debugger custom rules](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_keras_custom_rule/tf-keras-custom-rule.html)
+ [PyTorch iterative model pruning of ResNet and AlexNet](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_iterative_model_pruning)
+ [Trigger Amazon CloudWatch Events using Debugger Rules to Take an Action Based on Training Status with TensorFlow](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_action_on_rule)

# Use Debugger with custom training containers
<a name="debugger-bring-your-own-container"></a>

Amazon SageMaker Debugger is available for any deep learning models that you bring to Amazon SageMaker AI. The AWS CLI, SageMaker AI `Estimator` API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train your models. To use Debugger with customized containers, you need to make a minimal change to your training script to implement the Debugger hook callback and retrieve tensors from training jobs. The following sections will walk you through how to use Debugger with Custom Training Containers.

You need the following resources to build a customized container with Debugger.
+ [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable)
+ [The SMDebug open source client library](https://github.com/awslabs/sagemaker-debugger)
+ A Docker base image of your choice
+ Your training script with a Debugger hook registered – For more information about registering a Debugger hook to your training script, see [Register Debugger hook to your training script](#debugger-script-mode).

For an end-to-end example of using Debugger with a custom training container, see the following example notebook.
+ [Build a Custom Training Container and Debug Training Jobs with Debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/build_your_own_container_with_debugger/debugger_byoc.html)

**Tip**  
This custom container with Debugger guide is an extension of the [Adapting your own training container](adapt-training-container.md) guide which walks you thorough how to build and push your custom training container to Amazon ECR.

## Prepare to build a custom training container
<a name="debugger-bring-your-own-container-1"></a>

To build a docker container, the basic structure of files should look like the following:

```
├── debugger_custom_container_test_notebook.ipynb      # a notebook to run python snippet codes
└── debugger_custom_container_test_folder              # this is a docker folder
    ├──  your-training-script.py                       # your training script with Debugger hook
    └──  Dockerfile                                    # a Dockerfile to build your own container
```

## Register Debugger hook to your training script
<a name="debugger-script-mode"></a>

To debug your model training, you need to add a Debugger hook to your training script.

**Note**  
This step is required to collect model parameters (output tensors) for debugging your model training. If you only want to monitor and profile, you can skip this hook registration step and exclude the `debugger_hook_config` parameter when constructing an estimater.

The following example code shows the structure of a training script using the Keras ResNet50 model and how to pass the Debugger hook as a Keras callback for debugging. To find a complete training script, see [TensorFlow training script with SageMaker Debugger hook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-debugger/build_your_own_container_with_debugger/docker/tf_keras_resnet_byoc.py).

```
# An example of training script (your-training-script.py)
import tensorflow.compat.v2 as tf
from tensorflow.keras.applications.resnet50 import ResNet50
import smdebug.tensorflow as smd

def train(batch_size, epoch, model, hook):

    ...
    model.fit(X_train, Y_train,
              batch_size=batch_size,
              epochs=epoch,
              validation_data=(X_valid, Y_valid),
              shuffle=True,

              # smdebug modification: Pass the Debugger hook in the main() as a Keras callback
              callbacks=[hook])

def main():
    parser=argparse.ArgumentParser(description="Train resnet50 cifar10")

    # hyperparameter settings
    parser.add_argument(...)
    
    args = parser.parse_args()

    model=ResNet50(weights=None, input_shape=(32,32,3), classes=10)

    # Add the following line to register the Debugger hook for Keras.
    hook=smd.KerasHook.create_from_json_file()

    # Start the training.
    train(args.batch_size, args.epoch, model, hook)

if __name__ == "__main__":
    main()
```

For more information about registering the Debugger hook for the supported frameworks and algorithm, see the following links in the SMDebug client library:
+ [SMDebug TensorFlow hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/tensorflow.md)
+ [SMDebug PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md)
+ [SMDebug MXNet hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/mxnet.md)
+ [SMDebug XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md)

In the following example notebooks' training scripts, you can find more examples about how to add the Debugger hooks to training scripts and collect output tensors in detail:
+ [ Debugger in script mode with the TensorFlow 2.1 framework](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow2/tensorflow2_keras_custom_container/tf2-keras-custom-container.html)

  To see the difference between using Debugger in a Deep Learning Container and in script mode, open this notebook and put it and [ the previous Debugger in a Deep Learning Container TensorFlow v2.1 notebook example](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow2/tensorflow2_zero_code_change/tf2-keras-default-container.html) side by side. 

   In script mode, the hook configuration part is removed from the script in which you set the estimator. Instead, the Debugger hook feature is merged into the training script, [ TensorFlow Keras ResNet training script in script mode](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/tensorflow2/tensorflow2_keras_custom_container/src/tf_keras_resnet_byoc.py). The training script imports the `smdebug` library in the required TensorFlow Keras environment to communicate with the TensorFlow ResNet50 algorithm. It also manually implements the `smdebug` hook functionality by adding the `callbacks=[hook]` argument inside the `train` function (in line 49), and by adding the manual hook configuration (in line 89) provided through SageMaker Python SDK.

  This script mode example runs the training job in the TF 2.1 framework for direct comparison with the zero script change in the TF 2.1 example. The benefit of setting up Debugger in script mode is the flexibility to choose framework versions not covered by AWS Deep Learning Containers. 
+ [ Using Amazon SageMaker Debugger in a PyTorch Container in Script Mode ](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_custom_container)

  This notebook enables Debugger in script mode in PyTorch v1.3.1 framework. PyTorch v1.3.1 is supported by SageMaker AI containers, and this example shows details of how to modify a training script. 

  The SageMaker AI PyTorch estimator is already in script mode by default. In the notebook, the line to activate `script_mode` is not included in the estimator configuration.

  This notebook shows detailed steps to change [the original PyTorch training script](https://github.com/pytorch/examples/blob/master/mnist/main.py) to a modified version to enable Debugger. Additionally, this example shows how you can use Debugger built-in rules to detect training issues such as the vanishing gradients problem, and the Debugger trial features to call and analyze the saved tensors. 

## Create and configure a Dockerfile
<a name="debugger-bring-your-own-container-2"></a>

Open your SageMaker AI JupyterLab and create a new folder, `debugger_custom_container_test_folder` in this example, to save your training script and `Dockerfile`. The following code example is a `Dockerfile` that includes essential docker build commends. Paste the following code into the `Dockerfile` text file and save it. Upload your training script to the same folder.

```
# Specify a docker base image
FROM tensorflow/tensorflow:2.2.0rc2-gpu-py3
RUN /usr/bin/python3 -m pip install --upgrade pip
RUN pip install --upgrade protobuf

# Install required packages to enable the SageMaker Python SDK and the smdebug library
RUN pip install sagemaker-training
RUN pip install smdebug
CMD ["bin/bash"]
```

If you want to use a pre-built AWS Deep Learning Container image, see [Available AWS Deep Learning Containers Images](https://aws.amazon.com/releasenotes/available-deep-learning-containers-images/).

## Build and push the custom training image to Amazon ECR
<a name="debugger-bring-your-own-container-3"></a>

Create a test notebook, `debugger_custom_container_test_notebook.ipynb`, and run the following code in the notebook cell. This will access the `debugger_byoc_test_docker` directory, build the docker with the specified `algorithm_name`, and push the docker container to your Amazon ECR.

```
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-debugger-mnist-byoc-tf2'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)

!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $byoc_image_uri
!docker push $byoc_image_uri
```

**Tip**  
If you use one of the AWS Deep Learning Container base images, run the following code to log in to Amazon ECR and access to the Deep Learning Container image repository.  

```
! aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
```

## Run and debug training jobs using the custom training container
<a name="debugger-bring-your-own-container-4"></a>

After you build and push your docker container to Amazon ECR, configure a SageMaker AI estimator with your training script and the Debugger-specific parameters. After you execute the `estimator.fit()`, Debugger will collect output tensors, monitor them, and detect training issues. Using the saved tensors, you can further analyze the training job by using the `smdebug` core features and tools. Configuring a workflow of Debugger rule monitoring process with Amazon CloudWatch Events and AWS Lambda, you can automate a stopping training job process whenever the Debugger rules spots training issues.

```
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs

profiler_config=ProfilerConfig(...)
debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=Estimator(
    image_uri=byoc_image_uri,
    entry_point="./debugger_custom_container_test_folder/your-training-script.py"
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-custom-container-test',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    
    # Debugger-specific parameters
    profiler_config=profiler_config,
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

# start training
estimator.fit()
```

# Configure Debugger using SageMaker API
<a name="debugger-createtrainingjob-api"></a>

 The preceding topics focus on using Debugger through Amazon SageMaker Python SDK, which is a wrapper around AWS SDK for Python (Boto3) and SageMaker API operations. This offers a high-level experience of accessing the Amazon SageMaker API operations. In case you need to manually configure the SageMaker API operations using AWS Boto3 or AWS Command Line Interface (CLI) for other SDKs, such as Java, Go, and C\$1\$1, this section covers how to configure the following low-level API operations.

**Topics**
+ [

# JSON (AWS CLI)
](debugger-built-in-rules-api.CLI.md)
+ [

# SDK for Python (Boto3)
](debugger-built-in-rules-api.Boto3.md)

# JSON (AWS CLI)
<a name="debugger-built-in-rules-api.CLI"></a>

Amazon SageMaker Debugger built-in rules can be configured for a training job using the [DebugHookConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html), [DebugRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html), [ProfilerConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html), and [ProfilerRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html) objects through the SageMaker AI [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API operation. You need to specify the right image URI in the `RuleEvaluatorImage` parameter, and the following examples walk you through how to set up the JSON strings to request [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

The following code shows a complete JSON template to run a training job with required settings and Debugger configurations. Save the template as a JSON file in your working directory and run the training job using AWS CLI. For example, save the following code as `debugger-training-job-cli.json`.

**Note**  
Ensure that you use the correct Docker container images. To find AWS Deep Learning Container images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules).

```
{
   "TrainingJobName": "debugger-aws-cli-test",
   "RoleArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-YYYYMMDDT123456",
   "AlgorithmSpecification": {
      // Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
      "TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04",
      "TrainingInputMode": "File",
      "EnableSageMakerMetricsTimeSeries": false
   },
   "HyperParameters": {
      "sagemaker_program": "entry_point/tf-hvd-train.py",
      "sagemaker_submit_directory": "s3://sagemaker-us-west-2-111122223333/debugger-boto3-profiling-test/source.tar.gz"
   },
   "OutputDataConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/output"
   },
   "DebugHookConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/debug-output",
      "CollectionConfigurations": [
         {
            "CollectionName": "losses",
            "CollectionParameters" : {
                "train.save_interval": "50"
            }
         }
      ]
   },
   "DebugRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "LossNotDecreasing",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "LossNotDecreasing"}
      }
   ],
   "ProfilerConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/profiler-output",
      "ProfilingIntervalInMilliseconds": 500,
      "ProfilingParameters": {
          "DataloaderProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\", }",
          "DetailedProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, }",
          "PythonProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cprofile\", \"cProfileTimer\": \"total_time\"}",
          "LocalPath": "/opt/ml/output/profiler/" 
      }
   },
   "ProfilerRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "ProfilerReport",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "ProfilerReport"}
      }
   ],
   "ResourceConfig": { 
      "InstanceType": "ml.p3.8xlarge",
      "InstanceCount": 1,
      "VolumeSizeInGB": 30
   },
   
   "StoppingCondition": { 
      "MaxRuntimeInSeconds": 86400
   }
}
```

After saving the JSON file, run the following command in your terminal. (Use `!` at the beginning of the line if you use a Jupyter notebook.)

```
aws sagemaker create-training-job --cli-input-json file://debugger-training-job-cli.json
```

## To configure a Debugger rule for debugging model parameters
<a name="debugger-built-in-rules-api-debug.CLI"></a>

The following code samples show how to configure a built-in `VanishingGradient` rule using this SageMaker API. 

**To enable Debugger to collect output tensors**

Specify the Debugger hook configuration as follows:

```
"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "gradients",
            "CollectionParameters" : {
                "save_interval": "500"
            }
        }
    ]
}
```

This will make the training job save the tensor collection, `gradients`, every `save_interval` of 500 steps. To find available `CollectionName` values, see [Debugger Built-in Collections](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections) in the *SMDebug client library documentation*. To find available `CollectionParameters` parameter keys and values, see the [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) class in the *SageMaker Python SDK documentation*.

**To enable Debugger rules for debugging the output tensors**

The following `DebugRuleConfigurations` API example shows how to run the built-in `VanishingGradient` rule on the saved `gradients` collection.

```
"DebugRuleConfigurations": [
    {
        "RuleConfigurationName": "VanishingGradient",
        "RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "VanishingGradient",
            "threshold": "20.0"
        }
    }
]
```

With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training job using the `VanishingGradient` rule on the collection of `gradients` tensor. To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

## To configure a Debugger built-in rule for profiling system and framework metrics
<a name="debugger-built-in-rules-api-profile.CLI"></a>

The following example code shows how to specify the ProfilerConfig API operation to enable collecting system and framework metrics.

**To enable Debugger profiling to collect system and framework metrics**

------
#### [ Target Step ]

```
"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500, 
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/" 
    }
}
```

------
#### [ Target Time Duration ]

```
"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500,
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/"  
    }
}
```

------

**To enable Debugger rules for profiling the metrics**

The following example code shows how to configure the `ProfilerReport` rule.

```
"ProfilerRuleConfigurations": [ 
    {
        "RuleConfigurationName": "ProfilerReport",
        "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "ProfilerReport",
            "CPUBottleneck_cpu_threshold": "90",
            "IOBottleneck_threshold": "90"
        }
    }
]
```

To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

## Update Debugger profiling configuration using the `UpdateTrainingJob` API
<a name="debugger-updatetrainingjob-api.CLI"></a>

Debugger profiling configuration can be updated while your training job is running by using the [UpdateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateTrainingJob.html) API operation. Configure new [ProfilerConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html) and [ProfilerRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html) objects, and specify the training job name to the `TrainingJobName` parameter.

```
{
    "ProfilerConfig": { 
        "DisableProfiler": boolean,
        "ProfilingIntervalInMilliseconds": number,
        "ProfilingParameters": { 
            "string" : "string" 
        }
    },
    "ProfilerRuleConfigurations": [ 
        { 
            "RuleConfigurationName": "string",
            "RuleEvaluatorImage": "string",
            "RuleParameters": { 
                "string" : "string" 
            }
        }
    ],
    "TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
}
```

## Add Debugger custom rule configuration to the `CreateTrainingJob` API
<a name="debugger-custom-rules-api.CLI"></a>

A custom rule can be configured for a training job using the [ DebugHookConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html) and [ DebugRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html) objects in the [ CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API operation. The following code sample shows how to configure a custom `ImproperActivation` rule written with the *smdebug* library using this SageMaker API operation. This example assumes that you’ve written the custom rule in *custom\$1rules.py* file and uploaded it to an Amazon S3 bucket. The example provides pre-built Docker images that you can use to run your custom rules. These are listed at [Amazon SageMaker Debugger image URIs for custom rule evaluators](debugger-reference.md#debuger-custom-rule-registry-ids). You specify the URL registry address for the pre-built Docker image in the `RuleEvaluatorImage` parameter.

```
"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "relu_activations",
            "CollectionParameters": {
                "include_regex": "relu",
                "save_interval": "500",
                "end_step": "5000"
            }
        }
    ]
},
"DebugRulesConfigurations": [
    {
        "RuleConfigurationName": "improper_activation_job",
        "RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest",
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 400,
        "RuleParameters": {
           "source_s3_uri": "s3://bucket/custom_rules.py",
           "rule_to_invoke": "ImproperActivation",
           "collection_names": "relu_activations"
        }
    }
]
```

To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

# SDK for Python (Boto3)
<a name="debugger-built-in-rules-api.Boto3"></a>

Amazon SageMaker Debugger built-in rules can be configured for a training job using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) function of the AWS Boto3 SageMaker AI client. You need to specify the right image URI in the `RuleEvaluatorImage` parameter, and the following examples walk you through how to set up the request body for the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) function.

The following code shows a complete example of how to configure Debugger for the `create_training_job()` request body and start a training job in `us-west-2`, assuming that a training script `entry_point/train.py` is prepared using TensorFlow. To find an end-to-end example notebook, see [Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker Debugger (Boto3)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_profiling/tf-resnet-profiling-multi-gpu-multi-node-boto3.html).

**Note**  
Ensure that you use the correct Docker container images. To find available AWS Deep Learning Container images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules).

```
import sagemaker, boto3
import datetime, tarfile

# Start setting up a SageMaker session and a Boto3 SageMaker client
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()

# Upload a training script to a default Amazon S3 bucket of the current SageMaker session
source = 'source.tar.gz'
project = 'debugger-boto3-test'

tar = tarfile.open(source, 'w:gz')
tar.add ('entry_point/train.py') # Specify the directory and name of your training script
tar.close()

s3 = boto3.client('s3')
s3.upload_file(source, bucket, project+'/'+source)

# Set up a Boto3 session client for SageMaker
sm = boto3.Session(region_name=region).client("sagemaker")

# Start a training job
sm.create_training_job(
    TrainingJobName='debugger-boto3-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'),
    HyperParameters={
        'sagemaker_submit_directory': 's3://'+bucket+'/'+project+'/'+source,
        'sagemaker_program': '/entry_point/train.py' # training scrip file location and name under the sagemaker_submit_directory
    },
    AlgorithmSpecification={
        # Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
        'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04',
        'TrainingInputMode': 'File',
        'EnableSageMakerMetricsTimeSeries': False
    },
    RoleArn='arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20201014T161125',
    OutputDataConfig={'S3OutputPath': 's3://'+bucket+'/'+project+'/output'},
    ResourceConfig={
        'InstanceType': 'ml.p3.8xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 30
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 86400
    },
    DebugHookConfig={
        'S3OutputPath': 's3://'+bucket+'/'+project+'/debug-output',
        'CollectionConfigurations': [
            {
                'CollectionName': 'losses',
                'CollectionParameters' : {
                    'train.save_interval': '500',
                    'eval.save_interval': '50'
                }
            }
        ]
    },
    DebugRuleConfigurations=[
        {
            'RuleConfigurationName': 'LossNotDecreasing',
            'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
            'RuleParameters': {'rule_to_invoke': 'LossNotDecreasing'}
        }
    ],
    ProfilerConfig={
        'S3OutputPath': 's3://'+bucket+'/'+project+'/profiler-output',
        'ProfilingIntervalInMilliseconds': 500,
        'ProfilingParameters': {
            'DataloaderProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "MetricsRegex": ".*", }',
            'DetailedProfilingConfig': '{"StartStep": 5, "NumSteps": 3, }',
            'PythonProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "ProfilerName": "cprofile", "cProfileTimer": "total_time"}',
            'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
        }
    },
    ProfilerRuleConfigurations=[
        {
            'RuleConfigurationName': 'ProfilerReport',
            'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
            'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}
        }
    ]
)
```

## To configure a Debugger rule for debugging model parameters
<a name="debugger-built-in-rules-api-debug.Boto3"></a>

The following code samples show how to configure a built-in `VanishingGradient` rule using this SageMaker API. 

**To enable Debugger to collect output tensors**

Specify the Debugger hook configuration as follows:

```
DebugHookConfig={
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
    'CollectionConfigurations': [
        {
            'CollectionName': 'gradients',
            'CollectionParameters' : {
                'train.save_interval': '500',
                'eval.save_interval': '50'
            }
        }
    ]
}
```

This will make the training job save a tensor collection, `gradients`, every `save_interval` of 500 steps. To find available `CollectionName` values, see [Debugger Built-in Collections](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections) in the *SMDebug client library documentation*. To find available `CollectionParameters` parameter keys and values, see the [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) class in the *SageMaker Python SDK documentation*.

**To enable Debugger rules for debugging the output tensors**

The following `DebugRuleConfigurations` API example shows how to run the built-in `VanishingGradient` rule on the saved `gradients` collection.

```
DebugRuleConfigurations=[
    {
        'RuleConfigurationName': 'VanishingGradient',
        'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
        'RuleParameters': {
            'rule_to_invoke': 'VanishingGradient',
            'threshold': '20.0'
        }
    }
]
```

With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training job using the `VanishingGradient` rule on the collection of `gradients` tensor. To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

## To configure a Debugger built-in rule for profiling system and framework metrics
<a name="debugger-built-in-rules-api-profile.Boto3"></a>

The following example code shows how to specify the ProfilerConfig API operation to enable collecting system and framework metrics.

**To enable Debugger profiling to collect system and framework metrics**

------
#### [ Target Step ]

```
ProfilerConfig={ 
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', # Optional. Path to an S3 bucket to save profiling outputs
    # Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    'ProfilingIntervalInMilliseconds': 500, 
    'ProfilingParameters': {
        'DataloaderProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3, 
            "MetricsRegex": ".*"
        }',
        'DetailedProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3 
        }',
        'PythonProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3, 
            "ProfilerName": "cprofile",  # Available options: cprofile, pyinstrument
            "cProfileTimer": "total_time"  # Include only when using cprofile. Available options: cpu, off_cpu, total_time
        }',
        'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
    }
}
```

------
#### [ Target Time Duration ]

```
ProfilerConfig={ 
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', # Optional. Path to an S3 bucket to save profiling outputs
    # Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    'ProfilingIntervalInMilliseconds': 500,
    'ProfilingParameters': {
        'DataloaderProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10, 
            "MetricsRegex": ".*"
        }',
        'DetailedProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10
        }',
        'PythonProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10, 
            "ProfilerName": "cprofile",  # Available options: cprofile, pyinstrument
            "cProfileTimer": "total_time"  # Include only when using cprofile. Available options: cpu, off_cpu, total_time
        }',
        'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
    }
}
```

------

**To enable Debugger rules for profiling the metrics**

The following example code shows how to configure the `ProfilerReport` rule.

```
ProfilerRuleConfigurations=[ 
    {
        'RuleConfigurationName': 'ProfilerReport',
        'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
        'RuleParameters': {
            'rule_to_invoke': 'ProfilerReport',
            'CPUBottleneck_cpu_threshold': '90',
            'IOBottleneck_threshold': '90'
        }
    }
]
```

To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

## Update Debugger Profiling Configuration Using the `UpdateTrainingJob` API Operation
<a name="debugger-updatetrainingjob-api.Boto3"></a>

Debugger profiling configuration can be updated while your training job is running by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_training_job) function of the AWS Boto3 SageMaker AI client. Configure new [ProfilerConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html) and [ProfilerRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html) objects, and specify the training job name to the `TrainingJobName` parameter.

```
ProfilerConfig={ 
    'DisableProfiler': boolean,
    'ProfilingIntervalInMilliseconds': number,
    'ProfilingParameters': { 
        'string' : 'string' 
    }
},
ProfilerRuleConfigurations=[ 
    { 
        'RuleConfigurationName': 'string',
        'RuleEvaluatorImage': 'string',
        'RuleParameters': { 
            'string' : 'string' 
        }
    }
],
TrainingJobName='your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS'
```

## Add Debugger Custom Rule Configuration to the CreateTrainingJob API Operation
<a name="debugger-custom-rules-api.Boto3"></a>

A custom rule can be configured for a training job using the [ DebugHookConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html) and [ DebugRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html) objects using the AWS Boto3 SageMaker AI client's [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) function. The following code sample shows how to configure a custom `ImproperActivation` rule written with the *smdebug* library using this SageMaker API operation. This example assumes that you’ve written the custom rule in *custom\$1rules.py* file and uploaded it to an Amazon S3 bucket. The example provides pre-built Docker images that you can use to run your custom rules. These are listed at [Amazon SageMaker Debugger image URIs for custom rule evaluators](debugger-reference.md#debuger-custom-rule-registry-ids). You specify the URL registry address for the pre-built Docker image in the `RuleEvaluatorImage` parameter.

```
DebugHookConfig={
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
    'CollectionConfigurations': [
        {
            'CollectionName': 'relu_activations',
            'CollectionParameters': {
                'include_regex': 'relu',
                'save_interval': '500',
                'end_step': '5000'
            }
        }
    ]
},
DebugRulesConfigurations=[
    {
        'RuleConfigurationName': 'improper_activation_job',
        'RuleEvaluatorImage': '552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest',
        'InstanceType': 'ml.c4.xlarge',
        'VolumeSizeInGB': 400,
        'RuleParameters': {
           'source_s3_uri': 's3://bucket/custom_rules.py',
           'rule_to_invoke': 'ImproperActivation',
           'collection_names': 'relu_activations'
        }
    }
]
```

To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

# Amazon SageMaker Debugger references
<a name="debugger-reference"></a>

Find more information and references about using Amazon SageMaker Debugger in the following topics.

**Topics**
+ [

## Amazon SageMaker Debugger APIs
](#debugger-apis)
+ [

## Docker images for Debugger rules
](#debugger-docker-images-rules)
+ [

## Amazon SageMaker Debugger exceptions
](#debugger-exceptions)
+ [

## Distributed training supported by Amazon SageMaker Debugger
](#debugger-considerations)

## Amazon SageMaker Debugger APIs
<a name="debugger-apis"></a>

Amazon SageMaker Debugger has API operations in several locations that are used to implement its monitoring and analysis of model training.

Amazon SageMaker Debugger also provides the open source [`sagemaker-debugger` Python SDK](https://github.com/awslabs/sagemaker-debugger/tree/master/smdebug) that is used to configure built-in rules, define custom rules, and register hooks to collect output tensor data from training jobs.

The [Amazon SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) is a high-level SDK focused on machine learning experimentation. The SDK can be used to deploy built-in or custom rules defined with the `SMDebug` Python library to monitor and analyze these tensors using SageMaker AI estimators.

Debugger has added operations and types to the Amazon SageMaker API that enable the platform to use Debugger when training a model and to manage the configuration of inputs and outputs. 
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateTrainingJob.html) use the following Debugger APIs to configure tensor collections, rules, rule images, and profiling options:
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CollectionConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CollectionConfiguration.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TensorBoardOutputConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TensorBoardOutputConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) provides a full description of a training job, including the following Debugger configurations and rule evaluation statuses:
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleEvaluationStatus.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleEvaluationStatus.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleEvaluationStatus.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleEvaluationStatus.html)

The rule configuration API operations use the SageMaker Processing functionality when analyzing a model training. For more information about SageMaker Processing, see [Data transformation workloads with SageMaker Processing](processing-job.md).

## Docker images for Debugger rules
<a name="debugger-docker-images-rules"></a>

Amazon SageMaker AI provides two sets of Docker images for rules: one set for evaluating rules provided by SageMaker AI (built-in rules) and one set for evaluating custom rules provided in Python source files. 

If you use the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable), you can simply use SageMaker AI high-level Debugger API operations with SageMaker AI Estimator API operations, without having to manually retrieve the Debugger Docker images and configure the `ConfigureTrainingJob`API. 

If you are not using the SageMaker Python SDK, you have to retrieve a relevant pre-built container base image for the Debugger rules. Amazon SageMaker Debugger provides pre-built Docker images for built-in and custom rules, and the images are stored in Amazon Elastic Container Registry (Amazon ECR). To pull an image from an Amazon ECR repository (or to push an image to one), use the full name registry URL of the image using the `CreateTrainingJob` API. SageMaker AI uses the following URL patterns for the Debugger rule container image registry address. 

```
<account_id>.dkr.ecr.<Region>.amazonaws.com/<ECR repository name>:<tag>
```

For the account ID in each AWS Region, the Amazon ECR repository name, and the tag value, see the following topics.

**Topics**
+ [

### Amazon SageMaker Debugger image URIs for built-in rule evaluators
](#debuger-built-in-registry-ids)
+ [

### Amazon SageMaker Debugger image URIs for custom rule evaluators
](#debuger-custom-rule-registry-ids)

### Amazon SageMaker Debugger image URIs for built-in rule evaluators
<a name="debuger-built-in-registry-ids"></a>

Use the following values for the components of the registry URLs for the images that provide built-in rules for Amazon SageMaker Debugger. For account IDs, see the following table.

**ECR Repository Name**: sagemaker-debugger-rules 

**Tag**: latest 

**Example of a full registry URL**: 

`904829902805.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rules:latest`

Account IDs for Built-in Rules Container Images by AWS Region


| Region | account\$1id | 
| --- | --- | 
| af-south-1 |  314341159256  | 
| ap-east-1 |  199566480951  | 
| ap-northeast-1 |  430734990657   | 
| ap-northeast-2 |  578805364391  | 
| ap-south-1 |  904829902805  | 
| ap-southeast-1 |  972752614525  | 
| ap-southeast-2 |  184798709955  | 
| ca-central-1 |  519511493484  | 
| cn-north-1 |  618459771430  | 
| cn-northwest-1 |  658757709296  | 
| eu-central-1 |  482524230118  | 
| eu-north-1 |  314864569078  | 
| eu-south-1 |  563282790590  | 
| eu-west-1 |  929884845733  | 
| eu-west-2 |  250201462417  | 
| eu-west-3 |  447278800020  | 
| me-south-1 |  986000313247  | 
| sa-east-1 |  818342061345  | 
| us-east-1 |  503895931360  | 
| us-east-2 |  915447279597  | 
| us-west-1 |  685455198987  | 
| us-west-2 |  895741380848  | 
| us-gov-west-1 |  515509971035  | 

### Amazon SageMaker Debugger image URIs for custom rule evaluators
<a name="debuger-custom-rule-registry-ids"></a>

Use the following values for the components of the registry URL for the images that provide custom rule evaluators for Amazon SageMaker Debugger. For account IDs, see the following table.

**ECR Repository Name**: sagemaker-debugger-rule-evaluator 

**Tag**: latest 

**Example of a full registry URL**: 

`552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest`

Account IDs for Custom Rules Container Images by AWS Region


| Region | account\$1id | 
| --- | --- | 
| af-south-1 |  515950693465  | 
| ap-east-1 |  645844755771  | 
| ap-northeast-1 |  670969264625   | 
| ap-northeast-2 |  326368420253  | 
| ap-south-1 |  552407032007  | 
| ap-southeast-1 |  631532610101  | 
| ap-southeast-2 |  445670767460  | 
| ca-central-1 |  105842248657  | 
| cn-north-1 |  617202126805  | 
| cn-northwest-1 |  658559488188  | 
| eu-central-1 |  691764027602  | 
| eu-north-1 |  091235270104  | 
| eu-south-1 |  335033873580  | 
| eu-west-1 |  606966180310  | 
| eu-west-2 |  074613877050  | 
| eu-west-3 |  224335253976  | 
| me-south-1 |  050406412588  | 
| sa-east-1 |  466516958431  | 
| us-east-1 |  864354269164  | 
| us-east-2 |  840043622174  | 
| us-west-1 |  952348334681  | 
| us-west-2 |  759209512951  | 
| us-gov-west-1 |  515361955729  | 

## Amazon SageMaker Debugger exceptions
<a name="debugger-exceptions"></a>

Amazon SageMaker Debugger is designed to be aware of that tensors required to execute a rule might not be available at every step. As a result, it raises a few exceptions, which enable you to control what happens when a tensor is missing. These exceptions are available in the [smdebug.exceptions module](https://github.com/awslabs/sagemaker-debugger/blob/master/smdebug/exceptions.py). You can import them as follows:

```
from smdebug.exceptions import *
```

The following exceptions are available:
+ `TensorUnavailableForStep` – The tensor requested is not available for the step. This might mean that this step might not be saved at all by the hook, or that this step might have saved some tensors but the requested tensor is not part of them. Note that when you see this exception, it means that this tensor can never become available for this step in the future. If the tensor has reductions saved for the step, it notifies you they can be queried.
+ `TensorUnavailable` – This tensor is not being saved or has not been saved by the `smdebug` API. This means that this tensor is never seen for any step in `smdebug`.
+ `StepUnavailable` – The step was not saved and Debugger has no data from the step.
+ `StepNotYetAvailable` – The step has not yet been seen by `smdebug`. It might be available in the future if the training is still going on. Debugger automatically loads new data as it becomes available.
+ `NoMoreData` – Raised when the training ends. Once you see this, you know that there are no more steps and no more tensors to be saved.
+ `IndexReaderException` – The index reader is not valid.
+ `InvalidWorker` – A worker was invoked that was not valid.
+ `RuleEvaluationConditionMet` – Evaluation of the rule at the step resulted in the condition being met.
+ `InsufficientInformationForRuleInvocation` – Insufficient information was provided to invoke the rule.

## Distributed training supported by Amazon SageMaker Debugger
<a name="debugger-considerations"></a>

The following list shows the scope of validity and considerations for using Debugger on training jobs with deep learning frameworks and various distributed training options.
+ **Horovod**

  Scope of validity of using Debugger for training jobs with Horovod    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-reference.html)
+ **SageMaker AI distributed data parallel**

  Scope of validity of using Debugger for training jobs with SageMaker AI distributed data parallel    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-reference.html)

  \$1 Debugger does not support framework profiling for TensorFlow 2.x.

  \$1\$1 SageMaker AI distributed data parallel does not support TensorFlow 2.x with Keras implementation.
+ **SageMaker AI distributed model parallel** – Debugger does not support SageMaker AI distributed model parallel training.
+ **Distributed training with SageMaker AI checkpoints** – Debugger is not available for training jobs when both the distributed training option and SageMaker AI checkpoints are enabled. You might see an error that looks like the following: 

  ```
  SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
  ```

  To use Debugger for training jobs with distributed training options, you need to disable SageMaker AI checkpointing and add manual checkpointing functions to your training script. For more information about using Debugger with distributed training options and checkpoints, see [Using SageMaker AI distributed data parallel with Amazon SageMaker Debugger and checkpoints](distributed-troubleshooting-data-parallel.md#distributed-ts-data-parallel-debugger) and [Saving Checkpoints](distributed-troubleshooting-model-parallel.md#distributed-ts-model-parallel-checkpoints).
+ **Parameter Server** – Debugger does not support parameter server-based distributed training.
+ Profiling distributed training framework operations, such as the `AllReduced` operation of SageMaker AI distributed data parallel and [Horovod operations](https://horovod.readthedocs.io/en/stable/timeline_include.html), is not available.

# Access a training container through AWS Systems Manager for remote debugging
<a name="train-remote-debugging"></a>

You can securely connect to SageMaker training containers through AWS Systems Manager (SSM). This gives you a shell-level access to debug training jobs that are running within the container. You can also log commands and responses that are streamed to Amazon CloudWatch. If you use your own Amazon Virtual Private Cloud (VPC) to train a model, you can use AWS PrivateLink to set up a VPC endpoint for SSM and connect to containers privately through SSM.

You can connect to [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) or connect to your own training container set up with the SageMaker Training environment. 

## Set up IAM permissions
<a name="train-remote-debugging-iam"></a>

To enable SSM in your SageMaker training container, you need to set up an IAM role for the container. For you or users in your AWS account to access the training containers through SSM, you need to set up IAM users with permissions to use SSM.

### IAM role
<a name="train-remote-debugging-iam-role"></a>

For a SageMaker training container to start with the SSM agent, provide an IAM role with SSM permissions.

To enable remote debugging for your training job, SageMaker AI needs to start the [SSM agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html) in the training container when the training job starts. To allow the SSM agent to communicate with the SSM service, add the following policy to the IAM role that you use to run your training job. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	             
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssmmessages:CreateControlChannel",
                "ssmmessages:CreateDataChannel",
                "ssmmessages:OpenControlChannel",
                "ssmmessages:OpenDataChannel"
            ],
            "Resource": "*"    
        }
    ]
 }
```

------

### IAM user
<a name="train-remote-debugging-iam-user"></a>

Add the following policy to provide an IAM user with SSM session permissions to connect to an SSM target. In this case, the SSM target is a SageMaker training container.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	             
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": "*"    
        }
    ]
}
```

------

 You can restrict IAM users to connect only to containers for specific training jobs by adding the `Condition` key, as shown in the following policy sample. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	             
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": [
                "*"
            ],
            "Condition": {
                "StringLike": {
                    "ssm:resourceTag/aws:ssmmessages:target-id": [
                        "sagemaker-training-job:*"
                    ]
                }
            } 
        }
    ]
}
```

------

You can also explicitly use the `sagemaker:EnableRemoteDebug` condition key to restrict remote debugging. The following is an example policy for IAM users to restrict remote debugging.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "DenyRemoteDebugInTrainingJob",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTrainingJob",
                "sagemaker:UpdateTrainingJob"
            ],
            "Resource": "*",
            "Condition": {
                "BoolIfExists": {
                    "sagemaker:EnableRemoteDebug": false
                }
            }
        }
    ]
}
```

------

For more information, see [Condition keys for Amazon SageMaker AI](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html#amazonsagemaker-policy-keys) in the *AWS Service Authorization Reference*.

## How to enable remote debugging for a SageMaker training job
<a name="train-remote-debugging-how-to-use"></a>

In this section, learn how to enable remote debugging when starting or updating a training job in Amazon SageMaker AI.

------
#### [ SageMaker Python SDK ]

Using the estimator class in the SageMaker Python SDK, you can turn remote debugging on or off using the `enable_remote_debug` parameter or the `enable_remote_debug()` and `disable_remote_debug()` methods.

**To enable remote debugging when you create a training job**

To enable remote debugging when you create a new training job, set the `enable_remote_debug` parameter to `True`. The default value is `False`, so if you don’t set this parameter at all, or you explicitly set it to `False`, remote debugging functionality is disabled.

```
import sagemaker

session = sagemaker.Session()

estimator = sagemaker.estimator.Estimator(
    ...,
    sagemaker_session=session,
    image_uri="<your_image_uri>", #must be owned by your organization or Amazon DLCs 
    role=role,
    instance_type="ml.m5.xlarge",
    instance_count=1,
    output_path=output_path,
    max_run=1800,
    enable_remote_debug=True
)
```

**To enable remote debugging by updating a training job**

Using the following estimator class methods, you can enable or disable remote debugging while a training job is running when the `SecondaryStatus` of the job is `Downloading` or `Training`.

```
# Enable RemoteDebug
estimator.enable_remote_debug()

# Disable RemoteDebug
estimator.disable_remote_debug()
```

------
#### [ AWS SDK for Python (Boto3) ]

**To enable remote debugging when you create a training job**

To enable remote debugging when you create a new training job, set the value for the `EnableRemoteDebug` key to `True` in the `RemoteDebugConfig` parameter. 

```
import boto3

sm = boto3.Session(region_name=region).client("sagemaker")

# Start a training job
sm.create_training_job(
    ...,
    TrainingJobName=job_name,
    AlgorithmSpecification={
        // Specify a training Docker container image URI 
        // (Deep Learning Container or your own training container) to TrainingImage.
        "TrainingImage": "<your_image_uri>",
        "TrainingInputMode": "File"
    },
    RoleArn=iam_role_arn,
    OutputDataConfig=output_path,
    ResourceConfig={
        "InstanceType": "ml.m5.xlarge",
        "InstanceCount": 1,
        "VolumeSizeInGB": 30
    },
    StoppingCondition={
        "MaxRuntimeInSeconds": 86400
    },
    RemoteDebugConfig={
        "EnableRemoteDebug": True
    }
)
```

**To enable remote debugging by updating a training job**

Using the `update_traing_job` API, you can enable or disable remote debugging while a training job is running when the `SecondaryStatus` of the job is `Downloading` or `Training`.

```
# Update a training job
sm.update_training_job(
    TrainingJobName=job_name,
    RemoteDebugConfig={
        "EnableRemoteDebug": True     # True | False
    }
)
```

------
#### [ AWS Command Line Interface (CLI) ]

**To enable remote debugging when you create a training job**

Prepare a `CreateTrainingJob` request file in JSON format, as follows.

```
// train-with-remote-debug.json
{
    "TrainingJobName": job_name,
    "RoleArn": iam_role_arn,
    "AlgorithmSpecification": {
        // Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
        "TrainingImage": "<your_image_uri>",
        "TrainingInputMode": "File"
    },
    "OutputDataConfig": {
        "S3OutputPath": output_path
    },
    "ResourceConfig": {
        "InstanceType": "ml.m5.xlarge",
        "InstanceCount": 1,
        "VolumeSizeInGB": 30
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 86400
    },
    "RemoteDebugConfig": {
        "EnableRemoteDebug": True
    }
}
```

After saving the JSON file, run the following command in the terminal where you submit the training job. The following example command assumes that the JSON file is named `train-with-remote-debug.json`. If you run it from a Jupyter notebook, add an exclamation point (`!`) to the beginning of the line.

```
aws sagemaker create-training-job \
    --cli-input-json file://train-with-remote-debug.json
```

**To enable remote debugging by updating a training job**

Prepare an `UpdateTrainingJob` request file in JSON format, as follows.

```
// update-training-job-with-remote-debug-config.json
{
    "TrainingJobName": job_name,
    "RemoteDebugConfig": {
        "EnableRemoteDebug": True
    }
}
```

After saving the JSON file, run the following command in the terminal where you submit the training job. The following example command assumes that the JSON file is named `train-with-remote-debug.json`. If you run it from a Jupyter notebook, add an exclamation point (`!`) to the beginning of the line.

```
aws sagemaker update-training-job \
    --cli-input-json file://update-training-job-with-remote-debug-config.json
```

------

## Access your training container
<a name="train-remote-debugging-access-container"></a>

You can access a training container when the `SecondaryStatus` of the corresponding training job is `Training`. The following code examples demonstrate how to check the status of your training job using the `DescribeTrainingJob` API, how to check the training job logs in CloudWatch, and how to log in to the training container.

**To check the status of a training job**

------
#### [ SageMaker Python SDK ]

To check the `SecondaryStatus` of a training job, run the following SageMaker Python SDK code.

```
import sagemaker

session = sagemaker.Session()

# Describe the job status
training_job_info = session.describe_training_job(job_name)
print(training_job_info)
```

------
#### [ AWS SDK for Python (Boto3) ]

To check the `SecondaryStatus` of a training job, run the following SDK for Python (Boto3) code.

```
import boto3

session = boto3.session.Session()
region = session.region_name
sm = boto3.Session(region_name=region).client("sagemaker")

# Describe the job status
sm.describe_training_job(TrainingJobName=job_name)
```

------
#### [ AWS Command Line Interface (CLI) ]

To check the `SecondaryStatus` of a training job, run the following AWS CLI command for SageMaker AI.

```
aws sagemaker describe-training-job \
    --training-job-name job_name
```

------

**To find the host name of a training container**

To connect to the training container through SSM, use this format for the target ID: `sagemaker-training-job:<training-job-name>_algo-<n>`, where `algo-<n>` is the name of the container host. If your job is running on a single instance, the host is always `algo-1`. If you run a distributed training job on multiple instances, SageMaker AI creates an equal number of hosts and log streams. For example, if you use 4 instances, SageMaker AI creates `algo-1`, `algo-2`, `algo-3`, and `algo-4`. You must determine which log stream you want to debug, and its host number. To access log streams that are associated with a training job, do the following.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Training**, then choose **Training jobs**.

1. From the **Training jobs** list, choose the training job that you want to debug. The training job details page opens.

1. In the **Monitor** section, choose **View logs**. The related training job log stream list opens in the CloudWatch console.

1. Log stream names appear in `<training-job-name>/algo-<n>-<time-stamp>` format, with `algo-<n>` representing the host name. 

To learn more about how SageMaker AI manages configuration information for multi-instance distributed training, see [Distributed Training Configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-dist-training).

**To access the training container**

Use the following command in terminal to start the SSM session (`[aws ssm start-session](https://docs.aws.amazon.com/cli/latest/reference/ssm/start-session.html)`) and connect to the training container. 

```
aws ssm start-session --target sagemaker-training-job:<training-job-name>_algo-<n>
```

For example, if the training job name is `training-job-test-remote-debug` and the host name is `algo-1`, the target ID becomes `sagemaker-training-job:training-job-test-remote-debug_algo-1`. If the output of this command is similar to `Starting session with SessionId:xxxxx`, the connection is successful.

### SSM access with AWS PrivateLink
<a name="train-remote-debugging-access-container-vpc"></a>

If your training containers run within a Amazon Virtual Private Cloud that is not connected to the public internet, you can use AWS PrivateLink to enable SSM. AWS PrivateLink restricts all network traffic between your endpoint instances, SSM, and Amazon EC2 to the Amazon network. For more information on how to setup SSM access with AWS PrivateLink, see [Set up an Amazon VPC endpoint for Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html). 

## Log SSM session commands and results
<a name="train-remote-debugging-log-ssm"></a>

After following the instructions at [Create a Session Manager preferences document (command line)](https://docs.aws.amazon.com/systems-manager/latest/userguide/getting-started-create-preferences-cli.html), you can create SSM documents that define your preferences for SSM sessions. You can use SSM documents to configure session options, including data encryption, session duration, and logging. For example, you can specify whether to store session log data in an Amazon Simple Storage Service (Amazon S3) bucket or in an Amazon CloudWatch Logs group. You can create documents that define general preferences for all sessions for an AWS account and AWS Region, or documents that define preferences for individual sessions.

## Troubleshooting issues by checking error logs from SSM
<a name="train-remote-debugging-checking-ssm-agent-logs"></a>

Amazon SageMaker AI uploads errors from the SSM agent to your CloudWatch Logs in the `/aws/sagemaker/TrainingJobs` log group. SSM agent log streams are named in this format: `<job-name>/algo-<n>-<timestamp>/ssm`. For example, if you create a two-node training job named `training-job-test-remote-debug`, the training job log `training-job-test-remote-debug/algo-<n>-<timestamp>` and multiple SSM agent error logs `training-job-test-remote-debug/algo-<n>-<timestamp>/ssm` are uploaded to your CloudWatch Logs. In this example, you can review the `*/ssm` log streams to troubleshoot SSM issues.

```
training-job-test-remote-debug/algo-1-1680535238
training-job-test-remote-debug/algo-2-1680535238
training-job-test-remote-debug/algo-1-1680535238/ssm
training-job-test-remote-debug/algo-2-1680535238/ssm
```

## Considerations
<a name="train-remote-debugging-considerations"></a>

Consider the following when using SageMaker AI remote debugging.
+ Remote debugging isn't supported for [SageMaker AI algorithm containers](https://docs.aws.amazon.com/sagemaker/latest/dg/algorithms-choose.html) or containers from SageMaker AI on AWS Marketplace.
+ You can't start an SSM session for containers that have network isolation enabled because the isolation prevents outbound network calls.

# Release notes for debugging capabilities of Amazon SageMaker AI
<a name="debugger-release-notes"></a>

See the following release notes to track the latest updates for debugging capabilities of Amazon SageMaker AI.

## December 21, 2023
<a name="debugger-release-notes-20231221"></a>

**New features**

Released a remote debugging functionality, a new debugging capability of SageMaker AI that gives you a shell-level access to training containers. With this release, you can debug training jobs by logging into the job containers running on SageMaker AI ML instances. To learn more, see [Access a training container through AWS Systems Manager for remote debugging](train-remote-debugging.md).

## September 7, 2023
<a name="debugger-release-notes-20230907"></a>

**New features**

Added a new utility module `sagemaker.interactive_apps.tensorboard.TensorBoardApp` that provides a function called `get_app_url()`. The `get_app_url()` function generates unsigned or presigned URLs to open the TensorBoard application in any environment in SageMaker AI or Amazon EC2. This is to provide a unified experience for both Studio Classic and non-Studio Classic users. For the Studio Classic environment, you can open TensorBoard by running the `get_app_url()` function as it is, or you can also specify a job name to start tracking as the TensorBoard application opens. For non-Studio Classic environments, you can open TensorBoard by providing your domain information to the utility function. With this functionality, regardless of where or how you run training code and launch training jobs, you can directly access TensorBoard by running the `get_app_url` function in your Jupyter notebook or terminal. This functionality is available in the SageMaker Python SDK v2.184.0 and later. For more information, see [Accessing the TensorBoard application on SageMaker AI](debugger-htb-access-tb.md).

## April 4, 2023
<a name="debugger-release-notes-20230404"></a>

**New features**

Released SageMaker AI with TensorBoard, a capability that hosts TensorBoard on SageMaker AI. TensorBoard is available as an application through SageMaker AI domain, and the SageMaker AI Training platform supports TensorBoard output data collection to S3 and loading them automatically to the hosted TensorBoard on SageMaker AI. With this capability, you can run training jobs set up with TensorBoard summary writers in SageMaker AI, save the TensorBoard output files in Amazon S3, open the TensorBoard application directly from the SageMaker AI console, and load the output files using SageMaker AI Data Manager plugin implemented to the hosted TensorBoard interface. You don't need to install TensorBoard manually and host locally on the SageMaker AI IDEs or local machine. To learn more, see [TensorBoard in Amazon SageMaker AI](tensorboard-on-sagemaker.md).

## March 16, 2023
<a name="debugger-release-notes-20230315"></a>

**Deprecation notes**

SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows. 
+ SageMaker Python SDK <= v2.130.0
+ PyTorch >= v1.6.0, < v2.0
+ TensorFlow >= v2.3.1, < v2.11

With the deprecation, SageMaker Debugger also discontinues support for the following three `ProfilerRules` for framework profiling.
+ [MaxInitializationTime](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#max-initialization-time)
+ [OverallFrameworkMetrics](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#overall-framework-metrics)
+ [StepOutlier](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#step-outlier)

## February 21, 2023
<a name="debugger-release-notes-20230221"></a>

**Other changes**
+ The XGBoost report tab has been removed from the SageMaker Debugger's profiler dashboard. You can still access the XGBoost report by downloading it as a Jupyter notebook or a HTML file. For more information, see [SageMaker Debugger XGBoost Training Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-report-xgboost.html).
+ Starting from this release, the built-in profiler rules are not activated by default. To use the SageMaker Debugger profiler rules to detect certain computational problems, you need to add the rules when you configure a SageMaker training job launcher.

## December 1, 2020
<a name="debugger-release-notes-20201201"></a>

Amazon SageMaker Debugger launched deep profiling features at re:Invent 2020.

## December 3, 2019
<a name="debugger-release-notes-20191203"></a>

Amazon SageMaker Debugger initially launched at re:Invent 2019.

# Profile and optimize computational performance
<a name="train-profile-computational-performance"></a>

When training state-of-the-art deep learning models that rapidly grow in size, scaling the training job of such models to a large GPU cluster and identifying computational performance issues from billions and trillions of operations and communications in every iteration of the gradient descent process become a challenge.

SageMaker AI provides profiling tools to visualize and diagnose such complex computation issues arising from running training jobs on AWS cloud computing resources. There are two profiling options that SageMaker AI offers: Amazon SageMaker Profiler and a resource utilzation monitor in Amazon SageMaker Studio Classic. See the following introductions of the two functionalities to gain quick insights and learn which one to use depending on your needs.

**Amazon SageMaker Profiler**

Amazon SageMaker Profiler is a profiling capability of SageMaker AI with which you can deep dive into compute resources provisioned while training deep learning models, and gain visibility into operation-level details. SageMaker Profiler provides Python modules for adding annotations throughout PyTorch or TensorFlow training scripts and activating SageMaker Profiler. You can access the modules through the SageMaker Python SDK and AWS Deep Learning Containers. 

With SageMaker Profiler, you can track all activities on CPUs and GPUs, such as CPU and GPU utilizations, kernel runs on GPUs, kernel launches on CPUs, sync operations, memory operations across CPUs and GPUs, latencies between kernel launches and corresponding runs, and data transfer between CPUs and GPUs. 

SageMaker Profiler also offers a user interface (UI) that visualizes the *profile*, a statistical summary of profiled events, and the timeline of a training job for tracking and understanding the time relationship of the events between GPUs and CPUs.

To learn more about SageMaker Profiler, see [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md).

**Monitoring AWS compute resources in Amazon SageMaker Studio Classic**

SageMaker AI also provides a user interface in Studio Classic for monitoring resource utilization at high level, but with more granularity compared to the default utilization metrics collected from SageMaker AI to CloudWatch.

For any training job you run in SageMaker AI using the SageMaker Python SDK, SageMaker AI starts profiling basic resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization, network, and I/O wait time. It collects these resource utilization metrics every 500 milliseconds. 

Compared to Amazon CloudWatch metrics, which collect metrics at intervals of 1 second, the monitoring functionality of SageMaker AI provides finer granularity into the resource utilization metrics down to 100-millisecond (0.1 second) intervals, so you can dive deep into the metrics at the level of an operation or a step.

To access the dashboard for monitoring the resource utilization metrics of a training job, see the [SageMaker AI Debugger UI in SageMaker Studio Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html).


**Topics**
+ [

# Amazon SageMaker Profiler
](train-use-sagemaker-profiler.md)
+ [

# Monitor AWS compute resource utilization in Amazon SageMaker Studio Classic
](debugger-profile-training-jobs.md)
+ [

# Release notes for profiling capabilities of Amazon SageMaker AI
](profiler-release-notes.md)

# Amazon SageMaker Profiler
<a name="train-use-sagemaker-profiler"></a>


|  | 
| --- |
|  Amazon SageMaker Profiler is currently in preview release and available at no cost in supported AWS Regions. The generally available version of Amazon SageMaker Profiler (if any) may include features and pricing that are different than those offered in preview.  | 

Amazon SageMaker Profiler is a capability of Amazon SageMaker AI that provides a detailed view into the AWS compute resources provisioned during training deep learning models on SageMaker AI. It focuses on profiling the CPU and GPU usage, kernel runs on GPUs, kernel launches on CPUs, sync operations, memory operations across CPUs and GPUs, latencies between kernel launches and corresponding runs, and data transfer between CPUs and GPUs. SageMaker Profiler also offers a user interface (UI) that visualizes the *profile*, a statistical summary of profiled events, and the timeline of a training job for tracking and understanding the time relationship of the events between GPUs and CPUs.

**Note**  
SageMaker Profiler supports PyTorch and TensorFlow and is available in [AWS Deep Learning Containers for SageMaker AI](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only). To learn more, see [Supported framework images, AWS Regions, and instance types](profiler-support.md).

**For data scientists**

Training deep learning models on a large compute cluster often has computational optimization problems, such as bottlenecks, kernel launch latencies, memory limit, and low resource utilization.

To identify such computational performance issues, you need to profile deeper into the compute resources to understand which kernels introduce latencies and which operations cause bottlenecks. Data scientists can take the benefit from using the SageMaker Profiler UI for visualizing the detailed profile of training jobs. The UI provides a dashboard furnished with summary charts and a timeline interface to track every event on the compute resources. Data scientists can also add custom annotations to track certain parts of the training job using the SageMaker Profiler Python modules.

**For administrators**

Through the Profiler landing page in the SageMaker AI console or [SageMaker AI domain](https://docs.aws.amazon.com/sagemaker/latest/dg/sm-domain.html), you can manage the Profiler application users if you are an administrator of an AWS account or SageMaker AI domain. Each domain user can access their own Profiler application given the granted permissions. As a SageMaker AI domain administrator and domain user, you can create and delete the Profiler application given the permission level you have.

**Topics**
+ [

# Supported framework images, AWS Regions, and instance types
](profiler-support.md)
+ [

# Prerequisites for SageMaker Profiler
](profiler-prereq.md)
+ [

# Prepare and run a training job with SageMaker Profiler
](profiler-prepare.md)
+ [

# Open the SageMaker Profiler UI application
](profiler-access-smprofiler-ui.md)
+ [

# Explore the profile output data visualized in the SageMaker Profiler UI
](profiler-explore-viz.md)
+ [

# Troubleshooting for SageMaker Profiler
](profiler-faq.md)

# Supported framework images, AWS Regions, and instance types
<a name="profiler-support"></a>

This feature supports the following machine learning frameworks and AWS Regions.

**Note**  
To use this feature, make sure that you have installed the SageMaker Python SDK [version 2.180.0](https://pypi.org/project/sagemaker/2.180.0/) or later.

## SageMaker AI framework images pre-installed with SageMaker Profiler
<a name="profiler-support-frameworks"></a>

SageMaker Profiler is pre-installed in the following [AWS Deep Learning Containers for SageMaker AI](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).

### PyTorch images
<a name="profiler-support-frameworks-pytorch"></a>


| PyTorch versions | AWS DLC image URI | 
| --- | --- | 
| 2.2.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker  | 
| 2.1.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker  | 
| 2.0.1 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu121-ubuntu20.04-sagemaker  | 
| 1.13.1 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker  | 

### TensorFlow images
<a name="profiler-support-frameworks-tensorflow"></a>


| TensorFlow versions | AWS DLC image URI | 
| --- | --- | 
| 2.13.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/tensorflow-training:2.13.0-gpu-py310-cu118-ubuntu20.04-sagemaker  | 
| 2.12.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/tensorflow-training:2.12.0-gpu-py310-cu118-ubuntu20.04-sagemaker  | 
| 2.11.0 |  *763104351884*.dkr.ecr.*<region>*.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-ubuntu20.04-sagemaker  | 

**Important**  
Distribution and maintenance of the framework containers in the preceding tables are under the [Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html) managed by the AWS Deep Learning Containers service. We highly recommend you to upgrade to the [currently supported framework versions](https://aws.amazon.com/releasenotes/dlc-support-policy/), if you are using prior framework versions that are no longer supported.

**Note**  
If you want to use SageMaker Profiler for other framework images or your own Docker images, you can install SageMaker Profiler using the SageMaker Profiler Python package binary files provided in the following section.

## SageMaker Profiler Python package binary files
<a name="profiler-python-package"></a>

If you want to configure your own Docker container, use SageMaker Profiler in other pre-built containers for PyTorch and TensorFlow, or install the SageMaker Profiler Python package locally, use one the following binary files. Depending on the Python and CUDA versions in your environment, choose one of the following.

### PyTorch
<a name="profiler-python-package-for-pytorch"></a>
+ Python3.8, CUDA 11.3: [https://smppy.s3.amazonaws.com/pytorch/cu113/smprof-0.3.334-cp38-cp38-linux_x86_64.whl](https://smppy.s3.amazonaws.com/pytorch/cu113/smprof-0.3.334-cp38-cp38-linux_x86_64.whl)
+ Python3.9, CUDA 11.7: [https://smppy.s3.amazonaws.com/pytorch/cu117/smprof-0.3.334-cp39-cp39-linux_x86_64.whl](https://smppy.s3.amazonaws.com/pytorch/cu117/smprof-0.3.334-cp39-cp39-linux_x86_64.whl)
+ Python3.10, CUDA 11.8: [https://smppy.s3.amazonaws.com/pytorch/cu118/smprof-0.3.334-cp310-cp310-linux_x86_64.whl](https://smppy.s3.amazonaws.com/pytorch/cu118/smprof-0.3.334-cp310-cp310-linux_x86_64.whl)
+ Python3.10, CUDA 12.1: [https://smppy.s3.amazonaws.com/pytorch/cu121/smprof-0.3.334-cp310-cp310-linux_x86_64.whl](https://smppy.s3.amazonaws.com/pytorch/cu121/smprof-0.3.334-cp310-cp310-linux_x86_64.whl)

### TensorFlow
<a name="profiler-python-package-for-tensorflow"></a>
+ Python3.9, CUDA 11.2: [https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.334-cp39-cp39-linux_x86_64.whl](https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.334-cp39-cp39-linux_x86_64.whl)
+ Python3.10, CUDA 11.8: [https://smppy.s3.amazonaws.com/tensorflow/cu118/smprof-0.3.334-cp310-cp310-linux_x86_64.whl](https://smppy.s3.amazonaws.com/tensorflow/cu118/smprof-0.3.334-cp310-cp310-linux_x86_64.whl)

For more information about how to install SageMaker Profiler using the binary files, see [(Optional) Install the SageMaker Profiler Python package](profiler-prepare.md#profiler-install-python-package).

## Supported AWS Regions
<a name="profiler-support-regions"></a>

SageMaker Profiler is available in the following AWS Regions.
+ US East (N. Virginia) (`us-east-1`)
+ US East (Ohio) (`us-east-2`)
+ US West (Oregon) (`us-west-2`)
+ Europe (Frankfurt) (`eu-central-1`)
+ Europe (Ireland) (`eu-west-1`)

## Supported instance types
<a name="profiler-support-instance-types"></a>

SageMaker Profiler supports profiling of training jobs on the following instance types.

**CPU and GPU profiling**
+ `ml.g4dn.12xlarge`
+ `ml.g5.24xlarge`
+ `ml.g5.48xlarge`
+ `ml.p3dn.24xlarge`
+ `ml.p4de.24xlarge`
+ `ml.p4d.24xlarge`
+ `ml.p5.48xlarge`

**GPU profiling only**
+ `ml.g5.2xlarge`
+ `ml.g5.4xlarge`
+ `ml.g5.8xlarge`
+ `ml.g5.16.xlarge`

# Prerequisites for SageMaker Profiler
<a name="profiler-prereq"></a>

The following list shows the prerequisites to start using SageMaker Profiler.
+ A SageMaker AI domain set up with Amazon VPC in your AWS account. 

  For instructions on setting up a domain, see [Onboard to Amazon SageMaker AI domain using quick setup](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html). You also need to add domain user profiles for individual users to access the Profiler UI application. For more information, see [Add user profiles](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-user-profile-add.html).
+ The following list is the minimum set of permissions for using the Profiler UI application.
  + `sagemaker:CreateApp`
  + `sagemaker:DeleteApp`
  + `sagemaker:DescribeTrainingJob`
  + `sagemaker:Search`
  + `s3:GetObject`
  + `s3:ListBucket`

# Prepare and run a training job with SageMaker Profiler
<a name="profiler-prepare"></a>

Setting up to running a training job with the SageMaker Profiler consists of two steps: adapting the training script and configuring the SageMaker training job launcher.

**Topics**
+ [

## Step 1: Adapt your training script using the SageMaker Profiler Python modules
](#profiler-prepare-training-script)
+ [

## Step 2: Create a SageMaker AI framework estimator and activate SageMaker Profiler
](#profiler-profilerconfig)
+ [

## (Optional) Install the SageMaker Profiler Python package
](#profiler-install-python-package)

## Step 1: Adapt your training script using the SageMaker Profiler Python modules
<a name="profiler-prepare-training-script"></a>

To start capturing kernel runs on GPUs while the training job is running, modify your training script using the SageMaker Profiler Python modules. Import the library and add the `start_profiling()` and `stop_profiling()` methods to define the beginning and the end of profiling. You can also use optional custom annotations to add markers in the training script to visualize hardware activities during particular operations in each step.

Note that the annotators extract operations from GPUs. For profiling operations in CPUs, you don’t need to add any additional annotations. CPU profiling is also activated when you specify the profiling configuration, which you’ll practice in [Step 2: Create a SageMaker AI framework estimator and activate SageMaker Profiler](#profiler-profilerconfig).

**Note**  
Profiling an entire training job is not the most efficient use of resources. We recommend profiling at most 300 steps of a training job.

**Important**  
The release on [December 14, 2023](profiler-release-notes.md#profiler-release-notes-20231214) involves a breaking change. The SageMaker Profiler Python package name is changed from `smppy` to `smprof`. This is effective in the [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for TensorFlow v2.12 and later.  
If you use one of the previous versions of the [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) such TensorFlow v2.11.0, the SageMaker Profiler Python package is still available as `smppy`. If you are uncertain about which version or the package name you should use, replace the import statement of the SageMaker Profiler package with the following code snippet.  

```
try:
    import smprof 
except ImportError:
    # backward-compatability for TF 2.11 and PT 1.13.1 images
    import smppy as smprof
```

**Approach 1.** Use the context manager `smprof.annotate` to annotate full functions

You can wrap full functions with the `smprof.annotate()` context manager. This wrapper is recommended if you want to profile by functions instead of code lines. The following example script shows how to implement the context manager to wrap the training loop and full functions in each iteration.

```
import smprof

SMProf = smprof.SMProfiler.instance()
config = smprof.Config()
config.profiler = {
    "EnableCuda": "1",
}
SMProf.configure(config)
SMProf.start_profiling()

for epoch in range(args.epochs):
    if world_size > 1:
        sampler.set_epoch(epoch)
    tstart = time.perf_counter()
    for i, data in enumerate(trainloader, 0):
        with smprof.annotate("step_"+str(i)):
            inputs, labels = data
            inputs = inputs.to("cuda", non_blocking=True)
            labels = labels.to("cuda", non_blocking=True)
    
            optimizer.zero_grad()
    
            with smprof.annotate("Forward"):
                outputs = net(inputs)
            with smprof.annotate("Loss"):
                loss = criterion(outputs, labels)
            with smprof.annotate("Backward"):
                loss.backward()
            with smprof.annotate("Optimizer"):
                optimizer.step()

SMProf.stop_profiling()
```

**Approach 2.** Use `smprof.annotation_begin()` and `smprof.annotation_end()` to annotate specific code line in functions

You can also define annotations to profile specific code lines. You can set the exact starting point and end point of profiling at the level of individual code lines, not by the functions. For example, in the following script, the `step_annotator` is defined at the beginning of each iteration and ends at the end of the iteration. Meanwhile, other detailed annotators for each operations are defined and wrap around the target operations throughout each iteration.

```
import smprof

SMProf = smprof.SMProfiler.instance()
config = smprof.Config()
config.profiler = {
    "EnableCuda": "1",
}
SMProf.configure(config)
SMProf.start_profiling()

for epoch in range(args.epochs):
    if world_size > 1:
        sampler.set_epoch(epoch)
    tstart = time.perf_counter()
    for i, data in enumerate(trainloader, 0):
        step_annotator = smprof.annotation_begin("step_" + str(i))

        inputs, labels = data
        inputs = inputs.to("cuda", non_blocking=True)
        labels = labels.to("cuda", non_blocking=True)
        optimizer.zero_grad()

        forward_annotator = smprof.annotation_begin("Forward")
        outputs = net(inputs)
        smprof.annotation_end(forward_annotator)

        loss_annotator = smprof.annotation_begin("Loss")
        loss = criterion(outputs, labels)
        smprof.annotation_end(loss_annotator)

        backward_annotator = smprof.annotation_begin("Backward")
        loss.backward()
        smprof.annotation_end(backward_annotator)

        optimizer_annotator = smprof.annotation_begin("Optimizer")
        optimizer.step()
        smprof.annotation_end(optimizer_annotator)

        smprof.annotation_end(step_annotator)

SMProf.stop_profiling()
```

After annotating and setting up the profiler initiation modules, save the script to submit using a SageMaker training job launcher in the following Step 2. The sample launcher assumes that the training script is named `train_with_profiler_demo.py`.

## Step 2: Create a SageMaker AI framework estimator and activate SageMaker Profiler
<a name="profiler-profilerconfig"></a>

The following procedure shows how to prepare a SageMaker AI framework estimator for training using the SageMaker Python SDK.

1. Set up a `profiler_config` object using the `ProfilerConfig` and `Profiler` modules as follows.

   ```
   from sagemaker import ProfilerConfig, Profiler
   profiler_config = ProfilerConfig(
       profile_params = Profiler(cpu_profiling_duration=3600)
   )
   ```

   The following is the description of the `Profiler` module and its argument.
   +  `Profiler`: The module for activating SageMaker Profiler with the training job.
     +  `cpu_profiling_duration` (int): Specify the time duration in seconds for profiling on CPUs. Default is 3600 seconds. 

1. Create a SageMaker AI framework estimator with the `profiler_config` object created in the previous step. The following code shows an example of creating a PyTorch estimator. If you want to create a TensorFlow estimator, import `sagemaker.tensorflow.TensorFlow` instead, and specify one of the [TensorFlow versions](profiler-support.md#profiler-support-frameworks-tensorflow) supported by SageMaker Profiler. For more information about supported frameworks and instance types, see [SageMaker AI framework images pre-installed with SageMaker Profiler](profiler-support.md#profiler-support-frameworks).

   ```
   import sagemaker
   from sagemaker.pytorch import PyTorch
   
   estimator = PyTorch(
       framework_version="2.0.0",
       role=sagemaker.get_execution_role(),
       entry_point="train_with_profiler_demo.py", # your training job entry point
       source_dir=source_dir, # source directory for your training script
       output_path=output_path,
       base_job_name="sagemaker-profiler-demo",
       hyperparameters=hyperparameters, # if any
       instance_count=1, # Recommended to test with < 8
       instance_type=ml.p4d.24xlarge,
       profiler_config=profiler_config
   )
   ```

1. Start the training job by running the `fit` method. With `wait=False`, you can silence the training job logs and let it run in the background.

   ```
   estimator.fit(wait=False)
   ```

While running the training job or after the job has completed, you can go to the next topic at [Open the SageMaker Profiler UI application](profiler-access-smprofiler-ui.md) and start exploring and visualizing the saved profiles.

If you want to directly access the profile data saved in the Amazon S3 bucket, use the following script to retrieve the S3 URI.

```
import os
# This is an ad-hoc function to get the S3 URI
# to where the profile output data is saved
def get_detailed_profiler_output_uri(estimator):
    config_name = None
    for processing in estimator.profiler_rule_configs:
        params = processing.get("RuleParameters", dict())
        rule = config_name = params.get("rule_to_invoke", "")
        if rule == "DetailedProfilerProcessing":
            config_name = processing.get("RuleConfigurationName")
            break
    return os.path.join(
        estimator.output_path, 
        estimator.latest_training_job.name, 
        "rule-output",
        config_name,
    )

print(
    f"Profiler output S3 bucket: ", 
    get_detailed_profiler_output_uri(estimator)
)
```

## (Optional) Install the SageMaker Profiler Python package
<a name="profiler-install-python-package"></a>

To use SageMaker Profiler on PyTorch or TensorFlow framework images not listed in [SageMaker AI framework images pre-installed with SageMaker Profiler](profiler-support.md#profiler-support-frameworks), or on your own custom Docker container for training, you can install SageMaker Profiler by using one of the [SageMaker Profiler Python package binary files](profiler-support.md#profiler-python-package).

**Option 1: Install the SageMaker Profiler package while launching a training job**

If you want to use SageMaker Profiler for training jobs using PyTorch or TensorFlow images not listed in [SageMaker AI framework images pre-installed with SageMaker Profiler](profiler-support.md#profiler-support-frameworks), create a `requirements.txt` file and locate it under the path you specify to the `source_dir` parameter of the SageMaker AI framework estimator in [Step 2](#profiler-profilerconfig). For more information about setting up a `requirements.txt` file in general, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries) in the *SageMaker Python SDK documentation*. In the `requirements.txt` file, add one of the S3 bucket paths for the [SageMaker Profiler Python package binary files](profiler-support.md#profiler-python-package).

```
# requirements.txt
https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.332-cp39-cp39-linux_x86_64.whl
```

**Option 2: Install the SageMaker Profiler package in your custom Docker containers**

If you use a custom Docker container for training, add one of the [SageMaker Profiler Python package binary files](profiler-support.md#profiler-python-package) to your Dockerfile.

```
# Install the smprof package version compatible with your CUDA version
RUN pip install https://smppy.s3.amazonaws.com/tensorflow/cu112/smprof-0.3.332-cp39-cp39-linux_x86_64.whl
```

For guidance on running a custom Docker container for training on SageMaker AI in general, see [Adapting your own training container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html).

# Open the SageMaker Profiler UI application
<a name="profiler-access-smprofiler-ui"></a>

You can access the SageMaker Profiler UI application through the following options.

**Topics**
+ [

## Option 1: Launch the SageMaker Profiler UI from the domain details page
](#profiler-access-smprofiler-ui-console-smdomain)
+ [

## Option 2: Launch the SageMaker Profiler UI application from the SageMaker Profiler landing page in the SageMaker AI console
](#profiler-access-smprofiler-ui-console-profiler-landing-page)
+ [

## Option 3: Use the application launcher function in the SageMaker AI Python SDK
](#profiler-access-smprofiler-ui-app-launcher-function)

## Option 1: Launch the SageMaker Profiler UI from the domain details page
<a name="profiler-access-smprofiler-ui-console-smdomain"></a>

If you have access to the SageMaker AI console, you can take this option.

**Navigate to the domain details page**

 The following procedure shows how to navigate to the domain details page. 

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **domains**. 

1. From the list of domains, select the domain in which you want to launch the SageMaker Profiler application.

**Launch the SageMaker Profiler UI application**

The following procedure shows how to launch the SageMaker Profiler application that is scoped to a user profile. 

1. On the domain details page, choose the **User profiles** tab. 

1. Identify the user profile for which you want to launch the SageMaker Profiler UI application. 

1. Choose **Launch** for the selected user profile, and choose **Profiler**. 

## Option 2: Launch the SageMaker Profiler UI application from the SageMaker Profiler landing page in the SageMaker AI console
<a name="profiler-access-smprofiler-ui-console-profiler-landing-page"></a>

The following procedure describes how to launch the SageMaker Profiler UI application from the SageMaker Profiler landing page in the SageMaker AI console. If you have access to the SageMaker AI console, you can take this option.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, choose **Profiler**.

1. Under **Get started**, select the domain in which you want to launch the Studio Classic application. If your user profile only belongs to one domain, you do not see the option for selecting a domain.

1. Select the user profile for which you want to launch the SageMaker Profiler UI application. If there is no user profile in the domain, choose **Create user profile**. For more information about creating a new user profile, see [Add user profiles](https://docs.aws.amazon.com/sagemaker/latest/dg/domain-user-profile-add.html).

1. Choose **Open Profiler**.

## Option 3: Use the application launcher function in the SageMaker AI Python SDK
<a name="profiler-access-smprofiler-ui-app-launcher-function"></a>

If you are a SageMaker AI domain user and have access only to SageMaker Studio, you can access the SageMaker Profiler UI application through SageMaker Studio Classic by running the [https://sagemaker.readthedocs.io/en/stable/api/utility/interactive_apps.html#module-sagemaker.interactive_apps.detail_profiler_app](https://sagemaker.readthedocs.io/en/stable/api/utility/interactive_apps.html#module-sagemaker.interactive_apps.detail_profiler_app) function.

Note that SageMaker Studio Classic is the previous Studio UI experience before re:Invent 2023, and is migrated as an application into a newly designed Studio UI at re:Invent 2023. The SageMaker Profiler UI application is available at SageMaker AI domain level, and thus requires your domain ID and user profile name. Currently, the `DetailedProfilerApp` function only works within the SageMaker Studio Classic application; the function properly takes in the domain and user profile information from SageMaker Studio Classic.

For domain, domain users, and Studio created before re:Invent 2023, Studio Classic would be the default experience unless you have updated it following the instructions at [Migrating from Amazon SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-migrate.html). If this is your case, there's no further action needed, and you can directly launch the SageMaker Profiler UI application by running the `DetailProfilerApp` funciton.

If you created a new domain and Studio after re:Invent 2023, launch the Studio Classic application within the Studio UI and then run the `DetailProfilerApp` function to launch the SageMaker Profiler UI application.

Note that the `DetailedProfilerApp` function doesn’t work in other SageMaker AI machine learning IDEs, such as the SageMaker Studio JupyterLab application, the SageMaker Studio Code Editor application, and SageMaker Notebook instances. If you run the `DetailedProfilerApp` function in those IDEs, it returns a URL to the Profiler landing page in the SageMaker AI console, instead of a direct link to open the Profiler UI application.

# Explore the profile output data visualized in the SageMaker Profiler UI
<a name="profiler-explore-viz"></a>

This section walks through the SageMaker Profiler UI and provides tips for how to use and gain insights from it.

## Load profile
<a name="profiler-explore-viz-load"></a>

When you open the SageMaker Profiler UI, the **Load profile** page opens up. To load and generate the **Dashboard** and **Timeline**, go through the following procedure.<a name="profiler-explore-viz-load-procedure"></a>

**To load the profile of a training job**

1. From the **List of training jobs** section, use the check box to choose the training job for which you want to load the profile.

1. Choose **Load**. The job name should appear in the **Loaded profile** section at the top.

1. Choose the radio button on the left of the **Job name** to generate the **Dashboard** and **Timeline**. Note that when you choose the radio button, the UI automatically opens the **Dashboard**. Note also that if you generate the visualizations while the job status and loading status still appear to be in progress, the SageMaker Profiler UI generates **Dashboard** plots and a **Timeline** up to the most recent profile data collected from the ongoing training job or the partially loaded profile data.

**Tip**  
You can load and visualize one profile at a time. To load another profile, you must first unload the previously loaded profile. To unload a profile, use the trash bin icon on the right end of the profile in the **Loaded profile** section.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-load-data.png)


## Dashboard
<a name="profiler-explore-viz-overview"></a>

After you finish loading and selecting the training job, the UI opens the **Dashboard** page furnished with the following panels by default.
+ **GPU active time** – This pie chart shows the percentage of GPU active time versus GPU idle time. You can check if your GPUs are more active than idle throughout the entire training job. GPU active time is based on the profile data points with a utilization rate greater than 0%, whereas GPU idle time is the profiled data points with 0% utilization.
+ **GPU utilization over time** – This timeline graph shows the average GPU utilization rate over time per node, aggregating all of the nodes in a single chart. You can check if the GPUs have an unbalanced workload, under-utilization issues, bottlenecks, or idle issues during certain time intervals. To track the utilization rate at the individual GPU level and related kernel runs, use the [Timeline interface](#profiler-explore-viz-timeline). Note that the GPU activity collection starts from where you added the profiler starter function `SMProf.start_profiling()` in your training script, and stops at `SMProf.stop_profiling()`.
+ **CPU active time** – This pie chart shows the percentage of CPU active time versus CPU idle time. You can check if your CPUs are more active than idle throughout the entire training job. CPU active time is based on the profiled data points with a utilization rate greater than 0%, whereas CPU idle time is the profiled data points with 0% utilization.
+ **CPU utilization over time** – This timeline graph shows the average CPU utilization rate over time per node, aggregating all of the nodes in a single chart. You can check if the CPUs are bottlenecked or underutilized during certain time intervals. To track the utilization rate of the CPUs aligned with the individual GPU utilization and kernel runs, use the [Timeline interface](#profiler-explore-viz-timeline). Note that the utilization metrics start from the start from the job initialization.
+ **Time spent by all GPU kernels** – This pie chart shows all GPU kernels operated throughout the training job. It shows the top 15 GPU kernels by default as individual sectors and all other kernels in one sector. Hover over the sectors to see more detailed information. The value shows the total time of the GPU kernels operated in seconds, and the percentage is based on the entire time of the profile. 
+ **Time spent by top 15 GPU kernels** – This pie chart shows all GPU kernels operated throughout the training job. It shows the top 15 GPU kernels as individual sectors. Hover over the sectors to see more detailed information. The value shows the total time of the GPU kernels operated in seconds, and the percentage is based on the entire time of the profile. 
+ **Launch counts of all GPU kernels** – This pie chart shows the number of counts for every GPU kernel launched throughout the training job. It shows the top 15 GPU kernels as individual sectors and all other kernels in one sector. Hover over the sectors to see more detailed information. The value shows the total count of the launched GPU kernels, and the percentage is based on the entire count of all kernels. 
+ **Launch counts of top 15 GPU kernels** – This pie chart shows the number of counts of every GPU kernel launched throughout the training job. It shows the top 15 GPU kernels. Hover over the sectors to see more detailed information. The value shows the total count of the launched GPU kernels, and the percentage is based on the entire count of all kernels. 
+ **Step time distribution** – This histogram shows the distribution of step durations on GPUs. This plot is generated only after you add the step annotator in your training script.
+ **Kernel precision distribution** – This pie chart shows the percentage of time spent on running kernels in different data types such as FP32, FP16, INT32, and INT8. 
+ **GPU activity distribution** – This pie chart shows the percentage of time spent on GPU activities, such as running kernels, memory (`memcpy` and `memset`), and synchronization (`sync`).
+ **GPU memory operations distribution** – This pie chart shows the percentage of time spent on GPU memory operations. This visualizes the `memcopy` activities and helps identify if your training job is spending excessive time on certain memory operations.
+ **Create a new histogram** – Create a new diagram of a custom metric you annotated manually during [Step 1: Adapt your training script using the SageMaker Profiler Python modules](profiler-prepare.md#profiler-prepare-training-script). When adding a custom annotation to a new histogram, select or type the name of the annotation you added in the training script. For example, in the demo training script in Step 1, `step`, `Forward`, `Backward`, `Optimize`, and `Loss` are the custom annotations. While creating a new histogram, these annotation names should appear in the drop-down menu for metric selection. If you choose `Backward`, the UI adds the histogram of the time spent on backward passes throughout the profiled time to the **Dashboard**. This type of histogram is useful for checking if there are outliers taking abnormally longer time and causing bottleneck problems.

The following screenshots show the GPU and CPU active time ratio and the average GPU and CPU utilization rate with respect to time per compute node.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-dashboard-1.png)


The following screenshot shows an example of pie charts for comparing how many times the GPU kernels are launched and measuring the time spent on running them. In the **Time spent by all GPU kernels** and **Launch counts of all GPU kernels** panels, you can also specify an integer to the input field for *k* to adjust the number of legend to show in the plots. For example, if you specify 10, the plots show the top 10 most run and launched kernels respectively.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-dashboard-2.png)


The following screenshot shows an example of step time duration histogram, and pie charts for the kernel precision distribution, GPU activity distribution, and GPU memory operation distribution.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-dashboard-3.png)


## Timeline interface
<a name="profiler-explore-viz-timeline"></a>

To gain a detailed view into the compute resources at the level of operations and kernels scheduled on the CPUs and run on the GPUs, use the **Timeline** interface.

You can zoom in and out and pan left or right in the timeline interface using your mouse, the `[w, a, s, d]` keys, or the four arrow keys on the keyboard.

**Tip**  
For more tips on the keyboard shortcuts to interact with the **Timeline** interface, choose **Keyboard shortcuts** in the left pane.

The timeline tracks are organized in a tree structure, giving you information from the host level to the device level. For example, if you run `N` instances with eight GPUs in each, the timeline structure of each instance would be as follows.
+ **algo-inode** – This is what SageMaker AI tags to assign jobs to provisioned instances. The digit inode is randomly assigned. For example, if you use 4 instances, this section expands from **algo-1** to **algo-4**.
  + **CPU** – In this section, you can check the average CPU utilization rate and performance counters.
  + **GPUs** – In this section, you can check the average GPU utilization rate, individual GPU utilization rate, and kernels.
    + **SUM Utilization** – The average GPU utilization rates per instance.
    + **HOST-0 PID-123** – A unique name assigned to each process track. The acronym PID is the process ID, and the number appended to it is the process ID number that's recorded during data capture from the process. This section shows the following information from the process.
      + **GPU-inum\$1gpu utilization** – The utilization rate of the inum\$1gpu-th GPU over time.
      + **GPU-inum\$1gpu device** – The kernel runs on the inum\$1gpu-th GPU device.
        + **stream icuda\$1stream** – CUDA streams showing kernel runs on the GPU device. To learn more about CUDA streams, see the slides in PDF at [CUDA C/C\$1\$1 Streams and Concurrency](https://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf) provided by NVIDIA.
      + **GPU-inum\$1gpu host** – The kernel launches on the inum\$1gpu-th GPU host.

The following several screenshots show the **Timeline** of the profile of a training job run on `ml.p4d.24xlarge` instances, which are equipped with 8 NVIDIA A100 Tensor Core GPUs in each.

The following is a zoomed-out view of the profile, printing a dozen of steps including an intermittent data loader between `step_232` and `step_233` for fetching the next data batch.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-timeline-1.png)


For each CPU, you can track the CPU utilization and performance counters, such as `"clk_unhalted_ref.tsc"` and `"itlb_misses.miss_causes_a_walk"`, which are indicative of instructions run on the CPU.

For each GPU, you can see a host timeline and a device timeline. Kernel launches are on the host timeline and kernel runs are on the device timeline. You can also see annotations (such as forward, backward, and optimize) if you have added in training script in the GPU host timeline.

In the timeline view, you can also track kernel launch-and-run pairs. This helps you understand how a kernel launch scheduled on a host (CPU) is run on the corresponding GPU device.

**Tip**  
Press the `f` key to zoom into the selected kernel.

The following screenshot is a zoomed-in view into `step_233` and `step_234` from the previous screenshot. The timeline interval selected in the following screenshot is the `AllReduce` operation, an essential communication and synchronization step in distributed training, run on the GPU-0 device. In the screenshot, note that the kernel launch in the GPU-0 host connects to the kernel run in the GPU-0 device stream 1, indicated with the arrow in cyan color.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-timeline-2.png)


Also two information tabs appear in the bottom pane of the UI when you select a timeline interval, as shown in the previous screenshot. The **Current Selection** tab shows the details of the selected kernel and the connected kernel launch from the host. The connection direction is always from host (CPU) to device (GPU) since each GPU kernel is always called from a CPU. The **Connections** tab shows the chosen kernel launch and run pair. You can select either of them to move it to the center of the **Timeline** view.

The following screenshot zooms in further into the `AllReduce` operation launch and run pair. 

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/profiler/sagemaker-profiler-ui-timeline-3.png)


## Information
<a name="profiler-expore-viz-information"></a>

In **Information**, you can access information about the loaded training job, such as the instance type, Amazon Resource Names (ARNs) of compute resources provisioned for the job, node names, and hyperparameters.

## Settings
<a name="profiler-expore-viz-settings"></a>

The SageMaker AI Profiler UI application instance is configured to shut down after 2 hours of idle time by default. In **Settings**, use the following settings to adjust the auto shutdown timer.
+ **Enable app auto shutdown** – Choose and set to **Enabled** to let the application automatically shut down after the specified number of hours of idle time. To turn off the auto-shutdown functionality, choose **Disabled**.
+ **Auto shutdown threshold in hours** – If you choose **Enabled** for **Enable app auto shutdown**, you can set the threshold time in hours for the application to shut down automatically. This is set to 2 by default.

# Troubleshooting for SageMaker Profiler
<a name="profiler-faq"></a>

Use the following question-and-answer pairs to troubleshoot problems while using SageMaker Profiler.

**Q. I’m getting an error message, `ModuleNotFoundError: No module named 'smppy'`**

Since December 2023, the name of the SageMaker Profiler Python package has changed from `smppy` to `smprof` to resolve a duplicate package name issue; `smppy` is already used by an open source package.

Therefore, if you have been using `smppy` since before December 2023 and experiencing this `ModuleNotFoundError` issue, it might be due to the outdated package name in your training script while having the latested `smprof` package installed or using one of the latest [SageMaker AI framework images pre-installed with SageMaker Profiler](profiler-support.md#profiler-support-frameworks). In this case, make sure that you replace all mentions of `smppy` with `smprof` throughout your training script.

While updating the SageMaker Profiler Python package name in your training scripts, to avoid confusion around which version of the package name you should use, consider using a conditional import statement as shown in the following code snippet.

```
try:
    import smprof 
except ImportError:
    # backward-compatability for TF 2.11 and PT 1.13.1 images
    import smppy as smprof
```

Also note that if you have been using `smppy` while upgrading to the latest PyTorch or TensorFlow versions, make sure that you install the latest `smprof` package by following instructions at [(Optional) Install the SageMaker Profiler Python package](profiler-prepare.md#profiler-install-python-package).

**Q. I’m getting an error message, `ModuleNotFoundError: No module named 'smprof'`**

First, make sure that you use one of the officially supported SageMaker AI Framework Containers. If you don’t use one of those, you can install the `smprof` package by following instructions at [(Optional) Install the SageMaker Profiler Python package](profiler-prepare.md#profiler-install-python-package).

**Q. I’m not able to import `ProfilerConfig`**

If you are unable to import `ProfilerConfig` in your job launcher script using the SageMaker Python SDK, your local environment or the Jupyter kernel might have a significantly outdated version of the SageMaker Python SDK. Make sure that you upgrade the SDK to the latest version.

```
$ pip install --upgrade sagemaker
```

**Q. I’m getting an error message, `aborted: core dumped when importing smprof into my training script`**

In an earlier version of `smprof`, this issue occurs with PyTorch 2.0\$1 and PyTorch Lightning. To resolve this issue, also install the latest `smprof` package by following instructions at [(Optional) Install the SageMaker Profiler Python package](profiler-prepare.md#profiler-install-python-package).

**Q. I cannot find the SageMaker Profiler UI from SageMaker Studio. How can I find it?**

If you have access to the SageMaker AI console, choose one of the following options.
+ [Option 1: Launch the SageMaker Profiler UI from the domain details page](profiler-access-smprofiler-ui.md#profiler-access-smprofiler-ui-console-smdomain)
+ [Option 2: Launch the SageMaker Profiler UI application from the SageMaker Profiler landing page in the SageMaker AI console](profiler-access-smprofiler-ui.md#profiler-access-smprofiler-ui-console-profiler-landing-page)

If you are a domain user and don't have access to the SageMaker AI console, you can access the application through SageMaker Studio Classic. If this is your case, choose the following option.
+ [Option 3: Use the application launcher function in the SageMaker AI Python SDK](profiler-access-smprofiler-ui.md#profiler-access-smprofiler-ui-app-launcher-function)

# Monitor AWS compute resource utilization in Amazon SageMaker Studio Classic
<a name="debugger-profile-training-jobs"></a>

To track compute resource utilization of your training job, use the monitoring tools offered by Amazon SageMaker Debugger. 

For any training job you run in SageMaker AI using the SageMaker Python SDK, Debugger collects basic resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization, network, and I/O wait time every 500 milliseconds. To see the dashbard of the resource utilization metrics of your training job, simply use the [SageMaker Debugger UI in SageMaker Studio Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html).

Deep learning operations and steps might operate in intervals of milliseconds. Compared to Amazon CloudWatch metrics, which collect metrics at intervals of 1 second, Debugger provides finer granularity into the resource utilization metrics down to 100-millisecond (0.1 second) intervals so you can dive deep into the metrics at the level of an operation or a step. 

If you want to change the metric collection time interval, you can add a paramter for profiling configuration to your training job launcher. For example, if you're using the SageMaker AI Python SDK, you need to pass the `profiler_config` parameter when you create an estimator object. To learn how to adjust the resource utilization metric collection interval, see [Code template for configuring a SageMaker AI estimator object with the SageMaker Debugger Python modules in the SageMaker AI Python SDK](debugger-configuration-for-profiling.md#debugger-configuration-structure-profiler) and then [Configure settings for basic profiling of system resource utilization](debugger-configure-system-monitoring.md).

Additionally, you can add issue detecting tools called *built-in profiling rules* provided by SageMaker Debugger. The built-in profiling rules run analysis against the resource utilization metrics and detect computational performance issues. For more information, see [Use built-in profiler rules managed by Amazon SageMaker Debugger](use-debugger-built-in-profiler-rules.md). You can receive rule analysis results through the [SageMaker Debugger UI in SageMaker Studio Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html) or the [SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html). You can also create custom profiling rules using the SageMaker Python SDK. 

To learn more about monitoring functionalities provided by SageMaker Debugger, see the following topics.

**Topics**
+ [

# Estimator configuration with parameters for basic profiling using the Amazon SageMaker Debugger Python modules
](debugger-configuration-for-profiling.md)
+ [

# Use built-in profiler rules managed by Amazon SageMaker Debugger
](use-debugger-built-in-profiler-rules.md)
+ [

# List of Debugger built-in profiler rules
](debugger-built-in-profiler-rules.md)
+ [

# Amazon SageMaker Debugger UI in Amazon SageMaker Studio Classic Experiments
](debugger-on-studio.md)
+ [

# SageMaker Debugger interactive report
](debugger-profiling-report.md)
+ [

# Analyze data using the Debugger Python client library
](debugger-analyze-data.md)

# Estimator configuration with parameters for basic profiling using the Amazon SageMaker Debugger Python modules
<a name="debugger-configuration-for-profiling"></a>

By default, SageMaker Debugger basic profiling is on by default and monitors resource utilization metrics, such as CPU utilization, GPU utilization, GPU memory utilization, Network, and I/O wait time, of all SageMaker training jobs submitted using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). SageMaker Debugger collects these resource utilization metrics every 500 milliseconds. You don't need to make any additional changes in your code, training script, or the job launcher for tracking basic resource utilization. If you want to change the metric collection interval for basic profiling, you can specify Debugger-specific parameters while creating a SageMaker training job launcher using the SageMaker Python SDK, AWS SDK for Python (Boto3), or AWS Command Line Interface (CLI). In this guide, we focus on how to change profiling options using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). This page gives reference templates for configuring this estimator object.

If you want to access the resource utilization metrics dashboard of your training job in SageMaker Studio, you can jump onto the [Amazon SageMaker Debugger UI in Amazon SageMaker Studio Classic Experiments](debugger-on-studio.md).

If you want to activate the rules that detect system resource utilization problems automatically, you can add the `rules` parameter in the estimator object for activating the rules.

**Important**  
To use the latest SageMaker Debugger features, you need to upgrade the SageMaker Python SDK and the `SMDebug` client library. In your iPython kernel, Jupyter Notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Code template for configuring a SageMaker AI estimator object with the SageMaker Debugger Python modules in the SageMaker AI Python SDK
<a name="debugger-configuration-structure-profiler"></a>

To adjust the basic profiling configuration (`profiler_config`) or add the profiler rules (`rules`), choose one of the tabs to get the template for setting up a SageMaker AI estimator. In the subsequent pages, you can find more information about how to configure the two parameters.

**Note**  
The following code examples are not directly executable. Proceed to the next sections to learn how to configure each parameter.

------
#### [ PyTorch ]

```
# An example of constructing a SageMaker AI PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

session=boto3.session.Session()
region=session.region_name

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=PyTorch(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-profiling-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.12.0",
    py_version="py37",
    
    # SageMaker Debugger parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ TensorFlow ]

```
# An example of constructing a SageMaker AI TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

session=boto3.session.Session()
region=session.region_name

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-profiling-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.8.0",
    py_version="py37",
    
    # SageMaker Debugger parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ MXNet ]

```
# An example of constructing a SageMaker AI MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=MXNet(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-profiling-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.7.0",
    py_version="py37",
    
    # SageMaker Debugger parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

**Note**  
For MXNet, when configuring the `profiler_config` parameter, you can only configure for system monitoring. Profiling framework metrics is not supported for MXNet.

------
#### [ XGBoost ]

```
# An example of constructing a SageMaker AI XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import ProfilerConfig, ProfilerRule, rule_configs

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=XGBoost(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-profiling-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.5-1",

    # Debugger-specific parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

**Note**  
For XGBoost, when configuring the `profiler_config` parameter, you can only configure for system monitoring. Profiling framework metrics is not supported for XGBoost.

------
#### [ Generic estimator ]

```
# An example of constructing a SageMaker AI generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import ProfilerConfig, DebuggerHookConfig, Rule, ProfilerRule, rule_configs

profiler_config=ProfilerConfig(...)
rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Debugger-specific parameters
    profiler_config=profiler_config,
    rules=rules
)

estimator.fit(wait=False)
```

------

The following provides brief descriptions of the parameters.
+ `profiler_config` – Configure Debugger to collect system metrics and framework metrics from your training job and save into your secured S3 bucket URI or local machine. You can set how frequently or loosely collect the system metrics. To learn how to configure the `profiler_config` parameter, see [Configure settings for basic profiling of system resource utilization](debugger-configure-system-monitoring.md) and [Estimator configuration for framework profiling](debugger-configure-framework-profiling.md).
+ `rules` – Configure this parameter to activate SageMaker Debugger built-in rules that you want to run in parallel. Make sure that your training job has access to this S3 bucket. The rules runs on processing containers and automatically analyze your training job to find computational and operational performance issues. The [ProfilerReport](debugger-built-in-profiler-rules.md#profiler-report) rule is the most integrated rule that runs all built-in profiling rules and saves the profiling results as a report into your secured S3 bucket. To learn how to configure the `rules` parameter, see [Use built-in profiler rules managed by Amazon SageMaker Debugger](use-debugger-built-in-profiler-rules.md).

**Note**  
Debugger securely saves output data in subfolders of your default S3 bucket. For example, the format of the default S3 bucket URI is `s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/`. There are three subfolders created by Debugger: `debug-output`, `profiler-output`, and `rule-output`. You can also retrieve the default S3 bucket URIs using the [SageMaker AI estimator classmethods](debugger-estimator-classmethods.md).

See the following topics to find out how to configure the Debugger-specific parameters in detail.

**Topics**
+ [

## Code template for configuring a SageMaker AI estimator object with the SageMaker Debugger Python modules in the SageMaker AI Python SDK
](#debugger-configuration-structure-profiler)
+ [

# Configure settings for basic profiling of system resource utilization
](debugger-configure-system-monitoring.md)
+ [

# Estimator configuration for framework profiling
](debugger-configure-framework-profiling.md)
+ [

# Updating Debugger system monitoring and framework profiling configuration while a training job is running
](debugger-update-monitoring-profiling.md)
+ [

# Turn off Debugger
](debugger-turn-off-profiling.md)

# Configure settings for basic profiling of system resource utilization
<a name="debugger-configure-system-monitoring"></a>

To adjust the time interval for collecting the utilization metrics, use the `ProfilerConfig` API operation to create a parameter object while constructing a SageMaker AI framework or generic estimator depending on your preference.

**Note**  
By default, for all SageMaker training jobs, Debugger collects resource utilization metrics from Amazon EC2 instances every 500 milliseconds for system monitoring, without any Debugger-specific parameters specified in SageMaker AI estimators.   
Debugger saves the system metrics in the default S3 bucket. The format of the default S3 bucket URI is `s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/`.

The following code example shows how to set up the `profiler_config` parameter with a system monitoring time interval of 1000 milliseconds.

```
from sagemaker.debugger import ProfilerConfig

profiler_config=ProfilerConfig(
    system_monitor_interval_millis=1000
)
```
+  `system_monitor_interval_millis` (int) – Specify the monitoring intervals in milliseconds to record system metrics. Available values are 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds. The default value is 500 milliseconds.

To see the progress of system monitoring, see [Open the Amazon SageMaker Debugger Insights dashboard](debugger-on-studio-insights.md).

# Estimator configuration for framework profiling
<a name="debugger-configure-framework-profiling"></a>

**Warning**  
In favor of [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md), SageMaker AI Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.   
SageMaker Python SDK <= v2.130.0
PyTorch >= v1.6.0, < v2.0
TensorFlow >= v2.3.1, < v2.11
See also [March 16, 2023](debugger-release-notes.md#debugger-release-notes-20230315).

To enable Debugger framework profiling, configure the `framework_profile_params` parameter when you construct an estimator. Debugger framework profiling collects framework metrics, such as data from initialization stage, data loader processes, Python operators of deep learning frameworks and training scripts, detailed profiling within and between steps, with cProfile or Pyinstrument options. Using the `FrameworkProfile` class, you can configure custom framework profiling options. 

**Note**  
Before getting started with Debugger framework profiling, verify that the framework used to build your model is supported by Debugger for framework profiling. For more information, see [Supported frameworks and algorithms](debugger-supported-frameworks.md).   
Debugger saves the framework metrics in a default S3 bucket. The format of the default S3 bucket URI is `s3://sagemaker-<region>-<12digit_account_id>/<training-job-name>/profiler-output/`.

**Topics**
+ [

# Default framework profiling
](debugger-configure-framework-profiling-basic.md)
+ [

# Default system monitoring and customized framework profiling for target steps or a target time range
](debugger-configure-framework-profiling-range.md)
+ [

# Default system monitoring and customized framework profiling with different profiling options
](debugger-configure-framework-profiling-options.md)

# Default framework profiling
<a name="debugger-configure-framework-profiling-basic"></a>

Debugger framework default profiling includes the following options: detailed profiling, data loader profiling, and Python profiling. The following example code is the simplest `profiler_config` parameter setting to start the default system monitoring and the default framework profiling. The `FrameworkProfile` class in the following example code initiates the default framework profiling when a training job starts. 

```
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
    
profiler_config=ProfilerConfig(
    framework_profile_params=FrameworkProfile()
)
```

With this `profiler_config` parameter configuration, Debugger calls the default settings of monitoring and profiling. Debugger monitors system metrics every 500 milliseconds; profiles the fifth step with the detailed profiling option; the seventh step with the data loader profiling option; and the ninth, tenth, and eleventh steps with the Python profiling option. 

To find available profiling configuration options, the default parameter settings, and examples of how to configure them, see [Default system monitoring and customized framework profiling with different profiling options](debugger-configure-framework-profiling-options.md) and [SageMaker Debugger APIs – FrameworkProfile](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.FrameworkProfile) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

If you want to change the system monitoring interval and enable the default framework profiling, you can specify the `system_monitor_interval_millis` parameter explicitly with the `framework_profile_params` parameter. For example, to monitor every 1000 milliseconds and enable the default framework profiling, use the following example code.

```
from sagemaker.debugger import ProfilerConfig, FrameworkProfile
    
profiler_config=ProfilerConfig(
    system_monitor_interval_millis=1000,
    framework_profile_params=FrameworkProfile()
)
```

For more information about the `FrameworkProfile` class, see [SageMaker Debugger APIs – FrameworkProfile](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.FrameworkProfile) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# Default system monitoring and customized framework profiling for target steps or a target time range
<a name="debugger-configure-framework-profiling-range"></a>

If you want to specify target steps or target time intervals to profile your training job, you need to specify parameters for the `FrameworkProfile` class. The following code examples show how to specify the target ranges for profiling along with system monitoring.
+ **For a target step range**

  With the following example configuration, Debugger monitors the entire training job every 500 milliseconds (the default monitoring) and profiles a target step range from step 5 to step 15 (for 10 steps).

  ```
  from sagemaker.debugger import ProfilerConfig, FrameworkProfile
      
  profiler_config=ProfilerConfig(
      framework_profile_params=FrameworkProfile(start_step=5, num_steps=10)
  )
  ```

  With the following example configuration, Debugger monitors the entire training job every 1000 milliseconds and profiles a target step range from step 5 to step 15 (for 10 steps).

  ```
  from sagemaker.debugger import ProfilerConfig, FrameworkProfile
      
  profiler_config=ProfilerConfig(
      system_monitor_interval_millis=1000,
      framework_profile_params=FrameworkProfile(start_step=5, num_steps=10)
  )
  ```
+ **For a target time range**

  With the following example configuration, Debugger monitors the entire training job every 500 milliseconds (the default monitoring) and profiles a target time range from the current Unix time for 600 seconds.

  ```
  import time
  from sagemaker.debugger import ProfilerConfig, FrameworkProfile
  
  profiler_config=ProfilerConfig(
      framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()), duration=600)
  )
  ```

  With the following example configuration, Debugger monitors the entire training job every 1000 milliseconds and profiles a target time range from the current Unix time for 600 seconds.

  ```
  import time
  from sagemaker.debugger import ProfilerConfig, FrameworkProfile
  
  profiler_config=ProfilerConfig(
      system_monitor_interval_millis=1000,
      framework_profile_params=FrameworkProfile(start_unix_time=int(time.time()), duration=600)
  )
  ```

  The framework profiling is performed for all of the profiling options at the target step or time range. 

  To find more information about available profiling options, see [SageMaker Debugger APIs – FrameworkProfile](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.FrameworkProfile) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

  The next section shows you how to script the available profiling options.

# Default system monitoring and customized framework profiling with different profiling options
<a name="debugger-configure-framework-profiling-options"></a>

This section gives information about the supported profiling configuration classes, as well as an example configuration. You can use the following profiling configuration classes to manage the framework profiling options:
+ [DetailedProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DetailedProfilingConfig) – Specify a target step or time range to profile framework operations using the native framework profilers (TensorFlow profiler and PyTorch profiler). For example, if using TensorFlow, the Debugger hooks enable the TensorFlow profiler to collect TensorFlow-specific framework metrics. Detailed profiling enables you to profile all framework operators at a pre-step (before the first step), within steps, and between steps of a training job.
**Note**  
Detailed profiling might significantly increase GPU memory consumption. We do not recommend enabling detailed profiling for more than a couple of steps.
+ [DataloaderProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DataloaderProfilingConfig) – Specify a target step or time range to profile deep learning framework data loader processes. Debugger collects every data loader event of the frameworks.
**Note**  
Data loader profiling might lower the training performance while collecting information from data loaders. We don't recommend enabling data loader profiling for more than a couple of steps.  
Debugger is preconfigured to annotate data loader processes only for the AWS deep learning containers. Debugger cannot profile data loader processes from any other custom or external training containers.
+ [PythonProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.PythonProfilingConfig) – Specify a target step or time range to profile Python functions. You can also choose between two Python profilers: cProfile and Pyinstrument.
  + *cProfile* – The standard Python profiler. cProfile collects information for every Python operator called during training. With cProfile, Debugger saves cumulative time and annotation for each function call, providing complete detail about Python functions. In deep learning, for example, the most frequently called functions might be the convolutional filters and backward pass operators, and cProfile profiles every single of them. For the cProfile option, you can further select a timer option: total time, CPU time, and off-CPU time. While you can profile every function call executing on processors (both CPU and GPU) in CPU time, you can also identify I/O or network bottlenecks with the off-CPU time option. The default is total time, and Debugger profiles both CPU and off-CPU time. With cProfile, you are able to drill down to every single functions when analyzing the profile data.
  + *Pyinstrument* – Pyinstrument is a low-overhead Python profiler that works based on sampling. With the Pyinstrument option, Debugger samples profiling events every millisecond. Because Pyinstrument measures elapsed wall-clock time instead of CPU time, the Pyinstrument option can be a better choice over the cProfile option for reducing profiling noise (filtering out irrelevant function calls that are cumulatively fast) and capturing operators that are actually compute intensive (cumulatively slow) for training your model. With Pyinstrument, you are able to see a tree of function calls and better understand the structure and root cause of the slowness.
**Note**  
Enabling Python profiling might slow down the overall training time. cProfile profiles the most frequently called Python operators at every call, so the processing time on profiling increases with respect to the number of calls. For Pyinstrument, the cumulative profiling time increases with respect to time because of its sampling mechanism.

The following example configuration shows the full structure when you use the different profiling options with specified values.

```
import time
from sagemaker.debugger import (ProfilerConfig, 
                                FrameworkProfile, 
                                DetailedProfilingConfig, 
                                DataloaderProfilingConfig, 
                                PythonProfilingConfig,
                                PythonProfiler, cProfileTimer)

profiler_config=ProfilerConfig(
    system_monitor_interval_millis=500,
    framework_profile_params=FrameworkProfile(
        detailed_profiling_config=DetailedProfilingConfig(
            start_step=5, 
            num_steps=1
        ),
        dataloader_profiling_config=DataloaderProfilingConfig(
            start_step=7, 
            num_steps=1
        ),
        python_profiling_config=PythonProfilingConfig(
            start_step=9, 
            num_steps=1, 
            python_profiler=PythonProfiler.CPROFILE, 
            cprofile_timer=cProfileTimer.TOTAL_TIME
        )
    )
)
```

For more information about available profiling options, see [DetailedProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DetailedProfilingConfig), [DataloaderProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DataloaderProfilingConfig), and [PythonProfilingConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.PythonProfilingConfig) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# Updating Debugger system monitoring and framework profiling configuration while a training job is running
<a name="debugger-update-monitoring-profiling"></a>

If you want to activate or update the Debugger monitoring configuration for a training job that is currently running, use the following SageMaker AI estimator extension methods:
+ To activate Debugger system monitoring for a running training job and receive a Debugger profiling report, use the following:

  ```
  estimator.enable_default_profiling()
  ```

  When you use the `enable_default_profiling` method, Debugger initiates the default system monitoring and the `ProfileReport` built-in rule, which generates a comprehensive profiling report at the end of the training job. This method can be called only if the current training job is running without both Debugger monitoring and profiling.

  For more information, see [estimator.enable\$1default\$1profiling](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.enable_default_profiling) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).
+ To update system monitoring configuration, use the following:

  ```
  estimator.update_profiler(
      system_monitor_interval_millis=500
  )
  ```

  For more information, see [estimator.update\$1profiler](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.update_profiler) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# Turn off Debugger
<a name="debugger-turn-off-profiling"></a>

If you want to completely turn off Debugger, do one of the following:
+ Before starting a training job, do the following:

  To turn off profiling, include the `disable_profiler` parameter to your estimator and set it to `True`.
**Warning**  
If you disable it, you won't be able to view the comprehensive Studio Debugger insights dashboard and the autogenerated profiling report.

  To turn off debugging, set the `debugger_hook_config` parameter to `False`.
**Warning**  
If you disable it, you won't be able to collect output tensors and cannot debug your model parameters.

  ```
  estimator=Estimator(
      ...
      disable_profiler=True
      debugger_hook_config=False
  )
  ```

  For more information about the Debugger-specific parameters, see [SageMaker AI Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).
+ While a training job is running, do the following:

  To disable both monitoring and profiling while your training job is running, use the following estimator classmethod:

  ```
  estimator.disable_profiling()
  ```

  To disable framework profiling only and keep system monitoring, use the `update_profiler` method:

  ```
  estimator.update_profiler(disable_framework_metrics=true)
  ```

  For more information about the estimator extension methods, see the [estimator.disable\$1profiling](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.disable_profiling) and [estimator.update\$1profiler](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.update_profiler) classmethods in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation.

# Use built-in profiler rules managed by Amazon SageMaker Debugger
<a name="use-debugger-built-in-profiler-rules"></a>

The Amazon SageMaker Debugger built-in profiler rules analyze system metrics and framework operations collected during the training of a model. Debugger offers the `ProfilerRule` API operation that helps configure the rules to monitor training compute resources and operations and to detect anomalies. For example, the profiling rules can help you detect whether there are computational problems such as CPU bottlenecks, excessive I/O wait time, imbalanced workload across GPU workers, and compute resource underutilization. To see a full list of available built-in profiling rules, see [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md). The following topics show how to use the Debugger built-in rules with default parameter settings and custom parameter values.

**Note**  
The built-in rules are provided through Amazon SageMaker processing containers and fully managed by SageMaker Debugger at no additional cost. For more information about billing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Topics**
+ [

## Use SageMaker Debugger built-in profiler rules with their default parameter settings
](#debugger-built-in-profiler-rules-configuration)
+ [

## Use Debugger built-in profiler rules with custom parameter values
](#debugger-built-in-profiler-rules-configuration-param-change)

## Use SageMaker Debugger built-in profiler rules with their default parameter settings
<a name="debugger-built-in-profiler-rules-configuration"></a>

To add SageMaker Debugger built-in rules in your estimator, you need to configure a `rules` list object. The following example code shows the basic structure of listing the SageMaker Debugger built-in rules.

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules=[
    ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_1()),
    ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_2()),
    ...
    ProfilerRule.sagemaker(rule_configs.BuiltInProfilerRuleName_n()),
    ... # You can also append more debugging rules in the Rule.sagemaker(rule_configs.*()) format.
]

estimator=Estimator(
    ...
    rules=rules
)
```

For a complete list of available built-in rules, see [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md).

To use the profiling rules and inspect the computational performance and progress of your training job, add the [https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html#profiler-report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html#profiler-report) rule of SageMaker Debugger. This rule activates all built-in rules under the [Debugger ProfilerRule](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html#debugger-built-in-profiler-rules-ProfilerRule) `ProfilerRule` family. Furthermore, this rule generates an aggregated profiling report. For more information, see [Profiling Report Generated Using SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html). You can use the following code to add the profiling report rule to your training estimator.

```
from sagemaker.debugger import Rule, rule_configs

rules=[
    ProfilerRule.sagemaker(rule_configs.ProfilerReport())
]
```

When you start the training job with the `ProfilerReport` rule, Debugger collects resource utilization data every 500 milliseconds. Debugger analyzes the resource utilization to identify if your model is having bottleneck problems. If the rules detect training anomalies, the rule evaluation status changes to `IssueFound`. You can set up automated actions, such as notifying training issues and stopping training jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see [Action on Amazon SageMaker Debugger rules](debugger-action-on-rules.md).

## Use Debugger built-in profiler rules with custom parameter values
<a name="debugger-built-in-profiler-rules-configuration-param-change"></a>

If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure the `base_config` and `rule_parameters` parameters for the `ProfilerRule.sagemaker` and `Rule.sagemaker` class methods. In case of the `Rule.sagemaker` class methods, you can also customize tensor collections through the `collections_to_save` parameter. For instruction on how to use the `CollectionConfig` class, see [Configure tensor collections using the `CollectionConfig` API](debugger-configure-tensor-collections.md). 

Use the following configuration template for built-in rules to customize parameter values. By changing the rule parameters as you want, you can adjust the sensitivity of the rules to be initiated. 
+ The `base_config` argument is where you call the built-in rule methods.
+ The `rule_parameters` argument is to adjust the default key values of the built-in rules listed in [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md).

For more information about the Debugger rule class, methods, and parameters, see [SageMaker AI Debugger Rule class](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig

rules=[
    ProfilerRule.sagemaker(
        base_config=rule_configs.BuiltInProfilerRuleName(),
        rule_parameters={
                "key": "value"
        }
    )
]
```

The parameter descriptions and value customization examples are provided for each rule at [List of Debugger built-in profiler rules](debugger-built-in-profiler-rules.md).

For a low-level JSON configuration of the Debugger built-in rules using the `CreateTrainingJob` API, see [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md).

# List of Debugger built-in profiler rules
<a name="debugger-built-in-profiler-rules"></a>

Use the Debugger built-in profiler rules provided by Amazon SageMaker Debugger and analyze metrics collected while training your models. The Debugger built-in rules monitor various common conditions that are critical for the success of running a performant training job. You can call the built-in profiler rules using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) or the low-level SageMaker API operations. There's no additional cost for using the built-in rules. For more information about billing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Note**  
The maximum numbers of built-in profiler rules that you can attach to a training job is 20. SageMaker Debugger fully manages the built-in rules and analyzes your training job synchronously.

**Important**  
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Profiler rules
<a name="debugger-built-in-profiler-rules-ProfilerRule"></a>

The following rules are the Debugger built-in rules that are callable using the `ProfilerRule.sagemaker` classmethod.

Debugger built-in rule for generating the profiling report


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Profiling Report for any SageMaker training job |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html)  | 

Debugger built-in rules for profiling hardware system resource utilization (system metrics)


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Generic system monitoring rules for any SageMaker training job |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html)  | 

Debugger built-in rules for profiling framework metrics


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Profiling rules for deep learning frameworks (TensorFlow and PyTorch) |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-profiler-rules.html)  | 

**Warning**  
In favor of [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md), SageMaker AI Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.   
SageMaker Python SDK <= v2.130.0
PyTorch >= v1.6.0, < v2.0
TensorFlow >= v2.3.1, < v2.11
See also [March 16, 2023](debugger-release-notes.md#debugger-release-notes-20230315).

**To use the built-in rules with default parameter values** – use the following configuration format:

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_1()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_2()),
    ...
    ProfilerRule.sagemaker(rule_configs.BuiltInRuleName_n())
]
```

**To use the built-in rules with customizing the parameter values** – use the following configuration format:

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    ProfilerRule.sagemaker(
        base_config=rule_configs.BuiltInRuleName(),
        rule_parameters={
                "key": "value"
        }
    )
]
```

To find available keys for the `rule_parameters` parameter, see the parameter description tables.

Sample rule configuration codes are provided for each built-in rule below the parameter description tables.
+ For a full instruction and examples of using the Debugger built-in rules, see [Debugger built-in rules example code](debugger-built-in-rules-example.md#debugger-deploy-built-in-rules).
+ For a full instruction on using the built-in rules with the low-level SageMaker API operations, see [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md).

## ProfilerReport
<a name="profiler-report"></a>

The ProfilerReport rule invokes all of the built-in rules for monitoring and profiling. It creates a profiling report and updates when the individual rules are triggered. You can download a comprehensive profiling report while a training job is running or after the training job is complete. You can adjust the rule parameter values to customize sensitivity of the built-in monitoring and profiling rules. The following example code shows the basic format to adjust the built-in rule parameters through the ProfilerReport rule.

```
rules=[
    ProfilerRule.sagemaker(
        rule_configs.ProfilerReport(
            <BuiltInRuleName>_<parameter_name> = value
        )
    )  
]
```

If you trigger this ProfilerReport rule without any customized parameter as shown in the following example code, then the ProfilerReport rule triggers all of the built-in rules for monitoring and profiling with their default parameter values.

```
rules=[ProfilerRule.sagemaker(rule_configs.ProfilerReport())]
```

The following example code shows how to specify and adjust the CPUBottleneck rule's `cpu_threshold` parameter and the IOBottleneck rule's `threshold` parameter.

```
rules=[
    ProfilerRule.sagemaker(
        rule_configs.ProfilerReport(
            CPUBottleneck_cpu_threshold = 90,
            IOBottleneck_threshold = 90
        )
    )  
]
```

To explore what's in the profiler report, see [SageMaker Debugger Profiling Report](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html). Also, because this rule activates all of the profiling rules, you can also check the rule analysis status using the [SageMaker Debugger UI in SageMaker Studio Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio.html).

Parameter Descriptions for the OverallSystemUsage Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| <BuiltInRuleName>\$1<parameter\$1name> |  Customizable parameter to adjust thresholds of other built-in monitoring and profiling rules.  **Optional** Default value: `None`  | 

## BatchSize
<a name="batch-size-rule"></a>

The BatchSize rule helps detect if GPU is underutilized due to a small batch size. To detect this issue, this rule monitors the average CPU utilization, GPU utilization, and GPU memory utilization. If utilization on CPU, GPU, and GPU memory is low on average, it may indicate that the training job can either run on a smaller instance type or can run with a bigger batch size. This analysis does not work for frameworks that heavily overallocate memory. However, increasing the batch size can lead to processing or data loading bottlenecks because more data preprocessing time is required in each iteration.

Parameter Descriptions for the BatchSize Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| cpu\$1threshold\$1p95 |  Defines the threshold for 95th quantile of CPU utilization in percentage. **Optional** Valid values: Integer Default value: `70` (in percentage)  | 
| gpu\$1threshold\$1p95 |  Defines the threshold for 95th quantile of GPU utilization in percentage. **Optional** Valid values: Integer Default value: `70` (in percentage)  | 
| gpu\$1memory\$1threshold\$1p95 | Defines the threshold for 95th quantile of GPU memory utilization in percentage. **Optional** Valid values: Integer Default values: `70` (in percentage)  | 
| patience | Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `100`  | 
| window |  Window size for computing quantiles. **Optional** Valid values: Integer Default values: `500`  | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## CPUBottleneck
<a name="cpu-bottleneck"></a>

The CPUBottleneck rule helps detect if GPU is underutilized due to CPU bottlenecks. Rule returns True if number of CPU bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the CPUBottleneck Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold |  Defines the threshold for proportion of bottlenecked time to the total training time. If the proportion exceeds the percentage specified to the threshold parameter, the rule switches the rule status to True. **Optional** Valid values: Integer Default value: `50` (in percentage)  | 
| gpu\$1threshold |  A threshold that defines low GPU utilization. **Optional** Valid values: Integer Default value: `10` (in percentage)  | 
| cpu\$1threshold | A threshold that defines high CPU utilization. **Optional** Valid values: Integer Default values: `90` (in percentage)  | 
| patience | Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `100`  | 
| scan\$1interval\$1us | Time interval with which timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## GPUMemoryIncrease
<a name="gpu-memory-increase"></a>

The GPUMemoryIncrease rule helps detect a large increase in memory usage on GPUs.

Parameter Descriptions for the GPUMemoryIncrease Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| increase |  Defines the threshold for absolute memory increase. **Optional** Valid values: Integer Default value: `10` (in percentage)  | 
| patience |  Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `100`  | 
| window |  Window size for computing quantiles. **Optional** Valid values: Integer Default values: `500`  | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## IOBottleneck
<a name="io-bottleneck"></a>

This rule helps to detect if GPU is underutilized due to data IO bottlenecks. Rule returns True if number of IO bottlenecks exceeds a predefined threshold.

Parameter Descriptions for the IOBottleneck Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold | Defines the threshold when Rule to return True.**Optional**Valid values: IntegerDefault value: `50` (in percentage) | 
| gpu\$1threshold |  A threshold that defines when GPU is considered underutilized. **Optional** Valid values: Integer Default value: `70` (in percentage)  | 
| io\$1threshold | A threshold that defines high IO wait time.**Optional**Valid values: IntegerDefault values: `50` (in percentage) | 
| patience | Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter.**Optional**Valid values: IntegerDefault values: `1000` | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## LoadBalancing
<a name="load-balancing"></a>

The LoadBalancing rule helps detect issues in workload balancing among multiple GPUs.

Parameter Descriptions for the LoadBalancing Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold |  Defines the workload percentage. **Optional** Valid values: Integer Default value: `0.5` (unitless proportion)  | 
| patience |  Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `10`  | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## LowGPUUtilization
<a name="low-gpu-utilization"></a>

The LowGPUUtilization rule helps detect if GPU utilization is low or suffers from fluctuations. This is checked for each GPU on each worker. Rule returns True if 95th quantile is below threshold\$1p95 which indicates underutilization. Rule returns true if 95th quantile is above threshold\$1p95 and 5th quantile is below threshold\$1p5 which indicates fluctuations.

Parameter Descriptions for the LowGPUUtilization Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold\$1p95 |  A threshold for 95th quantile below which GPU is considered to be underutilized. **Optional** Valid values: Integer Default value: `70` (in percentage)  | 
| threshold\$1p5 | A threshold for 5th quantile. Default is 10 percent.**Optional**Valid values: IntegerDefault values: `10` (in percentage) | 
| patience |  Defines the number of data points to skip until the rule starts evaluation. The first several steps of training jobs usually show high volume of data processes, so keep the rule patient and prevent it from being invoked too soon with a given number of profiling data that you specify with this parameter. **Optional** Valid values: Integer Default values: `1000`  | 
| window |  Window size for computing quantiles. **Optional** Valid values: Integer Default values: `500`  | 
| scan\$1interval\$1us |  Time interval that timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## OverallSystemUsage
<a name="overall-system-usage"></a>

The OverallSystemUsage rule measures overall system usage per worker node. The rule currently only aggregates values per node and computes their percentiles.

Parameter Descriptions for the OverallSystemUsage Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| scan\$1interval\$1us |  Time interval to scan timeline files. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## MaxInitializationTime
<a name="max-initialization-time"></a>

The MaxInitializationTime rule helps detect if the training initialization is taking too much time. The rule waits until the first step is available.

Parameter Descriptions for the MaxInitializationTime Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| threshold |  Defines the threshold in minutes to wait for the first step to become available. **Optional** Valid values: Integer Default value: `20` (in minutes)  | 
| scan\$1interval\$1us |  Time interval with which timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## OverallFrameworkMetrics
<a name="overall-framework-metrics"></a>

The OverallFrameworkMetrics rule summarizes the time spent on framework metrics, such as forward and backward pass, and data loading.

Parameter Descriptions for the OverallFrameworkMetrics Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| scan\$1interval\$1us |  Time interval to scan timeline files. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

## StepOutlier
<a name="step-outlier"></a>

The StepOutlier rule helps detect outliers in step durations. This rule returns `True` if there are outliers with step durations larger than `stddev` sigmas of the entire step durations in a time range.

Parameter Descriptions for the StepOutlier Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 
| stddev |  Defines a factor by which to multiply the standard deviation. For example, the rule is invoked by default when a step duration is larger or smaller than 5 times the standard deviation.  **Optional** Valid values: Integer Default value: `5` (in minutes)  | 
| mode | Mode under which steps have been saved and on which Rule should run on. Per default rule will run on steps from EVAL and TRAIN phase**Optional**Valid values: IntegerDefault value: `5` (in minutes) | 
| n\$1outliers | How many outliers to ignore before rule returns True**Optional**Valid values: IntegerDefault value: `10` | 
| scan\$1interval\$1us |  Time interval with which timeline files are scanned. **Optional** Valid values: Integer Default values: `60000000` (in microseconds)  | 

# Amazon SageMaker Debugger UI in Amazon SageMaker Studio Classic Experiments
<a name="debugger-on-studio"></a>

Use the Amazon SageMaker Debugger Insights dashboard in Amazon SageMaker Studio Classic Experiments to analyze your model performance and system bottlenecks while running training jobs on Amazon Elastic Compute Cloud (Amazon EC2) instances. Gain insights into your training jobs and improve your model training performance and accuracy with the Debugger dashboards. By default, Debugger monitors system metrics (CPU, GPU, GPU memory, network, and data I/O) every 500 milliseconds and basic output tensors (loss and accuracy) every 500 iterations for training jobs. You can also further customize Debugger configuration parameter values and adjust the saving intervals through the Studio Classic UI or using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). 

**Important**  
If you're using an existing Studio Classic app, delete the app and restart to use the latest Studio Classic features. For instructions on how to restart and update your Studio Classic environment, see [Update Amazon SageMaker AI Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks-update.html). 

**Topics**
+ [

# Open the Amazon SageMaker Debugger Insights dashboard
](debugger-on-studio-insights.md)
+ [

# Amazon SageMaker Debugger Insights dashboard controller
](debugger-on-studio-insights-controllers.md)
+ [

# Explore the Amazon SageMaker Debugger Insights dashboard
](debugger-on-studio-insights-walkthrough.md)
+ [

# Shut down the Amazon SageMaker Debugger Insights instance
](debugger-on-studio-insights-close.md)

# Open the Amazon SageMaker Debugger Insights dashboard
<a name="debugger-on-studio-insights"></a>

In the SageMaker Debugger Insights dashboard in Studio Classic, you can see the compute resource utilization, resource utilization, and system bottleneck information of your training job that runs on Amazon EC2 instances in real time and after trainings

**Note**  
The SageMaker Debugger Insights dashboard runs a Studio Classic application on an `ml.m5.4xlarge` instance to process and render the visualizations. Each SageMaker Debugger Insights tab runs one Studio Classic kernel session. Multiple kernel sessions for multiple SageMaker Debugger Insights tabs run on the single instance. When you close a SageMaker Debugger Insights tab, the corresponding kernel session is also closed. The Studio Classic application remains active and accrues charges for the `ml.m5.4xlarge` instance usage. For information about pricing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Important**  
When you are done using the SageMaker Debugger Insights dashboard, you must shut down the `ml.m5.4xlarge` instance to avoid accruing charges. For instructions on how to shut down the instance, see [Shut down the Amazon SageMaker Debugger Insights instance](debugger-on-studio-insights-close.md).

**To open the SageMaker Debugger Insights dashboard**

1. On the Studio Classic **Home** page, choose **Experiments** in the left navigation pane.

1. Search your training job in the **Experiments** page. If your training job is set up with an Experiments run, the job should appear in the **Experiments** tab; if you didn't set up an Experiments run, the job should appear in the **Unassigned runs** tab.

1. Choose (click) the link of the training job name to see the job details.

1. Under the **OVERVIEW** menu, choose **Debuggger**. This should show the following two sections.
   + In the **Debugger rules** section, you can browse the status of the Debugger built-in rules associated with the training job.
   + In the **Debugger insights** section, you can find links to open SageMaker Debugger Insights on the dashboard.

1. In the **SageMaker Debugger Insights** section, choose the link of the training job name to open the SageMaker Debugger Insights dashboard. This opens a **Debug [your-training-job-name]** window. In this window, Debugger provides an overview of the computational performance of your training job on Amazon EC2 instances and helps you identify issues in compute resource utilization.

You can also download an aggregated profiling report by adding the built-in [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#profiler-report) rule of SageMaker Debugger. For more information, see [Configure Built-in Profiler Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-profiler-rules.html) and [Profiling Report Generated Using SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html).

# Amazon SageMaker Debugger Insights dashboard controller
<a name="debugger-on-studio-insights-controllers"></a>

There are different components of the Debugger controller for monitoring and profiling. In this guide, you learn about the Debugger controller components.

**Note**  
The SageMaker Debugger Insights dashboard runs a Studio Classic app on an `ml.m5.4xlarge` instance to process and render the visualizations. Each SageMaker Debugger Insights tab runs one Studio Classic kernel session. Multiple kernel sessions for multiple SageMaker Debugger Insights tabs run on the single instance. When you close a SageMaker Debugger Insights tab, the corresponding kernel session is also closed. The Studio Classic app remains active and accrues charges for the `ml.m5.4xlarge` instance usage. For information about pricing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Important**  
When you are done using the SageMaker Debugger Insights dashboard, shut down the `ml.m5.4xlarge` instance to avoid accruing charges. For instructions on how to shut down the instance, see [Shut down the Amazon SageMaker Debugger Insights instance](debugger-on-studio-insights-close.md).

## SageMaker Debugger Insights controller UI
<a name="debugger-on-studio-insights-controller"></a>

Using the Debugger controller located at the upper-left corner of the Insights dashboard, you can refresh the dashboard, configure or update Debugger settings for monitoring system metrics, stop a training job, and download a Debugger profiling report.

![\[SageMaker Debugger Insights Dashboard Controllers\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-refresh.png)

+ If you want to manually refresh the dashboard, choose the refresh button (the round arrow at the upper-left corner) as shown in the preceding screenshot. 
+ The **Monitoring** toggle button is on by default for any SageMaker training job initiated using the SageMaker Python SDK. If not activated, you can use the toggle button to start monitoring. During monitoring, Debugger only collects resource utilization metrics to detect computational problems such as CPU bottlenecks and GPU underutilization. For a complete list of resource utilization problems that Debugger monitors, see [Debugger built-in rules for profiling hardware system resource utilization (system metrics)](debugger-built-in-profiler-rules.md#built-in-rules-monitoring).
+ The **Configure monitoring** button opens a pop-up window that you can use to set or update the data collection frequency and the S3 path to save the data.   
![\[The pop-up window for configuring Debugger monitoring settings\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-enable-profiling-2.png)

  You can specify values for the following fields.
  + **S3 bucket URI**: Specify the base S3 bucket URI.
  + **Collect monitoring data every**: Select a time interval to collect system metrics. You can choose one of the monitoring intervals from the dropdown list. Available intervals are 100 milliseconds, 200 milliseconds, 500 milliseconds (default), 1 second, 5 seconds, and 1 minute. 
**Note**  
If you choose one of the lower time intervals, you increase the granularity of resource utilization metrics, so you can capture spikes and anomalies with a higher time resolution. However, higher the resolution, larger the size of system metrics to process. This might introduce additional overhead and impact the overall training and processing time.
+ Using the **Stop training** button, you can stop the training job when you find anomalies in resource utilization.
+ Using the **Download report** button, you can download an aggregated profiling report by using the built-in [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#profiler-report) rule of SageMaker Debugger. The button is activated when you add the built-in [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#profiler-report) rule to the estimator. For more information, see [Configure Built-in Profiler Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-profiler-rules.html) and [Profiling Report Generated Using SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html).

# Explore the Amazon SageMaker Debugger Insights dashboard
<a name="debugger-on-studio-insights-walkthrough"></a>

When you initiate a SageMaker training job, SageMaker Debugger starts monitoring the resource utilization of the Amazon EC2 instances by default. You can track the system utilization rates, statistics overview, and built-in rule analysis through the Insights dashboard. This guide walks you through the content of the SageMaker Debugger Insights dashboard under the following tabs: **System Metrics** and **Rules**. 

**Note**  
The SageMaker Debugger Insights dashboard runs a Studio Classic application on an `ml.m5.4xlarge` instance to process and render the visualizations. Each SageMaker Debugger Insights tab runs one Studio Classic kernel session. Multiple kernel sessions for multiple SageMaker Debugger Insights tabs run on the single instance. When you close a SageMaker Debugger Insights tab, the corresponding kernel session is also closed. The Studio Classic application remains active and accrues charges for the `ml.m5.4xlarge` instance usage. For information about pricing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Important**  
When you are done using the SageMaker Debugger Insights dashboard, shut down the `ml.m5.4xlarge` instance to avoid accruing charges. For instructions on how to shut down the instance, see [Shut down the Amazon SageMaker Debugger Insights instance](debugger-on-studio-insights-close.md).

**Important**  
In the reports, plots and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

**Topics**
+ [

## System metrics
](#debugger-insights-system-metrics-tab)
+ [

## Rules
](#debugger-on-studio-insights-rules)

## System metrics
<a name="debugger-insights-system-metrics-tab"></a>

In the **System Metrics** tab, you can use the summary table and timeseries plots to understand resource utilization.

### Resource utilization summary
<a name="debugger-on-studio-insights-sys-resource-summary"></a>

This summary table shows the statistics of compute resource utilization metrics of all nodes (denoted as algo-*n*). The resource utilization metrics include the total CPU utilization, the total GPU utilization, the total CPU memory utilization, the total GPU memory utilization, the total I/O wait time, and the total network in bytes. The table shows the minimum and the maximum values, and p99, p90, and p50 percentiles.

![\[A summary table of resource utilization\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-resource-util-summary.png)


### Resource utilization time series plots
<a name="debugger-on-studio-insights-sys-controller"></a>

Use the time series graphs to see more details of resource utilization and identify at what time interval each instance shows any undesired utilization rate, such as low GPU utilization and CPU bottlenecks that can cause a waste of the expensive instance.

**The time series graph controller UI**

The following screenshot shows the UI controller for adjusting the time series graphs.

![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-insights-graph-controller.png)

+ **algo-1**: Use this dropdown menu to choose the node that you want to look into.
+ **Zoom In**: Use this button to zoom in the time series graphs and view shorter time intervals.
+ **Zoom Out**: Use this button to zoom out the time series graphs and view wider time intervals.
+ **Pan Left**: Move the time series graphs to an earlier time interval.
+ **Pan Right**: Move the time series graphs to a later time interval.
+ **Fix Timeframe**: Use this check box to fix or bring back the time series graphs to show the whole view from the first data point to the last data point.

**CPU utilization and I/O wait time**

The first two graphs show CPU utilization and I/O wait time over time. By default, the graphs show the average of CPU utilization rate and I/O wait time spent on the CPU cores. You can select one or more CPU cores by selecting the labels to graph them on single chart and compare utilization across cores. You can drag and zoom in and out to have a closer look at specific time intervals.

![\[debugger-studio-insight-mockup\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-insights-node-cpu.png)


**GPU utilization and GPU memory utilization**

The following graphs show GPU utilization and GPU memory utilization over time. By default, the graphs show the mean utilization rate over time. You can select the GPU core labels to see the utilization rate of each core. Taking the mean of utilization rate over the total number of GPU cores shows the mean utilization of the entire hardware system resource. By looking at the mean utilization rate, you can check the overall system resource usage of an Amazon EC2 instance. The following figure shows an example training job on an `ml.p3.16xlarge` instance with 8 GPU cores. You can monitor if the training job is well distributed, fully utilizing all GPUs.

![\[debugger-studio-insight-mockup\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-node-gpu.gif)


**Overall system utilization over time**

The following heatmap shows an example of the entire system utilization of an `ml.p3.16xlarge` instance over time, projected onto the two-dimensional plot. Every CPU and GPU core is listed in the vertical axis, and the utilization is recorded over time with a color scheme, where the bright colors represent low utilization and the darker colors represent high utilization. See the labeled color bar on the right side of the plot to find out which color level corresponds to which utilization rate.

![\[debugger-studio-insight-mockup\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-node-heatmap.png)


## Rules
<a name="debugger-on-studio-insights-rules"></a>

Use the **Rules** tab to find a summary of the profiling rule analysis on your training job. If the profiling rule is activated with the training job, the text appears highlighted with the solid white text. Inactive rules are dimmed in gray text. To activate these rules, follow instructions at [Use built-in profiler rules managed by Amazon SageMaker Debugger](use-debugger-built-in-profiler-rules.md).

![\[The Rules tab in the SageMaker Debugger Insights dashboard\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-insights-rules.png)


# Shut down the Amazon SageMaker Debugger Insights instance
<a name="debugger-on-studio-insights-close"></a>

When you are not using the SageMaker Debugger Insights dashboard, you should shut down the app instance to avoid incurring additional fees.

**To shut down the SageMaker Debugger Insights app instance in Studio Classic**

![\[An animated screenshot that shows how to shut down a SageMaker Debugger Insights dashboard instance.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-studio-insights-shut-down.png)


1. In Studio Classic, select the **Running Instances and Kernels** icon (![\[Square icon with a white outline of a cloud on a dark blue background.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Running_squid.png)). 

1. Under the **RUNNING APPS** list, look for the **sagemaker-debugger-1.0** app. Select the shutdown icon (![\[Power button icon with a circular shape and vertical line symbol.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Shutdown_light.png)) next to the app. The SageMaker Debugger Insights dashboards run on an `ml.m5.4xlarge` instance. This instance also disappears from the **RUNNING INSTANCES** when you shut down the **sagemaker-debugger-1.0** app. 

# SageMaker Debugger interactive report
<a name="debugger-profiling-report"></a>

Receive profiling reports autogenerated by Debugger. The Debugger report provide insights into your training jobs and suggest recommendations to improve your model performance. The following screenshot shows a collage of the Debugger profiling report. To learn more, see [SageMaker Debugger interactive report](#debugger-profiling-report).

**Note**  
You can download a Debugger reports while your training job is running or after the job has finished. During training, Debugger concurrently updates the report reflecting the current rules' evaluation status. You can download a complete Debugger report only after the training job has completed.

**Important**  
In the reports, plots and and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

![\[An example of a Debugger training job summary report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profile-report.jpg)


For any SageMaker training jobs, the SageMaker Debugger [ProfilerReport](debugger-built-in-profiler-rules.md#profiler-report) rule invokes all of the [monitoring and profiling rules](debugger-built-in-profiler-rules.md#built-in-rules-monitoring) and aggregates the rule analysis into a comprehensive report. Following this guide, download the report using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) or the S3 console, and learn what you can interpret from the profiling results.

**Important**  
In the report, plots and and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

# Download the SageMaker Debugger profiling report
<a name="debugger-profiling-report-download"></a>

Download the SageMaker Debugger profiling report while your training job is running or after the job has finished using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and AWS Command Line Interface (CLI).

**Note**  
To get the profiling report generated by SageMaker Debugger, you must use the built-in [ProfilerReport](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html#profiler-report) rule offered by SageMaker Debugger. To activate the rule with your training job, see [Configure Built-in Profiler Rules](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-profiler-rules.html).

**Tip**  
You can also download the report with a single click in the SageMaker Studio Debugger insights dashboard. This doesn't require any additional scripting to download the report. To find out how to download the report from Studio, see [Open the Amazon SageMaker Debugger Insights dashboard](debugger-on-studio-insights.md).

------
#### [ Download using SageMaker Python SDK and AWS CLI ]

1. Check the current job's default S3 output base URI.

   ```
   estimator.output_path
   ```

1. Check the current job name.

   ```
   estimator.latest_training_job.job_name
   ```

1. The Debugger profiling report is stored under `<default-s3-output-base-uri>/<training-job-name>/rule-output`. Configure the rule output path as follows:

   ```
   rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
   ```

1. To check if the report is generated, list directories and files recursively under the `rule_output_path` using `aws s3 ls` with the `--recursive` option.

   ```
   ! aws s3 ls {rule_output_path} --recursive
   ```

   This should return a complete list of files under an autogenerated folder that's named as `ProfilerReport-1234567890`. The folder name is a combination of strings: `ProfilerReport` and a unique 10-digit tag based on the Unix timestamp when the ProfilerReport rule is initiated.   
![\[An example of rule output\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-rule-output-ls-example.png)

   The `profiler-report.html` is an autogenerated profiling report by Debugger. The remaining files are the built-in rule analysis components stored in JSON and a Jupyter notebook that are used to aggregate them into the report.

1. Download the files recursively using `aws s3 cp`. The following command saves all of the rule output files to the `ProfilerReport-1234567890` folder under the current working directory.

   ```
   ! aws s3 cp {rule_output_path} ./ --recursive
   ```
**Tip**  
If using a Jupyter notebook server, run `!pwd` to double check the current working directory.

1. Under the `/ProfilerReport-1234567890/profiler-output` directory, open `profiler-report.html`. If using JupyterLab, choose **Trust HTML** to see the autogenerated Debugger profiling report.  
![\[An example of rule output\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-rule-output-open-html.png)

1. Open the `profiler-report.ipynb` file to explore how the report is generated. You can also customize and extend the profiling report using the Jupyter notebook file.

------
#### [ Download using Amazon S3 Console ]

1. Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base S3 bucket name should be in the following format: `sagemaker-<region>-111122223333`. Look up the base S3 bucket through the *Find bucket by name* field.  
![\[An example to the rule output S3 bucket URI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-0.png)

1. In the base S3 bucket, look up the training job name by specifying your job name prefix into the *Find objects by prefix* input field. Choose the training job name.  
![\[An example to the rule output S3 bucket URI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-1.png)

1. In the training job's S3 bucket, there must be three subfolders for training data collected by Debugger: **debug-output/**, **profiler-output/**, and **rule-output/**. Choose **rule-output/**.   
![\[An example to the rule output S3 bucket URI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-2.png)

1. In the **rule-output/** folder, choose **ProfilerReport-1234567890**, and choose **profiler-output/** folder. The **profiler-output/** folder contains **profiler-report.html** (the autogenerated profiling report in html), **profiler-report.ipynb** (a Jupyter notebook with scripts that are used for generating the report), and a **profiler-report/** folder (contains rule analysis JSON files that are used as components of the report).

1. Select the **profiler-report.html** file, choose **Actions**, and **Download**.  
![\[An example to the rule output S3 bucket URI\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-3.png)

1. Open the downloaded **profiler-report.html** file in a web browser.

------

**Note**  
If you started your training job without configuring the Debugger-specific parameters, Debugger generates the report based only on the system monitoring rules because the Debugger parameters are not configured to save framework metrics. To enable framework metrics profiling and receive an extended Debugger profiling report, configure the `profiler_config` parameter when constructing or updating SageMaker AI estimators.  
To learn how to configure the `profiler_config` parameter before starting a training job, see [Estimator configuration for framework profiling](debugger-configure-framework-profiling.md).  
To update the current training job and enable framework metrics profiling, see [Update Debugger Framework Profiling Configuration](debugger-update-monitoring-profiling.md).

# Debugger profiling report walkthrough
<a name="debugger-profiling-report-walkthrough"></a>

This section walks you through the Debugger profiling report section by section. The profiling report is generated based on the built-in rules for monitoring and profiling. The report shows result plots only for the rules that found issues.

**Important**  
In the report, plots and and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

**Topics**
+ [

## Training job summary
](#debugger-profiling-report-walkthrough-summary)
+ [

## System usage statistics
](#debugger-profiling-report-walkthrough-system-usage)
+ [

## Framework metrics summary
](#debugger-profiling-report-walkthrough-framework-metrics)
+ [

## Rules summary
](#debugger-profiling-report-walkthrough-rules-summary)
+ [

## Analyzing the training loop – step durations
](#debugger-profiling-report-walkthrough-step-durations)
+ [

## GPU utilization analysis
](#debugger-profiling-report-walkthrough-gpu-utilization)
+ [

## Batch size
](#debugger-profiling-report-walkthrough-batch-size)
+ [

## CPU bottlenecks
](#debugger-profiling-report-walkthrough-cpu-bottlenecks)
+ [

## I/O bottlenecks
](#debugger-profiling-report-walkthrough-io-bottlenecks)
+ [

## Load balancing in multi-GPU training
](#debugger-profiling-report-walkthrough-workload-balancing)
+ [

## GPU memory analysis
](#debugger-profiling-report-walkthrough-gpu-memory)

## Training job summary
<a name="debugger-profiling-report-walkthrough-summary"></a>

At the beginning of the report, Debugger provides a summary of your training job. In this section, you can overview the time durations and timestamps at different training phases.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-summary.gif)


The summary table contains the following information:
+ **start\$1time** – The exact time when the training job started.
+ **end\$1time** – The exact time when the training job finished.
+ **job\$1duration\$1in\$1seconds** – The total training time from the **start\$1time** to the **end\$1time**.
+ **training\$1loop\$1start** – The exact time when the first step of the first epoch has started.
+ **training\$1loop\$1end** – The exact time when the last step of the last epoch has finished.
+ **training\$1loop\$1duration\$1in\$1seconds** – The total time between the training loop start time and the training loop end time.
+ **initialization\$1in\$1seconds** – Time spent on initializing the training job. The initialization phase covers the period from the **start\$1time** to the **training\$1loop\$1start** time. The initialization time is spent on compiling the training script, starting the training script, creating and initializing the model, initiating EC2 instances, and downloading training data.
+ **finalization\$1in\$1seconds** – Time spent on finalizing the training job, such as finishing the model training, updating the model artifacts, and closing the EC2 instances. The finalization phase covers the period from the **training\$1loop\$1end** time to the **end\$1time**.
+ **initialization (%)** – The percentage of time spent on **initialization** over the total **job\$1duration\$1in\$1seconds**. 
+ **training loop (%)** – The percentage of time spent on **training loop** over the total **job\$1duration\$1in\$1seconds**.
+ **finalization (%)** – The percentage of time spent on **finalization** over the total **job\$1duration\$1in\$1seconds**.

## System usage statistics
<a name="debugger-profiling-report-walkthrough-system-usage"></a>

In this section, you can see an overview of system utilization statistics.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-system-usage.png)


The Debugger profiling report includes the following information:
+ **node** – Lists the name of nodes. If using distributed training on multi nodes (multiple EC2 instances), the node names are in format of `algo-n`.
+ **metric** – The system metrics collected by Debugger: CPU, GPU, CPU memory, GPU memory, I/O, and Network metrics.
+ **unit** – The unit of the system metrics.
+ **max** – The maximum value of each system metric.
+ **p99** – The 99th percentile of each system utilization.
+ **p95** – The 95th percentile of each system utilization.
+ **p50** – The 50th percentile (median) of each system utilization.
+ **min** – The minimum value of each system metric.

## Framework metrics summary
<a name="debugger-profiling-report-walkthrough-framework-metrics"></a>

In this section, the following pie charts show the breakdown of framework operations on CPUs and GPUs.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-framework-metrics-summary.gif)


Each of the pie charts analyzes the collected framework metrics in various aspects as follows:
+ **Ratio between TRAIN/EVAL phase and others** – Shows the ratio between time durations spent on different training phases.
+ **Ratio between forward and backward pass** – Shows the ratio between time durations spent on forward and backward pass in the training loop.
+ **Ratio between CPU/GPU operators** – Shows the ratio between time spent on operators running on CPU or GPU, such as convolutional operators.
+ **General metrics recorded in framework** – Shows the ratio between time spent on major framework metrics, such as data loading, forward and backward pass.

### Overview: CPU Operators
<a name="debugger-profiling-report-walkthrough-cpu-operators"></a>

This section provides information of the CPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called CPU operators.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-framework-cpu-operators.gif)


### Overview: GPU operators
<a name="debugger-profiling-report-walkthrough-gpu-operators"></a>

This section provides information of the GPU operators in detail. The table shows the percentage of the time and the absolute cumulative time spent on the most frequently called GPU operators.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-framework-gpu-operators.gif)


## Rules summary
<a name="debugger-profiling-report-walkthrough-rules-summary"></a>

In this section, Debugger aggregates all of the rule evaluation results, analysis, rule descriptions, and suggestions.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-rules-summary.png)


## Analyzing the training loop – step durations
<a name="debugger-profiling-report-walkthrough-step-durations"></a>

In this section, you can find a detailed statistics of step durations on each GPU core of each node. Debugger evaluates mean, maximum, p99, p95, p50, and minimum values of step durations, and evaluate step outliers. The following histogram shows the step durations captured on different worker nodes and GPUs. You can enable or disable the histogram of each worker by choosing the legends on the right side. You can check if there is a particular GPU that's causing step duration outliers.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-framework-step-duration.gif)


## GPU utilization analysis
<a name="debugger-profiling-report-walkthrough-gpu-utilization"></a>

This section shows the detailed statistics about GPU core utilization based on LowGPUUtilization rule. It also summarizes the GPU utilization statistics, mean, p95, and p5 to determine if the training job is underutilizing GPUs.

## Batch size
<a name="debugger-profiling-report-walkthrough-batch-size"></a>

This section shows the detailed statistics of total CPU utilization, individual GPU utilizations, and GPU memory footprints. The BatchSize rule determines if you need to change the batch size to better utilize the GPUs. You can check whether the batch size is too small resulting in underutilization or too large causing overutilization and out of memory issues. In the plot, the boxes show the p25 and p75 percentile ranges (filled with dark purple and bright yellow respectively) from the median (p50), and the error bars show the 5th percentile for the lower bound and 95th percentile for the upper bound.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-batch-size.png)


## CPU bottlenecks
<a name="debugger-profiling-report-walkthrough-cpu-bottlenecks"></a>

In this section, you can drill down into the CPU bottlenecks that the CPUBottleneck rule detected from your training job. The rule checks if the CPU utilization is above `cpu_threshold` (90% by default) and also if the GPU utilization is below `gpu_threshold` (10% by default).

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-cpu-bottlenecks.png)


The pie charts show the following information:
+ **Low GPU usage caused by CPU bottlenecks** – Shows the ratio of data points between the ones with GPU utilization above and below the threshold and the ones that matches the CPU bottleneck criteria.
+ **Ratio between TRAIN/EVAL phase and others** – Shows the ratio between time durations spent on different training phases.
+ **Ratio between forward and backward pass** – Shows the ratio between time durations spent on forward and backward pass in the training loop.
+ **Ratio between CPU/GPU operators** – Shows the ratio between time durations spent on GPUs and CPUs by Python operators, such as data loader processes and forward and backward pass operators.
+ **General metrics recorded in framework** – Shows major framework metrics and the ratio between time durations spent on the metrics.

## I/O bottlenecks
<a name="debugger-profiling-report-walkthrough-io-bottlenecks"></a>

In this section, you can find a summary of I/O bottlenecks. The rule evaluates the I/O wait time and GPU utilization rates and monitors if the time spent on the I/O requests exceeds a threshold percent of the total training time. It might indicate I/O bottlenecks where GPUs are waiting for data to arrive from storage.

## Load balancing in multi-GPU training
<a name="debugger-profiling-report-walkthrough-workload-balancing"></a>

In this section, you can identify workload balancing issue across GPUs. 

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-workload-balancing.gif)


## GPU memory analysis
<a name="debugger-profiling-report-walkthrough-gpu-memory"></a>

In this section, you can analyze the GPU memory utilization collected by the GPUMemoryIncrease rule. In the plot, the boxes show the p25 and p75 percentile ranges (filled with dark purple and bright yellow respectively) from the median (p50), and the error bars show the 5th percentile for the lower bound and 95th percentile for the upper bound.

![\[An example of Debugger profiling report\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-profiling-report-gpu-memory-utilization.png)


# Opt out of the collection of Amazon SageMaker Debugger usage statistics
<a name="debugger-telemetry"></a>

For all SageMaker training jobs, Amazon SageMaker Debugger runs the [ProfilerReport](debugger-built-in-profiler-rules.md#profiler-report) rule and autogenerates a [SageMaker Debugger interactive report](debugger-profiling-report.md). The `ProfilerReport` rule provides a Jupyter notebook file (`profiler-report.ipynb`) that generates a corresponding HTML file (`profiler-report.html`). 

Debugger collects profiling report usage statistics by including code in the Jupyter notebook that collects the unique `ProfilerReport` rule's processing job ARN if the user opens the final `profiler-report.html` file.

Debugger only collects information about whether a user opens the final HTML report. It **DOES NOT** collect any information from training jobs, training data, training scripts, processing jobs, logs, or the content of the profiling report itself.

You can opt out of the collection of usage statistics using one of the following options.

## (Recommended) Option 1: Opt out before running a training job
<a name="debugger-telemetry-profiler-report-opt-out-1"></a>

To opt out, you need to add the following Debugger `ProfilerReport` rule configuration to your training job request.

------
#### [ SageMaker Python SDK ]

```
estimator=sagemaker.estimator.Estimator(
    ...

    rules=ProfilerRule.sagemaker(
        base_config=rule_configs.ProfilerReport()
        rule_parameters={"opt_out_telemetry": "True"}
    )
)
```

------
#### [ AWS CLI ]

```
"ProfilerRuleConfigurations": [ 
    { 
        "RuleConfigurationName": "ProfilerReport-1234567890",
        "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "ProfilerReport", 
            "opt_out_telemetry": "True"
        }
    }
]
```

------
#### [ AWS SDK for Python (Boto3) ]

```
ProfilerRuleConfigurations=[ 
    {
        'RuleConfigurationName': 'ProfilerReport-1234567890',
        'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
        'RuleParameters': {
            'rule_to_invoke': 'ProfilerReport',
            'opt_out_telemetry': 'True'
        }
    }
]
```

------

## Option 2: Opt out after a training job has completed
<a name="debugger-telemetry-profiler-report-opt-out-2"></a>

To opt out after training has completed, you need to modify the `profiler-report.ipynb` file. 

**Note**  
HTML reports autogenerated without **Option 1** already added to your training job request still report the usage statistics even after you opt out using **Option 2**.

1. Follow the instructions on downloading the Debugger profiling report files in the [Download the SageMaker Debugger profiling report](debugger-profiling-report-download.md) page.

1. In the `/ProfilerReport-1234567890/profiler-output` directory, open `profiler-report.ipynb`. 

1. Add **opt\$1out=True** to the `setup_profiler_report()` function in the fifth code cell as shown in the following example code:

   ```
   setup_profiler_report(processing_job_arn, opt_out=True)
   ```

1. Run the code cell to finish opting out.

# Analyze data using the Debugger Python client library
<a name="debugger-analyze-data"></a>

While your training job is running or after it has completed, you can access the training data collected by Debugger using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and the [SMDebug client library](https://github.com/awslabs/sagemaker-debugger/). The Debugger Python client library provides analysis and visualization tools that enable you to drill down into your training job data.

**To install the library and use its analysis tools (in a JupyterLab notebook or an iPython kernel)**

```
! pip install -U smdebug
```

The following topics walk you through how to use the Debugger Python tools to visualize and analyze the training data collected by Debugger.

**Analyze system and framework metrics**
+ [Access the profile data](debugger-analyze-data-profiling.md)
+ [Plot the system metrics and framework metrics data](debugger-access-data-profiling-default-plot.md)
+ [Access the profiling data using the pandas data parsing tool](debugger-access-data-profiling-pandas-frame.md)
+ [Access the Python profiling stats data](debugger-access-data-python-profiling.md)
+ [Merge timelines of multiple profile trace files](debugger-merge-timeline.md)
+ [Profiling data loaders](debugger-data-loading-time.md)

# Access the profile data
<a name="debugger-analyze-data-profiling"></a>

The SMDebug `TrainingJob` class reads data from the S3 bucket where the system and framework metrics are saved. 

**To set up a `TrainingJob` object and retrieve profiling event files of a training job**

```
from smdebug.profiler.analysis.notebook_utils.training_job import TrainingJob
tj = TrainingJob(training_job_name, region)
```

**Tip**  
You need to specify the `training_job_name` and `region` parameters to log to a training job. There are two ways to specify the training job information:   
Use the SageMaker Python SDK while the estimator is still attached to the training job.  

  ```
  import sagemaker
  training_job_name=estimator.latest_training_job.job_name
  region=sagemaker.Session().boto_region_name
  ```
Pass strings directly.  

  ```
  training_job_name="your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
  region="us-west-2"
  ```

**Note**  
By default, SageMaker Debugger collects system metrics to monitor hardware resource utilization and system bottlenecks. Running the following functions, you might receive error messages regarding unavailability of framework metrics. To retrieve framework profiling data and gain insights into framework operations, you must enable framework profiling.  
If you use SageMaker Python SDK to manipulate your training job request, pass the `framework_profile_params` to the `profiler_config` argument of your estimator. To learn more, see [Configure SageMaker Debugger Framework Profiling](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-framework-profiling.html).
If you use Studio Classic, turn on profiling using the **Profiling** toggle button in the Debugger insights dashboard. To learn more, see [SageMaker Debugger Insights Dashboard Controller](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio-insights-controllers.html).

**To retrieve a description of the training job description and the S3 bucket URI where the metric data are saved**

```
tj.describe_training_job()
tj.get_config_and_profiler_s3_output_path()
```

**To check if the system and framework metrics are available from the S3 URI**

```
tj.wait_for_sys_profiling_data_to_be_available()
tj.wait_for_framework_profiling_data_to_be_available()
```

**To create system and framework reader objects after the metric data become available**

```
system_metrics_reader = tj.get_systems_metrics_reader()
framework_metrics_reader = tj.get_framework_metrics_reader()
```

**To refresh and retrieve the latest training event files**

The reader objects have an extended method, `refresh_event_file_list()`, to retrieve the latest training event files.

```
system_metrics_reader.refresh_event_file_list()
framework_metrics_reader.refresh_event_file_list()
```

# Plot the system metrics and framework metrics data
<a name="debugger-access-data-profiling-default-plot"></a>

You can use the system and algorithm metrics objects for the following visualization classes to plot timeline graphs and histograms.

**Note**  
To visualize the data with narrowed-down metrics in the following visualization object plot methods, specify `select_dimensions` and `select_events` parameters. For example, if you specify `select_dimensions=["GPU"]`, the plot methods filter the metrics that include the "GPU" keyword. If you specify `select_events=["total"]`, the plot methods filter the metrics that include the "total" event tags at the end of the metric names. If you enable these parameters and give the keyword strings, the visualization classes return the charts with filtered metrics.
+ The `MetricsHistogram` class

  ```
  from smdebug.profiler.analysis.notebook_utils.metrics_histogram import MetricsHistogram
  
  metrics_histogram = MetricsHistogram(system_metrics_reader)
  metrics_histogram.plot(
      starttime=0, 
      endtime=system_metrics_reader.get_timestamp_of_latest_available_file(), 
      select_dimensions=["CPU", "GPU", "I/O"], # optional
      select_events=["total"]                  # optional
  )
  ```
+ The `StepTimelineChart` class

  ```
  from smdebug.profiler.analysis.notebook_utils.step_timeline_chart import StepTimelineChart
  
  view_step_timeline_chart = StepTimelineChart(framework_metrics_reader)
  ```
+ The `StepHistogram` class

  ```
  from smdebug.profiler.analysis.notebook_utils.step_histogram import StepHistogram
  
  step_histogram = StepHistogram(framework_metrics_reader)
  step_histogram.plot(
      starttime=step_histogram.last_timestamp - 5 * 1000 * 1000, 
      endtime=step_histogram.last_timestamp, 
      show_workers=True
  )
  ```
+ The `TimelineCharts` class

  ```
  from smdebug.profiler.analysis.notebook_utils.timeline_charts import TimelineCharts
  
  view_timeline_charts = TimelineCharts(
      system_metrics_reader, 
      framework_metrics_reader,
      select_dimensions=["CPU", "GPU", "I/O"], # optional
      select_events=["total"]                  # optional 
  )
  
  view_timeline_charts.plot_detailed_profiler_data([700,710])
  ```
+ The `Heatmap` class

  ```
  from smdebug.profiler.analysis.notebook_utils.heatmap import Heatmap
  
  view_heatmap = Heatmap(
      system_metrics_reader,
      framework_metrics_reader,
      select_dimensions=["CPU", "GPU", "I/O"], # optional
      select_events=["total"],                 # optional
      plot_height=450
  )
  ```

# Access the profiling data using the pandas data parsing tool
<a name="debugger-access-data-profiling-pandas-frame"></a>

The following `PandasFrame` class provides tools to convert the collected profiling data to Pandas data frame. 

```
from smdebug.profiler.analysis.utils.profiler_data_to_pandas import PandasFrame
```

The `PandasFrame` class takes the `tj` object's S3 bucket output path, and its methods `get_all_system_metrics()` `get_all_framework_metrics()` return system metrics and framework metrics in the Pandas data format.

```
pf = PandasFrame(tj.profiler_s3_output_path)
system_metrics_df = pf.get_all_system_metrics()
framework_metrics_df = pf.get_all_framework_metrics(
    selected_framework_metrics=[
        'Step:ModeKeys.TRAIN', 
        'Step:ModeKeys.GLOBAL'
    ]
)
```

# Access the Python profiling stats data
<a name="debugger-access-data-python-profiling"></a>

The Python profiling provides framework metrics related to Python functions and operators in your training scripts and the SageMaker AI deep learning frameworks. 

<a name="debugger-access-data-python-profiling-modes"></a>**Training Modes and Phases for Python Profiling**

To profile specific intervals during training to partition statistics for each of these intervals, Debugger provides tools to set modes and phases. 

For training modes, use the following `PythonProfileModes` class:

```
from smdebug.profiler.python_profile_utils import PythonProfileModes
```

This class provides the following options:
+ `PythonProfileModes.TRAIN` – Use if you want to profile the target steps in the training phase. This mode option available only for TensorFlow.
+ `PythonProfileModes.EVAL` – Use if you want to profile the target steps in the evaluation phase. This mode option available only for TensorFlow.
+ `PythonProfileModes.PREDICT` – Use if you want to profile the target steps in the prediction phase. This mode option available only for TensorFlow.
+ `PythonProfileModes.GLOBAL` – Use if you want to profile the target steps in the global phase, which includes the previous three phases. This mode option available only for PyTorch.
+ `PythonProfileModes.PRE_STEP_ZERO` – Use if you want to profile the target steps in the initialization stage before the first training step of the first epoch starts. This phase includes the initial job submission, uploading the training scripts to EC2 instances, preparing the EC2 instances, and downloading input data. This mode option available for both TensorFlow and PyTorch.
+ `PythonProfileModes.POST_HOOK_CLOSE` – Use if you want to profile the target steps in the finalization stage after the training job has done and the Debugger hook is closed. This phase includes profiling data while the training jobs are finalized and completed. This mode option available for both TensorFlow and PyTorch.

<a name="debugger-access-data-python-profiling-phases"></a>For training phases, use the following `StepPhase` class:

```
from smdebug.profiler.analysis.utils.python_profile_analysis_utils import StepPhase
```

This class provides the following options:
+ `StepPhase.START` – Use to specify the start point of the initialization phase.
+ `StepPhase.STEP_START` – Use to specify the start step of the training phase.
+ `StepPhase.FORWARD_PASS_END` – Use to specify the steps where the forward pass ends. This option is available only for PyTorch.
+ `StepPhase.STEP_END` – Use to specify the end steps in the training phase. This option is available only for TensorFlow.
+ `StepPhase.END` – Use to specify the ending point of the finalization (post-hook-close) phase. If the callback hook is not closed, the finalization phase profiling does not occur.

**Python Profiling Analysis Tools**

Debugger supports the Python profiling with two profiling tools:
+ cProfile – The standard python profiler. cProfile collects framework metrics on CPU time for every function called when profiling was enabled.
+ Pyinstrument – This is a low overhead Python profiler sampling profiling events every milliseconds.

To learn more about the Python profiling options and what's collected, see [Default system monitoring and customized framework profiling with different profiling options](debugger-configure-framework-profiling-options.md).

The following methods of the `PythonProfileAnalysis`, `cProfileAnalysis`, `PyinstrumentAnalysis` classes are provided to fetch and analyze the Python profiling data. Each function loads the latest data from the default S3 URI.

```
from smdebug.profiler.analysis.python_profile_analysis import PythonProfileAnalysis, cProfileAnalysis, PyinstrumentAnalysis
```

To set Python profiling objects for analysis, use the cProfileAnalysis or PyinstrumentAnalysis classes as shown in the following example code. It shows how to set a `cProfileAnalysis` object, and if you want to use `PyinstrumentAnalysis`, replace the class name.

```
python_analysis = cProfileAnalysis(
    local_profile_dir=tf_python_stats_dir, 
    s3_path=tj.profiler_s3_output_path
)
```

The following methods are available for the `cProfileAnalysis` and `PyinstrumentAnalysis` classes to fetch the Python profiling stats data:
+ `python_analysis.fetch_python_profile_stats_by_time(start_time_since_epoch_in_secs, end_time_since_epoch_in_secs)` – Takes in a start time and end time, and returns the function stats of step stats whose start or end times overlap with the provided interval.
+ `python_analysis.fetch_python_profile_stats_by_step(start_step, end_step, mode, start_phase, end_phase)` – Takes in a start step and end step and returns the function stats of all step stats whose profiled `step` satisfies `start_step <= step < end_step`. 
  + `start_step` and `end_step` (str) – Specify the start step and end step to fetch the Python profiling stats data.
  + `mode` (str) – Specify the mode of training job using the `PythonProfileModes` enumerator class. The default is `PythonProfileModes.TRAIN`. Available options are provided in the [Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-modes) section.
  + `start_phase` (str) – Specify the start phase in the target step(s) using the `StepPhase` enumerator class. This parameter enables profiling between different phases of training. The default is `StepPhase.STEP_START`. Available options are provided in the [ Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-phases) section.
  + `end_phase` (str) – Specify the end phase in the target step(s) using the `StepPhase` enumerator class. This parameter sets up the end phase of training. Available options are as same as the ones for the `start_phase` parameter. The default is `StepPhase.STEP_END`. Available options are provided in the [ Training Modes and Phases for Python Profiling](#debugger-access-data-python-profiling-phases) section.
+ `python_analysis.fetch_profile_stats_between_modes(start_mode, end_mode)` – Fetches stats from the Python profiling between the start and end modes.
+ `python_analysis.fetch_pre_step_zero_profile_stats()` – Fetches the stats from the Python profiling until step 0.
+ `python_analysis.fetch_post_hook_close_profile_stats()` – Fetches stats from the Python profiling after the hook is closed.
+ `python_analysis.list_profile_stats()` – Returns a DataFrame of the Python profiling stats. Each row holds the metadata for each instance of profiling and the corresponding stats file (one per step).
+ `python_analysis.list_available_node_ids()` – Returns a list the available node IDs for the Python profiling stats.

The `cProfileAnalysis` class specific methods:
+  `fetch_profile_stats_by_training_phase()` – Fetches and aggregates the Python profiling stats for every possible combination of start and end modes. For example, if a training and validation phases are done while detailed profiling is enabled, the combinations are `(PRE_STEP_ZERO, TRAIN)`, `(TRAIN, TRAIN)`, `(TRAIN, EVAL)`, `(EVAL, EVAL)`, and `(EVAL, POST_HOOK_CLOSE)`. All stats files within each of these combinations are aggregated.
+  `fetch_profile_stats_by_job_phase()` – Fetches and aggregates the Python profiling stats by job phase. The job phases are `initialization` (profiling until step 0), `training_loop` (training and validation), and `finalization` (profiling after the hook is closed).

# Merge timelines of multiple profile trace files
<a name="debugger-merge-timeline"></a>

The SMDebug client library provide profiling analysis and visualization tools for merging timelines of system metrics, framework metrics, and Python profiling data collected by Debugger. 

**Tip**  
Before proceeding, you need to set a TrainingJob object that will be utilized throughout the examples in this page. For more information about setting up a TrainingJob object, see [Access the profile data](debugger-analyze-data-profiling.md).

The `MergedTimeline` class provides tools to integrate and correlate different profiling information in a single timeline. After Debugger captures profiling data and annotations from different phases of a training job, JSON files of trace events are saved in a default `tracefolder` directory.
+ For annotations in the Python layers, the trace files are saved in `*pythontimeline.json`. 
+ For annotations in the TensorFlow C\$1\$1 layers, the trace files are saved in `*model_timeline.json`. 
+ Tensorflow profiler saves events in a `*trace.json.gz` file. 

**Tip**  
If you want to list all of the JSON trace files, use the following AWS CLI command:  

```
! aws s3 ls {tj.profiler_s3_output_path} --recursive | grep '\.json$'
```

As shown in the following animated screenshot, putting and aligning the trace events captured from the different profiling sources in a single plot can provide an overview of the entire events occurring in different phases of the training job.

![\[An example of merged timeline\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-merged-timeline.gif)


**Tip**  
To interact with the merged timeline on the traicing app using a keyboard, use the `W` key for zooming in, the `A` key for shifting to the left, the `S` key for zooming out, and the `D` key for shifiting to the right.

The multiple event trace JSON files can be merged into one trace event JSON file using the following `MergedTimeline` API operation and class method from the `smdebug.profiler.analysis.utils.merge_timelines` module.

```
from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline

combined_timeline = MergedTimeline(path, file_suffix_filter, output_directory)
combined_timeline.merge_timeline(start, end, unit)
```

The `MergedTimeline` API operation passes the following parameters:
+ `path` (str) – Specify a root folder (`/profiler-output`) that contains system and framework profiling trace files. You can locate the `profiler-output` using the SageMaker AI estimator classmethod or the TrainingJob object. For example, `estimator.latest_job_profiler_artifacts_path()` or `tj.profiler_s3_output_path`.
+ `file_suffix_filter` (list) – Specify a list of file suffix filters to merge timelines. Available suffiex filters are `["model_timeline.json", "pythontimeline.json", "trace.json.gz"].` If this parameter is not manually specified, all of the trace files are merged by default.
+ `output_directory` (str) – Specify a path to save the merged timeline JSON file. The default is to the directory specified for the `path` parameter.

The `merge_timeline()` classmethod passes the following parameters to execute the merging process:
+ `start` (int) – Specify start time (in microseconds and in Unix time format) or start step to merge timelines.
+ `end` (int) – Specify end time (in microseconds and in Unix time format) or end step to merge timelines.
+ `unit` (str) – Choose between `"time"` and `"step"`. The default is `"time"`.

Using the following example codes, execute the `merge_timeline()` method and download the merged JSON file. 
+ Merge timeline with the `"time"` unit option. The following example code merges all available trace files between the Unix start time (the absolute zero Unix time) and the current Unix time, which means that you can merge the timelines for the entire training duration.

  ```
  import time
  from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline
  from smdebug.profiler.profiler_constants import CONVERT_TO_MICROSECS
  
  combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./")
  combined_timeline.merge_timeline(0, int(time.time() * CONVERT_TO_MICROSECS))
  ```
+ Merge timeline with the `"step"` unit option. The following example code merges all available timelines between step 3 and step 9.

  ```
  from smdebug.profiler.analysis.utils.merge_timelines import MergedTimeline
  
  combined_timeline = MergedTimeline(tj.profiler_s3_output_path, output_directory="./")
  combined_timeline.merge_timeline(3, 9, unit="step")
  ```

Open the Chrome tracing app at `chrome://tracing` on a Chrome browser, and open the JSON file. You can explore the output to plot the merged timeline. 

# Profiling data loaders
<a name="debugger-data-loading-time"></a>

In PyTorch, data loader iterators, such as `SingleProcessingDataLoaderIter` and `MultiProcessingDataLoaderIter`, are initiated at the beginning of every iteration over a dataset. During the initialization phase, PyTorch turns on worker processes depending on the configured number of workers, establishes data queue to fetch data and `pin_memory` threads.

To use the PyTorch data loader profiling analysis tool, import the following `PT_dataloader_analysis` class:

```
from smdebug.profiler.analysis.utils.pytorch_dataloader_analysis import PT_dataloader_analysis
```

Pass the profiling data retrieved as a Pandas frame data object in the [Access the profiling data using the pandas data parsing tool](debugger-access-data-profiling-pandas-frame.md) section:

```
pt_analysis = PT_dataloader_analysis(pf)
```

The following functions are available for the `pt_analysis` object:

The SMDebug `S3SystemMetricsReader` class reads the system metrics from the S3 bucket specified to the `s3_trial_path` parameter.
+ `pt_analysis.analyze_dataloaderIter_initialization()`

  The analysis outputs the median and maximum duration for these initializations. If there are outliers, (i.e duration is greater than 2 \$1 median), the function prints the start and end times for those durations. These can be used to inspect system metrics during those time intervals.

  The following list shows what analysis is available from this class method:
  + Which type of data loader iterators were initialized.
  + The number of workers per iterator.
  + Inspect whether the iterator was initialized with or without pin\$1memory.
  + Number of times the iterators were initialized during training.
+ `pt_analysis.analyze_dataloaderWorkers()`

  The following list shows what analysis is available from this class method:
  + The number of worker processes that were spun off during the entire training. 
  + Median and maximum duration for the worker processes. 
  + Start and end time for the worker processes that are outliers. 
+ `pt_analysis.analyze_dataloader_getnext()`

  The following list shows what analysis is available from this class method:
  + Number of GetNext calls made during the training. 
  + Median and maximum duration in microseconds for GetNext calls. 
  + Start time, End time, duration and worker id for the outlier GetNext call duration. 
+ `pt_analysis.analyze_batchtime(start_timestamp, end_timestamp, select_events=[".*"], select_dimensions=[".*"])`

  Debugger collects the start and end times of all the GetNext calls. You can find the amount of time spent by the training script on one batch of data. Within the specified time window, you can identify the calls that are not directly contributing to the training. These calls can be from the following operations: computing the accuracy, adding the losses for debugging or logging purposes, and printing the debugging information. Operations like these can be compute intensive or time consuming. We can identify such operations by correlating the Python profiler, system metrics, and framework metrics.

  The following list shows what analysis is available from this class method:
  + Profile time spent on each data batch, `BatchTime_in_seconds`, by finding the difference between start times of current and subsequent GetNext calls. 
  + Find the outliers in `BatchTime_in_seconds` and start and end time for those outliers.
  + Obtain the system and framework metrics during those `BatchTime_in_seconds` timestamps. This indicates where the time was spent.
+ `pt_analysis.plot_the_window()`

  Plots a timeline charts between a start timestamp and the end timestamp.

# Release notes for profiling capabilities of Amazon SageMaker AI
<a name="profiler-release-notes"></a>

See the following release notes to track the latest updates for profiling capabilities of Amazon SageMaker AI.

## March 21, 2024
<a name="profiler-release-notes-20240321"></a>

**Currency updates**

[SageMaker Profiler](train-use-sagemaker-profiler.md) has added support for PyTorch v2.2.0, v2.1.0, and v2.0.1.

**AWS Deep Learning Containers pre-installed with SageMaker Profiler**

[SageMaker Profiler](train-use-sagemaker-profiler.md) is packaged in the following [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).
+ SageMaker AI Framework Container for PyTorch v2.2.0
+ SageMaker AI Framework Container for PyTorch v2.1.0
+ SageMaker AI Framework Container for PyTorch v2.0.1

## December 14, 2023
<a name="profiler-release-notes-20231214"></a>

**Currency updates**

[SageMaker Profiler](train-use-sagemaker-profiler.md) has added support for TensorFlow v2.13.0.

**Breaking changes**

This release involves a breaking change. The SageMaker Profiler Python package name is changed from `smppy` to `smprof`. If you have been using the previous version of the package while you have started using the latest [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for TensorFlow listed in the following section, make sure that you update the package name from `smppy` to `smprof` in the import statement in your training script.

**AWS Deep Learning Containers pre-installed with SageMaker Profiler**

[SageMaker Profiler](train-use-sagemaker-profiler.md) is packaged in the following [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).
+ SageMaker AI Framework Container for TensorFlow v2.13.0
+ SageMaker AI Framework Container for TensorFlow v2.12.0

If you use the previous versions of the [framework containers](profiler-support.md#profiler-support-frameworks) such TensorFlow v2.11.0, the SageMaker Profiler Python package is still available as `smppy`. If you are uncertain which version or the package name you should use, replace the import statement of the SageMaker Profiler package with the following code snippet.

```
try:
    import smprof 
except ImportError:
    # backward-compatability for TF 2.11 and PT 1.13.1 images
    import smppy as smprof
```

## August 24, 2023
<a name="profiler-release-notes-20230824"></a>

**New features**

Released Amazon SageMaker Profiler, a profiling and visualization capability of SageMaker AI to deep dive into compute resources provisioned while training deep learning models and gain visibility into operation-level details. SageMaker Profiler provides Python modules (`smppy`) for adding annotations throughout PyTorch or TensorFlow training scripts and activating SageMaker Profiler. You can access the modules through the SageMaker AI Python SDK and AWS Deep Learning Containers. For any jobs run with the SageMaker Profiler Python modules, you can load the profile data in the SageMaker Profiler UI application that provides a summary dashboard and a detailed timeline. To learn more, see [Amazon SageMaker Profiler](train-use-sagemaker-profiler.md).

This release of the SageMaker Profiler Python package is integrated into the following [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for PyTorch and TensorFlow.
+ PyTorch v2.0.0
+ PyTorch v1.13.1
+ TensorFlow v2.12.0
+ TensorFlow v2.11.0

# Distributed training in Amazon SageMaker AI
<a name="distributed-training"></a>

SageMaker AI provides distributed training libraries and supports various distributed training options for deep learning tasks such as computer vision (CV) and natural language processing (NLP). With SageMaker AI’s distributed training libraries, you can run highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs. You can also use other distributed training frameworks and packages such as PyTorch DistributedDataParallel (DDP), `torchrun`, MPI (`mpirun`), and parameter server. The following section gives information about fundamental distributed training concepts. Throughout the documentation, instructions and examples focus on how to set up the distributed training options for deep learning tasks using the SageMaker Python SDK.

**Tip**  
To learn best practices for distributed computing of machine learning (ML) training and processing jobs in general, see [Distributed computing with SageMaker AI best practices](distributed-training-options.md).

## Distributed training concepts
<a name="distributed-training-basic-concepts"></a>

 SageMaker AI’s distributed training libraries use the following distributed training terms and features. 

**Datasets and Batches**
+ **Training Dataset**: All of the data you use to train the model.
+ **Global batch size**: The number of records selected from the training dataset in each iteration to send to the GPUs in the cluster. This is the number of records over which the gradient is computed at each iteration. If data parallelism is used, it is equal to the total number of model replicas multiplied by the per-replica batch size: `global batch size = (the number of model replicas) * (per-replica batch size)`. A single batch of global batch size is often referred to as the *mini-batch * in machine learning literature.
+ **Per-replica batch size:** When data parallelism is used, this is the number of records sent to each model replica. Each model replica performs a forward and backward pass with this batch to calculate weight updates. The resulting weight updates are synchronized (averaged) across all replicas before the next set of per-replica batches are processed. 
+ **Micro-batch**: A subset of the mini-batch or, if hybrid model and data parallelism is used , it is a subset of the per-replica sized batch . When you use SageMaker AI’s distributed model parallelism library, each micro-batch is fed into the training pipeline one-by-one and follows an [execution schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html#model-parallel-pipeline-execution) defined by the library's runtime.

**Training**
+ **Epoch**: One training cycle through the entire dataset. It is common to have multiple iterations per an epoch. The number of epochs you use in training is unique on your model and use case.
+ **Iteration**: A single forward and backward pass performed using a global batch sized batch (a mini-batch) of training data. The number of iterations performed during training is determined by the global batch size and the number of epochs used for training. For example, if a dataset includes 5,000 samples, and you use a global batch size of 500, it will take 10 iterations to complete a single epoch.
+ **Learning rate**: A variable that influences the amount that weights are changed in response to the calculated error of the model. The learning rate plays an important role in the model’s ability to converge as well as the speed and optimality of convergence.

**Instances and GPUs**
+ **Instances**: An AWS [machine learning compute instance](https://aws.amazon.com/sagemaker/pricing/). These are also referred to as *nodes*.
+ **Cluster size**: When using SageMaker AI's distributed training library, this is the number of instances multiplied by the number of GPUs in each instance. For example, if you use two ml.p3.8xlarge instances in a training job, which have 4 GPUs each, the cluster size is 8. While increasing cluster size can lead to faster training times, communication between instances must be optimized; Otherwise, communication between the nodes can add overhead and lead to slower training times. The SageMaker AI distributed training library is designed to optimize communication between Amazon EC2 ML compute instances, leading to higher device utilization and faster training times.

**Distributed Training Solutions**
+ **Data parallelism**: A strategy in distributed training where a training dataset is split up across multiple GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. Each GPU contains a *replica* of the model, receives different batches of training data, performs a forward and backward pass, and shares weight updates with the other nodes for synchronization before moving on to the next batch and ultimately another epoch.
+ **Model parallelism**: A strategy in distributed training where the model partitioned across multiple GPUs in a compute cluster, which consists of multiple Amazon EC2 ML Instances. The model might be complex and have a large number of hidden layers and weights, making it unable to fit in the memory of a single instance. Each GPU carries a subset of the model, through which the data flows and the transformations are shared and compiled. The efficiency of model parallelism, in terms of GPU utilization and training time, is heavily dependent on how the model is partitioned and the execution schedule used to perform forward and backward passes.
+ **Pipeline Execution Schedule** (**Pipelining**): The pipeline execution schedule determines the order in which computations (micro-batches) are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism and overcome the performance loss due to sequential computation by having the GPUs compute simultaneously on different data samples. To learn more, see [Pipeline Execution Schedule](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html#model-parallel-pipeline-execution). 

### Advanced concepts
<a name="distributed-training-advanced-concepts"></a>

Machine Learning (ML) practitioners commonly face two scaling challenges when training models: *scaling model size* and *scaling training data*. While model size and complexity can result in better accuracy, there is a limit to the model size you can fit into a single CPU or GPU. Furthermore, scaling model size may result in more computations and longer training times.

Not all models handle training data scaling equally well because they need to ingest all the training data *in memory* for training. They only scale vertically, and to bigger and bigger instance types. In most cases, scaling training data results in longer training times.

Deep Learning (DL) is a specific family of ML algorithms consisting of several layers of artificial neural networks. The most common training method is with mini-batch Stochastic Gradient Descent (SGD). In mini-batch SGD, the model is trained by conducting small iterative changes of its coefficients in the direction that reduces its error. Those iterations are conducted on equally sized subsamples of the training dataset called *mini-batches*. For each mini-batch, the model is run in each record of the mini-batch, its error measured and the gradient of the error estimated. Then the average gradient is measured across all the records of the mini-batch and provides an update direction for each model coefficient. One full pass over the training dataset is called an *epoch*. Model trainings commonly consist of dozens to hundreds of epochs. Mini-batch SGD has several benefits: First, its iterative design makes training time theoretically linear of dataset size. Second, in a given mini-batch each record is processed individually by the model without need for inter-record communication other than the final gradient average. The processing of a mini-batch is consequently particularly suitable for parallelization and distribution.  

Parallelizing SGD training by distributing the records of a mini-batch over different computing devices is called *data parallel distributed training*, and is the most commonly used DL distribution paradigm. Data parallel training is a relevant distribution strategy to scale the mini-batch size and process each mini-batch faster. However, data parallel training comes with the extra complexity of having to compute the mini-batch gradient average with gradients coming from all the workers and communicating it to all the workers, a step called *allreduce* that can represent a growing overhead, as the training cluster is scaled, and that can also drastically penalize training time if improperly implemented or implemented over improper hardware subtracts.  

Data parallel SGD still requires developers to be able to fit at least the model and a single record in a computing device, such as a single CPU or GPU. When training very large models such as large transformers in Natural Language Processing (NLP), or segmentation models over high-resolution images, there may be situations in which this is not feasible. An alternative way to break up the workload is to partition the model over multiple computing devices, an approach called *model-parallel distributed training*. 

# Get started with distributed training in Amazon SageMaker AI
<a name="distributed-training-get-started"></a>

The following page gives information about the steps needed to get started with distributed training in Amazon SageMaker AI. If you’re already familiar with distributed training, choose one of the following options that matches your preferred strategy or framework to get started. If you want to learn about distributed training in general, see [Distributed training concepts](distributed-training.md#distributed-training-basic-concepts).

The SageMaker AI distributed training libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker AI, and improve training speed and throughput. The libraries offer both data parallel and model parallel training strategies. They combine software and hardware technologies to improve inter-GPU and inter-node communications, and extend SageMaker AI’s training capabilities with built-in options that require minimal code changes to your training scripts. 

## Before you get started
<a name="distributed-training-before-getting-started"></a>

SageMaker Training supports distributed training on a single instance as well as multiple instances, so you can run any size of training at scale. We recommend you to use the framework estimator classes such as [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator) and [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) in the SageMaker Python SDK, which are the training job launchers with various distributed training options. When you create an estimator object, the object sets up distributed training infrastructure, runs the `CreateTrainingJob` API in the backend, finds the Region where your current session is running, and pulls one of the pre-built AWS deep learning container prepackaged with a number of libraries including deep learning frameworks, distributed training frameworks, and the [EFA](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) driver. If you want to mount an FSx file system to the training instances, you need to pass your VPC subnet and security group ID to the estimator. Before running your distributed training job in SageMaker AI, read the following general guidance on the basic infrastructure setup.

### Availability zones and network backplane
<a name="availability-zones"></a>

When using multiple instances (also called *nodes*), it’s important to understand the network that connects the instances, how they read the training data, and how they share information between themselves. For example, when you run a distributed data-parallel training job, a number of factors, such as communication between the nodes of a compute cluster for running the `AllReduce` operation and data transfer between the nodes and data storage in Amazon Simple Storage Service or Amazon FSx for Lustre, play a crucial role to achieve an optimal use of compute resources and a faster training speed. To reduce communication overhead, make sure that you configure instances, VPC subnet, and data storage in the same AWS Region and Availability Zone.

### GPU instances with faster network and high-throughput storage
<a name="optimized-GPU"></a>

You can technically use any instances for distributed training. For cases where you need to run multi-node distributed training jobs for training large models, such as large language models (LLMs) and diffusion models, which require faster inter-node commutation, we recommend [EFA-enabled GPU instances supported by SageMaker AI](http://aws.amazon.com/about-aws/whats-new/2021/05/amazon-sagemaker-supports-elastic-fabric-adapter-distributed-training/). Especially, to achieve the most performant distributed training job in SageMaker AI, we recommend [P4d and P4de instances equipped with NVIDIA A100 GPUs](http://aws.amazon.com/ec2/instance-types/p4/). These are also equipped with high-throughput low-latency local instance storage and faster intra-node network. For data storage, we recommend [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) that provides high throughput for storing training datasets and model checkpoints.

**Use the SageMaker AI distributed data parallelism (SMDDP) library**

The SMDDP library improves communication between nodes with implementations of `AllReduce` and `AllGather` collective communication operations that are optimized for AWS network infrastructure and Amazon SageMaker AI ML instance topology. You can use the [SMDDP library as the backend of PyTorch-based distributed training packages](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html): [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html), [DeepSpeed](https://github.com/microsoft/DeepSpeed), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed). The following code example shows how to set a `PyTorch` estimator for launching a distributed training job on two `ml.p4d.24xlarge` instances.

```
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    ...,
    instance_count=2,
    instance_type="ml.p4d.24xlarge",
    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)
```

To learn how to prepare your training script and launch a distributed data-parallel training job on SageMaker AI, see [Run distributed training with the SageMaker AI distributed data parallelism library](data-parallel.md).

**Use the SageMaker AI model parallelism library (SMP)**

SageMaker AI provides the SMP library and supports various distributed training techniques, such as sharded data parallelism, pipelining, tensor parallelism, optimizer state sharding, and more. To learn more about what the SMP library offers, see [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md).

To use SageMaker AI's model parallelism library, configure the `distribution` parameter of the SageMaker AI framework estimators. Supported framework estimators are [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator) and [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator). The following code example shows how to construct a framework estimator for distributed training with the model parallelism library on two `ml.p4d.24xlarge` instances.

```
from sagemaker.framework import Framework

distribution={
    "smdistributed": {
        "modelparallel": {
            "enabled":True,
            "parameters": {
                ...   # enter parameter key-value pairs here
            }
        },
    },
    "mpi": {
        "enabled" : True,
        ...           # enter parameter key-value pairs here
    }
}

estimator = Framework(
    ...,
    instance_count=2,
    instance_type="ml.p4d.24xlarge",
    distribution=distribution
)
```

To learn how to adapt your training script, configure distribution parameters in the `estimator` class, and launch a distributed training job, see [SageMaker AI's model parallelism library](model-parallel.md) (see also [Distributed Training APIs](https://sagemaker.readthedocs.io/en/stable/api/training/distributed.html#the-sagemaker-distributed-model-parallel-library) in the *SageMaker Python SDK documentation*).

**Use open source distributed training frameworks**

SageMaker AI also supports the following options to operate `mpirun` and `torchrun` in the backend.
+ To use [PyTorch DistributedDataParallel (DDP)](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html) in SageMaker AI with the `mpirun` backend, add `distribution={"pytorchddp": {"enabled": True}}` to your PyTorch estimator. For more information, see also [PyTorch Distributed Training](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training) and [SageMaker AI PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)'s `distribution` argument in the *SageMaker Python SDK documentation*.
**Note**  
This option is available for PyTorch 1.12.0 and later.

  ```
  from sagemaker.pytorch import PyTorch
  
  estimator = PyTorch(
      ...,
      instance_count=2,
      instance_type="ml.p4d.24xlarge",
      distribution={"pytorchddp": {"enabled": True}}  # runs mpirun in the backend
  )
  ```
+ SageMaker AI supports the [PyTorch `torchrun` launcher](https://pytorch.org/docs/stable/elastic/run.html) for distributed training on GPU-based Amazon EC2 instances, such as P3 and P4, as well as Trn1 powered by the [AWS Trainium](https://aws.amazon.com/machine-learning/trainium/) device. 

  To use [PyTorch DistributedDataParallel (DDP)](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html) in SageMaker AI with the `torchrun` backend, add `distribution={"torch_distributed": {"enabled": True}}` to the PyTorch estimator.
**Note**  
This option is available for PyTorch 1.13.0 and later.

  The following code snippet shows an example of constructing a SageMaker AI PyTorch estimator to run distributed training on two `ml.p4d.24xlarge` instances with the `torch_distributed` distribution option.

  ```
  from sagemaker.pytorch import PyTorch
  
  estimator = PyTorch(
      ...,
      instance_count=2,
      instance_type="ml.p4d.24xlarge",
      distribution={"torch_distributed": {"enabled": True}}   # runs torchrun in the backend
  )
  ```

  For more information, see [Distributed PyTorch Training](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training) and [SageMaker AI PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)'s `distribution` argument in the *SageMaker Python SDK documentation*.

  **Notes for distributed training on Trn1**

  A Trn1 instance consists of up to 16 Trainium devices, and each Trainium device consists of two [NeuronCores](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/neuroncores-arch.html#neuroncores-v2-arch). For specs of the AWS Trainium devices, see [Trainium Architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn1-arch.html#id2) in the *AWS Neuron Documentation*.

  To train on the Trainium-powered instances, you only need to specify the Trn1 instance code, `ml.trn1.*`, in string to the `instance_type` argument of the SageMaker AI PyTorch estimator class. To find available Trn1 instance types, see [AWS Trn1 Architecture](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trn1-arch.html#aws-trn1-arch) in the *AWS Neuron documentation*.
**Note**  
SageMaker Training on Amazon EC2 Trn1 instances is currently available only for the PyTorch framework in the AWS Deep Learning Containers for PyTorch Neuron starting v1.11.0. To find a complete list of supported versions of PyTorch Neuron, see [Neuron Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) in the *AWS Deep Learning Containers GitHub repository*.

  When you launch a training job on Trn1 instances using the SageMaker Python SDK, SageMaker AI automatically picks up and runs the right container from [Neuron Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) provided by AWS Deep Learning Containers. The Neuron Containers are prepackaged with training environment settings and dependencies for easier adaptation of your training job to the SageMaker Training platform and Amazon EC2 Trn1 instances.
**Note**  
To run your PyTorch training job on Trn1 instances with SageMaker AI, you should modify your training script to initialize process groups with the `xla` backend and use [PyTorch/XLA](https://pytorch.org/xla/release/1.12/index.html). To support the XLA adoption process, the AWS Neuron SDK provides PyTorch Neuron that uses XLA to make conversion of PyTorch operations to Trainium instructions. To learn how to modify your training script, see [Developer Guide for Training with PyTorch Neuron (`torch-neuronx`)](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-programming-guide.html) in the *AWS Neuron Documentation*.

  For more information, see [Distributed Training with PyTorch Neuron on Trn1 instances](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id24) and [SageMaker AI PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)'s `distribution` argument in the *SageMaker Python SDK documentation*.
+ To use MPI in SageMaker AI, add `distribution={"mpi": {"enabled": True}}` to your estimator. The MPI distribution option is available for the following frameworks: MXNet, PyTorch, and TensorFlow.
+ To use a parameter server in SageMaker AI, add `distribution={"parameter_server": {"enabled": True}}` to your estimator. The parameter server option is available for the following frameworks: MXNet, PyTorch, and TensorFlow. 
**Tip**  
For more information about using the MPI and parameter server options per framework, use the following links to the *SageMaker Python SDK documentation*.  
[MXNet Distributed Training](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#distributed-training) and [SageMaker AI MXNet Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html#mxnet-estimator)'s `distribution` argument
[PyTorch Distributed Training](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#distributed-pytorch-training) and [SageMaker AI PyTorch Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)'s `distribution` argument
[TensorFlow Distributed Training](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#distributed-training) and [SageMaker AI TensorFlow Estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator)'s `distribution` argument.

# Strategies for distributed training
<a name="distributed-training-strategies"></a>

Distributed training is usually split by two approaches: data parallel and model parallel. *Data parallel* is the most common approach to distributed training: You have a lot of data, batch it up, and send blocks of data to multiple CPUs or GPUs (nodes) to be processed by the neural network or ML algorithm, then combine the results. The neural network is the same on each node. A *model parallel* approach is used with large models that won’t fit in a node’s memory in one piece; it breaks up the model and places different parts on different nodes. In this situation, you need to send your batches of data out to each node so that the data is processed on all parts of the model. 

The terms *network* and *model* are often used interchangeably: A large model is really a large network with many layers and parameters. Training with a large network produces a large model, and loading the model back onto the network with all your pre-trained parameters and their weights loads a large model into memory. When you break apart a model to split it across nodes, you’re also breaking apart the underlying network. A network consists of layers, and to split up the network, you put layers on different compute devices.

A common pitfall of naively splitting layers across devices is severe GPU under-utilization. Training is inherently sequential in both forward and backward passes, and at a given time, only one GPU can actively compute, while the others wait on the activations to be sent. Modern model parallel libraries solve this problem by using pipeline execution schedules to improve device utilization. However, only the Amazon SageMaker AI's distributed model parallel library includes automatic model splitting. The two core features of the library, automatic model splitting and pipeline execution scheduling, simplifies the process of implementing model parallelism by making automated decisions that lead to efficient device utilization.

## Train with data parallel and model parallel
<a name="distributed-training-data-model-parallel"></a>

If you are training with a large dataset, start with a data parallel approach. If you run out of memory during training, you may want to switch to a model parallel approach, or try hybrid model and data parallelism. You can also try the following to improve performance with data parallel:
+ Change your model’s hyperparameters. 
+ Reduce the batch size.
+ Keep reducing the batch size until it fits. If you reduce batch size to 1, and still run out of memory, then you should try model-parallel training. 

Try gradient compression (FP16, INT8):
+ On NVIDIA TensorCore-equipped hardware, using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) creates both speed-up and memory consumption reduction.
+ SageMaker AI's distributed data parallelism library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to enable AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism library runs its `AllReduce` operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:
  + [Frameworks - PyTorch](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#pytorch) in the *NVIDIA Deep Learning Performance documentation*
  + [Frameworks - TensorFlow](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensorflow) in the *NVIDIA Deep Learning Performance documentation*
  + [Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision) in the *NVIDIA Developer Docs*
  + [Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/) in the *PyTorch Blog*
  + [TensorFlow mixed precision APIs](https://www.tensorflow.org/guide/mixed_precision) in the *TensorFlow documentation*

Try reducing the input size:
+ Reduce the NLP sequence length if you increase the sequence link, need to adjust the batch size down, or adjust the GPUs up to spread the batch. 
+ Reduce image resolution. 

Check if you use batch normalization, since this can impact convergence. When you use distributed training, your batch is split across GPUs and the effect of a much lower batch size can be a higher error rate thereby disrupting the model from converging. For example, if you prototyped your network on a single GPU with a batch size of 64, then scaled up to using four p3dn.24xlarge, you now have 32 GPUs and your per-GPU batch size drops from 64 to 2. This will likely break the convergence you saw with a single node. 

Start with model-parallel training when: 
+  Your model does not fit on a single device. 
+ Due to your model size, you’re facing limitations in choosing larger batch sizes, such as if your model weights take up most of your GPU memory and you are forced to choose a smaller, suboptimal batch size.  

To learn more about the SageMaker AI distributed libraries, see the following:
+  [Run distributed training with the SageMaker AI distributed data parallelism library](data-parallel.md) 
+  [(Archived) SageMaker model parallelism library v1.x](model-parallel.md) 

# Distributed training optimization
<a name="distributed-training-optimize"></a>

Customize hyperparameters for your use case and your data to get the best scaling efficiency. In the following discussion, we highlight some of the most impactful training variables and provide references to state-of-the-art implementations so you can learn more about your options. Also, we recommend that you refer to your preferred framework’s distributed training documentation. 
+  [Apache MXNet distributed training](https://mxnet.apache.org/versions/1.7/api/faq/distributed_training) 
+  [PyTorch distributed training](https://pytorch.org/tutorials/beginner/dist_overview.html) 
+  [TensorFlow distributed training](https://www.tensorflow.org/guide/distributed_training) 

## Batch Size
<a name="batch-size-intro"></a>

SageMaker AI distributed toolkits generally allow you to train on bigger batches. For example, if a model fits within a single device but can only be trained with a small batch size, using either model-parallel training or data parallel training enables you to experiment with larger batch sizes. 

Be aware that batch size directly influences model accuracy by controlling the amount of noise in the model update at each iteration. Increasing batch size reduces the amount of noise in the gradient estimation, which can be beneficial when increasing from very small batches sizes, but can result in degraded model accuracy as the batch size increases to large values.  

**Tip**  
Adjust your hyperparameters to ensure that your model trains to a satisfying convergence as you increase its batch size.

A number of techniques have been developed to maintain good model convergence when batch is increased.

## Mini-batch size
<a name="distributed-training-mini-batch"></a>

In SGD, the mini-batch size quantifies the amount of noise present in the gradient estimation. A small mini-batch results in a very noisy mini-batch gradient, which is not representative of the true gradient over the dataset. A large mini-batch results in a mini-batch gradient close to the true gradient over the dataset and potentially not noisy enough—likely to stay locked in irrelevant minima. 

To learn more about these techniques, see the following papers:
+ [Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour](https://arxiv.org/pdf/1706.02677.pdf), Goya et al. 
+ [PowerAI DDL](https://arxiv.org/pdf/1708.02188.pdf), Cho et al. 
+ [Scale Out for Large Minibatch SGD: Residual Network Training on ImageNet-1K with Improved Accuracy and Reduced Time to Train](https://arxiv.org/pdf/1711.04291.pdf), Codreanu et al. 
+ [ImageNet Training in Minutes](https://arxiv.org/pdf/1709.05011.pdf), You et al. 
+ [Large Batch Training of Convolutional Networks](https://arxiv.org/pdf/1708.03888.pdf), You et al. 
+ [Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes](https://arxiv.org/pdf/1904.00962.pdf), You et al. 
+ [Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes](https://arxiv.org/pdf/2006.13484.pdf), Zheng et al. 
+ [Deep Gradient Compression](https://arxiv.org/abs/1712.01887), Lin et al. 

# Scaling training
<a name="distributed-training-scenarios"></a>

The following sections cover scenarios in which you may want to scale up training, and how you can do so using AWS resources. You may want to scale training in one of the following situations:
+ Scaling from a single GPU to many GPUs
+ Scaling from a single instance to multiple instances
+ Using custom training scripts

## Scaling from a single GPU to many GPUs
<a name="scaling-from-one-GPU"></a>

The amount of data or the size of the model used in machine learning can create situations in which the time to train a model is longer that you are willing to wait. Sometimes, the training doesn’t work at all because the model or the training data is too large. One solution is to increase the number of GPUs you use for training. On an instance with multiple GPUs, like a `p3.16xlarge` that has eight GPUs, the data and processing is split across the eight GPUs. When you use distributed training libraries, this can result in a near-linear speedup in the time it takes to train your model. It takes slightly over 1/8 the time it would have taken on `p3.2xlarge` with one GPU.


|  Instance type  |  GPUs  | 
| --- | --- | 
|  p3.2xlarge  |  1  | 
|  p3.8xlarge  |  4  | 
|  p3.16xlarge  |  8  | 
|  p3dn.24xlarge  |  8  | 

**Note**  
The ml instance types used by SageMaker training have the same number of GPUs as the corresponding p3 instance types. For example, `ml.p3.8xlarge` has the same number of GPUs as `p3.8xlarge` - 4. 

## Scaling from a single instance to multiple instances
<a name="scaling-from-one-instance"></a>

If you want to scale your training even further, you can use more instances. However, you should choose a larger instance type before you add more instances. Review the previous table to see how many GPUs are in each p3 instance type. 

If you have made the jump from a single GPU on a `p3.2xlarge` to four GPUs on a `p3.8xlarge`, but decide that you require more processing power, you may see better performance and incur lower costs if you choose a `p3.16xlarge` before trying to increase instance count. Depending on the libraries you use, when you keep your training on a single instance, performance is better and costs are lower than a scenario where you use multiple instances.

When you are ready to scale the number of instances, you can do this with SageMaker AI Python SDK `estimator` function by setting your `instance_count`. For example, you can set `instance_type = p3.16xlarge` and   `instance_count = 2`. Instead of the eight GPUs on a single `p3.16xlarge`, you have 16 GPUs across two identical instances. The following chart shows [scaling and throughput starting with eight GPUs](https://aws.amazon.com/blogs/machine-learning/scalable-multi-node-training-with-tensorflow/) on a single instance and increasing to 64 instances for a total of 256 GPUs. 

 ![\[Chart showing how throughput increases and time to train decreases with more GPUs.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/Distributed-Training-in-SageMaker-image.png) 

## Custom training scripts
<a name="custom-training-scripts"></a>

While SageMaker AI makes it simple to deploy and scale the number of instances and GPUs, depending on your framework of choice, managing the data and results can be very challenging, which is why external supporting libraries are often used. This most basic form of distributed training requires modification of your training script to manage the data distribution. 

SageMaker AI also supports Horovod and implementations of distributed training native to each major deep learning framework. If you choose to use examples from these frameworks, you can follow SageMaker AI’s [container guide](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers.html) for Deep Learning Containers, and various [example notebooks](https://sagemaker-examples.readthedocs.io/en/latest/training/bring_your_own_container.html) that demonstrate implementations. 

# Run distributed training with the SageMaker AI distributed data parallelism library
<a name="data-parallel"></a>

The SageMaker AI distributed data parallelism (SMDDP) library extends SageMaker training capabilities on deep learning models with near-linear scaling efficiency by providing implementations of collective communication operations optimized for AWS infrastructure.

When training large machine learning (ML) models, such as large language models (LLM) and diffusion models, on a huge training dataset, ML practitioners use clusters of accelerators and distributed training techniques to reduce the time to train or resolve memory constraints for models that cannot fit in each GPU memory. ML practitioners often start with multiple accelerators on a single instance and then scale to clusters of instances as their workload requirements increase. As the cluster size increases, so does the communication overhead between multiple nodes, which leads to drop in overall computational performance.

To address such overhead and memory problems, the SMDDP library offers the following.
+ The SMDDP library optimizes training jobs for AWS network infrastructure and Amazon SageMaker AI ML instance topology.
+ The SMDDP library improves communication between nodes with implementations of `AllReduce` and `AllGather` collective communication operations that are optimized for AWS infrastructure. 

To learn more about the details of the SMDDP library offerings, proceed to [Introduction to the SageMaker AI distributed data parallelism library](data-parallel-intro.md).

For more information about training with the model-parallel strategy offered by SageMaker AI, see also [(Archived) SageMaker model parallelism library v1.x](model-parallel.md).

**Topics**
+ [

# Introduction to the SageMaker AI distributed data parallelism library
](data-parallel-intro.md)
+ [

# Supported frameworks, AWS Regions, and instances types
](distributed-data-parallel-support.md)
+ [

# Distributed training with the SageMaker AI distributed data parallelism library
](data-parallel-modify-sdp.md)
+ [

# Amazon SageMaker AI data parallelism library examples
](distributed-data-parallel-v2-examples.md)
+ [

# Configuration tips for the SageMaker AI distributed data parallelism library
](data-parallel-config.md)
+ [

# Amazon SageMaker AI distributed data parallelism library FAQ
](data-parallel-faq.md)
+ [

# Troubleshooting for distributed training in Amazon SageMaker AI
](distributed-troubleshooting-data-parallel.md)
+ [

# SageMaker AI data parallelism library release notes
](data-parallel-release-notes.md)

# Introduction to the SageMaker AI distributed data parallelism library
<a name="data-parallel-intro"></a>

The SageMaker AI distributed data parallelism (SMDDP) library is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library addresses communications overhead of the key collective communication operations by offering the following.

1. The library offers `AllReduce` optimized for AWS. `AllReduce` is a key operation used for synchronizing gradients across GPUs at the end of each training iteration during distributed data training.

1. The library offers `AllGather` optimized for AWS. `AllGather` is another key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries such as the SageMaker AI model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP).

1. The library performs optimized node-to-node communication by fully utilizing AWS network infrastructure and the Amazon EC2 instance topology. 

The SMDDP library can increase training speed by offering performance improvement as you scale your training cluster, with near-linear scaling efficiency.

**Note**  
The SageMaker AI distributed training libraries are available through the AWS deep learning containers for PyTorch and Hugging Face within the SageMaker Training platform. To use the libraries, you must use the SageMaker Python SDK or the SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout the documentation, instructions and examples focus on how to use the distributed training libraries with the SageMaker Python SDK.

## SMDDP collective communication operations optimized for AWS compute resources and network infrastructure
<a name="data-parallel-collective-operations"></a>

The SMDDP library provides implementations of the `AllReduce` and `AllGather` collective operations that are optimized for AWS compute resources and network infrastructure.

### SMDDP `AllReduce` collective operation
<a name="data-parallel-allreduce"></a>

The SMDDP library achieves optimal overlapping of the `AllReduce` operation with the backward pass, significantly improving GPU utilization. It achieves near-linear scaling efficiency and faster training speed by optimizing kernel operations between CPUs and GPUs. The library performs `AllReduce` in parallel while GPU is computing gradients without taking away additional GPU cycles, which makes the library to achieve faster training.
+  *Leverages CPUs*: The library uses CPUs to `AllReduce` gradients, offloading this task from the GPUs. 
+ * Improved GPU usage*: The cluster’s GPUs focus on computing gradients, improving their utilization throughout training.

The following is the high-level workflow of the SMDDP `AllReduce` operation.

1. The library assigns ranks to GPUs (workers).

1. At each iteration, the library divides each global batch by the total number of workers (world size) and assigns small batches (batch shards) to the workers.
   + The size of the global batch is `(number of nodes in a cluster) * (number of GPUs per node) * (per batch shard)`. 
   + A batch shard (small batch) is a subset of dataset assigned to each GPU (worker) per iteration. 

1. The library launches a training script on each worker.

1. The library manages copies of model weights and gradients from the workers at the end of every iteration.

1. The library synchronizes model weights and gradients across the workers to aggregate a single trained model.

The following architecture diagram shows an example of how the library sets up data parallelism for a cluster of 3 nodes. 

 
![\[SMDDP AllReduce and data parallelism architecture diagram\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/data-parallel/sdp-architecture.png)


### SMDDP `AllGather` collective operation
<a name="data-parallel-allgather"></a>

`AllGather` is a collective operation where each worker starts with an input buffer, and then concatenates or *gathers* the input buffers from all other workers into an output buffer.

**Note**  
The SMDDP `AllGather` collective operation is available in `smdistributed-dataparallel>=2.0.1` and AWS Deep Learning Containers (DLC) for PyTorch v2.0.1 and later.

`AllGather` is heavily used in distributed training techniques such as sharded data parallelism where each individual worker holds a fraction of a model, or a sharded layer. The workers call `AllGather` before forward and backward passes to reconstruct the sharded layers. The forward and backward passes continue onward after the parameters are *all gathered*. During the backward pass, each worker also calls `ReduceScatter` to collect (reduce) gradients and break (scatter) them into gradient shards to update the corresponding sharded layer. For more details on the role of these collective operations in sharded data parallelism, see the [SMP library's implementati on of sharded data parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-sharded-data-parallelism.html), [ZeRO](https://deepspeed.readthedocs.io/en/latest/zero3.html#) in the DeepSpeed documentation, and the blog about [PyTorch Fully Sharded Data Parallelism](https://engineering.fb.com/2021/07/15/open-source/fsdp/).

Because collective operations like AllGather are called in every iteration, they are the main contributors to GPU communication overhead. Faster computation of these collective operations directly translates to a shorter training time with no side effects on convergence. To achieve this, the SMDDP library offers `AllGather` optimized for [P4d instances](https://aws.amazon.com/ec2/instance-types/p4/).

SMDDP `AllGather` uses the following techniques to improve computational performance on P4d instances.

1. It transfers data between instances (inter-node) through the [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/) network with a mesh topology. EFA is the AWS low-latency and high-throughput network solution. A mesh topology for inter-node network communication is more tailored to the characteristics of EFA and AWS network infrastructure. Compared to the NCCL ring or tree topology that involves multiple packet hops, SMDDP avoids accumulating latency from multiple hops as it only needs one hop. SMDDP implements a network rate control algorithm that balances the workload to each communication peer in a mesh topology and achieves a higher global network throughput.

1. It adopts [low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology (GDRCopy)](https://github.com/NVIDIA/gdrcopy) to coordinate local NVLink and EFA network traffic. GDRCopy, a low-latency GPU memory copy library offered by NVIDIA, provides low-latency communication between CPU processes and GPU CUDA kernels. With this technology, the SMDDP library is able to pipeline the intra-node and inter-node data movement.

1. It reduces the usage of GPU streaming multiprocessors to increase compute power for running model kernels. P4d and P4de instances are equipped with NVIDIA A100 GPUs, which each have 108 streaming multiprocessors. While NCCL takes up to 24 streaming multiprocessors to run collective operations, SMDDP uses fewer than 9 streaming multiprocessors. Model compute kernels pick up the saved streaming multiprocessors for faster computation.

# Supported frameworks, AWS Regions, and instances types
<a name="distributed-data-parallel-support"></a>

Before using the SageMaker AI distributed data parallelism (SMDDP) library, check what are the supported ML frameworks and instance types and if there are enough quotas in your AWS account and AWS Region.

## Supported frameworks
<a name="distributed-data-parallel-supported-frameworks"></a>

The following tables show the deep learning frameworks and their versions that SageMaker AI and SMDDP support. The SMDDP library is available in [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only), integrated in [Docker containers distributed by the SageMaker model parallelism (SMP) library v2](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2), or downloadable as a binary file.

**Note**  
To check the latest updates and release notes of the SMDDP library, see the [SageMaker AI data parallelism library release notes](data-parallel-release-notes.md).

**Topics**
+ [

### PyTorch
](#distributed-data-parallel-supported-frameworks-pytorch)
+ [

### PyTorch Lightning
](#distributed-data-parallel-supported-frameworks-lightning)
+ [

### Hugging Face Transformers
](#distributed-data-parallel-supported-frameworks-transformers)
+ [

### TensorFlow (deprecated)
](#distributed-data-parallel-supported-frameworks-tensorflow)

### PyTorch
<a name="distributed-data-parallel-supported-frameworks-pytorch"></a>


| PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | 
| v2.3.1 | smdistributed-dataparallel==v2.5.0 | Not available | 658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.4.1/cu121/2024-10-09/smdistributed\$1dataparallel-2.5.0-cp311-cp311-linux\$1x86\$164.whl | 
| v2.3.0 | smdistributed-dataparallel==v2.3.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker | Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\$1dataparallel-2.3.0-cp311-cp311-linux\$1x86\$164.whl | 
| v2.2.0 | smdistributed-dataparallel==v2.2.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\$1dataparallel-2.2.0-cp310-cp310-linux\$1x86\$164.whl | 
| v2.1.0 | smdistributed-dataparallel==v2.1.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\$1dataparallel-2.1.0-cp310-cp310-linux\$1x86\$164.whl | 
| v2.0.1 | smdistributed-dataparallel==v2.0.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\$1dataparallel-2.0.2-cp310-cp310-linux\$1x86\$164.whl | 
| v2.0.0 | smdistributed-dataparallel==v1.8.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.0/cu118/2023-03-20/smdistributed\$1dataparallel-1.8.0-cp310-cp310-linux\$1x86\$164.whl | 
| v1.13.1 | smdistributed-dataparallel==v1.7.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.13.1/cu117/2023-01-09/smdistributed\$1dataparallel-1.7.0-cp39-cp39-linux\$1x86\$164.whl | 
| v1.12.1 | smdistributed-dataparallel==v1.6.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.1/cu113/2022-12-05/smdistributed\$1dataparallel-1.6.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.12.0 | smdistributed-dataparallel==v1.5.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\$1dataparallel-1.5.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.11.0 | smdistributed-dataparallel==v1.4.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.11.0/cu113/2022-04-14/smdistributed\$1dataparallel-1.4.1-cp38-cp38-linux\$1x86\$164.whl | 

\$1\$1 The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

**Note**  
The SMDDP library is available in AWS Regions where the [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) and the [SMP Docker images](distributed-model-parallel-support-v2.md) are in service.

**Note**  
The SMDDP library v1.4.0 and later works as a backend of PyTorch distributed (torch.distributed) data parallelism (torch.parallel.DistributedDataParallel). In accordance with the change, the following [smdistributed APIs](https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest/smd_data_parallel_pytorch.html#pytorch-api) for the PyTorch distributed package have been deprecated.  
`smdistributed.dataparallel.torch.distributed` is deprecated. Use the [torch.distributed](https://pytorch.org/docs/stable/distributed.html) package instead.
`smdistributed.dataparallel.torch.parallel.DistributedDataParallel` is deprecated. Use the [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) API instead.
If you need to use the previous versions of the library (v1.3.0 or before), see the [archived SageMaker AI distributed data parallelism documentation](https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest.html#documentation-archive) in the *SageMaker AI Python SDK documentation*.

### PyTorch Lightning
<a name="distributed-data-parallel-supported-frameworks-lightning"></a>

The SMDDP library is available for PyTorch Lightning in the following SageMaker AI Framework Containers for PyTorch and the SMP Docker containers.

**PyTorch Lightning v2**


| PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | SMP Docker images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | --- | 
| 2.2.5 | 2.3.0 | smdistributed-dataparallel==v2.3.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker | Currently not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed\$1dataparallel-2.3.0-cp311-cp311-linux\$1x86\$164.whl | 
| 2.2.0 | 2.2.0 | smdistributed-dataparallel==v2.2.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.2.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed\$1dataparallel-2.2.0-cp310-cp310-linux\$1x86\$164.whl | 
| 2.1.2 | 2.1.0 | smdistributed-dataparallel==v2.1.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker | 658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121 | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed\$1dataparallel-2.1.0-cp310-cp310-linux\$1x86\$164.whl | 
| 2.1.0 | 2.0.1 | smdistributed-dataparallel==v2.0.1 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | Not available | https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed\$1dataparallel-2.0.2-cp310-cp310-linux\$1x86\$164.whl | 

**PyTorch Lightning v1**


| PyTorch Lightning version | PyTorch version | SMDDP library version | SageMaker AI Framework Container images pre-installed with SMDDP | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | --- | 
|  1.7.2 1.7.0 1.6.4 1.6.3 1.5.10  | 1.12.0 | smdistributed-dataparallel==v1.5.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker | https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.12.0/cu113/2022-07-01/smdistributed\$1dataparallel-1.5.0-cp38-cp38-linux\$1x86\$164.whl | 

\$1\$1 The URLs of the binary files are for installing the SMDDP library in custom containers. For more information, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

**Note**  
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the PyTorch DLCs. When you construct a SageMaker AI PyTorch estimator and submit a training job request in [Step 2](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-framework-estimator), you need to provide `requirements.txt` to install `pytorch-lightning` and `lightning-bolts` in the SageMaker AI PyTorch training container.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

### Hugging Face Transformers
<a name="distributed-data-parallel-supported-frameworks-transformers"></a>

The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and paired PyTorch and TensorFlow versions, see the latest [Hugging Face Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers) and the [Prior Hugging Face Container Versions](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#prior-hugging-face-container-versions).

### TensorFlow (deprecated)
<a name="distributed-data-parallel-supported-frameworks-tensorflow"></a>

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. The following table lists previous DLCs for TensorFlow with the SMDDP library installed.


| TensorFlow version | SMDDP library version | 
| --- | --- | 
| 2.9.1, 2.10.1, 2.11.0 |  smdistributed-dataparallel==v1.4.1  | 
| 2.8.3 |  smdistributed-dataparallel==v1.3.0  | 

## AWS Regions
<a name="distributed-data-parallel-availablity-zone"></a>

The SMDDP library is available in all of the AWS Regions where the [AWS Deep Learning Containers for SageMaker AI](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) and the [SMP Docker images](distributed-model-parallel-support-v2.md) are in service.

## Supported instance types
<a name="distributed-data-parallel-supported-instance-types"></a>

The SMDDP library requires one of the following instance types.


| Instance type | 
| --- | 
| ml.p3dn.24xlarge\$1 | 
| ml.p4d.24xlarge | 
| ml.p4de.24xlarge | 

**Tip**  
To properly run distributed training on the EFA-enabled instance types, you should enable traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

**Important**  
\$1 The SMDDP library has discontinued support for optimizing its collective communication operations on P3 instances. While you can still utilize the SMDDP optimized `AllReduce` collective on `ml.p3dn.24xlarge` instances, there will be no further development support to enhance performance on this instance type. Note that the SMDDP optimized `AllGather` collective is only available for P4 instances.

For specs of the instance types, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Request a service quota increase for SageMaker AI resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure).

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.
```

# Distributed training with the SageMaker AI distributed data parallelism library
<a name="data-parallel-modify-sdp"></a>

The SageMaker AI distributed data parallelism (SMDDP) library is designed for ease of use and to provide seamless integration with PyTorch.

When training a deep learning model with the SMDDP library on SageMaker AI, you can focus on writing your training script and model training. 

To get started, import the SMDDP library to use its collective operations optimized for AWS. The following topics provide instructions on what to add to your training script depending on which collective operation you want to optimize.

**Topics**
+ [

# Adapting your training script to use the SMDDP collective operations
](data-parallel-modify-sdp-select-framework.md)
+ [

# Launching distributed training jobs with SMDDP using the SageMaker Python SDK
](data-parallel-use-api.md)

# Adapting your training script to use the SMDDP collective operations
<a name="data-parallel-modify-sdp-select-framework"></a>

The training script examples provided in this section are simplified and highlight only the required changes to enable the SageMaker AI distributed data parallelism (SMDDP) library in your training script. For end-to-end Jupyter notebook examples that demonstrate how to run a distributed training job with the SMDDP library, see [Amazon SageMaker AI data parallelism library examples](distributed-data-parallel-v2-examples.md).

**Topics**
+ [

# Use the SMDDP library in your PyTorch training script
](data-parallel-modify-sdp-pt.md)
+ [

# Use the SMDDP library in your PyTorch Lightning training script
](data-parallel-modify-sdp-pt-lightning.md)
+ [

# Use the SMDDP library in your TensorFlow training script (deprecated)
](data-parallel-modify-sdp-tf2.md)

# Use the SMDDP library in your PyTorch training script
<a name="data-parallel-modify-sdp-pt"></a>

Starting from the SageMaker AI distributed data parallelism (SMDDP) library v1.4.0, you can use the library as a backend option for the [PyTorch distributed package](https://pytorch.org/tutorials/beginner/dist_overview.html). To use the SMDDP `AllReduce` and `AllGather` collective operations, you only need to import the SMDDP library at the beginning of your training script and set SMDDP as the the backend of PyTorch distributed modules during process group initialization. With the single line of backend specification, you can keep all the native PyTorch distributed modules and the entire training script unchanged. The following code snippets show how to use the SMDDP library as the backend of PyTorch-based distributed training packages: [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html), [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html), [DeepSpeed](https://github.com/microsoft/DeepSpeed), and [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed).

## For PyTorch DDP or FSDP
<a name="data-parallel-enable-for-ptddp-ptfsdp"></a>

Initialize the process group as follows.

```
import torch.distributed as dist
import smdistributed.dataparallel.torch.torch_smddp

dist.init_process_group(backend="smddp")
```

**Note**  
(For PyTorch DDP jobs only) The `smddp` backend currently does not support creating subprocess groups with the `torch.distributed.new_group()` API. You also cannot use the `smddp` backend concurrently with other process group backends such as `NCCL` and `Gloo`.

## For DeepSpeed or Megatron-DeepSpeed
<a name="data-parallel-enable-for-deepspeed"></a>

Initialize the process group as follows.

```
import deepspeed
import smdistributed.dataparallel.torch.torch_smddp

deepspeed.init_distributed(dist_backend="smddp")
```

**Note**  
To use SMDDP `AllGather` with the `mpirun`-based launchers (`smdistributed` and `pytorchddp`) in [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md), you also need to set the following environment variable in your training script.  

```
export SMDATAPARALLEL_OPTIMIZE_SDP=true
```

For general guidance on writing a PyTorch FSDP training script, see [Advanced Model Training with Fully Sharded Data Parallel (FSDP)](https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html) in the PyTorch documentation.

For general guidance on writing a PyTorch DDP training script, see [Getting started with distributed data parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) in the PyTorch documentation.

After you have completed adapting your training script, proceed to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md).

# Use the SMDDP library in your PyTorch Lightning training script
<a name="data-parallel-modify-sdp-pt-lightning"></a>

If you want to bring your [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/starter/introduction.html) training script and run a distributed data parallel training job in SageMaker AI, you can run the training job with minimal changes in your training script. The necessary changes include the following: import the `smdistributed.dataparallel` library’s PyTorch modules, set up the environment variables for PyTorch Lightning to accept the SageMaker AI environment variables that are preset by the SageMaker training toolkit, and activate the SMDDP library by setting the process group backend to `"smddp"`. To learn more, walk through the following instructions that break down the steps with code examples.

**Note**  
The PyTorch Lightning support is available in the SageMaker AI data parallel library v1.5.0 and later.

## PyTorch Lightning == v2.1.0 and PyTorch == 2.0.1
<a name="smddp-pt-201-lightning-210"></a>

1. Import the `pytorch_lightning` library and the `smdistributed.dataparallel.torch` modules.

   ```
   import lightning as pl
   import smdistributed.dataparallel.torch.torch_smddp
   ```

1. Instantiate the [LightningEnvironment](https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.plugins.environments.LightningEnvironment.html).

   ```
   from lightning.fabric.plugins.environments.lightning import LightningEnvironment
   
   env = LightningEnvironment()
   env.world_size = lambda: int(os.environ["WORLD_SIZE"])
   env.global_rank = lambda: int(os.environ["RANK"])
   ```

1. **For PyTorch DDP** – Create an object of the [DDPStrategy](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.DDPStrategy.html) class with `"smddp"` for `process_group_backend` and `"gpu"` for `accelerator`, and pass that to the [Trainer](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) class.

   ```
   import lightning as pl
   from lightning.pytorch.strategies import DDPStrategy
   
   ddp = DDPStrategy(
       cluster_environment=env, 
       process_group_backend="smddp", 
       accelerator="gpu"
   )
   
   trainer = pl.Trainer(
       max_epochs=200, 
       strategy=ddp, 
       devices=num_gpus, 
       num_nodes=num_nodes
   )
   ```

   **For PyTorch FSDP** – Create an object of the [FSDPStrategy](https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.strategies.FSDPStrategy.html) class (with [wrapping policy](https://pytorch.org/docs/stable/fsdp.html) of choice) with `"smddp"` for `process_group_backend` and `"gpu"` for `accelerator`, and pass that to the [Trainer](https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html) class.

   ```
   import lightning as pl
   from lightning.pytorch.strategies import FSDPStrategy
   
   from functools import partial
   from torch.distributed.fsdp.wrap import size_based_auto_wrap_policy
   
   policy = partial(
       size_based_auto_wrap_policy, 
       min_num_params=10000
   )
   
   fsdp = FSDPStrategy(
       auto_wrap_policy=policy,
       process_group_backend="smddp", 
       cluster_environment=env
   )
   
   trainer = pl.Trainer(
       max_epochs=200, 
       strategy=fsdp, 
       devices=num_gpus, 
       num_nodes=num_nodes
   )
   ```

After you have completed adapting your training script, proceed to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md). 

**Note**  
When you construct a SageMaker AI PyTorch estimator and submit a training job request in [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md), you need to provide `requirements.txt` to install `pytorch-lightning` and `lightning-bolts` in the SageMaker AI PyTorch training container.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

# Use the SMDDP library in your TensorFlow training script (deprecated)
<a name="data-parallel-modify-sdp-tf2"></a>

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

The following steps show you how to modify a TensorFlow training script to utilize SageMaker AI's distributed data parallel library.  

The library APIs are designed to be similar to Horovod APIs. For additional details on each API that the library offers for TensorFlow, see the [SageMaker AI distributed data parallel TensorFlow API documentation](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html#api-documentation).

**Note**  
SageMaker AI distributed data parallel is adaptable to TensorFlow training scripts composed of `tf` core modules except `tf.keras` modules. SageMaker AI distributed data parallel does not support TensorFlow with Keras implementation.

**Note**  
The SageMaker AI distributed data parallelism library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to enable AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism library runs its `AllReduce` operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:  
[Frameworks - TensorFlow](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensorflow) in the *NVIDIA Deep Learning Performance documentation*
[Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision) in the *NVIDIA Developer Docs*
[TensorFlow mixed precision APIs](https://www.tensorflow.org/guide/mixed_precision) in the *TensorFlow documentation*

1. Import the library's TensorFlow client and initialize it.

   ```
   import smdistributed.dataparallel.tensorflow as sdp 
   sdp.init()
   ```

1. Pin each GPU to a single `smdistributed.dataparallel` process with `local_rank`—this refers to the relative rank of the process within a given node. The `sdp.tensorflow.local_rank()` API provides you with the local rank of the device. The leader node is rank 0, and the worker nodes are rank 1, 2, 3, and so on. This is invoked in the following code block as `sdp.local_rank()`. `set_memory_growth` is not directly related to SageMaker AI distributed, but must be set for distributed training with TensorFlow. 

   ```
   gpus = tf.config.experimental.list_physical_devices('GPU')
   for gpu in gpus:
       tf.config.experimental.set_memory_growth(gpu, True)
   if gpus:
       tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')
   ```

1. Scale the learning rate by the number of workers. The `sdp.tensorflow.size()` API provides you the number of workers in the cluster. This is invoked in the following code block as `sdp.size()`. 

   ```
   learning_rate = learning_rate * sdp.size()
   ```

1. Use the library’s `DistributedGradientTape` to optimize `AllReduce` operations during training. This wraps `tf.GradientTape`.  

   ```
   with tf.GradientTape() as tape:
         output = model(input)
         loss_value = loss(label, output)
       
   # SageMaker AI data parallel: Wrap tf.GradientTape with the library's DistributedGradientTape
   tape = sdp.DistributedGradientTape(tape)
   ```

1. Broadcast the initial model variables from the leader node (rank 0) to all the worker nodes (ranks 1 through n). This is needed to ensure a consistent initialization across all the worker ranks. Use the `sdp.tensorflow.broadcast_variables` API after the model and optimizer variables are initialized. This is invoked in the following code block as `sdp.broadcast_variables()`. 

   ```
   sdp.broadcast_variables(model.variables, root_rank=0)
   sdp.broadcast_variables(opt.variables(), root_rank=0)
   ```

1. Finally, modify your script to save checkpoints only on the leader node. The leader node has a synchronized model. This also avoids worker nodes overwriting the checkpoints and possibly corrupting the checkpoints. 

   ```
   if sdp.rank() == 0:
       checkpoint.save(checkpoint_dir)
   ```

The following is an example TensorFlow training script for distributed training with the library.

```
import tensorflow as tf

# SageMaker AI data parallel: Import the library TF API
import smdistributed.dataparallel.tensorflow as sdp

# SageMaker AI data parallel: Initialize the library
sdp.init()

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    # SageMaker AI data parallel: Pin GPUs to a single library process
    tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')

# Prepare Dataset
dataset = tf.data.Dataset.from_tensor_slices(...)

# Define Model
mnist_model = tf.keras.Sequential(...)
loss = tf.losses.SparseCategoricalCrossentropy()

# SageMaker AI data parallel: Scale Learning Rate
# LR for 8 node run : 0.000125
# LR for single node run : 0.001
opt = tf.optimizers.Adam(0.000125 * sdp.size())

@tf.function
def training_step(images, labels, first_batch):
    with tf.GradientTape() as tape:
        probs = mnist_model(images, training=True)
        loss_value = loss(labels, probs)

    # SageMaker AI data parallel: Wrap tf.GradientTape with the library's DistributedGradientTape
    tape = sdp.DistributedGradientTape(tape)

    grads = tape.gradient(loss_value, mnist_model.trainable_variables)
    opt.apply_gradients(zip(grads, mnist_model.trainable_variables))

    if first_batch:
       # SageMaker AI data parallel: Broadcast model and optimizer variables
       sdp.broadcast_variables(mnist_model.variables, root_rank=0)
       sdp.broadcast_variables(opt.variables(), root_rank=0)

    return loss_value

...

# SageMaker AI data parallel: Save checkpoints only from master node.
if sdp.rank() == 0:
    checkpoint.save(checkpoint_dir)
```

After you have completed adapting your training script, move on to [Launching distributed training jobs with SMDDP using the SageMaker Python SDK](data-parallel-use-api.md). 

# Launching distributed training jobs with SMDDP using the SageMaker Python SDK
<a name="data-parallel-use-api"></a>

To run a distributed training job with your adapted script from [Adapting your training script to use the SMDDP collective operations](data-parallel-modify-sdp-select-framework.md), use the SageMaker Python SDK's framework or generic estimators by specifying the prepared training script as an entry point script and the distributed training configuration.

This page walks you through how to use the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/index.html) in two ways.
+ If you want to achieve a quick adoption of your distributed training job in SageMaker AI, configure a SageMaker AI [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch) or [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) framework estimator class. The framework estimator picks up your training script and automatically matches the right image URI of the [pre-built PyTorch or TensorFlow Deep Learning Containers (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only), given the value specified to the `framework_version` parameter.
+ If you want to extend one of the pre-built containers or build a custom container to create your own ML environment with SageMaker AI, use the SageMaker AI generic `Estimator` class and specify the image URI of the custom Docker container hosted in your Amazon Elastic Container Registry (Amazon ECR).

Your training datasets should be stored in Amazon S3 or [Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html) in the AWS Region in which you are launching your training job. If you use Jupyter notebooks, you should have a SageMaker notebook instance or a SageMaker Studio Classic app running in the same AWS Region. For more information about storing your training data, see the [SageMaker Python SDK data inputs](https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-input) documentation. 

**Tip**  
We recommend that you use Amazon FSx for Lustre instead of Amazon S3 to improve training performance. Amazon FSx has higher throughput and lower latency than Amazon S3.

**Tip**  
To properly run distributed training on the EFA-enabled instance types, you should enables traffic between the instances by setting up the security group of your VPC to allow all inbound and outbound traffic to and from the security group itself. To learn how to set up the security group rules, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

Choose one of the following topics for instructions on how to run a distributed training job of your training script. After you launch a training job, you can monitor system utilization and model performance using [Amazon SageMaker Debugger](train-debugger.md) or Amazon CloudWatch.

While you follow instructions in the following topics to learn more about technical details, we also recommend that you try the [Amazon SageMaker AI data parallelism library examples](distributed-data-parallel-v2-examples.md) to get started.

**Topics**
+ [

# Use the PyTorch framework estimators in the SageMaker Python SDK
](data-parallel-framework-estimator.md)
+ [

# Use the SageMaker AI generic estimator to extend pre-built DLC containers
](data-parallel-use-python-skd-api.md)
+ [

# Create your own Docker container with the SageMaker AI distributed data parallel library
](data-parallel-bring-your-own-container.md)

# Use the PyTorch framework estimators in the SageMaker Python SDK
<a name="data-parallel-framework-estimator"></a>

You can launch distributed training by adding the `distribution` argument to the SageMaker AI framework estimators, [https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html#sagemaker.pytorch.estimator.PyTorch) or [https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator). For more details, choose one of the frameworks supported by the SageMaker AI distributed data parallelism (SMDDP) library from the following selections.

------
#### [ PyTorch ]

The following launcher options are available for launching PyTorch distributed training.
+ `pytorchddp` – This option runs `mpirun` and sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to the `distribution` parameter.

  ```
  { "pytorchddp": { "enabled": True } }
  ```
+ `torch_distributed` – This option runs `torchrun` and sets up environment variables needed for running PyTorch distributed training on SageMaker AI. To use this option, pass the following dictionary to the `distribution` parameter.

  ```
  { "torch_distributed": { "enabled": True } }
  ```
+ `smdistributed` – This option also runs `mpirun` but with `smddprun` that sets up environment variables needed for running PyTorch distributed training on SageMaker AI.

  ```
  { "smdistributed": { "dataparallel": { "enabled": True } } }
  ```

If you chose to replace NCCL `AllGather` to SMDDP `AllGather`, you can use all three options. Choose one option that fits with your use case.

If you chose to replace NCCL `AllReduce` with SMDDP `AllReduce`, you should choose one of the `mpirun`-based options: `smdistributed` or `pytorchddp`. You can also add additional MPI options as follows.

```
{ 
    "pytorchddp": {
        "enabled": True, 
        "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
    }
}
```

```
{ 
    "smdistributed": { 
        "dataparallel": {
            "enabled": True, 
            "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"
        }
    }
}
```

The following code sample shows the basic structure of a PyTorch estimator with distributed training options.

```
from sagemaker.pytorch import PyTorch

pt_estimator = PyTorch(
    base_job_name="training_job_name_prefix",
    source_dir="subdirectory-to-your-code",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    py_version="py310",
    framework_version="2.0.1",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p4de.24xlarge
    instance_type="ml.p4d.24xlarge",

    # Activate distributed training with SMDDP
    distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
    # distribution={ "torch_distributed": { "enabled": True } }  # torchrun, activates SMDDP AllGather
    # distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather
)

pt_estimator.fit("s3://bucket/path/to/training/data")
```

**Note**  
PyTorch Lightning and its utility libraries such as Lightning Bolts are not preinstalled in the SageMaker AI PyTorch DLCs. Create the following `requirements.txt` file and save in the source directory where you save the training script.  

```
# requirements.txt
pytorch-lightning
lightning-bolts
```
For example, the tree-structured directory should look like the following.  

```
├── pytorch_training_launcher_jupyter_notebook.ipynb
└── sub-folder-for-your-code
    ├──  adapted-training-script.py
    └──  requirements.txt
```
For more information about specifying the source directory to place the `requirements.txt` file along with your training script and a job submission, see [Using third-party libraries](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#id12) in the *Amazon SageMaker AI Python SDK documentation*.

**Considerations for activating SMDDP collective operations and using the right distributed training launcher options**
+ SMDDP `AllReduce` and SMDDP `AllGather` are not mutually compatible at present.
+ SMDDP `AllReduce` is activated by default when using `smdistributed` or `pytorchddp`, which are `mpirun`-based launchers, and NCCL `AllGather` is used.
+ SMDDP `AllGather` is activated by default when using `torch_distributed` launcher, and `AllReduce` falls back to NCCL.
+ SMDDP `AllGather` can also be activated when using the `mpirun`-based launchers with an additional environment variable set as follows.

  ```
  export SMDATAPARALLEL_OPTIMIZE_SDP=true
  ```

------
#### [ TensorFlow ]

**Important**  
The SMDDP library discontinued support for TensorFlow and is no longer available in DLCs for TensorFlow later than v2.11.0. To find previous TensorFlow DLCs with the SMDDP library installed, see [TensorFlow (deprecated)](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks-tensorflow).

```
from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
    base_job_name = "training_job_name_prefix",
    entry_point="adapted-training-script.py",
    role="SageMakerRole",
    framework_version="2.11.0",
    py_version="py38",

    # For running a multi-node distributed training job, specify a value greater than 1
    # Example: 2,3,4,..8
    instance_count=2,

    # Instance types supported by the SageMaker AI data parallel library: 
    # ml.p4d.24xlarge, ml.p3dn.24xlarge, and ml.p3.16xlarge
    instance_type="ml.p3.16xlarge",

    # Training using the SageMaker AI data parallel distributed training strategy
    distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }
)

tf_estimator.fit("s3://bucket/path/to/training/data")
```

------

# Use the SageMaker AI generic estimator to extend pre-built DLC containers
<a name="data-parallel-use-python-skd-api"></a>

You can customize SageMaker AI prebuilt containers or extend them to handle any additional functional requirements for your algorithm or model that the prebuilt SageMaker AI Docker image doesn't support. For an example of how you can extend a pre-built container, see [Extend a Prebuilt Container](https://docs.aws.amazon.com/sagemaker/latest/dg/prebuilt-containers-extend.html).

To extend a prebuilt container or adapt your own container to use the library, you must use one of the images listed in [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

**Note**  
From TensorFlow 2.4.1 and PyTorch 1.8.1, SageMaker AI framework DLCs supports EFA-enabled instance types. We recommend that you use the DLC images that contain TensorFlow 2.4.1 or later and PyTorch 1.8.1 or later. 

For example, if you use PyTorch, your Dockerfile should contain a `FROM` statement similar to the following:

```
# SageMaker AI PyTorch image
FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/pytorch-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker AI PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# /opt/ml and all subdirectories are utilized by SageMaker AI, use the /code subdirectory to store your user code.
COPY train.py /opt/ml/code/train.py

# Defines cifar10.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py
```

You can further customize your own Docker container to work with SageMaker AI using the [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) and the binary file of the SageMaker AI distributed data parallel library. To learn more, see the instructions in the following section.

# Create your own Docker container with the SageMaker AI distributed data parallel library
<a name="data-parallel-bring-your-own-container"></a>

To build your own Docker container for training and use the SageMaker AI data parallel library, you must include the correct dependencies and the binary files of the SageMaker AI distributed parallel libraries in your Dockerfile. This section provides instructions on how to create a complete Dockerfile with the minimum set of dependencies for distributed training in SageMaker AI using the data parallel library.

**Note**  
This custom Docker option with the SageMaker AI data parallel library as a binary is available only for PyTorch.

**To create a Dockerfile with the SageMaker training toolkit and the data parallel library**

1. Start with a Docker image from [NVIDIA CUDA](https://hub.docker.com/r/nvidia/cuda). Use the cuDNN developer versions that contain CUDA runtime and development tools (headers and libraries) to build from the [PyTorch source code](https://github.com/pytorch/pytorch#from-source).

   ```
   FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
   ```
**Tip**  
The official AWS Deep Learning Container (DLC) images are built from the [NVIDIA CUDA base images](https://hub.docker.com/r/nvidia/cuda). If you want to use the prebuilt DLC images as references while following the rest of the instructions, see the [AWS Deep Learning Containers for PyTorch Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/pytorch). 

1. Add the following arguments to specify versions of PyTorch and other packages. Also, indicate the Amazon S3 bucket paths to the SageMaker AI data parallel library and other software to use AWS resources, such as the Amazon S3 plug-in. 

   To use versions of the third party libraries other than the ones provided in the following code example, we recommend you look into the [official Dockerfiles of AWS Deep Learning Container for PyTorch](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker) to find versions that are tested, compatible, and suitable for your application. 

   To find URLs for the `SMDATAPARALLEL_BINARY` argument, see the lookup tables at [Supported frameworks](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks).

   ```
   ARG PYTORCH_VERSION=1.10.2
   ARG PYTHON_SHORT_VERSION=3.8
   ARG EFA_VERSION=1.14.1
   ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
   ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
   ARG CONDA_PREFIX="/opt/conda"
   ARG BRANCH_OFI=1.1.3-aws
   ```

1. Set the following environment variables to properly build SageMaker training components and run the data parallel library. You use these variables for the components in the subsequent steps.

   ```
   # Set ENV variables required to build PyTorch
   ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0"
   ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
   ENV NCCL_VERSION=2.10.3
   
   # Add OpenMPI to the path.
   ENV PATH /opt/amazon/openmpi/bin:$PATH
   
   # Add Conda to path
   ENV PATH $CONDA_PREFIX/bin:$PATH
   
   # Set this enviroment variable for SageMaker AI to launch SMDDP correctly.
   ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
   
   # Add enviroment variable for processes to be able to call fork()
   ENV RDMAV_FORK_SAFE=1
   
   # Indicate the container type
   ENV DLC_CONTAINER_TYPE=training
   
   # Add EFA and SMDDP to LD library path
   ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
   ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
   ```

1. Install or update `curl`, `wget`, and `git` to download and build packages in the subsequent steps.

   ```
   RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
       apt-get update && apt-get install -y  --no-install-recommends \
           curl \
           wget \
           git \
       && rm -rf /var/lib/apt/lists/*
   ```

1. Install [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) software for Amazon EC2 network communication.

   ```
   RUN DEBIAN_FRONTEND=noninteractive apt-get update
   RUN mkdir /tmp/efa \
       && cd /tmp/efa \
       && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
       && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
       && cd aws-efa-installer \
       && ./efa_installer.sh -y --skip-kmod -g \
       && rm -rf /tmp/efa
   ```

1. Install [Conda](https://docs.conda.io/en/latest/) to handle package management. 

   ```
   RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
       chmod +x ~/miniconda.sh && \
       ~/miniconda.sh -b -p $CONDA_PREFIX && \
       rm ~/miniconda.sh && \
       $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \
       $CONDA_PREFIX/bin/conda clean -ya
   ```

1. Get, build, and install PyTorch and its dependencies. We build [PyTorch from the source code](https://github.com/pytorch/pytorch#from-source) because we need to have control of the NCCL version to guarantee compatibility with the [AWS OFI NCCL plug-in](https://github.com/aws/aws-ofi-nccl).

   1. Following the steps in the [PyTorch official dockerfile](https://github.com/pytorch/pytorch/blob/master/Dockerfile), install build dependencies and set up [ccache](https://ccache.dev/) to speed up recompilation.

      ```
      RUN DEBIAN_FRONTEND=noninteractive \
          apt-get install -y --no-install-recommends \
              build-essential \
              ca-certificates \
              ccache \
              cmake \
              git \
              libjpeg-dev \
              libpng-dev \
          && rm -rf /var/lib/apt/lists/*
        
      # Setup ccache
      RUN /usr/sbin/update-ccache-symlinks
      RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
      ```

   1. Install [PyTorch’s common and Linux dependencies](https://github.com/pytorch/pytorch#install-dependencies).

      ```
      # Common dependencies for PyTorch
      RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
      
      # Linux specific dependency for PyTorch
      RUN conda install -c pytorch magma-cuda113
      ```

   1. Clone the [PyTorch GitHub repository](https://github.com/pytorch/pytorch).

      ```
      RUN --mount=type=cache,target=/opt/ccache \
          cd / \
          && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
      ```

   1. Install and build a specific [NCCL](https://developer.nvidia.com/nccl) version. To do this, replace the content in the PyTorch’s default NCCL folder (`/pytorch/third_party/nccl`) with the specific NCCL version from the NVIDIA repository. The NCCL version was set in the step 3 of this guide.

      ```
      RUN cd /pytorch/third_party/nccl \
          && rm -rf nccl \
          && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
          && cd nccl \
          && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
          && make pkg.txz.build \
          && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
      ```

   1. Build and install PyTorch. This process usually takes slightly more than 1 hour to complete. It is built using the NCCL version downloaded in a previous step.

      ```
      RUN cd /pytorch \
          && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
          python setup.py install \
          && rm -rf /pytorch
      ```

1. Build and install [AWS OFI NCCL plugin](https://github.com/aws/aws-ofi-nccl). This enables [libfabric](https://github.com/ofiwg/libfabric) support for the SageMaker AI data parallel library.

   ```
   RUN DEBIAN_FRONTEND=noninteractive apt-get update \
       && apt-get install -y --no-install-recommends \
           autoconf \
           automake \
           libtool
   RUN mkdir /tmp/efa-ofi-nccl \
       && cd /tmp/efa-ofi-nccl \
       && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
       && cd aws-ofi-nccl \
       && ./autogen.sh \
       && ./configure --with-libfabric=/opt/amazon/efa \
       --with-mpi=/opt/amazon/openmpi \
       --with-cuda=/usr/local/cuda \
       --with-nccl=$CONDA_PREFIX \
       && make \
       && make install \
       && rm -rf /tmp/efa-ofi-nccl
   ```

1. Build and install [TorchVision](https://github.com/pytorch/vision.git).

   ```
   RUN pip install --no-cache-dir -U \
       packaging \
       mpi4py==3.0.3
   RUN cd /tmp \
       && git clone https://github.com/pytorch/vision.git -b v0.9.1 \
       && cd vision \
       && BUILD_VERSION="0.9.1+cu111" python setup.py install \
       && cd /tmp \
       && rm -rf vision
   ```

1. Install and configure OpenSSH. OpenSSH is required for MPI to communicate between containers. Allow OpenSSH to talk to containers without asking for confirmation.

   ```
   RUN apt-get update \
       && apt-get install -y  --allow-downgrades --allow-change-held-packages --no-install-recommends \
       && apt-get install -y --no-install-recommends openssh-client openssh-server \
       && mkdir -p /var/run/sshd \
       && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
       && echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
       && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
       && rm -rf /var/lib/apt/lists/*
   
   # Configure OpenSSH so that nodes can communicate with each other
   RUN mkdir -p /var/run/sshd && \
    sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
   RUN rm -rf /root/.ssh/ && \
    mkdir -p /root/.ssh/ && \
    ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
    && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
   ```

1. Install the PT S3 plug-in to efficiently access datasets in Amazon S3.

   ```
   RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
   RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
   ```

1. Install the [libboost](https://www.boost.org/) library. This package is needed for networking the asynchronous IO functionality of the SageMaker AI data parallel library.

   ```
   WORKDIR /
   RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
       && tar -xzf boost_1_73_0.tar.gz \
       && cd boost_1_73_0 \
       && ./bootstrap.sh \
       && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \
       && cd .. \
       && rm -rf boost_1_73_0.tar.gz \
       && rm -rf boost_1_73_0 \
       && cd ${CONDA_PREFIX}/include/boost
   ```

1. Install the following SageMaker AI tools for PyTorch training.

   ```
   WORKDIR /root
   RUN pip install --no-cache-dir -U \
       smclarify \
       "sagemaker>=2,<3" \
       sagemaker-experiments==0.* \
       sagemaker-pytorch-training
   ```

1. Finally, install the SageMaker AI data parallel binary and the remaining dependencies.

   ```
   RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
     apt-get update && apt-get install -y  --no-install-recommends \
     jq \
     libhwloc-dev \
     libnuma1 \
     libnuma-dev \
     libssl1.1 \
     libtool \
     hwloc \
     && rm -rf /var/lib/apt/lists/*
   
   RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
   ```

1. After you finish creating the Dockerfile, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html) to learn how to build the Docker container, host it in Amazon ECR, and run a training job using the SageMaker Python SDK.

The following example code shows a complete Dockerfile after combining all the previous code blocks.

```
# This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04

# Set appropiate versions and location for components
ARG PYTORCH_VERSION=1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
ARG CONDA_PREFIX="/opt/conda"
ARG BRANCH_OFI=1.1.3-aws

# Set ENV variables required to build PyTorch
ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV NCCL_VERSION=2.10.3

# Add OpenMPI to the path.
ENV PATH /opt/amazon/openmpi/bin:$PATH

# Add Conda to path
ENV PATH $CONDA_PREFIX/bin:$PATH

# Set this enviroment variable for SageMaker AI to launch SMDDP correctly.
ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main

# Add enviroment variable for processes to be able to call fork()
ENV RDMAV_FORK_SAFE=1

# Indicate the container type
ENV DLC_CONTAINER_TYPE=training

# Add EFA and SMDDP to LD library path
ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH

# Install basic dependencies to download and build other dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
  apt-get update && apt-get install -y  --no-install-recommends \
  curl \
  wget \
  git \
  && rm -rf /var/lib/apt/lists/*

# Install EFA.
# This is required for SMDDP backend communication
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN mkdir /tmp/efa \
    && cd /tmp/efa \
    && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
    && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
    && cd aws-efa-installer \
    && ./efa_installer.sh -y --skip-kmod -g \
    && rm -rf /tmp/efa

# Install Conda
RUN curl -fsSL -v -o ~/miniconda.sh -O  https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
    chmod +x ~/miniconda.sh && \
    ~/miniconda.sh -b -p $CONDA_PREFIX && \
    rm ~/miniconda.sh && \
    $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \
    $CONDA_PREFIX/bin/conda clean -ya

# Install PyTorch.
# Start with dependencies listed in official PyTorch dockerfile
# https://github.com/pytorch/pytorch/blob/master/Dockerfile
RUN DEBIAN_FRONTEND=noninteractive \
    apt-get install -y --no-install-recommends \
        build-essential \
        ca-certificates \
        ccache \
        cmake \
        git \
        libjpeg-dev \
        libpng-dev && \
    rm -rf /var/lib/apt/lists/*

# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache

# Common dependencies for PyTorch
RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses

# Linux specific dependency for PyTorch
RUN conda install -c pytorch magma-cuda113

# Clone PyTorch
RUN --mount=type=cache,target=/opt/ccache \
    cd / \
    && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
# Note that we need to use the same NCCL version for PyTorch and OFI plugin.
# To enforce that, install NCCL from source before building PT and OFI plugin.

# Install NCCL.
# Required for building OFI plugin (OFI requires NCCL's header files and library)
RUN cd /pytorch/third_party/nccl \
    && rm -rf nccl \
    && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
    && cd nccl \
    && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
    && make pkg.txz.build \
    && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1

# Build and install PyTorch.
RUN cd /pytorch \
    && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
    python setup.py install \
    && rm -rf /pytorch

RUN ccache -C

# Build and install OFI plugin. \
# It is required to use libfabric.
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
    && apt-get install -y --no-install-recommends \
        autoconf \
        automake \
        libtool
RUN mkdir /tmp/efa-ofi-nccl \
    && cd /tmp/efa-ofi-nccl \
    && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
    && cd aws-ofi-nccl \
    && ./autogen.sh \
    && ./configure --with-libfabric=/opt/amazon/efa \
        --with-mpi=/opt/amazon/openmpi \
        --with-cuda=/usr/local/cuda \
        --with-nccl=$CONDA_PREFIX \
    && make \
    && make install \
    && rm -rf /tmp/efa-ofi-nccl

# Build and install Torchvision
RUN pip install --no-cache-dir -U \
    packaging \
    mpi4py==3.0.3
RUN cd /tmp \
    && git clone https://github.com/pytorch/vision.git -b v0.9.1 \
    && cd vision \
    && BUILD_VERSION="0.9.1+cu111" python setup.py install \
    && cd /tmp \
    && rm -rf vision

# Install OpenSSH.
# Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation
RUN apt-get update \
    && apt-get install -y  --allow-downgrades --allow-change-held-packages --no-install-recommends \
    && apt-get install -y --no-install-recommends openssh-client openssh-server \
    && mkdir -p /var/run/sshd \
    && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
    && echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
    && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
    && rm -rf /var/lib/apt/lists/*
# Configure OpenSSH so that nodes can communicate with each other
RUN mkdir -p /var/run/sshd && \
    sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
RUN rm -rf /root/.ssh/ && \
    mkdir -p /root/.ssh/ && \
    ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
    cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
    && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config

# Install PT S3 plugin.
# Required to efficiently access datasets in Amazon S3
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt

# Install libboost from source.
# This package is needed for smdataparallel functionality (for networking asynchronous IO).
WORKDIR /
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
    && tar -xzf boost_1_73_0.tar.gz \
    && cd boost_1_73_0 \
    && ./bootstrap.sh \
    && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \
    && cd .. \
    && rm -rf boost_1_73_0.tar.gz \
    && rm -rf boost_1_73_0 \
    && cd ${CONDA_PREFIX}/include/boost

# Install SageMaker AI PyTorch training.
WORKDIR /root
RUN pip install --no-cache-dir -U \
    smclarify \
    "sagemaker>=2,<3" \
    sagemaker-experiments==0.* \
    sagemaker-pytorch-training

# Install SageMaker AI data parallel binary (SMDDP)
# Start with dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
    apt-get update && apt-get install -y  --no-install-recommends \
        jq \
        libhwloc-dev \
        libnuma1 \
        libnuma-dev \
        libssl1.1 \
        libtool \
        hwloc \
    && rm -rf /var/lib/apt/lists/*

# Install SMDDP
RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
```

**Tip**  
For more general information about creating a custom Dockerfile for training in SageMaker AI, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

**Tip**  
If you want to extend the custom Dockerfile to incorporate the SageMaker AI model parallel library, see [Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library](model-parallel-sm-sdk.md#model-parallel-bring-your-own-container).

# Amazon SageMaker AI data parallelism library examples
<a name="distributed-data-parallel-v2-examples"></a>

This page provides Jupyter notebooks that present examples of implementing the SageMaker AI distributed data parallelism (SMDDP) library to run distributed training jobs on SageMaker AI.

## Blogs and Case Studies
<a name="distributed-data-parallel-v2-examples-blog"></a>

The following blogs discuss case studies about using the SMDDP library.

**SMDDP v2 blogs**
+ [Enable faster training with Amazon SageMaker AI data parallel library](https://aws.amazon.com/blogs/machine-learning/enable-faster-training-with-amazon-sagemaker-data-parallel-library/), *AWS Machine Learning Blog* (December 05, 2023)

**SMDDP v1 blogs**
+ [How I trained 10TB for Stable Diffusion on SageMaker AI](https://medium.com/@emilywebber/how-i-trained-10tb-for-stable-diffusion-on-sagemaker-39dcea49ce32) in *Medium* (November 29, 2022)
+ [Run PyTorch Lightning and native PyTorch DDP on Amazon SageMaker Training, featuring Amazon Search ](https://aws.amazon.com/blogs/machine-learning/run-pytorch-lightning-and-native-pytorch-ddp-on-amazon-sagemaker-training-featuring-amazon-search/), *AWS Machine Learning Blog* (August 18, 2022)
+ [Training YOLOv5 on AWS with PyTorch and the SageMaker AI distributed data parallel library](https://medium.com/@sitecao/training-yolov5-on-aws-with-pytorch-and-sagemaker-distributed-data-parallel-library-a196ab01409b), *Medium* (May 6, 2022)
+ [Speed up EfficientNet model training on SageMaker AI with PyTorch and the SageMaker AI distributed data parallel library](https://medium.com/@dangmz/speed-up-efficientnet-model-training-on-amazon-sagemaker-with-pytorch-and-sagemaker-distributed-dae4b048c01a), *Medium* (March 21, 2022)
+ [Speed up EfficientNet training on AWS with the SageMaker AI distributed data parallel library](https://towardsdatascience.com/speed-up-efficientnet-training-on-aws-by-up-to-30-with-sagemaker-distributed-data-parallel-library-2dbf6d1e18e8), *Towards Data Science* (January 12, 2022)
+ [Hyundai reduces ML model training time for autonomous driving models using Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/hyundai-reduces-training-time-for-autonomous-driving-models-using-amazon-sagemaker/), *AWS Machine Learning Blog* (June 25, 2021)
+ [Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker AI](https://huggingface.co/blog/sagemaker-distributed-training-seq2seq), the *Hugging Face website* (April 8, 2021)

## Example notebooks
<a name="distributed-data-parallel-v2-examples-pytorch"></a>

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/). To download the examples, run the following command to clone the repository and go to `training/distributed_training/pytorch/data_parallel`.

**Note**  
Clone and run the example notebooks in the following SageMaker AI ML IDEs.  
[SageMaker AI JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker AI Code Editor](https://docs.aws.amazon.com/sagemaker/latest/dg/code-editor.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) (available as an application in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)

```
git clone https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/training/distributed_training/pytorch/data_parallel
```

**SMDDP v2 examples**
+ [Train Llama 2 using the SageMaker AI distributed data parallel library (SMDDP) and DeepSpeed](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/deepspeed/llama2/smddp_deepspeed_example.ipynb)
+ [Train Falcon using the SageMaker AI distributed data parallel library (SMDDP) and PyTorch Fully Sharded Data Parallelism (FSDP)](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/fully_sharded_data_parallel/falcon/smddp_fsdp_example.ipynb)

**SMDDP v1 examples**
+ [CNN with PyTorch and the SageMaker AI data parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/mnist/pytorch_smdataparallel_mnist_demo.ipynb)
+ [BERT with PyTorch and the SageMaker AI data parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/data_parallel/bert/pytorch_smdataparallel_bert_demo.ipynb)
+ [CNN with TensorFlow 2.3.1 and the SageMaker AI data parallelism library](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/tensorflow/data_parallel/mnist/tensorflow2_smdataparallel_mnist_demo.html)
+ [BERT with TensorFlow 2.3.1 and the SageMaker AI data parallelism library](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/tensorflow/data_parallel/bert/tensorflow2_smdataparallel_bert_demo.html)
+ [HuggingFace Distributed Data Parallel Training in PyTorch on SageMaker AI - Distributed Question Answering](https://github.com/huggingface/notebooks/blob/master/sagemaker/03_distributed_training_data_parallelism/sagemaker-notebook.ipynb)
+ [HuggingFace Distributed Data Parallel Training in PyTorch on SageMaker AI - Distributed Text Summarization](https://github.com/huggingface/notebooks/blob/master/sagemaker/08_distributed_summarization_bart_t5/sagemaker-notebook.ipynb)
+ [HuggingFace Distributed Data Parallel Training in TensorFlow on SageMaker AI](https://github.com/huggingface/notebooks/blob/master/sagemaker/07_tensorflow_distributed_training_data_parallelism/sagemaker-notebook.ipynb)

# Configuration tips for the SageMaker AI distributed data parallelism library
<a name="data-parallel-config"></a>

Review the following tips before using the SageMaker AI distributed data parallelism (SMDDP) library. This list includes tips that are applicable across frameworks.

**Topics**
+ [

## Data preprocessing
](#data-parallel-config-dataprep)
+ [

## Single versus multiple nodes
](#data-parallel-config-multi-node)
+ [

## Debug scaling efficiency with Debugger
](#data-parallel-config-debug)
+ [

## Batch size
](#data-parallel-config-batch-size)
+ [

## Custom MPI options
](#data-parallel-config-mpi-custom)
+ [

## Use Amazon FSx and set up an optimal storage and throughput capacity
](#data-parallel-config-fxs)

## Data preprocessing
<a name="data-parallel-config-dataprep"></a>

If you preprocess data during training using an external library that utilizes the CPU, you may run into a CPU bottleneck because SageMaker AI distributed data parallel uses the CPU for `AllReduce` operations. You may be able to improve training time by moving preprocessing steps to a library that uses GPUs or by completing all preprocessing before training.

## Single versus multiple nodes
<a name="data-parallel-config-multi-node"></a>

We recommend that you use this library with multiple nodes. The library can be used with a single-host, multi-device setup (for example, a single ML compute instance with multiple GPUs); however, when you use two or more nodes, the library’s `AllReduce` operation gives you significant performance improvement. Also, on a single host, NVLink already contributes to in-node `AllReduce` efficiency.

## Debug scaling efficiency with Debugger
<a name="data-parallel-config-debug"></a>

You can use Amazon SageMaker Debugger to monitor and visualize CPU and GPU utilization and other metrics of interest during training. You can use Debugger [built-in rules](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html) to monitor computational performance issues, such as `CPUBottleneck`, `LoadBalancing`, and `LowGPUUtilization`. You can specify these rules with [Debugger configurations](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configuration-for-debugging.html) when you define an Amazon SageMaker Python SDK estimator. If you use AWS CLI and AWS SDK for Python (Boto3) for training on SageMaker AI, you can enable Debugger as shown in [Configure SageMaker Debugger Using Amazon SageMaker API](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).

To see an example using Debugger in a SageMaker training job, you can reference one of the notebook examples in the [SageMaker Notebook Examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger). To learn more about Debugger, see [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html).

## Batch size
<a name="data-parallel-config-batch-size"></a>

In distributed training, as more nodes are added, batch sizes should increase proportionally. To improve convergence speed as you add more nodes to your training job and increase the global batch size, increase the learning rate.

One way to achieve this is by using a gradual learning rate warmup where the learning rate is ramped up from a small to a large value as the training job progresses. This ramp avoids a sudden increase of the learning rate, allowing healthy convergence at the start of training. For example, you can use a *Linear Scaling Rule* where each time the mini-batch size is multiplied by k, the learning rate is also multiplied by k. To learn more about this technique, see the research paper, [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/pdf/1706.02677.pdf), Sections 2 and 3.

## Custom MPI options
<a name="data-parallel-config-mpi-custom"></a>

The SageMaker AI distributed data parallel library employs Message Passing Interface (MPI), a popular standard for managing communication between nodes in a high-performance cluster, and uses NVIDIA’s NCCL library for GPU-level communication. When you use the data parallel library with a TensorFlow or Pytorch `Estimator`, the respective container sets up the MPI environment and executes the `mpirun` command to start jobs on the cluster nodes.

You can set custom MPI operations using the `custom_mpi_options` parameter in the `Estimator`. Any `mpirun` flags passed in this field are added to the `mpirun` command and executed by SageMaker AI for training. For example, you may define the `distribution` parameter of an `Estimator` using the following to use the [https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) variable to print the NCCL version at the start of the program:

```
distribution = {'smdistributed':{'dataparallel':{'enabled': True, "custom_mpi_options": "-verbose -x NCCL_DEBUG=VERSION"}}}
```

## Use Amazon FSx and set up an optimal storage and throughput capacity
<a name="data-parallel-config-fxs"></a>

When training a model on multiple nodes with distributed data parallelism, it is highly recommended to use [FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html). Amazon FSx is a scalable and high-performance storage service that supports shared file storage with a faster throughput. Using Amazon FSx storage at scale, you can achieve a faster data loading speed across the compute nodes.

Typically, with distributed data parallelism, you would expect that the total training throughput scales near-linearly with the number of GPUs. However, if you use suboptimal Amazon FSx storage, the training performance might slow down due to a low Amazon FSx throughput. 

For example, if you use the [**SCRATCH\$12** deployment type of Amazon FSx file system](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) with the minimum 1.2 TiB storage capacity, the I/O throughput capacity is 240 MB/s. Amazon FSx storage works in a way that you can assign physical storage devices, and the more devices assigned, the larger throughput you get. The smallest storage increment for the SRATCH\$12 type is 1.2 TiB, and the corresponding throughput gain is 240 MB/s.

Assume that you have a model to train on a 4-node cluster over a 100 GB data set. With a given batch size that’s optimized to the cluster, assume that the model can complete one epoch in about 30 seconds. In this case, the minimum required I/O speed is approximately 3 GB/s (100 GB / 30 s). This is apparently a much higher throughput requirement than 240 MB/s. With such a limited Amazon FSx capacity, scaling your distributed training job up to larger clusters might aggravate I/O bottleneck problems; model training throughput might improve in later epochs as cache builds up, but Amazon FSx throughput can still be a bottleneck.

To alleviate such I/O bottleneck problems, you should increase the Amazon FSx storage size to obtain a higher throughput capacity. Typically, to find an optimal I/O throughput, you may experiment with different Amazon FSx throughput capacities, assigning an equal to or slightly lower throughput than your estimate, until you find that it is sufficient to resolve the I/O bottleneck problems. In case of the aforementioned example, Amazon FSx storage with 2.4 GB/s throughput and 67 GB RAM cache would be sufficient. If the file system has an optimal throughput, the model training throughput should reach maximum either immediately or after the first epoch as cache has built up.

To learn more about how to increase Amazon FSx storage and deployment types, see the following pages in the *Amazon FSx for Lustre documentation*:
+  [How to increase storage capacity](https://docs.aws.amazon.com/fsx/latest/LustreGuide/managing-storage-capacity.html#increase-storage-capacity) 
+  [Aggregate file system performance](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) 

# Amazon SageMaker AI distributed data parallelism library FAQ
<a name="data-parallel-faq"></a>

Use the following to find answers to commonly asked questions about the SMDDP library.

**Q: When using the library, how are the `allreduce`-supporting CPU instances managed? Do I have to create heterogeneous CPU-GPU clusters, or does the SageMaker AI service create extra C5s for jobs that use the SMDDP library? **

The SMDDP library only supports GPU instances, more specificcally, P4d and P4de instances with NVIDIA A100 GPUs and EFA. No additional C5 or CPU instances are launched; if your SageMaker AI training job is on an 8-node P4d cluster, only 8 `ml.p4d.24xlarge` instances are used. No additional instances are provisioned.

**Q: I have a training job taking 5 days on a single `ml.p3.24xlarge` instance with a set of hyperparameters H1 (learning rate, batch size, optimizer, etc). Is using SageMaker AI's data parallelism library and a five-time bigger cluster enough to achieve an approximate five-time speedup? Or do I have to revisit its training hyperparameters after activating the SMDDP library? **

The library changes the overall batch size. The new overall batch size is scaled linearly with the number of training instances used. As a result of this, hyperparameters, such as learning rate, have to be changed to ensure convergence. 

**Q: Does the SMDDP library support Spot? **

Yes. You can use managed spot training. You specify the path to the checkpoint file in the SageMaker training job. You save and restore checkpoints in their training script as mentioned in the last steps of [Use the SMDDP library in your TensorFlow training script (deprecated)](data-parallel-modify-sdp-tf2.md) and [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md). 

**Q: Is the SMDDP library relevant in a single-host, multi-device setup?**

The library can be used in single-host multi-device training but the library offers performance improvements only in multi-host training.

**Q: Where should the training dataset be stored? **

The training dataset can be stored in an Amazon S3 bucket or on an Amazon FSx drive. See this [document for various supported input file systems for a training job](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.FileSystemInput). 

**Q: When using the SMDDP library, is it mandatory to have training data in FSx for Lustre? Can Amazon EFS and Amazon S3 be used? **

We generally recommend you use Amazon FSx because of its lower latency and higher throughput. If you prefer, you can use Amazon EFS or Amazon S3.

**Q: Can the library be used with CPU nodes?** 

No. To find instance types supported by the SMDDP library, see [Supported instance types](distributed-data-parallel-support.md#distributed-data-parallel-supported-instance-types).

**Q: What frameworks and framework versions are currently supported by the SMDDP library at launch?** 

the SMDDP library currently supports PyTorch v1.6.0 or later and TensorFlow v2.3.0 or later. It doesn't support TensorFlow 1.x. For more information about which version of the SMDDP library is packaged within AWS deep learning containers, see [Release Notes for Deep Learning Containers](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html).

**Q: Does the library support AMP?**

Yes, the SMDDP library supports Automatic Mixed Precision (AMP) out of the box. No extra action is needed to use AMP other than the framework-level modifications to your training script. If gradients are in FP16, the SageMaker AI data parallelism library runs its `AllReduce` operation in FP16. For more information about implementing AMP APIs to your training script, see the following resources:
+ [Frameworks - PyTorch](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#pytorch) in the *NVIDIA Deep Learning Performace documentation*
+ [Frameworks - TensorFlow](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensorflow) in the *NVIDIA Deep Learning Performace documentation*
+ [Automatic Mixed Precision for Deep Learning](https://developer.nvidia.com/automatic-mixed-precision) in the *NVIDIA Developer Docs*
+ [Introducing native PyTorch automatic mixed precision for faster training on NVIDIA GPUs](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/) in the *PyTorch Blog*
+ [TensorFlow mixed precision APIs](https://www.tensorflow.org/guide/mixed_precision) in the *TensorFlow documentation*

**Q: How do I identify if my distributed training job is slowed down due to I/O bottleneck?**

With a larger cluster, the training job requires more I/O throughput, and therefore the training throughput might take longer (more epochs) to ramp up to the maximum performance. This indicates that I/O is being bottlenecked and cache is harder to build up as you scale nodes up (higher throughput requirement and more complex network topology). For more information about monitoring the Amazon FSx throughput on CloudWatch, see [Monitoring FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring_overview.html) in the *FSx for Lustre User Guide*. 

**Q: How do I resolve I/O bottlenecks when running a distributed training job with data parallelism?**

We highly recommend that you use Amazon FSx as your data channel if you are using Amazon S3. If you are already using Amazon FSx but still having I/O bottleneck problems, you might have set up your Amazon FSx file system with a low I/O throughput and a small storage capacity. For more information about how to estimate and choose the right size of I/O throughput capacity, see [Use Amazon FSx and set up an optimal storage and throughput capacity](data-parallel-config.md#data-parallel-config-fxs).

**Q: (For the library v1.4.0 or later) How do I resolve the `Invalid backend` error while initializing process group.**

If you encounter the error message `ValueError: Invalid backend: 'smddp'` when calling `init_process_group`, this is due to the breaking change in the SMDDP library v1.4.0 and later. You must import the PyTorch client of the library, `smdistributed.dataparallel.torch.torch_smddp`, which registers `smddp` as a backend for PyTorch. To learn more, see [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md).

**Q: (For the SMDDP library v1.4.0 or later) I would like to call the collective primitives of the [https://pytorch.org/docs/stable/distributed.html](https://pytorch.org/docs/stable/distributed.html) interface. Which primitives does the `smddp` backend support?**

In v1.4.0, the SMDDP library supports `all_reduce`, `broadcast`, `reduce`, `all_gather`, and `barrier` of of the `torch.distributed` interface.

**Q: (For the SMDDP library v1.4.0 or later) Does this new API work with other custom DDP classes or libraries like Apex DDP? **

The SMDDP library is tested with other third-party distributed data parallel libraries and framework implementations that use the `torch.distribtued` modules. Using the SMDDP library with custom DDP classes works as long as the collective operations used by the custom DDP classes are supported by the SMDDP library. See the preceding question for a list of supported collectives. If you have these use cases and need further support, reach out to the SageMaker AI team through the [AWS Support Center](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

**Q: Does the SMDDP library support the bring-your-own-container (BYOC) option? If so, how do I install the library and run a distributed training job by writing a custom Dockerfile?**

If you want to integrate the SMDDP library and its minimum dependencies into your own Docker container, BYOC is the right approach. You can build your own container using the binary file of the library. The recommended process is to write a custom Dockerfile with the library and its dependencies, build the Docker container, host it in Amazon ECR, and use the ECR image URI to launch a training job using the SageMaker AI generic estimator class. For more instructions on how to prepare a custom Dockerfile for distributed training in SageMaker AI with the SMDDP library, see [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).

# Troubleshooting for distributed training in Amazon SageMaker AI
<a name="distributed-troubleshooting-data-parallel"></a>

If you have problems in running a training job when you use the library, use the following list to try to troubleshoot. If you need further support, reach out to the SageMaker AI team through [AWS Support Center](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

**Topics**
+ [

## Using SageMaker AI distributed data parallel with Amazon SageMaker Debugger and checkpoints
](#distributed-ts-data-parallel-debugger)
+ [

## An unexpected prefix attached to model parameter keys
](#distributed-ts-data-parallel-pytorch-prefix)
+ [

## SageMaker AI distributed training job stalling during initialization
](#distributed-ts-data-parallel-efa-sg)
+ [

## SageMaker AI distributed training job stalling at the end of training
](#distributed-ts-data-parallel-stall-at-the-end)
+ [

## Observing scaling efficiency degradation due to Amazon FSx throughput bottlenecks
](#distributed-ts-data-parallel-fxs-bottleneck)
+ [

## SageMaker AI distributed training job with PyTorch returns deprecation warnings
](#distributed-ts-data-parallel-deprecation-warnings)

## Using SageMaker AI distributed data parallel with Amazon SageMaker Debugger and checkpoints
<a name="distributed-ts-data-parallel-debugger"></a>

To monitor system bottlenecks, profile framework operations, and debug model output tensors for training jobs with SageMaker AI distributed data parallel, use Amazon SageMaker Debugger. 

However, when you use SageMaker Debugger, SageMaker AI distributed data parallel, and SageMaker AI checkpoints, you might see an error that looks like the following example. 

```
SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
```

This is due to an internal error between Debugger and checkpoints, which occurs when you enable SageMaker AI distributed data parallel. 
+ If you enable all three features, SageMaker Python SDK automatically turns off Debugger by passing `debugger_hook_config=False`, which is equivalent to the following framework `estimator` example.

  ```
  bucket=sagemaker.Session().default_bucket()
  base_job_name="sagemaker-checkpoint-test"
  checkpoint_in_bucket="checkpoints"
  
  # The S3 URI to store the checkpoints
  checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)
  
  estimator = TensorFlow(
      ...
  
      distribution={"smdistributed": {"dataparallel": { "enabled": True }}},
      checkpoint_s3_uri=checkpoint_s3_bucket,
      checkpoint_local_path="/opt/ml/checkpoints",
      debugger_hook_config=False
  )
  ```
+ If you want to keep using both SageMaker AI distributed data parallel and SageMaker Debugger, a workaround is manually adding checkpointing functions to your training script instead of specifying the `checkpoint_s3_uri` and `checkpoint_local_path` parameters from the estimator. For more information about setting up manual checkpointing in a training script, see [Saving Checkpoints](distributed-troubleshooting-model-parallel.md#distributed-ts-model-parallel-checkpoints).

## An unexpected prefix attached to model parameter keys
<a name="distributed-ts-data-parallel-pytorch-prefix"></a>

For PyTorch distributed training jobs, an unexpected prefix (`model` for example) might be attached to `state_dict` keys (model parameters). The SageMaker AI data parallel library does not directly alter or prepend any model parameter names when PyTorch training jobs save model artifacts. The PyTorch's distributed training changes the names in the `state_dict` to go over the network, prepending the prefix. If you encounter any model failure problem due to different parameter names while you are using the SageMaker AI data parallel library and checkpointing for PyTorch training, adapt the following example code to remove the prefix at the step you load checkpoints in your training script.

```
state_dict = {k.partition('model.')[2]:state_dict[k] for k in state_dict.keys()}
```

This takes each `state_dict` key as a string value, separates the string at the first occurrence of `'model.'`, and takes the third list item (with index 2) of the partitioned string.

For more information about the prefix issue, see a discussion thread at [Prefix parameter names in saved model if trained by multi-GPU?](https://discuss.pytorch.org/t/prefix-parameter-names-in-saved-model-if-trained-by-multi-gpu/494) in the *PyTorch discussion forum*.

For more information about the PyTorch methods for saving and loading models, see [Saving & Loading Model Across Devices](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-across-devices) in the *PyTorch documentation*.

## SageMaker AI distributed training job stalling during initialization
<a name="distributed-ts-data-parallel-efa-sg"></a>

If your SageMaker AI distributed data parallel training job stalls during initialization when using EFA-enabled instances, this might be due to a misconfiguration in the security group of the VPC subnet that's used for the training job. EFA requires a proper security group configuration to enable traffic between the nodes.

**To configure inbound and outbound rules for the security group**

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. Choose **Security Groups** in the left navigation pane.

1. Select the security group that's tied to the VPC subnet you use for training. 

1. In the **Details** section, copy the **Security group ID**.

1. On the **Inbound rules** tab, choose **Edit inbound rules**.

1. On the **Edit inbound rules** page, do the following: 

   1. Choose **Add rule**.

   1. For **Type**, choose **All traffic**.

   1. For **Source**, choose **Custom**, paste the security group ID into the search box, and select the security group that pops up.

1. Choose **Save rules** to finish configuring the inbound rule for the security group.

1. On the **Outbound rules** tab, choose **Edit outbound rules**.

1. Repeat the step 6 and 7 to add the same rule as an outbound rule.

After you complete the preceding steps for configuring the security group with the inbound and outbound rules, re-run the training job and verify if the stalling issue is resolved.

For more information about configuring security groups for VPC and EFA, see [Security groups for your VPC](https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html) and [Elastic Fabric Adapter](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html).

## SageMaker AI distributed training job stalling at the end of training
<a name="distributed-ts-data-parallel-stall-at-the-end"></a>

One of the root causes of stalling issues at the end of training is a mismatch in the number of batches that are processed per epoch across different ranks. All workers (GPUs) synchronize their local gradients in the backward pass to ensure they all have the same copy of the model at the end of the batch iteration. If the batch sizes are unevenly assigned to different worker groups during the final epoch of training, the training job stalls. For example, while a group of workers (group A) finishes processing all batches and exits the training loop, another group of workers (group B) starts processing another batch and still expects communication from group A to synchronize the gradients. This causes group B to wait for group A, which already completed training and does not have any gradients to synchronize. 

Therefore, when setting up your training dataset, it is important that each worker gets the same number of data samples so that each worker goes through the same number of batches while training. Make sure each rank gets the same number of batches to avoid this stalling issue.

## Observing scaling efficiency degradation due to Amazon FSx throughput bottlenecks
<a name="distributed-ts-data-parallel-fxs-bottleneck"></a>

One potential cause of lowered scaling efficiency is the FSx throughput limit. If you observe a sudden drop in scaling efficiency when you switch to a larger training cluster, try using a larger FSx for Lustre file system with a higher throughput limit. For more information, see [Aggregate file system performance](https://docs.aws.amazon.com/fsx/latest/LustreGuide/performance.html#fsx-aggregate-perf) and [Managing storage and throughput capacity](https://docs.aws.amazon.com/fsx/latest/LustreGuide/managing-storage-capacity.html) in the *Amazon FSx for Lustre User Guide*.

## SageMaker AI distributed training job with PyTorch returns deprecation warnings
<a name="distributed-ts-data-parallel-deprecation-warnings"></a>

Since v1.4.0, the SageMaker AI distributed data parallelism library works as a backend of PyTorch distributed. Because of the breaking change of using the library with PyTorch, you might encounter a warning message that the `smdistributed` APIs for the PyTorch distributed package are deprecated. The warning message should be similar to the following:

```
smdistributed.dataparallel.torch.dist is deprecated in the SageMaker AI distributed data parallel library v1.4.0+.
Please use torch.distributed and specify 'smddp' as a backend when initializing process group as follows:
torch.distributed.init_process_group(backend='smddp')
For more information, see the library's API documentation at
https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html
```

In v1.4.0 and later, the library only needs to be imported once at the top of your training script and set as the backend during the PyTorch distributed initialization. With the single line of backend specification, you can keep your PyTorch training script unchanged and directly use the PyTorch distributed modules. See [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md) to learn about the breaking changes and the new way to use the library with PyTorch.

# SageMaker AI data parallelism library release notes
<a name="data-parallel-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker AI distributed data parallelism (SMDDP) library.

## The SageMaker AI distributed data parallelism library v2.5.0
<a name="data-parallel-release-notes-20241017"></a>

*Date: October 17, 2024*

**New features**
+ Added support for PyTorch v2.4.1 with CUDA v12.1.

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.6.0](model-parallel-release-notes.md#model-parallel-release-notes-20241017).

```
658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.4.1/cu121/2024-10-09/smdistributed_dataparallel-2.5.0-cp311-cp311-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.3.0
<a name="data-parallel-release-notes-20240611"></a>

*Date: June 11, 2024*

**New features**
+ Added support for PyTorch v2.3.0 with CUDA v12.1 and Python v3.11.
+ Added support for PyTorch Lightning v2.2.5. This is integrated into the SageMaker AI framework container for PyTorch v2.3.0.
+ Added instance type validation during import to prevent loading the SMDDP library on unsupported instance types. For a list of instance types compatible with the SMDDP library, see [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md).

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.3.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker
  ```

For a complete list of versions of the SMDDP library and the pre-built containers, see [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.3.0/cu121/2024-05-23/smdistributed_dataparallel-2.3.0-cp311-cp311-linux_x86_64.whl
```

**Other changes**
+ The SMDDP library v2.2.0 is integrated into the SageMaker AI framework container for PyTorch v2.2.0.

## The SageMaker AI distributed data parallelism library v2.2.0
<a name="data-parallel-release-notes-20240304"></a>

*Date: March 4, 2024*

**New features**
+ Added support for PyTorch v2.2.0 with CUDA v12.1.

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.2.0](model-parallel-release-notes.md#model-parallel-release-notes-20240307).

```
658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.2.0/cu121/2024-03-04/smdistributed_dataparallel-2.2.0-cp310-cp310-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.1.0
<a name="data-parallel-release-notes-20240301"></a>

*Date: March 1, 2024*

**New features**
+ Added support for PyTorch v2.1.0 with CUDA v12.1.

**Bug fixes**
+ Fixed the CPU memory leak issue in [SMDDP v2.0.1](#data-parallel-release-notes-20231207).

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library passed benchmark testing and is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.1.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-sagemaker
  ```

**Integration into Docker containers distributed by the SageMaker AI model parallelism (SMP) library**

This version of the SMDDP library is migrated to [The SageMaker model parallelism library v2.1.0](model-parallel-release-notes.md#model-parallel-release-notes-20240206).

```
658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121
```

For Regions where the SMP Docker images are available, see [AWS Regions](distributed-model-parallel-support-v2.md#distributed-model-parallel-availablity-zone-v2).

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.1.0/cu121/2024-02-04/smdistributed_dataparallel-2.1.0-cp310-cp310-linux_x86_64.whl
```

## The SageMaker AI distributed data parallelism library v2.0.1
<a name="data-parallel-release-notes-20231207"></a>

*Date: December 7, 2023*

**New features**
+ Added a new SMDDP-implementation of `AllGather` collective operation optimized for AWS compute resources and network infrastructure. To learn more, see [SMDDP `AllGather` collective operation](data-parallel-intro.md#data-parallel-allgather).
+ The SMDDP `AllGather` collective operation is compatible with PyTorch FSDP and DeepSpeed. To learn more, see [Use the SMDDP library in your PyTorch training script](data-parallel-modify-sdp-pt.md).
+ Added support for PyTorch v2.0.1

**Known issues**
+ There's a CPU memory leak issue from a gradual CPU memory increase while training with SMDDP `AllReduce` in DDP mode.

**Integration into SageMaker AI Framework Containers**

This version of the SMDDP library passed benchmark testing and is migrated to the following [SageMaker AI Framework Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only).
+ PyTorch v2.0.1

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker
  ```

**Binary file of this release**

You can download or install the library using the following URL.

```
https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl
```

**Other changes**
+ Starting from this release, documentation for the SMDDP library is fully available in this *Amazon SageMaker AI Developer Guide*. In favor of the complete developer guide for SMDDP v2 housed in the *Amazon SageMaker AI Developer Guide*, documentation for the [additional reference for SMDDP v1.x](https://sagemaker.readthedocs.io/en/stable/api/training/smd_data_parallel.html) in the *SageMaker AI Python SDK documentation* is no longer supported. If you still need SMP v1.x documentation, see the following snapshot of the documentation at [SageMaker Python SDK v2.212.0 documentation](https://sagemaker.readthedocs.io/en/v2.212.0/api/training/distributed.html#the-sagemaker-distributed-data-parallel-library).

# SageMaker model parallelism library v2
<a name="model-parallel-v2"></a>

**Note**  
Since the release of the SageMaker model parallelism (SMP) library v2.0.0 on December 19, 2023, this documentation is renewed for the SMP library v2. For previous versions of the SMP library, see [(Archived) SageMaker model parallelism library v1.x](model-parallel.md).

The Amazon SageMaker AI model parallelism library is a capability of SageMaker AI that enables high performance and optimized large scale training on SageMaker AI accelerate compute instances. The [Core features of the SageMaker model parallelism library v2](model-parallel-core-features-v2.md) include techniques and optimizations to accelerate and simplify large model training, such as hybrid sharded data parallelism, tensor parallelism, activation checkpointing, and activation offloading. You can use the SMP library to accelerate the training and fine-tuning of large language models (LLMs), large vision models (LVMs), and foundation models (FMs) with hundreds of billions of parameters.

The SageMaker model parallelism library v2 (SMP v2) aligns the library’s APIs and methods with open source PyTorch Fully Sharded Data Parallelism (FSDP), which gives you the benefit of SMP performance optimizations with minimal code changes. With SMP v2, you can improve the computational performance of training a state-of-the-art large model on SageMaker AI by bringing your PyTorch FSDP training scripts to SageMaker AI.

You can use SMP v2 for the general [SageMaker Training](train-model.md) jobs and distributed training workloads on [Amazon SageMaker HyperPod](sagemaker-hyperpod.md) clusters.

**Topics**
+ [

# Model parallelism concepts
](model-parallel-intro-v2.md)
+ [

# Supported frameworks and AWS Regions
](distributed-model-parallel-support-v2.md)
+ [

# Use the SageMaker model parallelism library v2
](model-parallel-use-api-v2.md)
+ [

# Core features of the SageMaker model parallelism library v2
](model-parallel-core-features-v2.md)
+ [

# Amazon SageMaker AI model parallelism library v2 examples
](distributed-model-parallel-v2-examples.md)
+ [

# SageMaker distributed model parallelism best practices
](model-parallel-best-practices-v2.md)
+ [

# The SageMaker model parallel library v2 reference
](distributed-model-parallel-v2-reference.md)
+ [

# Release notes for the SageMaker model parallelism library
](model-parallel-release-notes.md)
+ [

# (Archived) SageMaker model parallelism library v1.x
](model-parallel.md)

# Model parallelism concepts
<a name="model-parallel-intro-v2"></a>

Model parallelism is a distributed training method in which the deep learning (DL) model is partitioned across multiple GPUs and instances. The SageMaker model parallel library v2 (SMP v2) is compatible with the native PyTorch APIs and capabilities. This makes it convenient for you to adapt your PyTorch Fully Sharded Data Parallel (FSDP) training script to the SageMaker Training platform and take advantage of the performance improvement that SMP v2 provides. This introduction page provides a high-level overview about model parallelism and a description of how it can help overcome issues that arise when training deep learning (DL) models that are typically very large in size. It also provides examples of what the SageMaker model parallel library offers to help manage model parallel strategies and memory consumption.

## What is model parallelism?
<a name="model-parallel-what-is-v2"></a>

Increasing the size of deep learning models (layers and parameters) yields better accuracy for complex tasks such as computer vision and natural language processing. However, there is a limit to the maximum model size you can fit in the memory of a single GPU. When training DL models, GPU memory limitations can be bottlenecks in the following ways:
+ They limit the size of the model that you can train, because the memory footprint of a model scales proportionally to the number of parameters.
+ They limit the per-GPU batch size during training, driving down GPU utilization and training efficiency.

To overcome the limitations associated with training a model on a single GPU, SageMaker AI provides the model parallel library to help distribute and train DL models efficiently on multiple compute nodes. Furthermore, with the library, you can achieve optimized distributed training using EFA-supported devices, which enhance the performance of inter-node communication with low latency, high throughput, and OS bypass.

## Estimate memory requirements before using model parallelism
<a name="model-parallel-intro-estimate-memory-requirements-v2"></a>

Before you use the SageMaker model parallel library, consider the following to get a sense of the memory requirements of training large DL models.

For a training job that uses automatic mixed precision such as `float16` (FP16) or `bfloat16` (BF16) and Adam optimizers, the required GPU memory per parameter is about 20 bytes, which we can break down as follows:
+ An FP16 or BF16 parameter \$1 2 bytes
+ An FP16 or BF16 gradient \$1 2 bytes
+ An FP32 optimizer state \$1 8 bytes based on the Adam optimizers
+ An FP32 copy of parameter \$1 4 bytes (needed for the `optimizer apply` (OA) operation)
+ An FP32 copy of gradient \$1 4 bytes (needed for the OA operation)

Even for a relatively small DL model with 10 billion parameters, it can require at least 200GB of memory, which is much larger than the typical GPU memory (for example, NVIDIA A100 with 40GB/80GB memory) available on a single GPU. On top of the memory requirements for model and optimizer states, there are other memory consumers such as activations generated in the forward pass. The memory required can be a lot greater than 200GB.

For distributed training, we recommend that you use Amazon EC2 P4 and P5 instances that have NVIDIA A100 and H100 Tensor Core GPUs respectively. For more details about specifications such as CPU cores, RAM, attached storage volume, and network bandwidth, see the *Accelerated Computing* section in the [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/) page. For instance types that SMP v2 supports, see [Supported instance types](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-instance-types-v2).

Even with the accelerated computing instances, models with about 10 billion parameters such as Megatron-LM and T5, and even larger models with hundreds of billions of parameters such as GPT-3, cannot fit model replicas in each GPU device. 

## How the library employs model parallelism and memory saving techniques
<a name="model-parallel-intro-features-v2"></a>

The library consists of various types of model parallelism features and memory-saving features such as optimizer state sharding, activation checkpointing, and activation offloading. All these techniques can be combined to efficiently train large models that consist of hundreds of billions of parameters.

**Topics**
+ [

### Sharded data parallelism
](#model-parallel-intro-sdp-v2)
+ [

### Expert parallelism
](#model-parallel-intro-expert-parallelism-v2)
+ [

### Tensor parallelism
](#model-parallel-intro-tp-v2)
+ [

### Activation checkpointing and offloading
](#model-parallel-intro-activation-offloading-checkpointing-v2)
+ [

### Choosing the right techniques for your model
](#model-parallel-intro-choosing-techniques-v2)

### Sharded data parallelism
<a name="model-parallel-intro-sdp-v2"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs within a data-parallel group.

SMP v2 implements sharded data parallelism through FSDP, and extends it to implement the scale aware hybrid sharding strategy discussed in the blog post [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws).

You can apply sharded data parallelism to your model as a standalone strategy. Furthermore, if you are using the most performant GPU instances equipped with NVIDIA A100 Tensor Core GPUs, `ml.p4d.24xlarge` and `ml.p4de.24xlarge`, you can take the advantage of improved training speed from the `AllGather` operation offered by the [SageMaker data parallelism (SMDDP) library](data-parallel.md).

To dive deep into sharded data parallelism and learn how to set it up or use a combination of sharded data parallelism with other techniques like tensor parallelism and mixed precision training, see [Hybrid sharded data parallelism](model-parallel-core-features-v2-sharded-data-parallelism.md).

### Expert parallelism
<a name="model-parallel-intro-expert-parallelism-v2"></a>

SMP v2 integrates with [NVIDIA Megatron](https://github.com/NVIDIA/Megatron-LM) for implementing *expert parallelism* on top of its support for the native PyTorch FSDP APIs. You can keep your PyTorch FSDP training code as is and apply SMP expert parallelism for training *Mixture of Experts* (MoE) models within SageMaker AI.

An MoE model is a type of transformer model that consists of multiple *experts*, each consisting of a neural network, typically a feed-forward network (FFN). A gate network called *router* determines which tokens are sent to which expert. These experts specialize in processing specific aspects of the input data, enabling the model to train faster, reduce compute cost, while achieving the same performance quality as its counterpart dense model. And *expert parallelism* is a parallelism technique that handles splitting experts of an MoE model across GPU devices.

To learn how to train MoE models with SMP v2, see [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md).

### Tensor parallelism
<a name="model-parallel-intro-tp-v2"></a>

*Tensor parallelism* splits individual layers, or `nn.Modules`, across devices to run in parallel. The following figure shows the simplest example of how the SMP library splits a model with four layers to achieve two-way tensor parallelism (`"tensor_parallel_degree": 2`). In the following figure, the notations for model parallel group, tensor parallel group, and data parallel group are `MP_GROUP`, `TP_GROUP`, and `DP_GROUP` respectively. The layers of each model replica are bisected and distributed into two GPUs. The library manages communication across the tensor-distributed model replicas.

![\[Simplest example of how the SMP library splits a model with four layers to achieve two-way tensor parallelism ("tensor_parallel_degree": 2).\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/smp-v2-tensor-parallel.png)


To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set a combination of the core features, see [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md).

### Activation checkpointing and offloading
<a name="model-parallel-intro-activation-offloading-checkpointing-v2"></a>

To save GPU memory, the library supports activation checkpointing to avoid storing internal activations in the GPU memory for user-specified modules during the forward pass. The library recomputes these activations during the backward pass. In addition, with activation offloading, it offloads the stored activations to CPU memory and fetches them back to GPU during the backward pass to further reduce the activation memory footprint. For more information about how to use these features, see [Activation checkpointing](model-parallel-core-features-v2-pytorch-activation-checkpointing.md) and [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md).

### Choosing the right techniques for your model
<a name="model-parallel-intro-choosing-techniques-v2"></a>

For more information about choosing the right techniques and configurations, see [SageMaker distributed model parallelism best practices](model-parallel-best-practices-v2.md).

# Supported frameworks and AWS Regions
<a name="distributed-model-parallel-support-v2"></a>

Before using the SageMaker model parallelism library v2 (SMP v2), check the supported frameworks and instance types and determine if there are enough quotas in your AWS account and AWS Region.

**Note**  
To check the latest updates and release notes of the library, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).

## Supported frameworks
<a name="distributed-model-parallel-supported-frameworks-v2"></a>

SMP v2 supports the following deep learning frameworks and available through SMP Docker containers and an SMP Conda channel. When you use the framework estimator classes in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use SMP v2, we recommend that you always keep the SageMaker Python SDK up to date in your development environment.

**PyTorch versions that the SageMaker model parallelism library supports**

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-support-v2.html)

**SMP Conda channel**

The following Amazon S3 bucket is a public Conda channel hosted by the SMP service team. If you want to install the SMP v2 library in an environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.

```
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/
```

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

**Note**  
To find previous versions of the SMP library v1.x and pre-packaged DLCs, see [Supported Frameworks](distributed-model-parallel-support.md#distributed-model-parallel-supported-frameworks) in the *SMP v1 documentation*.

### Use SMP v2 with open source libraries
<a name="distributed-model-parallel-supported-frameworks-v2-open-source"></a>

The SMP v2 library works with other PyTorch-based open source libraries such as PyTorch Lightning, Hugging Face Transformers, and Hugging Face Accelerate, because SMP v2 is compatible with the PyTorch FSDP APIs. If you have further questions on using the SMP library with other third party libraries, contact the SMP service team at `sm-model-parallel-feedback@amazon.com`.

## AWS Regions
<a name="distributed-model-parallel-availablity-zone-v2"></a>

SMP v2 is available in the following AWS Regions. If you'd like to use the SMP Docker image URIs or the SMP Conda channel, check the following list and choose the AWS Region matching with yours, and update the image URI or the channel URL accordingly.
+ ap-northeast-1
+ ap-northeast-2
+ ap-northeast-3
+ ap-south-1
+ ap-southeast-1
+ ap-southeast-2
+ ca-central-1
+ eu-central-1
+ eu-north-1
+ eu-west-1
+ eu-west-2
+ eu-west-3
+ sa-east-1
+ us-east-1
+ us-east-2
+ us-west-1
+ us-west-2

## Supported instance types
<a name="distributed-model-parallel-supported-instance-types-v2"></a>

SMP v2 requires one of the following ML instance types.


| Instance type | 
| --- | 
| ml.p4d.24xlarge | 
| ml.p4de.24xlarge | 
| ml.p5.48xlarge | 
| ml.p5e.48xlarge | 

**Tip**  
Starting from SMP v2.2.0 supporting PyTorch v2.2.0 and later, [Mixed precision training with FP8 on P5 instances using Transformer Engine](model-parallel-core-features-v2-mixed-precision.md#model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5) is available.

For specs of the SageMaker machine learning instance types in general, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *AWS Service Quotas User Guide*.

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
    the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
    for training job usage' is 0 Instances, with current utilization of 0 Instances
    and a request delta of 1 Instances.
    Please contact AWS support to request an increase for this limit.
```

# Use the SageMaker model parallelism library v2
<a name="model-parallel-use-api-v2"></a>

On this page, you'll learn how to use the SageMaker model parallelism library v2 APIs and get started with running a PyTorch Fully Sharded Data Parallel (FSDP) training job in the SageMaker Training platform or on a SageMaker HyperPod cluster.

There are various scenarios for running a PyTorch training job with SMP v2.

1. For SageMaker training, use one of the pre-built SageMaker Framework Containers for PyTorch v2.0.1 and later, which are pre-packaged with SMP v2.

1. Use the SMP v2 binary file to set up a Conda environment for running a distributed training workload on a SageMaker HyperPod cluster.

1. Extend the pre-built SageMaker Framework Containers for PyTorch v2.0.1 and later to install any additional functional requirements for your use case. To learn how to extend a pre-built container, see [Extend a Pre-built Container](prebuilt-containers-extend.md).

1. You can also bring your own Docker container and manually set up all SageMaker Training environment using the [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit) and install the SMP v2 binary file. This is the least recommended option due to the complexity of dependencies. To learn how to run your own Docker container, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html).

This getting started guide covers the first two scenarios.

**Topics**
+ [

## Step 1: Adapt your PyTorch FSDP training script
](#model-parallel-adapt-pytorch-script-v2)
+ [

## Step 2: Launch a training job
](#model-parallel-launch-a-training-job-v2)

## Step 1: Adapt your PyTorch FSDP training script
<a name="model-parallel-adapt-pytorch-script-v2"></a>

To activate and configure the SMP v2 library, start with importing and adding the `torch.sagemaker.init()` module at the top of the script. This module takes in the SMP configuration dictionary of [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config) that you'll prepare in [Step 2: Launch a training job](#model-parallel-launch-a-training-job-v2). Also, for using the various core features offered by SMP v2, you might need to make few more changes to adapt your training script. More detailed instructions on adapting your training script for using the SMP v2 core features are provided at [Core features of the SageMaker model parallelism library v2](model-parallel-core-features-v2.md).

------
#### [ SageMaker Training ]

In your training script, add the following two lines of code, which is the minimal requirement to start training with SMP v2. In [Step 2: Launch a training job](#model-parallel-launch-a-training-job-v2), you’ll set up an object of the SageMaker `PyTorch` estimator class with an SMP configuration dictionary through the `distribution` argument of the estimator class.

```
import torch.sagemaker as tsm
tsm.init()
```

**Note**  
You can also directly pass a configuration dictionary of the [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config) to the `torch.sagemaker.init()` module. However, the parameters passed to the PyTorch estimator in [Step 2: Launch a training job](#model-parallel-launch-a-training-job-v2) take priority and override the ones specified to the `torch.sagemaker.init()` module.

------
#### [ SageMaker HyperPod ]

In your training script, add the following two lines of code. In [Step 2: Launch a training job](#model-parallel-launch-a-training-job-v2), you’ll set up a `smp_config.json` file for setting up SMP configurations in JSON format, and upload it to a storage or a file system mapped with your SageMaker HyperPod cluster. We recommend that you keep the configuration file under the same directory where you upload your training script.

```
import torch.sagemaker as tsm
tsm.init("/dir_to_training_files/smp_config.json")
```

**Note**  
You can also directly pass a configuration dictionary of the [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config) into the `torch.sagemaker.init()` module.

------

## Step 2: Launch a training job
<a name="model-parallel-launch-a-training-job-v2"></a>

Learn how to configure SMP distribution options for launching a PyTorch FSDP training job with SMP core features.

------
#### [ SageMaker Training ]

When you set up a training job launcher object of the [PyTorch framework estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) class in the SageMaker Python SDK, configure [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config) through `distribution` argument as follows.

**Note**  
The `distribution` configuration for SMP v2 is integrated in the SageMaker Python SDK starting from v2.200. Make sure that you use the SageMaker Python SDK v2.200 or later.

**Note**  
In SMP v2, you should configure `smdistributed` with `torch_distributed` for the `distribution` argument of the SageMaker `PyTorch` estimator. With `torch_distributed`, SageMaker AI runs `torchrun`, which is the default multi-node job launcher of [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html).

```
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    framework_version=2.2.0,
    py_version="310"
    # image_uri="<smp-docker-image-uri>" # For using prior versions, specify the SMP image URI directly.
    entry_point="your-training-script.py", # Pass the training script you adapted with SMP from Step 1.
    ... # Configure other required and optional parameters
    distribution={
        "torch_distributed": { "enabled": True },
        "smdistributed": {
            "modelparallel": {
                "enabled": True,
                "parameters": {
                    "hybrid_shard_degree": Integer,
                    "sm_activation_offloading": Boolean,
                    "activation_loading_horizon": Integer,
                    "fsdp_cache_flush_warnings": Boolean,
                    "allow_empty_shards": Boolean,
                    "tensor_parallel_degree": Integer,
                    "expert_parallel_degree": Integer,
                    "random_seed": Integer
                }
            }
        }
    }
)
```

**Important**  
For using one of the prior versions of PyTorch or SMP instead of the latest, you need to specify the SMP Docker image directly using the `image_uri` argument instead of the `framework_version` and `py_version` pair. The following is an example of   

```
estimator = PyTorch(
    ...,
    image_uri="658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121"
)
```
To find SMP Docker image URIs, see [Supported frameworks](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2).

------
#### [ SageMaker HyperPod ]

Before you start, make sure if the following prerequisites are met.
+ An Amazon FSx shared directory mounted (`/fsx`) to your HyperPod cluster.
+ Conda installed in the FSx shared directory. To learn how to install Conda, use the instructions at [Installing on Linux](https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html) in the *Conda User Guide*.
+ `cuda11.8` or `cuda12.1` installed on the head and compute nodes of your HyperPod cluster.

If the prerequisites are all met, proceed to the following instructions on launching a workload with SMP v2 on a HyperPod cluster.

1. Prepare an `smp_config.json` file that contains a dictionary of [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config). Make sure that you upload this JSON file to where you store your training script, or the path you specified to the `torch.sagemaker.init()` module in [Step 1](#model-parallel-adapt-pytorch-script-v2). If you’ve already passed the configuration dictionary to the `torch.sagemaker.init()` module in the training script in [Step 1](#model-parallel-adapt-pytorch-script-v2), you can skip this step. 

   ```
   // smp_config.json
   {
       "hybrid_shard_degree": Integer,
       "sm_activation_offloading": Boolean,
       "activation_loading_horizon": Integer,
       "fsdp_cache_flush_warnings": Boolean,
       "allow_empty_shards": Boolean,
       "tensor_parallel_degree": Integer,
       "expert_parallel_degree": Integer,
       "random_seed": Integer
   }
   ```

1. Upload the `smp_config.json` file to a directory in your file system. The directory path must match with the path you specified in [Step 1](#model-parallel-adapt-pytorch-script-v2). If you’ve already passed the configuration dictionary to the `torch.sagemaker.init()` module in the training script, you can skip this step.

1. On the compute nodes of your cluster, start a terminal session with the following command.

   ```
   sudo su -l ubuntu
   ```

1. Create a Conda environment on the compute nodes. The following code is an example script of creating a Conda environment and installing SMP, [SMDDP](data-parallel.md), CUDA, and other dependencies.

   ```
   # Run on compute nodes
   SMP_CUDA_VER=<11.8 or 12.1>
   
   source /fsx/<path_to_miniconda>/miniconda3/bin/activate
   
   export ENV_PATH=/fsx/<path to miniconda>/miniconda3/envs/<ENV_NAME>
   conda create -p ${ENV_PATH} python=3.10
   
   conda activate ${ENV_PATH}
   
   # Verify aws-cli is installed: Expect something like "aws-cli/2.15.0*"
   aws ‐‐version
   # Install aws-cli if not already installed
   # https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html#cliv2-linux-install
   
   # Install the SMP library
   conda install pytorch="2.0.1=sm_py3.10_cuda${SMP_CUDA_VER}*" packaging ‐‐override-channels \
     -c https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-2.0.0-pt-2.0.1/2023-12-11/smp-v2/ \
     -c pytorch -c numba/label/dev \
     -c nvidia -c conda-forge
   
   # Install dependencies of the script as below
   python -m pip install packaging transformers==4.31.0 accelerate ninja tensorboard h5py datasets \
       && python -m pip install expecttest hypothesis \
       && python -m pip install "flash-attn>=2.0.4" ‐‐no-build-isolation
   
   # Install the SMDDP wheel
   SMDDP_WHL="smdistributed_dataparallel-2.0.2-cp310-cp310-linux_x86_64.whl" \
     && wget -q https://smdataparallel.s3.amazonaws.com/binary/pytorch/2.0.1/cu118/2023-12-07/${SMDDP_WHL} \
     && pip install ‐‐force ${SMDDP_WHL} \
     && rm ${SMDDP_WHL}
   
   # cuDNN installation for Transformer Engine installation for CUDA 11.8
   # Please download from below link, you need to agree to terms 
   # https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.5/local_installers/11.x/cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz
   
   tar xf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
       && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
       && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
       && cp ./cudnn-linux-x86_64-8.9.5.30_cuda11-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
       && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive.tar.xz \
       && rm -rf cudnn-linux-x86_64-8.9.5.30_cuda11-archive/
   
   # Please download from below link, you need to agree to terms 
   # https://developer.download.nvidia.com/compute/cudnn/secure/8.9.7/local_installers/12.x/cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
   # cuDNN installation for TransformerEngine installation for cuda12.1
   tar xf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
       && rm -rf /usr/local/cuda-$SMP_CUDA_VER/include/cudnn* /usr/local/cuda-$SMP_CUDA_VER/lib/cudnn* \
       && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/include/* /usr/local/cuda-$SMP_CUDA_VER/include/ \
       && cp ./cudnn-linux-x86_64-8.9.7.29_cuda12-archive/lib/* /usr/local/cuda-$SMP_CUDA_VER/lib/ \
       && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive.tar.xz \
       && rm -rf cudnn-linux-x86_64-8.9.7.29_cuda12-archive/
       
   # TransformerEngine installation
   export CUDA_HOME=/usr/local/cuda-$SMP_CUDA_VER
   export CUDNN_PATH=/usr/local/cuda-$SMP_CUDA_VER/lib
   export CUDNN_LIBRARY=/usr/local/cuda-$SMP_CUDA_VER/lib
   export CUDNN_INCLUDE_DIR=/usr/local/cuda-$SMP_CUDA_VER/include
   export PATH=/usr/local/cuda-$SMP_CUDA_VER/bin:$PATH
   export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-$SMP_CUDA_VER/lib
   
   python -m pip install ‐‐no-build-isolation git+https://github.com/NVIDIA/TransformerEngine.git@v1.0
   ```

1. Run a test training job.

   1. In the shared file system (`/fsx`), clone the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/), and go to the `3.test_cases/11.modelparallel` folder.

      ```
      git clone https://github.com/aws-samples/awsome-distributed-training/
      cd awsome-distributed-training/3.test_cases/11.modelparallel
      ```

   1. Submit a job using `sbatch` as follows.

      ```
      conda activate <ENV_PATH>
      sbatch -N 16 conda_launch.sh
      ```

      If the job submission is successful, the output message of this `sbatch` command should be similar to `Submitted batch job ABCDEF`.

   1. Check the log file in the current directory under `logs/`.

      ```
      tail -f ./logs/fsdp_smp_ABCDEF.out
      ```

------

# Core features of the SageMaker model parallelism library v2
<a name="model-parallel-core-features-v2"></a>

The Amazon SageMaker AI model parallelism library v2 (SMP v2) offers distribution strategies and memory-saving techniques, such as sharded data parallelism, tensor parallelism, and checkpointing. The model parallelism strategies and techniques offered by SMP v2 help distribute large models across multiple devices while optimizing training speed and memory consumption. SMP v2 also provides a Python package `torch.sagemaker` to help adapt your training script with few lines of code change.

This guide follows the basic two-step flow introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). To dive deep into the core features of SMP v2 and how to use them, see the following topics.

**Note**  
These core features are available in SMP v2.0.0 and later and the SageMaker Python SDK v2.200.0 and later, and works for PyTorch v2.0.1 and later. To check the versions of the packages, see [Supported frameworks and AWS Regions](distributed-model-parallel-support-v2.md).

**Topics**
+ [

# Hybrid sharded data parallelism
](model-parallel-core-features-v2-sharded-data-parallelism.md)
+ [

# Expert parallelism
](model-parallel-core-features-v2-expert-parallelism.md)
+ [

# Context parallelism
](model-parallel-core-features-v2-context-parallelism.md)
+ [

# Compatibility with the SMDDP library optimized for AWS infrastructure
](model-parallel-core-features-v2-smddp-allgather.md)
+ [

# Mixed precision training
](model-parallel-core-features-v2-mixed-precision.md)
+ [

# Delayed parameter initialization
](model-parallel-core-features-v2-delayed-param-init.md)
+ [

# Activation checkpointing
](model-parallel-core-features-v2-pytorch-activation-checkpointing.md)
+ [

# Activation offloading
](model-parallel-core-features-v2-pytorch-activation-offloading.md)
+ [

# Tensor parallelism
](model-parallel-core-features-v2-tensor-parallelism.md)
+ [

# Fine-tuning
](model-parallel-core-features-v2-fine-tuning.md)
+ [

# FlashAttention
](model-parallel-core-features-v2-flashattention.md)
+ [

# Checkpointing using SMP
](model-parallel-core-features-v2-checkpoints.md)

# Hybrid sharded data parallelism
<a name="model-parallel-core-features-v2-sharded-data-parallelism"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across devices. This helps you fit a larger model or increase the batch size using the freed-up GPU memory. The SMP library offers a capability of running sharded data parallelism with PyTorch Fully Sharded Data Parallel (FSDP). PyTorch FSDP by default shards across the whole set of GPUs being used. In SMP v2, the library offers this sharded data parallelism on top of PyTorch FSDP by extending PyTorch hybrid sharding (`HYBRID_SHARD`), which is one of the [sharding strategies provided by PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy): `FULL_SHARD`, `SHARD_GRAD_OP`, `HYBRID_SHARD`, `_HYBRID_SHARD_ZERO2`. Extending hybrid sharding in this manner helps implement scale-aware-sharding as described in the blog [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws) for PyTorch FSDP.

The SMP library makes it easy to use `HYBRID_SHARD` and `_HYBRID_SHARD_ZERO2` across any configurable number of GPUs, extending the native PyTorch FSDP that supports sharding across a single node (`HYBRID_SHARD`) or all GPUs (`FULL_SHARD`). PyTorch FSDP calls can stay as is, and you only need to add the `hybrid_shard_degree` argument to the SMP configuration, as shown in the following code example. You don't need to change the value of the `sharding_strategy` argument in the PyTorch FSDP wrapper around your PyTorch model. You can pass `ShardingStrategy.HYBRID_SHARD` as the value. Alternatively, the SMP library overrides the strategy in the script and sets it to `ShardingStrategy.HYBRID_SHARD` if you specify a value equal to or greater than 2 to the `hybrid_shard_degree` parameter.

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `hybrid_shard_degree` parameter, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

**SMP configuration dictionary**

```
{ "hybrid_shard_degree": 16 }
```

**In training script**

```
import torch.sagemaker as tsm
tsm.init()

# Set up a PyTorch model
model = ...

# Wrap the PyTorch model using the PyTorch FSDP module
model = FSDP(
    model,
    ...
)

# Optimizer needs to be created after FSDP wrapper
optimizer = ...
```

# Expert parallelism
<a name="model-parallel-core-features-v2-expert-parallelism"></a>

A *Mixture of Experts* (MoE) model is a type of transformer model that employs a *sparse* approach, making it lighter for training compared to training traditional dense models. In this MoE neural network architecture, only a subset of the model's components called *experts* are utilized for each input. This approach offers several advantages, including more efficient training and faster inference, even with a larger model size. In other words, with the same compute budget for training a full dense model, you can fit a larger model or dataset when using MoE.

An MoE model consists of multiple *experts*, each consisting of a neural network, typically a feed-forward network (FFN). A gate network called *router* determines which tokens are sent to which expert. These experts specialize in processing specific aspects of the input data, enabling the model to train faster, reduce compute cost, while achieving the same performance quality as its counterpart dense model. To learn more about Mixture of Experts in general, refer to the blog [Applying Mixture of Experts in LLM Architectures](https://developer.nvidia.com/blog/applying-mixture-of-experts-in-llm-architectures/) in the *NVIDIA developer website*.

*Expert parallelism* is a type of parallelism that handles splitting experts of an MoE model across GPU devices.

SMP v2 integrates with [NVIDIA Megatron](https://github.com/NVIDIA/Megatron-LM) for implementing expert parallelism to support training MoE models, and runs on top of PyTorch FSDP APIs. You keep using your PyTorch FSDP training code as is and activate SMP expert parallelism for training MoE models.

## Hugging Face Transformer models compatible with SMP expert parallelism
<a name="model-parallel-core-features-v2-expert-parallelism-supported-models"></a>

SMP v2 currently offers expert parallelism support for the following Hugging Face transformer models.
+ [Mixtral](https://huggingface.co/docs/transformers/en/model_doc/mixtral)

## Configure expert parallelism
<a name="model-parallel-core-features-v2-expert-parallelism-configure"></a>

For `expert_parallel_degree`, you select a value for the degree of expert parallelism. The value must evenly divide the number of GPUs in your cluster. For example, to shard your model while using an instance with 8 GPUs, choose 2, 4, or 8. We recommend that you start with a small number, and gradually increase it until the model fits in the GPU memory.

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `expert_parallel_degree` parameter, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

**Note**  
You can use expert parallelism with [Hybrid sharded data parallelism](model-parallel-core-features-v2-sharded-data-parallelism.md). Note that expert parallelism is currently not compatible with tensor parallelism.

**Note**  
This expert parallelism training feature is available in the following combination of libraries of SageMaker and the PyTorch library:  
SMP v2.3.0 and later
The SageMaker Python SDK v2.214.4 and later
PyTorch v2.2.0 and later

### In your training script
<a name="model-parallel-core-features-v2-expert-parallelism-configure-in-script"></a>

As part of [Step 1](model-parallel-use-api-v2.md#model-parallel-adapt-pytorch-script-v2), initialize your script with `torch.sagemaker.init()` to activate SMP v2 and wrap your model with the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) API, adding the `config` parameter to the API to activate MoE. The following code snippet shows how to activate SMP MoE for the generic model class `AutoModelForCausalLM` pulling an MoE transformer model configuration using the `from_config` method for training from scratch, or the `from_pretrained` method for fine-tuning. To learn more about the SMP `MoEConfig` class, see [`torch.sagemaker.moe.moe_config.MoEConfig`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-moe).

```
# Import the torch.sagemaker.transform API and initialize.
import torch.sagemaker as tsm
tsm.init()

# Import transformers AutoModelForCausalLM class.
from transformers import AutoModelForCausalLM

# Import the SMP-implementation of MoE configuration class.
from torch.sagemaker.moe.moe_config import MoEConfig

# Define a transformer model with an MoE model configuration
model = AutoModelForCausalLM.from_config(MoEModelConfig)

# Wrap it by torch.sagemaker.transform with the SMP MoE configuration.
model = tsm.transform(
    model, 
    config=MoEConfig(
        smp_moe=True,
        random_seed=12345,
        moe_load_balancing="sinkhorn",
        global_token_shuffle=False,
        moe_all_to_all_dispatcher=True,
        moe_aux_loss_coeff=0.001,
        moe_z_loss_coeff=0.001
    )
)
```

### SMP configuration
<a name="model-parallel-core-features-v2-expert-parallelism-configure-in-estimator-config"></a>

As part of [Step 2](model-parallel-use-api-v2.md#model-parallel-launch-a-training-job-v2), add the following parameter to the SMP configuration dictionary for the SageMaker PyTorch estimator.

```
{   
    ..., # other SMP config parameters
    "expert_parallel_degree": 8
}
```

# Context parallelism
<a name="model-parallel-core-features-v2-context-parallelism"></a>

*Context parallelism* is a type of model parallelism that partitions the model activations along the sequence dimension. Unlike other [sequence parallelism](https://arxiv.org/abs/2205.05198) techniques, which only partition the `LayerNorm` and `RMSNorm`, context parallelism partitions the network inputs and all intermediate activations along the sequence dimension. 

SMP v2 integrates with [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) for context parallelism and can be used in conjunction with PyTorch FSDP and SMP [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md). You can enable all three parallelisms simultaneously for model training. Context parallelism is beneficial for training models with large activation sizes and long sequence lengths. It accelerates the computation of attention scores and attention outputs, by allowing each device to computes only a part of the scores and outputs along the sequence dimension. While tensor parallelism also accelerates computation through partitioning along the hidden dimension, the advantage of context parallelism is more substantial since computational requirements increase quadratically with sequence dimension.

## Hugging Face Transformer models compatible with SMP context parallelism
<a name="model-parallel-core-features-v2-context-parallelism-supported-models"></a>

SMP v2 currently offers context parallelism support for the following Hugging Face transformer models.
+ GPT-NeoX
+ Llama 2 and Llama 3
+ [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.3)

## Configure context parallelism
<a name="model-parallel-core-features-v2-context-parallelism-configure"></a>

Set an integer value to the `context_parallel_degree` parameter that evenly divides the number of GPUs in your cluster. For example, if you have an 8-GPU instance, use 2, 4, or 8 for `context_parallel_degree`. We recommend starting with a small `context_parallel_degree` value and gradually increasing it until the model fits in the GPU memory with the required input sequence length.

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `context_parallel_degree` parameter, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

### In your training script
<a name="model-parallel-core-features-v2-context-parallelism-configure-in-script"></a>

As part of [Step 1](model-parallel-use-api-v2.md#model-parallel-adapt-pytorch-script-v2), initialize your script with `torch.sagemaker.init()` to activate SMP v2 and wrap your model with the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) API. 

Starting from SMP v2.6.0, you can use the argument `cp_comm_type` to determine which context parallelism implementation to use. The SMP library currently supports two implementations: `p2p` and `all_gather`. The `p2p` implementation uses peer-to-peer send-receive calls for key-value accumulation during the attention implementation and runs asynchronously, allowing overlaps with compute. `all_gather` implementation, instead, uses the `AllGather` collective operation and runs synchronously.

```
import torch.sagemaker as tsm
tsm.init()

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_config(..)
model = tsm.transform(model, cp_comm_type="p2p")
```

### SMP configuration
<a name="model-parallel-core-features-v2-context-parallelism-configure-in-estimator"></a>

As part of [Step 2](model-parallel-use-api-v2.md#model-parallel-launch-a-training-job-v2), add the following parameter to the SMP configuration dictionary for the SageMaker PyTorch estimator.

```
{   
    ..., # other SMP config parameters
    "context_parallel_degree": 2
}
```

# Compatibility with the SMDDP library optimized for AWS infrastructure
<a name="model-parallel-core-features-v2-smddp-allgather"></a>

You can use the SageMaker model parallelism library v2 (SMP v2) in conjunction with the [SageMaker distributed data parallelism (SMDDP) library](data-parallel.md) that offers the `AllGather` collective communication operation optimized for AWS infrastructure. In distributed training, collective communication operations are designed for synchronizing multiple GPU workers and exchange information between them. `AllGather` is one of the core collective communication operations typically used in sharded data parallelism. To learn more about the SMDDP `AllGather` operation, see [SMDDP `AllGather` collective operation](data-parallel-intro.md#data-parallel-allgather) Optimizing such collective communication operations would directly contribute to a faster end-to-end training without side effects on convergence.

**Note**  
The SMDDP library supports P4 and P4de instances (see also [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md) by the SMDDP library).

The SMDDP library integrates natively with PyTorch through the [process group](https://pytorch.org/docs/stable/distributed.html) layer. To use the SMDDP library, you only need to add two lines of code to your training script. It supports any training frameworks such as SageMaker Model Parallelism Library, PyTorch FSDP, and DeepSpeed.

To activate SMDDP and use its `AllGather` operation, you need to add two lines of code to your training script as part of [Step 1: Adapt your PyTorch FSDP training script](model-parallel-use-api-v2.md#model-parallel-adapt-pytorch-script-v2). Note that you need to initialize PyTorch Distributed with the SMDDP backend first, and then run the SMP initialization.

```
import torch.distributed as dist

# Initialize with SMDDP
import smdistributed.dataparallel.torch.torch_smddp
dist.init_process_group(backend="smddp") # Replacing "nccl"

 # Initialize with SMP
import torch.sagemaker as tsm
tsm.init()
```

[SageMaker Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for PyTorch (see also [Supported frameworks and AWS Regions](distributed-model-parallel-support-v2.md) by SMP v2 and [Supported frameworks, AWS Regions, and instances types](distributed-data-parallel-support.md) by the SMDDP library) are pre-packaged with the SMP binary and the SMDDP binary. To learn more about the SMDDP library, see [Run distributed training with the SageMaker AI distributed data parallelism library](data-parallel.md). 

# Mixed precision training
<a name="model-parallel-core-features-v2-mixed-precision"></a>

The SageMaker model parallelism (SMP) library v2 supports mixed precision training out of the box by integrating with open source frameworks such as PyTorch FSDP and Transformer Engine. To learn more, see the following topics.

**Topics**
+ [

## Mixed precision training with FP8 on P5 instances using Transformer Engine
](#model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5)
+ [

## Mixed precision training with half-precision data types using PyTorch FSDP
](#model-parallel-core-features-v2-mixed-precision-half-precision)

## Mixed precision training with FP8 on P5 instances using Transformer Engine
<a name="model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5"></a>

Starting from the SageMaker model parallelism (SMP) library v2.2.0, the SMP library integrates with [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) and supports [FP8 mixed precision training](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html) out of the box, keeping compatibility with [PyTorch FSDP `MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision). This means that you can use both PyTorch FSDP for mixed precision training and Transformer Engine for FP8 training. For model layers not supported by Transformer Engine's FP8 training feature, those layers fall back to PyTorch FSDP mixed precision.

**Note**  
SMP v2 offers FP8 support for the following Hugging Face Transformer models:  
GPT-NeoX (available in SMP v2.2.0 and later)
Llama 2 (available in SMP v2.2.0 and later)
Mixtral 8x7b and Mixtral 8x22b (available in SMP v2.5.0 and later)

**Note**  
This FP8 training on the P5 feature is available in the following combination of libraries of SageMaker and the PyTorch library:  
The SageMaker Python SDK v2.212.0 and later
PyTorch v2.2.0 and later

*FP8* (8-bit floating point precision) is a data type that has emerged as another paradigm to accelerate deep learning training of LLM models. With the release of NVIDIA H100 GPUs supporting FP8 data types, you can benefit from the advantages from the performance improvements on P5 instances equipped with the H100 GPUs, while accelerating distributed training with FP8 mixed precision training.

The FP8 data type further branches down to E4M3 and E5M2 formats. *E4M3* offers a better precision, has a limited dynamic range, and is ideal for the forward pass in model training. *E5M2* has a broader dynamic range, but reduced precision, and is better suited for the backward pass, where precision is less critical and a wider dynamic range becomes beneficial. Hence, we recommend that you use the [hybrid FP8 strategy recipe](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#FP8-recipe) to leverage these characteristics effectively.

For half-precision data types (FP16 and BF16), global loss-scaling techniques such as static loss-scaling or dynamic loss-scaling handle convergence issues that arise from information loss due to rounding gradients in half-precision. However, the dynamic range of FP8 is even narrower, and the global loss scaling techniques are not sufficient. At this point, we need a finer-grained per-tensor scaling technique. *Delayed scaling* is a strategy that selects a scaling factor based on the maximum absolute values observed in a number of tensors form previous iterations. There's a trade-off in this strategy; it uses the full performance benefits of FP8 computation but requires memory for keeping the maximum value history of tensors. To learn more about the delayed scaling strategy in general, see the paper [https://arxiv.org/pdf/2209.05433.pdf](https://arxiv.org/pdf/2209.05433.pdf).

In practice, using FP8 is helpful in all training scenarios on P5 instances. We strongly recommend enabling FP8 whenever possible for enhancing training performance.

SMP v2 supports Transformer Engine out of the box. Therefore, when running FP8 training with SMP v2 on P5 instances of SageMaker AI (`ml.p5.48xlarge`), the only thing you need to do is to import `torch.sagemaker` in your training script and keep using the native Transformer Engine Python package. To learn more about using Transformer Engine for FP8 training in general, see [Using FP8 with Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html) in the *NVIDIA Transformer Engine documentation*. The following code snippet shows how the code lines for importing the SMP library and setting up FP8 in your training script should look.

```
import torch.sagemaker as tsm
import transformer_engine.pytorch as te
from transformer_engine.common.recipe import DelayedScaling, Format

# Initialize the SMP torch.sagemaker API.
tsm.init()

# Define a transformer model and wrap it with the torch.sagemaker.transform API.
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_config(ModelConfig)
model = tsm.transform(model)

# Enable E4M3 during forward pass, E5M2 during backward pass.
fp8_format = Format.HYBRID

# Create an FP8 recipe.
fp8_recipe = DelayedScaling(fp8_format=fp8_format, amax_history_len=32, amax_compute_algo="max")

# Enable FP8 autocasting.
with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe, fp8_group=tsm.state.world_process_group):
    out = model(inp)

loss = out.sum()
loss.backward()
```

To find a practical example of FP8 training with SMP v2 on P5 instances, see the example notebook at [Accelerate SageMaker PyTorch FSDP Training of Llama-v2 (or GPT-NeoX) with FP8 on P5 instances](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/llama_v2/smp-train-llama-fsdp-tp-fp8.ipynb).

## Mixed precision training with half-precision data types using PyTorch FSDP
<a name="model-parallel-core-features-v2-mixed-precision-half-precision"></a>

SMP v2 supports [PyTorch FSDP `MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) for training jobs on P4 and P5 instances. PyTorch FSDP provides various configurations for mixed precision for both performance improvement and memory reduction. 

**Note**  
This mixed precision training with the PyTorch FSDP feature is available in the following combination of libraries of SageMaker and the PyTorch library.  
SMP v2.0.0 and later
the SageMaker Python SDK v2.200.0 and later
PyTorch v2.0.1 and later

The standard way to configure a model for mixed precision is to create the model in `float32`, and then allow FSDP to cast the parameters to `float16` or `bfloat16` on the fly by passing a `MixedPrecision` policy, as shown in the following code snippet. For more information about options to change the `dtype` for parameters, reduction, or buffers for mixed precision in PyTorch, see [PyTorch FSDP `MixedPrecision` API](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision) in the *PyTorch documentation*.

```
# Native PyTorch API
from torch.distributed.fsdp import MixedPrecision

dtype = torch.bfloat16
mixed_precision_policy = MixedPrecision(
    param_dtype=dtype, reduce_dtype=dtype, buffer_dtype=dtype
)

model = FSDP(
    model,
    ...,
    mixed_precision=mixed_precision_policy
)
```

Note that certain models (such as the Hugging Face Transformers Llama model) expect buffers as `float32`. To use `float32`, replace `torch.bfloat16` with `torch.float32` in the line defining the `dtype` object.

# Delayed parameter initialization
<a name="model-parallel-core-features-v2-delayed-param-init"></a>

Initialization of a large model for training is not always possible with the limited GPU memory. To resolve this problem of insufficient GPU memory, you can initialize the model on CPU memory. However, for larger models with more than 20 or 40 billion parameters, even CPU memory might not be enough. For such case, we recommend that you initialize the model on what PyTorch calls a *meta device*, which allows the creation of tensors without any data attached to them. A tensor on a meta device only needs the shape information, and this allows to create a large model with its parameters on meta devices. [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index) provides the context manager `init_empty_weights` to help create such model on meta devices while initializing the buffers on a regular device. Before training starts, PyTorch FSDP initializes the model parameters. This delayed parameter initialization feature of SMP v2 delays this creation of model parameters to happen after PyTorch FSDP performs parameter sharding. PyTorch FSDP accepts a parameter initialization function (`param_init_fn`) when sharding the modules, and it calls `param_init_fn` for each module. The `param_init_fn` API takes a module as an argument and initializes all the parameters in it, not including the parameters of any child module. Note that this behavior *differs* from the native PyTorch v2.0.1 which has a bug causing the parameters to be initialized multiple times.

SMP v2 provides the [`torch.sagemaker.delayed_param.DelayedParamIniter`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-delayed-param-init) API for applying delayed parameter initialization.

The following code snippets show how to apply the `torch.sagemaker.delayed_param.DelayedParamIniter` API to your training script.

Assume that you have a PyTorch FSDP training script as follows.

```
# Creation of model on meta device
from accelerate import init_empty_weights
with init_empty_weights():
    model = create_model()

# Define a param init fn, below is an example for Hugging Face GPTNeoX.
def init_weights(module):
    d = torch.cuda.current_device()
    # Note that below doesn't work if you have buffers in the model
    # buffers will need to reinitialized after this call
    module.to_empty(device=d, recurse=False)
    if isinstance(module, (nn.Linear, Conv1D)):
        module.weight.data.normal_(mean=0.0, std=args.initializer_range)
        if module.bias:
            module.bias.data.zero_()
    elif isinstance(module, nn.Embedding):
        module.weight.data.normal_(mean=0.0, std=args.initializer_range)
        if module.padding_idx:
            module.weight.data[module.padding_idx].zero_()
    elif isinstance(module, nn.LayerNorm):
        module.bias.data.zero_()
        module.weight.data.fill_(1.0)

# Changes to FSDP wrapper.
model = FSDP(
    model,
    ...,
    param_init_fn=init_weights
)

# At this point model is initialized and sharded for sharded data parallelism.
```

Note that the delayed parameter initialization approach is not model agnostic. To resolve this issue, you need to write an `init_weights` function as shown in the preceding example to match the initialization in the original model definition, and it should cover all the parameters of the model. To simplify this process of preparing such `init_weights` function, SMP v2 implements this initialization function for the following models: GPT-2, GPT-J, GPT-NeoX, and Llama from Hugging Face Transformers. The `torch.sagemaker.delayed_param.DelayedParamIniter` API also works with the SMP tensor parallel implementation, `torch.sagemaker.tensor_parallel.transformer.TransformerLMHead` model, that you can call after the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) API call.

Using the `torch.sagemaker.delayed_param.DelayedParamIniter` API, you can adapt your PyTorch FSDP script as follows. After creating a model with empty weights, register the `torch.sagemaker.delayed_param.DelayedParamIniter` API to the model, and define an object of it. Pass the object to the `param_init_fn` of the PyTorch FSDP class.

```
from torch.sagemaker.delayed_param import DelayedParamIniter
from accelerate import init_empty_weights

with init_empty_weights():
    model = create_model()
    
delayed_initer = DelayedParamIniter(model)

with delayed_initer.validate_params_and_buffers_inited():
    model = FSDP(
        model,
        ...,
        param_init_fn=delayed_initer.get_param_init_fn()
    )
```

**Notes on tied weights**

When training models with tied weights, we need to take special care to tie the weights after initializing the weights with delayed parameter initialization. PyTorch FSDP does not have a mechanism to tie the weights after initializing them using `param_init_fn` as above. To address such cases we added API to allow a `post_init_hook_fn`, which can be used to tie the weights. You can pass any function in there which accepts the module as argument, but we also have a predefined `post_param_init_fn` defined in `DelayedParamIniter` which calls `tie_weights` method of the module if it exists. Note that it’s safe to always pass in `post_param_init_fn` even if there’s no `tie_weights` method for the module.

```
with delayed_initer.validate_params_and_buffers_inited():
    model = FSDP(
        model,
        ...,
        param_init_fn=delayed_initer.get_param_init_fn(),
        post_param_init_fn=delayed_initer.get_post_param_init_fn()
    )
```

# Activation checkpointing
<a name="model-parallel-core-features-v2-pytorch-activation-checkpointing"></a>

*Activation checkpointing* is a technique to reduce memory usage by clearing activations of certain layers and recomputing them during the backward pass. Effectively, this trades extra computation time for reducing memory usage. If a module is checkpointed, at the end of a forward pass, only the initial inputs to the module and final outputs from the module stay in memory. PyTorch releases any intermediate tensors that are part of the computation inside that module during the forward pass. During the backward pass of the checkpointed modules, PyTorch recomputes these tensors. At this point, the layers beyond this checkpointed module have finished their backward pass, so the peak memory usage with checkpointing becomes lower.

SMP v2 supports the PyTorch activation checkpointing module, [https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/#activation-checkpointing](https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/#activation-checkpointing). The following are examples of activation checkpointing of the Hugging Face GPT-NeoX model.

**Checkpointing Transformer layers of the Hugging Face GPT-NeoX model**

```
from transformers.models.gpt_neox import GPTNeoXLayer
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    apply_activation_checkpointing
)
    
# check_fn receives a module as the arg, 
# and it needs to return whether the module is to be checkpointed
def is_transformer_layer(module):
    from transformers.models.gpt_neox import GPTNeoXLayer
    return isinstance(submodule, GPTNeoXLayer)
    
apply_activation_checkpointing(model, check_fn=is_transformer_layer)
```

**Checkpointing every other Transformer layer of the Hugging Face GPT-NeoX model**

```
# check_fn receives a module as arg, 
# and it needs to return whether the module is to be checkpointed
# here we define that function based on global variable (transformer_layers)
from transformers.models.gpt_neox import GPTNeoXLayer
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    apply_activation_checkpointing
)

transformer_layers = [
    m for m model.modules() if isinstance(m, GPTNeoXLayer)
]

def is_odd_transformer_layer(module):
    return transformer_layers.index(module) % 2 == 0
    
apply_activation_checkpointing(model, check_fn=is_odd_transformer_layer)
```

Alternatively, PyTorch also has the `torch.utils.checkpoint` module for checkpointing, which is used by a subset of Hugging Face Transformers models. This module also works with SMP v2. However, it requires you to have access to the model definition for adding the checkpoint wrapper. Therefore, we recommend you to use the `apply_activation_checkpointing` method.

# Activation offloading
<a name="model-parallel-core-features-v2-pytorch-activation-offloading"></a>

**Important**  
In SMP v2.2.0, the activation offloading functionality of the SMP library doesn't work. Use the native PyTorch activation offloading instead.

Typically, the forward pass computes activations at each layer and keeps them in GPU memory until the backward pass for the corresponding layer finishes. Offloading these tensors to CPU memory after forward pass and fetching them back to GPU when they are needed can save substantial GPU memory usage. PyTorch supports offloading activations, but the implementation causes GPUs to be idle while activations are fetched back from CPU during backward pass. This causes a major performance degradation when using activation offloading.

SMP v2 improves this activation offloading. It pre-fetches activations ahead of time before the activations are needed for the GPU to start backward pass on those activations. The pre-fetching feature helps training progresses be run more efficiently without idle GPUs. This results in offering benefits from lower memory usage without a performance degradation.

You can keep the native PyTorch modules for offloading activations in your training script. The following is an example structure of applying the SMP activation offloading feature in your script. Note that activation offloading is applicable *only* when used together with [Activation checkpointing](model-parallel-core-features-v2-pytorch-activation-checkpointing.md). To learn more about the native PyTorch checkpoint tools for activation offloading, see:
+ [checkpoint\$1wrapper.py](https://github.com/pytorch/pytorch/blob/v2.0.1/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py#L171) in the *PyTorch GitHub repository*
+ [Activation Checkpointing](https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/#activation-checkpointing) in the PyTorch blog *Scaling Multi-modal Foundation Models in TorchMultimodal with PyTorch Distributed*.

You can apply the SMP activation offloading feature on [PyTorch activation checkpointing](https://pytorch.org/blog/scaling-multimodal-foundation-models-in-torchmultimodal-with-pytorch-distributed/#activation-checkpointing). This is done by adding the `sm_activation_offloading` and `activation_loading_horizon` parameters to the SMP configuration dictionary during [Step 2: Launch a training job](model-parallel-use-api-v2.md#model-parallel-launch-a-training-job-v2). 

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `sm_activation_offloading` and `activation_loading_horizon` parameters, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

**SMP configuration**

```
{
    "activation_loading_horizon": 2,
    "sm_activation_offloading": True
}
```

**In training script**

**Note**  
While activating the SMP activation offloading feature, make sure that you also use the PyTorch `offload_wrapper` function and apply it to the root module. The SMP activation offloading feature uses the root module to determine when forward pass is done to start pre-fetching.

```
import torch.sagemaker as tsm
tsm.init()

# Native PyTorch module for activation offloading
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    apply_activation_checkpointing, 
    offload_wrapper,
)

model = FSDP(...)

# Activation offloading requires activation checkpointing.
apply_activation_checkpointing(
    model,
    check_fn=checkpoint_transformer_layers_policy,
)

model = offload_wrapper(model)
```

# Tensor parallelism
<a name="model-parallel-core-features-v2-tensor-parallelism"></a>

*Tensor parallelism* is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices. In contrast to pipeline parallelism, which keeps individual weights intact but partitions the *set* of weights, gradients, or optimizer across devices, tensor parallelism shards *individual* weights. This typically involves distributed computation of specific operations, modules, or layers of the model.

Tensor parallelism is required in cases in which a single parameter consumes most of the GPU memory (such as large embedding tables with a large vocabulary size or a large softmax layer with a large number of classes). In this case, treating this large tensor or operation as an atomic unit is inefficient and impedes balance of the memory load.

SMP v2 integrates with [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) for the implementation for tensor parallelism, and runs on top of PyTorch FSDP APIs. You can enable PyTorch FSDP and SMP tensor parallelism simultaneously, and determine the best model parallelism for best performance.

In practice, tensor parallelism is especially helpful in the following scenarios.
+ When training with long context lengths as that leads to high activation memory with FSDP alone.
+ When training with really large clusters on which the global batch size exceeds desired limits.

## Hugging Face Transformer models compatible with the SMP tensor parallelism
<a name="model-parallel-core-features-v2-tensor-parallelism-supported-models"></a>

SMP v2 currently offers tensor parallelism support for the following Hugging Face transformer models.
+ GPT-NeoX
+ Llama 2
+ Llama 3
+ [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.3)
+ [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
+ [Mixtral 8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)

For reference configuration for applying tensor parallelism on these models, see [Configuration tips](model-parallel-best-practices-v2.md#model-parallel-best-practices-v2-config-tips).

## Configure tensor parallelism
<a name="model-parallel-core-features-v2-tensor-parallelism-configuration"></a>

For `tensor_parallel_degree`, you select a value for the degree of tensor parallelism. The value must evenly divide the number of GPUs in your cluster. For example, to shard your model while using an instance with 8 GPUs, choose 2, 4, or 8. We recommend that you start with a small number, and gradually increase it until the model fits in the GPU memory.

The following code snippets show how to add the SMP initialization module `torch.sagemaker.init()` to your training script and set up the SMP configuration dictionary in JSON format for training job launcher while following the two-step process introduced in [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md). You don’t need to make any changes to your PyTorch model or [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) configuration. For more information about the `tensor_parallel_degree` and `random_seed` parameters, see [SMP v2 core feature configuration parameters](distributed-model-parallel-v2-reference.md#distributed-model-parallel-v2-reference-init-config).

**SMP configuration**

```
{
    "tensor_parallel_degree": 8,
    "random_seed": 0 
}
```

**In your training script**

Initialize with `torch.sagemaker.init()` to activate SMP v2 and wrap your model with the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) API.

```
import torch.sagemaker as tsm
tsm.init()

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_config(..)
model = tsm.transform(model)
```

## Saving and loading Hugging Face Transformer checkpoints
<a name="model-parallel-core-features-v2-tensor-parallelism-checkpoints"></a>

After the SMP library transforms a model, it changes the state dictionary (`state_dict`) of the model. This means that the model becomes incompatible with the original Hugging Face Transformer checkpointing functionalities. To handle this, the SMP library provides APIs to save checkpoints from a transformed model in Hugging Face Transformer representation, and the `torch.sagemaker.transform` API to load a Hugging Face Transformer model checkpoint for fine-tuning.

For more information about saving checkpoints while using the tensor parallelism feature of SMP v2, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).

For more information about fine-tuning a model applying the tensor parallelism feature of SMP v2, see [Fine-tuning](model-parallel-core-features-v2-fine-tuning.md).

# Fine-tuning
<a name="model-parallel-core-features-v2-fine-tuning"></a>

Fine-tuning is a process of continuously training pre-trained models to improve performance for specific use cases.

Fine-tuning small models that fit fully on a single GPU, or those that fit 8 copies of model fully on CPUs is straightforward. It requires no special change to regular FSDP training. In the realm of models larger than this, you need to consider using the delayed parameter initialization functionality, which can be tricky.

To address this, the SMP library loads the full model on one of the ranks while the rest of the ranks create models with empty weights on a meta device. Then, PyTorch FSDP initializes the weights on non-zero ranks using the `init_weights` function, and synchronizes the weights on all ranks to the weights on the 0th rank with `sync_module_states` set to `True`. The following code snippet shows how you should set it up in your training script.

```
import torch.distributed as dist
from transformers import AutoModelForCasalLM
from accelerate import init_empty_weights
from torch.sagemaker.delayed_param import DelayedParamIniter

if dist.get_rank() == 0:
    model = AutoModelForCasalLM.from_pretrained(..., low_cpu_mem_usage=True)
else:
    with init_empty_weights():
        model = AutoModelForCasalLM.from_config(AutoConfig.from_pretrained(...))
    delayed_initer = DelayedParamIniter(model)

model = FSDP(
    model,
    ...,
    sync_module_states=True,
    param_init_fn=delayed_initer.get_param_init_fn() if dist.get_rank() > 0 else None
)
```

## Fine-tuning a pre-trained Hugging Face Transformer model with SMP tensor parallelism
<a name="model-parallel-core-features-v2-tensor-parallelism-fine-tuning-hf-transformer-with-tp"></a>

This section discusses loading Transformer models for two use cases: fine-tuning small Transformer models and fine-tuning large Transformer models. For smaller models without delayed parameter initialization, wrap the model with the `torch.sagemaker.transform` API before wrapping it with PyTorch FSDP.

```
import functools
from transformers import AutoModelForCausalLM
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from torch.sagemaker import transform

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", low_cpu_mem_usage=True)

# Transform model while loading state dictionary from rank 0.
tp_model = transform(model, load_state_dict_from_rank0=True)

# Wrap with FSDP.
model = FSDP(
    tp_model, 
    ...
    sync_module_states=True,
)
```

For larger models, the preceding approach causes to run out of CPU memory. We recommend that you use delayed parameter initialization to avoid such CPU memory issues. In this case, you can apply the `torch.sagemaker.transform` API and the `torch.sagemaker.delayed_param.DelayedParamIniter` API as shown in the following code example.

```
from transformers import AutoModelForCausalLM
from torch.sagemaker import transform
from torch.sagemaker.delayed_param import DelayedParamIniter

# Create one instance of model without delayed param
# on CPU, on one rank.
if dist.get_rank() == 0:
    model = AutoModelForCasalLM.from_pretrained(...,low_cpu_mem_usage=True)
else:
    with init_empty_weights():
        model = AutoModelForCasalLM.from_config(AutoConfig.from_pretrained(...))

# Transform model while loading state dictionary from rank 0
model = transform(model, load_state_dict_from_rank0=True)

if dist.get_rank() != 0: # For fine-tuning, delayed parameter on non-zero ranks
    delayed_initer = DelayedParamIniter(model)
else:
    delayed_initer = None

with (
    delayed_initer.validate_params_and_buffers_inited() if delayed_initer else nullcontext()
):
    # Wrap the model with FSDP
    model = FSDP(
        model, 
        ..., 
        sync_module_states=True,
        param_init_fn=delayed_initer.get_param_init_fn() if delayed_initer else None
    )
```

# FlashAttention
<a name="model-parallel-core-features-v2-flashattention"></a>

SMP v2 supports [FlashAttention](https://github.com/HazyResearch/flash-attention) kernels and makes it easy to apply them to various scenarios for Hugging Face Transformer models. Note that if you use FlashAttention package v2.0 or later, SMP uses FlashAttention v2; however, the Triton flash attention defaults to the flash attention kernel in FlashAttention v1.x, making it exclusively supported in FlashAttention v1. 

The module (`nn.Module`) is a low level API that defines the attention layers of a model. It should be applied right after model creation, from the `AutoModelForCausalLM.from_config()` API for example, and before the model is being transformed or wrapped with FSDP.

## Use FlashAttention kernels for self attention
<a name="model-parallel-core-features-v2-flashattention-self"></a>

The following code snippet shows how to use the [`torch.sagemaker.nn.attn.FlashSelfAttention`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-flashselfattention) API provided by SMP v2.

```
def new_attn(self, q, k, v, attention_mask=None, head_mask=None):
    return (
        self.flashmod((q, k, v), causal=True, cast_dtype=torch.bfloat16, layout="b h s d"),
        None,
    )

for layer in model.gpt_neox.layers:
    layer.attention.flash_mod = torch.sagemaker.nn.attn.FlashSelfAttention()
    layer.attention._attn = functools.partial(new_attn, layer.attention)
```

## Use FlashAttention kernels for grouped-query attention
<a name="model-parallel-core-features-v2-flashattention-grouped-query"></a>

SMP v2 also supports [FlashAttention](https://github.com/HazyResearch/flash-attention) kernels for grouped-query attention (GQA) and makes it easy to apply them to various scenarios for Hugging Face Transformer models. Different from original attention architecture, GQA equally partitions query heads into groups, and query heads in the same group share the same key and value heads. Therefore, q and kv heads are passed into forward call separately. Note: The number of q heads needs to be divisible by the number of kv heads.

**Example of using FlashGroupedQueryAttention**

The following code snippet shows how to use the [`torch.sagemaker.nn.attn.FlashGroupedQueryAttention`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn) API provided by SMP v2.

```
from transformers.models.llama.modeling_llama import LlamaAttention
from torch.sagemaker.nn.attn import FlashGroupedQueryAttention

class LlamaFlashAttention(LlamaAttention):
    def __init__(self, config: LlamaConfig):
        super().__init__(config)

        self.flash_attn = FlashGroupedQueryAttention(
            attention_dropout_prob=0.0,
        )
        
    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        ...
    ):
        query_states = self.q_proj(hidden_states)
        key_states = self.k_proj(hidden_states)
        value_states = self.v_proj(hidden_states)
        ...
        kv = (key_states, value_states)
        attn_output = self.flash_attn(
            query_states,
            kv,
            attn_mask=attention_mask,
            causal=True,
            layout="b h s d",
        )
        ...
        attn_output = self.o_proj(attn_output)
        ...
        return attn_output
```

The SMP library also provides [`torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-llamaFlashAttn), which uses the [`torch.sagemaker.nn.attn.FlashGroupedQueryAttention`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn) API at low level. Hugging Face Transformers has a similar implementation called [https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py) from v4.36.0. The following code snippet shows how to use the SMP v2 `LlamaFlashAttention` API or the Transformers `LlamaFlashAttention2` API to replace the attention layers of an existing Llama model.

```
from torch.sagemaker.nn.huggingface.llama_flashattn import LlamaFlashAttention
from transformers.models.llama.modeling_llama import LlamaFlashAttention2

flash_attn_class = LlamaFlashAttention # or flash_attn_class = LlamaFlashAttention2

attn_name = "self_attn"
for layer in model.model.layers:
    prev_layer = getattr(layer, attn_name)
    setattr(layer, attn_name, flash_attn_class(model.config))
```

# Checkpointing using SMP
<a name="model-parallel-core-features-v2-checkpoints"></a>

The SageMaker model parallelism (SMP) library supports PyTorch APIs for checkpoints, and provides APIs that help checkpoint properly while using the SMP library. 

PyTorch FSDP (Fully Sharded Data Parallelism) supports three types of checkpoints: full, sharded, and local, each serving different purposes. Full checkpoints are used when exporting the model after training is completed, as generating a full checkpoint is a computationally expensive process. Sharded checkpoints help save and load the state of a model sharded for each individual rank. With sharded checkpoints, you can resume training with different hardware configurations, such as a different number of GPUs. However, loading sharded checkpoints can be slow due to the communication involved among multiple devices. The SMP library provides local checkpointing functionalities, which allow faster retrieval of the model's state without additional communication overhead. Note that checkpoints created by FSDP require writing to a shared network file system such as Amazon FSx.

## Async local checkpoints
<a name="w2aac25c25c19c19c33b7"></a>

When training machine learning models, there is no need for subsequent iterations to wait for the checkpoint files to be saved to disk. With the release of SMP v2.5, the library supports saving checkpoint files asynchronously. This means that the subsequent training iteration can run simultaneously with the input and output (I/O) operations for creating checkpoints, without being slowed down or held back by those I/O operations. Also, the process of retrieving sharded model and optimizer paramemeters in PyTorch can be time-consuming due to the additional collective communication required to exchange distributed tensor metadata across ranks. Even when using `StateDictType.LOCAL_STATE_DICT` to save local checkpoints for each rank, PyTorch still invokes hooks that perform collective communication. To mitigate this issue and reduce the time required for checkpoint retrieval, SMP introduces `SMStateDictType.SM_LOCAL_STATE_DICT`, which allows for faster retrieval of model and optimizer checkpoints by bypassing the collective communication overhead. 

**Note**  
Maintaining consistency in the FSDP `SHARD_DEGREE` is a requirement for utilizing the `SMStateDictType.SM_LOCAL_STATE_DICT`. Ensure that the `SHARD_DEGREE` remains unchanged. While the number of model replications can vary, the model shard degree needs to be identical to the previous training setup when resuming from a checkpoint.

```
import os
import torch.distributed as dist
import torch.sagemaker as tsm
from torch.sagemaker import state
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.sagemaker.distributed.checkpoint.state_dict_saver import (
    async_save,
    maybe_finalize_async_calls,
)
from torch.sagemaker.distributed.checkpoint.state_dict_utils import (
    sm_state_dict_type,
    SMStateDictType,
)

global_rank = dist.get_rank()
save_dir = "/opt/ml/checkpoints"
sub_dir = f"tp{state.tp_rank}_ep{state.ep_rank}_fsdp{model.rank}"

# 1. Get replication ranks and group
current_replication_group = None
current_replication_ranks = None
for replication_ranks in state.ranker.get_rep_groups():
    rep_group = dist.new_group(replication_ranks)
    if global_rank in replication_ranks:
        current_replication_group = rep_group
        current_replication_ranks = replication_ranks

coordinator_rank = min(current_replication_ranks)

# 2. Wait for the previous checkpointing done
maybe_finalize_async_calls(
    blocking=True, process_group=current_replication_group
)

# 3. Get model local checkpoint
with sm_state_dict_type(model, SMStateDictType.SM_LOCAL_STATE_DICT):
    state_dict = {
       "model": model.state_dict(),
       "optimizer": optimizer.state_dict(),
        # Potentially add more customized state dicts.
    }

# 4. Save a local checkpoint 
async_save(
    state_dict,
    checkpoint_id=os.path.join(save_dir, sub_dir),
    process_group=current_replication_group,
    coordinator_rank=coordinator_rank,
)
```

The following code snippet demonstrates how to load a checkpoint utilizing `SMStateDictType.SM_LOCAL_STATE_DICT`.

```
import os
import torch.sagemaker as tsm
from torch.sagemaker import state
from torch.sagemaker.distributed.checkpoint.state_dict_loader import load
from torch.sagemaker.distributed.checkpoint.state_dict_utils import (
    sm_state_dict_type,
    SMStateDictType,
    init_optim_state
)
from torch.sagemaker.distributed.checkpoint.filesystem import (
    DistributedFileSystemReader,
)

load_dir = "/opt/ml/checkpoints"
sub_dir = f"tp{state.tp_rank}_ep{state.ep_rank}_fsdp{model.rank}"
global_rank = dist.get_rank()
checkpoint_id = os.path.join(load_dir, sub_dir)
storage_reader = DistributedFileSystemReader(checkpoint_id)

# 1. Get replication ranks and group
current_replication_group = None
current_replication_ranks = None
for replication_ranks in state.ranker.get_rep_groups():
    rep_group = dist.new_group(replication_ranks)
    if global_rank in replication_ranks:
        current_replication_group = rep_group
        current_replication_ranks = replication_ranks

coordinator_rank = min(current_replication_ranks)

# 2. Create local state_dict
with sm_state_dict_type(model, SMStateDictType.SM_LOCAL_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        # Potentially add more customized state dicts.
    }
 
    # Init optimizer state_dict states by setting zero grads and step.
    init_optim_state(optimizer, skip_empty_param=True)
    state_dict["optimizer"] = optimizer.state_dict()
 
# 3. Load a checkpoint
load(
    state_dict=state_dict,
    process_group=current_replication_group,
    coordinator_rank=coordinator_rank,
    storage_reader=storage_reader,
)
```

Storing checkpoints for large language models (LLMs) can be expensive as it often requires creating a large filesystem volume. To reduce costs, you have the option to save checkpoints directly to Amazon S3 without the need for additional filesystem services such as Amazon FSx. You can leverage the previous example with the following code snippet to save checkpoints to S3 by specifying an S3 URL as the destination. 

```
key = os.path.join(checkpoint_dir, sub_dir)
checkpoint_id= f"s3://{your_s3_bucket}/{key}"
async_save(state_dict, checkpoint_id=checkpoint_id, **kw)
load(state_dict, checkpoint_id=checkpoint_id, **kw)
```

## Async sharded checkpoints
<a name="w2aac25c25c19c19c33b9"></a>

There may be situations where you need to continue training with different hardware configurations, such as changing the number of GPUs. In these cases, your training processes must load checkpoints while resharding, which means resuming subsequent training with a different number of `SHARD_DEGREE`. In order to address the scenario where you need to resume training with a different number of `SHARD_DEGREE`, you must save your model checkpoints using the sharded state dictionary type, which is represented by `StateDictType.SHARDED_STATE_DICT`. Saving checkpoints in this format allows you to properly handle the resharding process when continuing the training with a modified hardware configuration. The provided code snippet illustrates how to use the `tsm` API to asynchronously save sharded checkpoints, enabling a more efficient and streamlined training process.

```
import os
import torch.sagemaker as tsm
from torch.sagemaker import state
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import StateDictType
from torch.sagemaker.utils.process_group_utils import get_global_ranks
from torch.sagemaker.distributed.checkpoint.state_dict_saver import (
    async_save,
    maybe_finalize_async_calls,
)

save_dir = "/opt/ml/checkpoints"
sub_dir = f"tp{state.tp_rank}_ep{state.ep_rank}"
checkpoint_id = os.path.join(save_dir, sub_dir)

# To determine whether curreto take part in checkpointing.
global_rank = dist.get_rank()
action_rank = state.ranker.get_rep_rank(global_rank) == 0
process_group = model.process_group
coordinator_rank = min(get_global_ranks(process_group))

# 1. wait for the previous checkpointing done
maybe_finalize_async_calls(blocking=True, process_group=process_group)

# 2. retrieve model & optimizer sharded state_dict
with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
    state_dict = {
        "model": model.state_dict(),
        "optimizer": FSDP.optim_state_dict(model, optimizer),
        # Potentially add more customized state dicts.
    }
 
# 3. save checkpoints asynchronously using async_save
if action_rank:
    async_save(
        state_dict,
        checkpoint_id=checkpoint_id,
        process_group=process_group,
        coordinator_rank=coordinator_rank,
    )
```

The process of loading shared checkpoints is similar to the previous section, but it involves using the `torch.sagemaker.distributed.checkpoint.filesystem.DistributedFileSystemReader` and its `load` method. The `load` method of this class allows you to load the shared checkpoint data, following a process analogous to the one described earlier.

```
import os
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import StateDictType
from torch.distributed.checkpoint.optimizer import load_sharded_optimizer_state_dict
from torch.sagemaker.distributed.checkpoint.state_dict_loader import load
from torch.sagemaker.utils.process_group_utils import get_global_ranks
from torch.sagemaker.distributed.checkpoint.filesystem import (
    DistributedFileSystemReader,
)
 
 load_dir = "/opt/ml/checkpoints"
sub_dir = f"tp{state.tp_rank}_ep{state.ep_rank}"
checkpoint_id = os.path.join(load_dir, sub_dir)
reader = DistributedFileSystemReader(checkpoint_id)

process_group = model.process_group
coordinator_rank = min(get_global_ranks(process_group))

with FSDP.state_dict_type(model, StateDictType.SHARDED_STATE_DICT):
   # 1. Load model and everything else except the optimizer.
   state_dict = {
        "model": model.state_dict()
        # Potentially more customized state dicts.
   }
   load(
        state_dict,
        storage_reader=reader,
        process_group=process_group,
        coordinator_rank=coordinator_rank,
   )
   model.load_state_dict(state_dict["model"])
 
   # 2. Load optimizer.
   optim_state = load_sharded_optimizer_state_dict(
        model_state_dict=state_dict["model"],
        optimizer_key="optimizer",
        storage_reader=reader,
        process_group=process_group,
    )    
   flattened_optimizer_state = FSDP.optim_state_dict_to_load(
        optim_state["optimizer"], model, optimizer,
         group=model.process_group
   )
   optimizer.load_state_dict(flattened_optimizer_state)
```

## Full model checkpoints
<a name="model-parallel-core-features-v2-checkpoints-full"></a>

At the end of training, you can save a full checkpoint that combines all shards of a model into a single model checkpoint file. The SMP library fully supports the PyTorch full model checkpoints API, so you don't need to make any changes.

Note that if you use the SMP [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md), the SMP library transforms the model. When checkpointing the full model in this case, the SMP library translates the model back to the Hugging Face Transformers checkpoint format by default.

In cases where you train with the SMP tensor parallelism and turn off the SMP translation process, you can use the `translate_on_save` argument of the PyTorch `FullStateDictConfig` API to switch the SMP auto-translation on or off as needed. For example, if you are focusing on training a model, you don’t need to add the translation process which adds overhead. In that case, we recommend you to set `translate_on_save=False`. Also, if you plan to keep using the SMP translation of the model for further training in future, you can switch it off to save the SMP translation of the model for later use. Translating the model back to the Hugging Face Transformers model checkpoint format is needed when you wrap up the training of your model and use that for inference.

```
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import FullStateDictConfig
import torch.sagemaker as tsm

# Save checkpoints.
with FSDP.state_dict_type(
    model, 
    StateDictType.FULL_STATE_DICT, 
    FullStateDictConfig(
        rank0_only=True, offload_to_cpu=True,
        # Default value is to translate back to Hugging Face Transformers format,
        # when saving full checkpoints for models trained with SMP tensor parallelism.
        # translate_on_save=True
    ),
):
    state_dict = model.state_dict()
    if dist.get_rank() == 0:
        logger.info("Processed state dict to save. Starting write to disk now.")
        os.makedirs(save_dir, exist_ok=True)
        # This name is needed for HF from_pretrained API to work.
        torch.save(state_dict, os.path.join(save_dir, "pytorch_model.bin"))
        hf_model_config.save_pretrained(save_dir)
    dist.barrier()
```

Note that the option `FullStateDictConfig(rank0_only=True, offload_to_cpu=True)` is to gather the model on the CPU of the 0th rank device to save memory when training large models.

To load the model back for inference, you do so as shown in the following code example. Note that the class `AutoModelForCausalLM` might change to other factor builder classes in Hugging Face Transformers, such as `AutoModelForSeq2SeqLM`, depending on your model. For more information, see [Hugging Face Transformers documentation](https://huggingface.co/docs/transformers/v4.36.1/en/model_doc/auto#natural-language-processing).

```
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(save_dir)
```

# Amazon SageMaker AI model parallelism library v2 examples
<a name="distributed-model-parallel-v2-examples"></a>

This page provides a list of blogs and Jupyter notebooks that present practical examples of implementing the SageMaker model parallelism (SMP) library v2 to run distributed training jobs on SageMaker AI.

## Blogs and Case Studies
<a name="distributed-model-parallel-v2-examples-blog"></a>

The following blogs discuss case studies about using SMP v2.
+ [Amazon SageMaker AI model parallel library now accelerates PyTorch FSDP workloads by up to 20%](https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-model-parallel-library-now-accelerates-pytorch-fsdp-workloads-by-up-to-20/)

## PyTorch example notebooks
<a name="distributed-model-parallel-examples-v2-pytorch"></a>

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/). To download the examples, run the following command to clone the repository and go to `training/distributed_training/pytorch/model_parallel_v2`.

**Note**  
Clone and run the example notebooks in the following SageMaker AI ML IDEs.  
[SageMaker JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Code Editor](https://docs.aws.amazon.com/sagemaker/latest/dg/code-editor.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) (available as an application in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)

```
git clone https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/training/distributed_training/pytorch/model_parallel_v2
```

**SMP v2 example notebooks**
+ [Accelerate training of Llama v2 with SMP v2, PyTorch FSDP, and Transformer Engine by running FP8 training on P5 instances](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/llama_v2/smp-train-llama-fsdp-tp-fp8.ipynb)
+ [Fine-tune Llama v2 with SMP v2 and PyTorch FSDP at large-scale using tensor parallelism, hybrid sharding, and activation offloading](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/llama_v2/smp-finetuning-llama-fsdp-tp.ipynb)
+ [Train GPT-NeoX with SMP v2 and PyTorch FSDP at large scale](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/gpt-neox/smp-train-gpt-neox-fsdp-tp.ipynb)
+ [Fine-tune GPT-NeoX with SMP v2 and PyTorch FSDP at large-scale using tensor parallelism, hybrid sharding, and activation offloading](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/gpt-neox/smp-finetuning-gpt-neox-fsdp-tp.ipynb)

# SageMaker distributed model parallelism best practices
<a name="model-parallel-best-practices-v2"></a>

Use the following guidelines when you run a distributed training job with the SageMaker model parallel library v2 (SMP v2).

## Setting up the right configuration for distributed training
<a name="model-parallel-best-practices-configuration-v2"></a>

To estimate and find the best starting point to apply distributed training techniques that SMP v2 provides, review the following list. Each list item discusses the advantage of using the [Core features of the SageMaker model parallelism library v2](model-parallel-core-features-v2.md) along with potential tradeoffs. 

### Configuration tips
<a name="model-parallel-best-practices-v2-config-tips"></a>

This section provides guidelines on how to decide on the best model configurations for optimal throughput with global batch size requirements.

First, we recommend the following setups regardless of the size of your model.

1. Use the most powerful instance type that you can use.

1. Turn on [mixed precision](model-parallel-core-features-v2-mixed-precision.md) all the time, as it provides substantial benefits for performance and memory reduction. We recommend you to use `bfloat16` as it's more precise than `float16`.

1. Turn on the [SageMaker distributed data parallelism library](data-parallel.md) (instead of using NCCL) whenever it’s applicable, as shown in [Compatibility with the SMDDP library optimized for AWS infrastructure](model-parallel-core-features-v2-smddp-allgather.md). One exception is for tensor-parallelism-only use cases (`hybrid_shard_degree = 1` and `tensor_paralle_degree > 1`).

1. If your model has more than about 60 billion parameters, we recommend using [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md). You can also use delayed parameter initialization to speed up the initialization for any model.

1. We recommend you to enable [Activation checkpointing](model-parallel-core-features-v2-pytorch-activation-checkpointing.md). 

Depending on the size of you model, we recommend that you start with the following guidance.

1. Use sharded data parallelism.

   1. Depending on the batch size you intend to fit in the GPU memory, choose the appropriate sharded data parallel degree. Normally, you should start with the lowest degree to fit your model in the GPU memory while minimizing overhead from network communication. If you see a warning that cache flushes are happening, we recommend that you increase the sharding degree. 

   1. Determine `world_size` based on the maximum local batch size and required global batch size, if any.

   1. You can experiment with activation offloading. Depending on scenarios, it can address your memory needs without having to increase the sharding degree, which means less communication. 

1. Use sharded data parallelism of PyTorch FSDP and tensor parallelism of SMP v2 simultaneously, as introduced in [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md).

   1. When training on large clusters, with FSDP alone the global batch size can become too large, causing convergence issues for the model. Typically, most research work keeps the batch size under 4 million tokens. In this case, you can resolve the problem by composing PyTorch FSDP with tensor parallelism of SMP v2 to reduce the batch size.

      For example, if you have 256 nodes and sequence length 4096, even a batch size of 1 per GPU leads to global batch size of 8M tokens. However, when you use tensor parallelism with degree 2 and batch size of 1 per tensor parallel group, this becomes 1/2 batch size per GPU, which translates to 4 million tokens.

   1. When training with long context lengths such as 8k, 16k activation memory can become very high. FSDP doesn't shard activations, and activations can cause GPUs to go out of memory. In such scenarios, you can train efficiently by composing PyTorch FSDP with tensor parallelism of SMP v2.

### Reference configurations
<a name="model-parallel-best-practices-configuration-reference-v2"></a>

The SageMaker model parallelism training team provides the following reference points based on experiments with the Llama 2 model transformed to the SMP transformer model using [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform), and trained on `ml.p4d.24xlarge` instance(s) with sequence length 4096 and mixed precision (FP16 or BF16).

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-best-practices-v2.html)

You can extrapolate from the preceding configurations to estimate GPU memory usage for your model configuration. For example, if you increase the sequence length for a 10-billion-parameter model or increase the size of the model to 20 billion, you might want to lower batch size first. If the model still doesn’t fit, try increasing the degree of tensor parallelism.

## Monitoring and logging a training job using the SageMaker AI console and Amazon CloudWatch
<a name="model-parallel-best-practices-monitoring-v2"></a>

To monitor system-level metrics such as CPU memory utilization, GPU memory utilization, and GPU utilization, use visualization provided through the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Training**.

1. Choose **Training jobs**.

1. In the main pane, choose the training job name for which you want to see more details.

1. Browse the main pane and find the **Monitor** section to see the automated visualization.

1. To see training job logs, choose **View logs** in the **Monitor** section. You can access the distributed training job logs of the training job in CloudWatch. If you launched multi-node distributed training, you should see multiple log streams with tags in the format of **algo-n-1234567890**. The **algo-1** log stream tracks training logs from the main (0th) node.

For more information, see [Amazon CloudWatch Metrics for Monitoring and Analyzing Training Jobs](training-metrics.md).

## Permissions
<a name="model-parallel-best-practices-permissions-v2"></a>

To run a SageMaker training job with model parallelism, make sure you have the right permissions in your IAM role, such as the following:
+ To use [FSx for Lustre](https://aws.amazon.com/fsx/), add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonFSxFullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonFSxFullAccess).
+ To use Amazon S3 as a data channel, add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonS3FullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonS3FullAccess).
+ To use Docker, build your own container, and push it to Amazon ECR, add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonEC2ContainerRegistryFullAccess).
+ To have a full access to use the entire suite of SageMaker AI features, add [https://console.aws.amazon.com/iam/home#/policies/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#/policies/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonSageMakerFullAccess). 

# The SageMaker model parallel library v2 reference
<a name="distributed-model-parallel-v2-reference"></a>

The following are references for the SageMaker model parallel library v2 (SMP v2).

**Topics**
+ [

## SMP v2 core feature configuration parameters
](#distributed-model-parallel-v2-reference-init-config)
+ [

## Reference for the SMP v2 `torch.sagemaker` package
](#model-parallel-v2-torch-sagemaker-reference)
+ [

## Upgrade from SMP v1 to SMP v2
](#model-parallel-v2-upgrade-from-v1)

## SMP v2 core feature configuration parameters
<a name="distributed-model-parallel-v2-reference-init-config"></a>

The following is a complete list of parameters to activate and configure the [Core features of the SageMaker model parallelism library v2](model-parallel-core-features-v2.md). These must be written in JSON format and passed to the PyTorch estimator in the SageMaker Python SDK or saved as a JSON file for SageMaker HyperPod.

```
{
    "hybrid_shard_degree": Integer,
    "sm_activation_offloading": Boolean,
    "activation_loading_horizon": Integer,
    "fsdp_cache_flush_warnings": Boolean,
    "allow_empty_shards": Boolean,
    "tensor_parallel_degree": Integer,
    "context_parallel_degree": Integer,
    "expert_parallel_degree": Integer,
    "random_seed": Integer
}
```
+ `hybrid_shard_degree` (Integer) – Specifies a sharded parallelism degree. The value must be an integer between `0` and `world_size`. The default value is `0`.
  + If set to `0`, it falls back to the native PyTorch implementation and API in the script when `tensor_parallel_degree` is 1. Otherwise, it computes the largest possible `hybrid_shard_degree` based on `tensor_parallel_degree` and `world_size`. When falling back to the native PyTorch FSDP use cases, if `FULL_SHARD` is the strategy you use, it shards across the whole cluster of GPUs. If `HYBRID_SHARD` or `_HYBRID_SHARD_ZERO2` was the strategy, it is equivalent to `hybrid_shard_degree` of 8. When tensor parallelism is enabled, it shards based on the revised `hybrid_shard_degree`.
  + If set to `1`, it falls back to the native PyTorch implementation and API for `NO_SHARD` in the script when `tensor_parallel_degree` is 1. Otherwise, it's equivalent to `NO_SHARD` within any given tensor parallel groups.
  + If set to an integer between 2 and `world_size`, sharding happens across the specified number of GPUs. If you don't set up `sharding_strategy` in the FSDP script, it gets overridden to `HYBRID_SHARD`. If you set `_HYBRID_SHARD_ZERO2`, the `sharding_strategy` you specify is used.
+ `sm_activation_offloading` (Boolean) – Specifies whether to enable the SMP activation offloading implementation. If `False`, offloading uses the native PyTorch implementation. If `True`, it uses the SMP activation offloading implementation. You also need to use the PyTorch activation offload wrapper (`torch.distributed.algorithms._checkpoint.checkpoint_wrapper.offload_wrapper`) in your script. To learn more, see [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md). The default value is `True`.
+ `activation_loading_horizon` (Integer) – An integer specifying the activation offloading horizon type for FSDP. This is the maximum number of checkpointed or offloaded layers whose inputs can be in the GPU memory simultaneously. To learn more, see [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md). The input value must be a positive integer. The default value is `2`.
+ `fsdp_cache_flush_warnings` (Boolean) – Detects and warns if cache flushes happen in the PyTorch memory manager, because they can degrade computational performance. The default value is `True`.
+ `allow_empty_shards` (Boolean) – Whether to allow empty shards when sharding tensors if tensor is not divisible. This is an experimental fix for crash during checkpointing in certain scenarios. Disabling this falls back to the original PyTorch behavior. The default value is `False`.
+ `tensor_parallel_degree` (Integer) – Specifies a tensor parallelism degree. The value must be between `1` and `world_size`. The default value is `1`. Note that passing a value greater than 1 does not enable context parallelism automatically; you also need to use the [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API to wrap the model in your training script. To learn more, see [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md).
+ `context_parallel_degree` (Integer) – Specifies the context parallelism degree. The value must be between `1` and `world_size` , and must be `<= hybrid_shard_degree`. The default value is `1`. Note that passing a value greater than 1 does not enable context parallelism automatically; you also need to use the [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API to wrap the model in your training script. To learn more, see [Context parallelism](model-parallel-core-features-v2-context-parallelism.md).
+ `expert_parallel_degree` (Integer) – Specifies a expert parallelism degree. The value must be between 1 and `world_size`. The default value is `1`. Note that passing a value greater than 1 does not enable context parallelism automatically; you also need to use the [`torch.sagemaker.transform`](#model-parallel-v2-torch-sagemaker-reference-transform) API to wrap the model in your training script. To learn more, see [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md).
+ `random_seed` (Integer) – A seed number for the random operations in distributed modules by SMP tensor parallelism or expert parallelism. This seed is added to tensor-parallel or expert-parallel ranks to set the actual seed for each rank. It is unique for each tensor-parallel and expert-parallel rank. SMP v2 makes sure that the random number generated across tensor-parallel and expert-parallel ranks matches the non-tensor-parallelism and non-expert-parallelism cases respectively.

## Reference for the SMP v2 `torch.sagemaker` package
<a name="model-parallel-v2-torch-sagemaker-reference"></a>

This section is a reference for the `torch.sagemaker` package provided by SMP v2.

**Topics**
+ [

### `torch.sagemaker.delayed_param.DelayedParamIniter`
](#model-parallel-v2-torch-sagemaker-reference-delayed-param-init)
+ [

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.async_save`
](#model-parallel-v2-torch-sagemaker-reference-checkpoint-async-save)
+ [

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.maybe_finalize_async_calls`
](#model-parallel-v2-torch-sagemaker-reference-checkpoint-state-dict-saver)
+ [

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.save`
](#model-parallel-v2-torch-sagemaker-reference-checkpoint-save)
+ [

### `torch.sagemaker.distributed.checkpoint.state_dict_loader.load`
](#model-parallel-v2-torch-sagemaker-reference-checkpoint-load)
+ [

### `torch.sagemaker.moe.moe_config.MoEConfig`
](#model-parallel-v2-torch-sagemaker-reference-moe)
+ [

### `torch.sagemaker.nn.attn.FlashSelfAttention`
](#model-parallel-v2-torch-sagemaker-reference-flashselfattention)
+ [

### `torch.sagemaker.nn.attn.FlashGroupedQueryAttention`
](#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn)
+ [

### `torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention`
](#model-parallel-v2-torch-sagemaker-reference-llamaFlashAttn)
+ [

### `torch.sagemaker.transform`
](#model-parallel-v2-torch-sagemaker-reference-transform)
+ [

### `torch.sagemaker` util functions and properties
](#model-parallel-v2-torch-sagemaker-reference-utils)

### `torch.sagemaker.delayed_param.DelayedParamIniter`
<a name="model-parallel-v2-torch-sagemaker-reference-delayed-param-init"></a>

An API for applying [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md) to a PyTorch model.

```
class torch.sagemaker.delayed_param.DelayedParamIniter(
    model: nn.Module,
    init_method_using_config : Callable = None,
    verbose: bool = False,
)
```

**Parameters**
+ `model` (`nn.Module`) – A PyTorch model to wrap and apply the delayed parameter initialization functionality of SMP v2.
+ `init_method_using_config` (Callable) – If you use the tensor parallel implementation of SMP v2 or supported [Hugging Face Transformer models compatible with the SMP tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models), keep this parameter at the default value, which is `None`. By default, the `DelayedParamIniter` API finds out how to initialize the given model correctly. For any other models, you need to create a custom parameter initialization function and add it to your script. The following code snippet is the default `init_method_using_config` function that SMP v2 implemented for the [Hugging Face Transformer models compatible with the SMP tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models). Use the following code snippet as a reference for creating your own initialization configuration function, adding it to your script, and passing it to the `init_method_using_config` parameter of the SMP `DelayedParamIniter` API.

  ```
  from torch.sagemaker.utils.module_utils import empty_module_params, move_buffers_to_device
  
  # Define a custom init config function.
  def custom_init_method_using_config(module):
      d = torch.cuda.current_device()
      empty_module_params(module, device=d)
      if isinstance(module, (nn.Linear, Conv1D)):
          module.weight.data.normal_(mean=0.0, std=config.initializer_range)
          if module.bias is not None:
              module.bias.data.zero_()
      elif isinstance(module, nn.Embedding):
          module.weight.data.normal_(mean=0.0, std=config.initializer_range)
          if module.padding_idx is not None:
              module.weight.data[module.padding_idx].zero_()
      elif isinstance(module, nn.LayerNorm):
          module.weight.data.fill_(1.0)
          module.bias.data.zero_()
      elif isinstance(module, LlamaRMSNorm):
          module.weight.data.fill_(1.0)
      move_buffers_to_device(module, device=d)
  
  delayed_initer = DelayedParamIniter(model, init_method_using_config=custom_init_method_using_config)
  ```

  For more information about the `torch.sagemaker.module_util` functions in the preceding code snippet, see [`torch.sagemaker` util functions and properties](#model-parallel-v2-torch-sagemaker-reference-utils).
+ `verbose` (Boolean) – Whether to enable more detailed logging during initialization and validation. The default value is `False`.

**Methods**
+ `get_param_init_fn()` – Returns the parameter initialization function that you can pass to the `param_init_fn` argument of the PyTorch FSDP wrapper class.
+ `get_post_param_init_fn()` – Returns the parameter initialization function that you can pass to the `post_param_init_fn` argument of the PyTorch FSDP wrapper class. This is needed when you have tied weights in the model. The model must implement the method `tie_weights`. For more information, see the **Notes on tied weight** in [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md).
+ `count_num_params` (`module: nn.Module, *args: Tuple[nn.Parameter]`) – Tracks how many parameters are being initialized by the parameter initialization function. This helps implement the following `validate_params_and_buffers_inited` method. You usually don’t need to call this function explicitly, because the `validate_params_and_buffers_inited` method implicitly calls this method in the backend.
+ `validate_params_and_buffers_inited` (`enabled: bool=True`) – This is a context manager that helps validate that the number of parameters initialized matches the total number of parameters in the model. It also validates that all parameters and buffers are now on GPU devices instead of meta devices. It raises `AssertionErrors` if these conditions are not met. This context manager is only optional and you're not required to use this context manager to initialize parameters.

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.async_save`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-async-save"></a>

Entry API for asynchronous save. Use this method to save a `state_dict` asynchronously to a specified `checkpoint_id`. 

```
def async_save(
    state_dict: STATE_DICT_TYPE,
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_writer: Optional[StorageWriter] = None,
    planner: Optional[SavePlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    coordinator_rank: int = 0,
    queue : AsyncCallsQueue = None,
    sharded_strategy: Union[SaveShardedStrategy, Tuple[str, int], None] = None,
    wait_error_handling: bool = True,
    force_check_all_plans: bool = True,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**Parameters**
+ `state_dict` (dict) - Required. The state dict to save.
+ `checkpoint_id` (str) - Required. The storage path to save checkpoints to.
+ `storage_writer` (StorageWriter) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) in PyTorch to perform write operations. If this is not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) is used.
+ `planner` (SavePlanner) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) in PyTorch. If this is not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) is used.
+ `process_group` (ProcessGroup) - Optional. The process group to work on. If `None`, the default (global) process group is used.
+ `coordinator_rank` (int) - Optional. The rank of the coordinator when performing collective communication operators such as `AllReduce`.
+ `queue` (AsyncRequestQueue) - Optional. The async scheduler to use. By default, it takes the global parameter `DEFAULT_ASYNC_REQUEST_QUEUE`.
+ `sharded_strategy` (PyTorchDistSaveShardedStrategy) - Optional. The sharded strategy to use for saving checkpoints. If this is is not specified, `torch.sagemaker.distributed.checkpoint.state_dict_saver.PyTorchDistSaveShardedStrategy` is used by default.
+ `wait_error_handling` (bool) - Optional. A flag specifying whether to wait for all ranks to finish error handling. The default value is `True`.
+ `force_check_all_plans` (bool) - Optional. A flag that determines whether to forcibly synchronize plans across ranks, even in the case of a cache hit. The default value is `True`.
+ `s3_region` (str) - Optional. The region where the S3 bucket is located. If not specified, the region is inferred from the `checkpoint_id`.
+ `s3client_config` (S3ClientConfig) - Optional. The dataclass exposing configurable parameters for the S3 client. If not provided, the default configuration of [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) is used. The `part_size` parameter is set to 64MB by default.

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.maybe_finalize_async_calls`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-state-dict-saver"></a>

This function allows a training process to monitor multiple asynchronous requests to be done. 

```
def maybe_finalize_async_calls(
    blocking=True, 
    process_group=None
) -> List[int]:
```

**Parameters**
+ `blocking` (bool) - Optional. If `True`, it will wait until all active requests are completed. Otherwise, it finalizes only the asynchronous requests that have already finished. The default value is `True`.
+ `process_group` (ProcessGroup) - Optional. The process group to operate on. If set to `None`, the default (global) process group is utilized.

**Returns**
+ A list containing the indices of asynchronous calls are successfully finalized.

### `torch.sagemaker.distributed.checkpoint.state_dict_saver.save`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-save"></a>

Use this method to save a `state_dict` synchronously to a specified `checkpoint_id`.

```
def save(
    state_dict: STATE_DICT_TYPE,
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_writer: Optional[StorageWriter] = None,
    planner: Optional[SavePlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    coordinator_rank: int = 0,
    wait_error_handling: bool = True,
    force_check_all_plans: bool = True,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**Parameters**
+ `state_dict` (dict) - Required. The state dict to save.
+ `checkpoint_id` (str) - Required. The storage path to save checkpoints to.
+ `storage_writer` (StorageWriter) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) in PyTorch to perform write operations. If this is not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageWriter) is used.
+ `planner` (SavePlanner) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) in PyTorch. If this is not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.SavePlanner) is used.
+ `process_group` (ProcessGroup) - Optional. The process group to work on. If `None`, the default (global) process group is used.
+ `coordinator_rank` (int) - Optional. The rank of the coordinator when performing collective communication operators such as `AllReduce`.
+ `wait_error_handling` (bool) - Optional. A flag specifying whether to wait for all ranks to finish error handling. The default value is `True`.
+ `force_check_all_plans` (bool) - Optional. A flag that determines whether to forcibly synchronize plans across ranks, even in the case of a cache hit. The default value is `True`.
+ `s3_region` (str) - Optional. The region where the S3 bucket is located. If not specified, the region is inferred from the `checkpoint_id`.
+ `s3client_config` (S3ClientConfig) - Optional. The dataclass exposing configurable parameters for the S3 client. If not provided, the default configuration of [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) is used. The `part_size` parameter is set to 64MB by default.

### `torch.sagemaker.distributed.checkpoint.state_dict_loader.load`
<a name="model-parallel-v2-torch-sagemaker-reference-checkpoint-load"></a>

Load the state dictionary of a distributed model (`state_dict`).

```
def load(
    state_dict: Dict[str, Any],
    *,
    checkpoint_id: Union[str, os.PathLike, None] = None,
    storage_reader: Optional[StorageReader] = None,
    planner: Optional[LoadPlanner] = None,
    process_group: Optional[dist.ProcessGroup] = None,
    check_keys_matched: bool = True,
    coordinator_rank: int = 0,
    s3_region: Optional[str] = None,
    s3client_config: Optional[S3ClientConfig] = None
) -> None:
```

**Parameters**
+ `state_dict` (dict) - Required. The `state_dict` to load.
+ `checkpoint_id` (str) - Required. The ID of a checkpoint. The meaning of the `checkpoint_id` depends on the storage. It can be a path to a folder or to a file. It can also be a key if the storage is a key-value store.
+ `storage_reader` (StorageReader) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageReader](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.StorageReader) in PyTorch to perform read operations. If not specified, distributed checkpointing will automatically infer the reader based on the `checkpoint_id`. If `checkpoint_id` is also `None`, an exception error is raised.
+ `planner` (StorageReader) - Optional. An instance of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner) in PyTorch. If not specificed, the default configuration of [https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.LoadPlanner) is used.
+ `check_keys_matched` (bool) - Optional. If enabled, checks whether the `state_dict` keys of all ranks are matched using `AllGather`.
+ `s3_region` (str) - Optional. The region where the S3 bucket is located. If not specified, the region is inferred from the `checkpoint_id`.
+ `s3client_config` (S3ClientConfig) - Optional. The dataclass exposing configurable parameters for the S3 client. If not provided, the default configuration of [S3ClientConfig](https://github.com/awslabs/s3-connector-for-pytorch/blob/main/s3torchconnector/src/s3torchconnector/_s3client/s3client_config.py#L7) is used. The `part_size` parameter is set to 64MB by default.

### `torch.sagemaker.moe.moe_config.MoEConfig`
<a name="model-parallel-v2-torch-sagemaker-reference-moe"></a>

A configuration class for setting up the SMP-implementation of Mixture-of-Experts (MoE). You can specify MoE configuration values through this class and pass it to the [https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-v2-reference.html#model-parallel-v2-torch-sagemaker-reference-transform](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-model-parallel-v2-reference.html#model-parallel-v2-torch-sagemaker-reference-transform) API call. To learn more about the usage of this class for training MoE models, see [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md).

```
class torch.sagemaker.moe.moe_config.MoEConfig(
    smp_moe=True,
    random_seed=12345,
    moe_load_balancing="sinkhorn",
    global_token_shuffle=False,
    moe_all_to_all_dispatcher=True,
    moe_aux_loss_coeff=0.001,
    moe_z_loss_coeff=0.001
)
```

**Parameters**
+ `smp_moe` (Boolean) - Whether to use the SMP-implementation of MoE. The default value is `True`.
+ `random_seed` (Integer) - A seed number for the random operations in expert-parallel distributed modules. This seed is added to the expert parallel rank to set the actual seed for each rank. It is unique for each expert parallel rank. The default value is `12345`.
+ `moe_load_balancing` (String) - Specify the load balancing type of the MoE router. Valid options are `aux_loss`, `sinkhorn`, `balanced`, and `none`. The default value is `sinkhorn`.
+ `global_token_shuffle` (Boolean) - Whether to shuffle tokens across EP ranks within the same EP group. The default value is `False`.
+ `moe_all_to_all_dispatcher` (Boolean) - Whether to use all-to-all dispatcher for the communications in MoE. The default value is `True`.
+ `moe_aux_loss_coeff` (Float) - A coefficient for auxiliary load balancing loss. The default value is `0.001`.
+ `moe_z_loss_coeff` (Float) - Coefficient for z-loss. The default value is `0.001`.

### `torch.sagemaker.nn.attn.FlashSelfAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-flashselfattention"></a>

An API for using [FlashAttention](model-parallel-core-features-v2-flashattention.md) with SMP v2.

```
class torch.sagemaker.nn.attn.FlashSelfAttention(
   attention_dropout_prob: float = 0.0,
   scale: Optional[float] = None,
   triton_flash_attention: bool = False,
   use_alibi: bool = False,
)
```

**Parameters**
+ `attention_dropout_prob` (float) – The dropout probability to apply to attention. The default value is `0.0`.
+ `scale` (float) – If passed, this scale factor is applied for softmax. If set to `None` (which is also the default value), the scale factor is `1 / sqrt(attention_head_size)`. The default value is `None`.
+ `triton_flash_attention` (bool) – If passed, Triton implementation of flash attention is used. This is necessary to supports Attention with Linear Biases (ALiBi) (see the following `use_alibi` parameter). This version of the kernel doesn’t support dropout. The default value is `False`.
+ `use_alibi` (bool) – If passed, it enables Attention with Linear Biases (ALiBi) using the mask provided. When using ALiBi, it needs an attention mask prepared as follows. The default value is `False`.

  ```
  def generate_alibi_attn_mask(attention_mask, batch_size, seq_length, 
      num_attention_heads, alibi_bias_max=8):
      device, dtype = attention_mask.device, attention_mask.dtype
      alibi_attention_mask = torch.zeros(
          1, num_attention_heads, 1, seq_length, dtype=dtype, device=device
      )
  
      alibi_bias = torch.arange(1 - seq_length, 1, dtype=dtype, device=device).view(
          1, 1, 1, seq_length
      )
      m = torch.arange(1, num_attention_heads + 1, dtype=dtype, device=device)
      m.mul_(alibi_bias_max / num_attention_heads)
      alibi_bias = alibi_bias * (1.0 / (2 ** m.view(1, num_attention_heads, 1, 1)))
  
      alibi_attention_mask.add_(alibi_bias)
      alibi_attention_mask = alibi_attention_mask[..., :seq_length, :seq_length]
      if attention_mask is not None and attention_mask.bool().any():
          alibi_attention_mask.masked_fill(
              attention_mask.bool().view(batch_size, 1, 1, seq_length), float("-inf")
          )
  
      return alibi_attention_mask
  ```

**Methods**
+ `forward(self, qkv, attn_mask=None, causal=False, cast_dtype=None, layout="b h s d")` – A regular PyTorch module function. When a `module(x)` is called, SMP runs this function automatically.
  + `qkv` – `torch.Tensor` of the following form: `(batch_size x seqlen x (3 x num_heads) x head_size)` or `(batch_size, (3 x num_heads) x seqlen x head_size)`, a tuple of `torch.Tensors` each of which might be of shape `(batch_size x seqlen x num_heads x head_size)`, or `(batch_size x num_heads x seqlen x head_size)`. An appropriate layout arg must be passed based on the shape. 
  + `attn_mask` – `torch.Tensor` of the following form `(batch_size x 1 x 1 x seqlen)`. To enable this attention mask parameter, it requires `triton_flash_attention=True` and `use_alibi=True`. To learn how to generate an attention mask using this method, see the code examples at [FlashAttention](model-parallel-core-features-v2-flashattention.md). The default value is `None`.
  + `causal` – When set to `False`, which is the default value of the argument, no mask is applied. When set to `True`, the `forward` method uses the standard lower triangular mask. The default value is `False`.
  + `cast_dtype` – When set to a particular `dtype`, it casts the `qkv` tensors to that `dtype` before `attn`. This is useful for implementations such as the Hugging Face Transformer GPT-NeoX model, which has `q` and `k` with `fp32` after rotary embeddings. If set to `None`, no cast is applied. The default value is `None`.
  + `layout` (string) – Available values are `b h s d` or `b s h d`. This should be set to the layout of `qkv` tensors passed, so appropriate transformations can be applied for `attn`. The default value is `b h s d`.

**Returns**

A single `torch.Tensor` with shape `(batch_size x num_heads x seq_len x head_size)`.

### `torch.sagemaker.nn.attn.FlashGroupedQueryAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn"></a>

An API for using `FlashGroupedQueryAttention` with SMP v2. To learn more about the usage of this API, see [Use FlashAttention kernels for grouped-query attention](model-parallel-core-features-v2-flashattention.md#model-parallel-core-features-v2-flashattention-grouped-query).

```
class torch.sagemaker.nn.attn.FlashGroupedQueryAttention(
    attention_dropout_prob: float = 0.0,
    scale: Optional[float] = None,
)
```

**Parameters**
+ `attention_dropout_prob` (float) – The dropout probability to apply to attention. The default value is `0.0`.
+ `scale` (float) – If passed, this scale factor is applied for softmax. If set to `None`, `1 / sqrt(attention_head_size)` is used as the scale factor. The default value is `None`.

**Methods**
+ `forward(self, q, kv, causal=False, cast_dtype=None, layout="b s h d")` – A regular PyTorch module function. When a `module(x)` is called, SMP runs this function automatically.
  + `q` – `torch.Tensor` of the following form `(batch_size x seqlen x num_heads x head_size)` or `(batch_size x num_heads x seqlen x head_size)`. Appropriate layout arg must be passed based on the shape. 
  + `kv` – `torch.Tensor` of the following form `(batch_size x seqlen x (2 x num_heads) x head_size)` or `(batch_size, (2 x num_heads) x seqlen x head_size)`, or a tuple of two `torch.Tensor`s, each of which might be of shape `(batch_size x seqlen x num_heads x head_size)` or `(batch_size x num_heads x seqlen x head_size)`. Appropriate `layout` argument must also be passed based on the shape.
  + `causal` – When set to `False`, which is the default value of the argument, no mask is applied. When set to `True`, the `forward` method uses the standard lower triangular mask. The default value is `False`.
  + `cast_dtype` – When set to a particular dtype, it casts the `qkv` tensors to that dtype before `attn`. This is useful for implementations such as Hugging Face Transformers GPT-NeoX, which has `q,k` with `fp32` after rotary embeddings. If set to `None`, no cast is applied. The default value is `None`.
  + layout (string) – Available values are `"b h s d"` or `"b s h d"`. This should be set to the layout of `qkv` tensors passed, so appropriate transformations can be applied for `attn`. The default value is `"b h s d"`.

**Returns**

Returns a single `torch.Tensor (batch_size x num_heads x seq_len x head_size)` that represents the output of attention computation.

### `torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention`
<a name="model-parallel-v2-torch-sagemaker-reference-llamaFlashAttn"></a>

An API that supports FlashAttention for the Llama model. This API uses the [`torch.sagemaker.nn.attn.FlashGroupedQueryAttention`](#model-parallel-v2-torch-sagemaker-reference-flashGroupedQueryAttn) API at low level. To learn how to use this, see [Use FlashAttention kernels for grouped-query attention](model-parallel-core-features-v2-flashattention.md#model-parallel-core-features-v2-flashattention-grouped-query).

```
class torch.sagemaker.nn.huggingface.llama_flashattn.LlamaFlashAttention(
    config: LlamaConfig
)
```

**Parameters**
+ `config` – A FlashAttention configuration for the Llama model.

**Methods**
+ `forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache)`
  + `hidden_states` (`torch.Tensor`) – Hidden states of a tensor in form of `(batch_size x seq_len x num_heads x head_size)`.
  + `attention_mask` (`torch.LongTensor`) – Mask to avoid performing attention on padding token indices in form of `(batch_size x seqlen)`. The default value is `None`.
  + `position_ids` (`torch.LongTensor`) – When not being `None`, it is in form of `(batch_size x seqlen)`, indicating the indices of positions of each input sequence token in the position embeddings. The default value is `None`.
  + `past_key_value` (Cache) – Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks). The default value is `None`. 
  + `output_attentions` (bool) – Indicates whether to return the attentions tensors of all attention layers. The default value is `False`. 
  + `use_cache` (bool) – Indicates whether to return `past_key_values` key value states. The default value is `False`. 

**Returns**

Returns a single `torch.Tensor (batch_size x num_heads x seq_len x head_size)` that represents the output of attention computation.

### `torch.sagemaker.transform`
<a name="model-parallel-v2-torch-sagemaker-reference-transform"></a>

SMP v2 provides this `torch.sagemaker.transform()` API for transforming Hugging Face Transformer models to SMP model implementations and enabling the SMP tensor parallelism.

```
torch.sagemaker.transform(
    model: nn.Module, 
    device: Optional[torch.device] = None, 
    dtype: Optional[torch.dtype] = None, 
    config: Optional[Dict] = None, 
    load_state_dict_from_rank0: bool = False,
    cp_comm_type: str = "p2p"
)
```

SMP v2 maintains transformation policies for the [Hugging Face Transformer models compatible with the SMP tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models) by converting the configuration of the Hugging Face Transformer models to the SMP transformer configuration.

**Parameters**
+ `model` (`torch.nn.Module`) – A model from [Hugging Face Transformer models compatible with the SMP tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md#model-parallel-core-features-v2-tensor-parallelism-supported-models) to transform and apply the tensor parallelism feature of the SMP library.
+ `device` (`torch.device`) – If passed, a new model is created on this device. If the original module has any parameter on meta device (see [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md)), then the transformed module will also be created on meta device, ignoring the argument passed here. The default value is `None`.
+ `dtype` (`torch.dtype`) – If passed, sets this as the dtype context manager for the creation of the model and creates a model with this dtype. This is typically unnecessary, as we want to create the model with `fp32` when using `MixedPrecision`, and `fp32` is the default dtype in PyTorch. The default value is `None`.
+ `config` (dict) – This is a dictionary for configuring the SMP transformer. The default value is `None`.
+ `load_state_dict_from_rank0` (Boolean) – By default, this module creates a new instance of the model with new weights. When this argument is set to `True`, SMP tries to load the state dictionary of the original PyTorch model from the 0th rank into transformed model for the tensor parallel group that the 0th rank is part of. When this is set to `True`, rank 0 can’t have any parameters on meta device. Only the first tensor parallel group populates the weights from the 0th rank after this transform call. You need to set `sync_module_states` to `True` in the FSDP wrapper to get these weights from the first tensor parallel group to all other processes. With this activated, the SMP library loads the state dictionary from the original model. The SMP library takes the `state_dict` of the model before transform, converts it to match the structure of the transformed model, shards it for each tensor parallel rank, communicates this state from the 0th rank to other ranks in the tensor parallel group that the 0th rank is part of, and loads it. The default value is `False`.
+ `cp_comm_type` (str) – Determines the context parallelism implementation and is only applicable when the `context_parallel_degree` is greater than 1. Available values for this parameter are `p2p` and `all_gather`. The `p2p` implementation utilizes peer-to-peer send-receive calls for key-and-value (KV) tensor accumulation during the attention computation, running asynchronously and allowing communication to overlap with computation. On the other hand, the `all_gather` implementation employs the `AllGather` communication collective operation for KV tensor accumulation. The default value is `"p2p"`.

**Returns **

Returns a transformed model that you can wrap with PyTorch FSDP. When `load_state_dict_from_rank0` is set to `True`, the tensor parallel group that involves rank 0 has weights loaded from the original state dictionary on rank 0. When using [Delayed parameter initialization](model-parallel-core-features-v2-delayed-param-init.md) on the original model, only these ranks have the actual tensors on CPUs for the parameters and buffers of the transformed model. The rest of the ranks continue to have the parameters and buffers on the meta device to save memory.

### `torch.sagemaker` util functions and properties
<a name="model-parallel-v2-torch-sagemaker-reference-utils"></a>

**torch.sagemaker util functions**
+ `torch.sagemaker.init(config: Optional[Union[str, Dict[str, Any]]] = None) -> None` – Initializes the PyTorch training job with SMP.
+ `torch.sagemaker.is_initialized() -> bool` – Checks whether the training job is initialized with SMP. When falling back to the native PyTorch while the job is initialized with SMP, some of the properties are not relevant and become `None`, as indicated in the following **Properties** list.
+ `torch.sagemaker.utils.module_utils.empty_module_params(module: nn.Module, device: Optional[torch.device] = None, recurse: bool = False) -> nn.Module` – Creates empty parameters on the given `device` if any, and it can be recursive for all nested modules if specified.
+ `torch.sagemaker.utils.module_utils.move_buffers_to_device(module: nn.Module, device: torch.device, recurse: bool = False) -> nn.Module` – Moves module buffers to the given `device`, and it can be recursive for all nested modules if specified.

**Properties**

`torch.sagemaker.state` holds multiple useful properties after the initialization of SMP with `torch.sagemaker.init`.
+ `torch.sagemaker.state.hybrid_shard_degree` (int) – The sharded data parallelism degree, a copy from user input in the SMP configuration passed to `torch.sagemaker.init()`. To learn more, see [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md).
+ `torch.sagemaker.state.rank` (int) – The global rank for the device, in the range of `[0, world_size)`.
+ `torch.sagemaker.state.rep_rank_process_group` (`torch.distributed.ProcessGroup`) – The process group including all devices with the same replication rank. Note the subtle but fundamental difference with `torch.sagemaker.state.tp_process_group`. When falling back to native PyTorch, it returns `None`.
+ `torch.sagemaker.state.tensor_parallel_degree` (int) – The tensor parallelism degree, a copy from user input in the SMP configuration passed to `torch.sagemaker.init()`. To learn more, see [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md).
+ `torch.sagemaker.state.tp_size` (int) – An alias to `torch.sagemaker.state.tensor_parallel_degree`.
+ `torch.sagemaker.state.tp_rank` (int) – The tensor parallelism rank for the device in the range of `[0, tp_size)`, determined by the tensor parallelism degree and the ranking mechanism.
+ `torch.sagemaker.state.tp_process_group` (`torch.distributed.ProcessGroup`) – The tensor parallel process group including all devices with the same rank in other dimensions (for example, sharded data parallelism and replication) but unique tensor parallel ranks. When falling back to native PyTorch, it returns `None`.
+ `torch.sagemaker.state.world_size` (int) – The total number of devices used in training.

## Upgrade from SMP v1 to SMP v2
<a name="model-parallel-v2-upgrade-from-v1"></a>

To move from SMP v1 to SMP v2, you must make script changes to remove the SMP v1 APIs and apply the SMP v2 APIs. Instead of starting from your SMP v1 script, we recommend you start from a PyTorch FSDP script, and follow the instructions at [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md).

To bring SMP v1 *models* to SMP v2, in SMP v1 you must collect the full model state dictionary and apply the translation functions on the model state dictionary to convert it into the Hugging Face Transformers model checkpoint format. Then in SMP v2, as discussed in [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md), you can load the Hugging Face Transformers model checkpoints, and then continue with using the PyTorch checkpoint APIs with SMP v2. To use SMP with your PyTorch FSDP model, make sure that you move to SMP v2 and make changes to your training script to use PyTorch FSDP and other latest features.

```
import smdistributed.modelparallel.torch as smp

# Create model
model = ...
model = smp.DistributedModel(model)

# Run training
...

# Save v1 full checkpoint
if smp.rdp_rank() == 0:
    model_dict = model.state_dict(gather_to_rank0=True) # save the full model
    # Get the corresponding translation function in smp v1 and translate
    if model_type == "gpt_neox":
        from smdistributed.modelparallel.torch.nn.huggingface.gptneox import translate_state_dict_to_hf_gptneox
        translated_state_dict = translate_state_dict_to_hf_gptneox(state_dict, max_seq_len=None)
    
    # Save the checkpoint
    checkpoint_path = "checkpoint.pt"
    if smp.rank() == 0:
        smp.save(
            {"model_state_dict": translated_state_dict},
            checkpoint_path,
            partial=False,
        )
```

To find available translation functions in SMP v1, see [Support for Hugging Face Transformer Models](model-parallel-extended-features-pytorch-hugging-face.md).

For instruction on model checkpoints saving and loading in SMP v2, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).

# Release notes for the SageMaker model parallelism library
<a name="model-parallel-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker model parallelism (SMP) library. If you have further questions about the SMP library, contact the SMP service team at `sm-model-parallel-feedback@amazon.com`.

## The SageMaker model parallelism library v2.8.0
<a name="model-parallel-release-notes-20250306"></a>

*Date: April 01, 2025*

### SMP library updates
<a name="model-parallel-release-notes-20250306-smp-lib"></a>

**Bug fixes**
+ SMP gradient norm clipping now supports activation offloading.

### SMP Docker and Enroot containers
<a name="model-parallel-release-notes-20250306-smp-docker"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to `v2.243.0` or later.

**Currency updates**
+ Added support for PyTorch v2.5.1
+ Upgraded CUDA support to v12.4
+ Upgraded NCCL support to v2.23.4
+ Upgraded SMDDP library to 2.6.0

**Container details**
+ SMP Docker container for PyTorch v2.5.1 with CUDA v12.4

  ```
  658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.5.1-gpu-py311-cu124
  ```
+ SMP Enroot container for PyTorch v2.5.1 with CUDA v12.4

  ```
  https://sagemaker-distributed-model-parallel.s3.<us-west-2>.amazonaws.com/enroot/2.5.1-gpu-py311-cu124.sqsh
  ```
+ Pre-installed packages
  + The SMP library v2.8.0
  + The SMDDP library v2.6.0
  + CUDNN v9.4.0
  + FlashAttention v2.5.8
  + TransformerEngine v1.10
  + Megatron v0.8.0
  + Hugging Face Transformers v4.44.2
  + Hugging Face Datasets library v2.19.0
  + EFA v1.36.0
  + NCCL v2.23.4
  + AWS-OFI-NCCL v1.13.2

### SMP Conda channel
<a name="model-parallel-release-notes-20250306-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.7.0
<a name="model-parallel-release-notes-20241204"></a>

*Date: December 04, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20241204-smp-lib"></a>

**New features**
+ Added support for [SageMaker HyperPod recipes](sagemaker-hyperpod-recipes.md).

### SMP Docker and Enroot containers
<a name="model-parallel-release-notes-20241204-smp-docker"></a>

The SMP library team distributes Docker and Enroot containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to `v2.237.0` or later.

**Container details**
+ SMP Docker container for PyTorch v2.4.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.<us-west-2>.smdistributed-modelparallel:2.4.1-gpu-py311-cu121
  ```
+ SMP Enroot container for PyTorch v2.4.1 with CUDA v12.1

  ```
  https://sagemaker-distributed-model-parallel.s3.<us-west-2>.amazonaws.com/enroot/2.4.1-gpu-py311-cu121.sqsh
  ```
+ Pre-installed packages
  + The SMP library v2.7.0
  + The SMDDP library v2.5.0
  + CUDNN v9.4.0
  + FlashAttention v2.5.8
  + TransformerEngine v1.10
  + Megatron v0.8.0
  + Hugging Face Transformers v4.44.2
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20241204-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in a Conda environment such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.6.1
<a name="model-parallel-release-notes-20241031"></a>

*Date: October 31, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20241031-smp-lib"></a>

**Bug fixes**
+ Fixed an `ImportError` issue that occurred when using older training scripts with SMP v2.6.0. This fixes the backward incompatibility with SMP v2.6.0.
+ Added a `DeprecationWarning` for `torch.sagemaker.distributed.fsdp.checkpoint`. This module will be deprecated and removed in SMP v2.7.0. If you're currently using `torch.sagemaker.distributed.fsdp.checkpoint` in your code, you should plan to update your scripts before the release of SMP v2.7.0 to avoid issues in the future.
+ Fixed a backward compatibility issue identified in SMP v2.6.0. This issue was related to the deprecation of the `USE_PG_WITH_UTIL` checkpoint method in SMP v2.6.0, which broke backward compatibility with previous versions of training scripts. To resolve this issue, re-run your PyTorch training jobs to pick up the latest SMP container packaged with SMP v2.6.1.

### SMP Docker container
<a name="model-parallel-release-notes-20241031-smp-docker"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers.

**Container details**
+ SMP Docker container for PyTorch v2.4.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121
  ```
+ Pre-installed packages
  + The SMP library v2.6.1
  + The SMDDP library v2.5.0
  + CUDNN v9.4.0
  + FlashAttention v2.5.8
  + TransformerEngine v1.10
  + Megatron v0.8.0
  + Hugging Face Transformers v4.44.2
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20241031-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.6.0
<a name="model-parallel-release-notes-20241017"></a>

*Date: October 17, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20241017-smp-lib"></a>

**New features**
+ Added support for the following LLM model configurations. You can start using [Context parallelism](model-parallel-core-features-v2-context-parallelism.md) and [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md).
  + [Llama3.1 8B](https://huggingface.co/meta-llama/Llama-3.1-8B)
  + [Llama3.1 70B](https://huggingface.co/meta-llama/Llama-3.1-70B)
  + [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.3)
+ Added [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md) support for the following Mixtral model configurations.
  + [Mixtral 8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
  + [Mixtral 8x22B](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)
+ Added support for an AllGather-based context parallelism implementation that utilizes the AllGather communication collective to obtain the full sequence of key-and-value tensors. Available implementations are `p2p` and `all_gather`. The `p2p` implementation utilizes peer-to-peer send-receive calls for key-and-value (KV) tensor accumulation during the attention computation, running asynchronously and allowing communication to overlap with computation. On the other hand, the `all_gather` implementation employs the `AllGather` communication collective operation for KV tensor accumulation. To learn how to apply these context parallelism implementation, see [Context parallelism](model-parallel-core-features-v2-context-parallelism.md).
+ Added support for tuning the Rotary Position Embedding (RoPE) theta value.

**Bug fixes**
+ Fixed a bug where Rotary Position Embedding (RoPE) isn’t properly initialized during pre-training when delayed parameter is enabled.

**Known issues**
+ Transformer Engine does not currently support context parallelism or FP8 with sliding window attention enabled. Thus, the SMP version of Mistral transformers don’t support context parallelism or FP8 training when sliding window configuration is set to a non-null value.

### SMP Docker container
<a name="model-parallel-release-notes-20241017-smp-docker"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers.

**Currency updates**
+ Upgraded PyTorch to v2.4.1
+ Upgraded Megatron to v0.8.0
+ Upgraded the TransformerEngine library to v1.10
+ Upgraded Transformers to v4.44.2
+ Upgraded cuDNN to v9.4.0.58

**Container details**
+ SMP Docker container for PyTorch v2.4.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.<us-west-2>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121
  ```
+ Pre-installed packages
  + The SMP library v2.6.0
  + The SMDDP library v2.5.0
  + CUDNN v9.4.0
  + FlashAttention v2.5.8
  + TransformerEngine v1.10
  + Megatron v0.8.0
  + Hugging Face Transformers v4.44.2
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20241017-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.5.0
<a name="model-parallel-release-notes-20240828"></a>

*Date: August 28, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20240828-smp-lib"></a>

**New features**
+ Added support for mixed-precision training using FP8 data format on P5 instances for the Mixtral model.
  + Supported Mixtral configurations are 8x7B and 8x22B. To learn more, see [Mixed precision training with FP8 on P5 instances using Transformer Engine](model-parallel-core-features-v2-mixed-precision.md#model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5).
+ Added support for [Context parallelism](model-parallel-core-features-v2-context-parallelism.md) for the following model configurations.
  + Llama-v2: 7B and 70B
  + Llama-v3: 8B and 70B
  + GPT-NeoX: 20B
+ Added support for saving checkpoints asynchronously. To learn more, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).
  + Support for saving checkpoints to S3 directly without using Amazon EBS or file servers.

**Bug fixes**
+ Resolved an issue that caused unexpectedly high initial loss during Llama fine-tuning when loading a pre-trained model checkpoint and utilizing tensor parallelism.

**Notes**
+ To use activation checkpointing for Mixtral with FP8 mixed precision, you will need to checkpoint the attention and expert layers separately. For an example of setting it up properly, see the [example training script](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel_v2/shared-scripts/train_utils.py) in the *Amazon SageMaker AI Examples repository*.

**Known issues**
+ The balanced load balancing type in the MoE configuration ([`torch.sagemaker.moe.moe_config.MoEConfig`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-moe)) is currently incompatible with activation checkpointing.
+ With context parallelism, GPT-NeoX shows performance regression in both pre-training and fine-tuning.
+ For GPT-NeoX on P4 instances, directly loading weights from a delayed parameter initialized transformed model into a Hugging Face transformer model leads to a loss mismatch on the first step.

### SMP Docker container
<a name="model-parallel-release-notes-20240828-smp-docker"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.224.0 or later.

**Currency updates**
+ Upgraded the FlashAttention library to v2.5.8
+ Upgraded the Transformer Engine library to v1.8
  + If you want to install Transformer Engine in a Conda environment, you need to build from the source and cherry-pick the specific upstream fixes ([744624d](https://github.com/NVIDIA/TransformerEngine/commit/744624d004f4514ffbaa90ac83e214311c86c607), [27c6342](https://github.com/NVIDIA/TransformerEngine/commit/27c6342ea8ad88034bf04b587dd13cb6088d2474), [7669bf3](https://github.com/NVIDIA/TransformerEngine/commit/7669bf3da68074517b134cd6acebd04b221fd545)).

**Container details**
+ SMP Docker container for PyTorch v2.3.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.3.1-gpu-py311-cu121
  ```

  For a complete list of supported regions, see [AWS Regions](distributed-data-parallel-support.md#distributed-data-parallel-availablity-zone).
+ Pre-installed packages
  + The SMP library v2.5.0
  + The SMDDP library v2.3.0
  + CUDNN v8.9.7.29
  + FlashAttention v2.5.8
  + TransformerEngine v1.8
  + Megatron v0.7.0
  + Hugging Face Transformers v4.40.1
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20240828-smp-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.4.0
<a name="model-parallel-release-notes-20240620"></a>

*Date: June 20, 2024*

### SMP library updates
<a name="model-parallel-release-notes-20240620-lib"></a>

**Bug fixes**
+ Fixed a bug that causes incorrect logit shapes when labels are not passed in the forward pass while using the SMP Transformer.

**Currency updates**
+ Added support for PyTorch v2.3.1.
+ Added support for Python v3.11.
+ Added support for the Hugging Face Transformers library v4.40.1.

**Deprecations**
+ Discontinued support for Python v3.10.
+ Discontinued support for the Hugging Face Transformers library versions before v4.40.1.

**Other changes**
+ Included a patch to toggle saving de-duplicated tensors on different ranks. To learn more, see the [discussion thread](https://github.com/pytorch/pytorch/pull/126569)in the PyTorch GitHub repository.

**Known issues**
+ There is a known issue that the loss might spike and then resume at a higher loss value while fine-tuning Llama-3 70B with tensor parallelism.

### SMP Docker container
<a name="model-parallel-release-notes-20240620-container"></a>

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.224.0 or later.

**Currency updates**
+ Upgraded the SMDDP library to v2.3.0.
+ Upgraded the NCCL library to v2.21.5.
+ Upgraded the EFA software to v1.32.0.

**Deprecations**
+ Discontinued the installation of the [Torch Distributed Experimental (torchdistX) library](https://pytorch.org/torchdistx/latest/index.html).

**Container details**
+ SMP Docker container for PyTorch v2.3.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.3.1-gpu-py311-cu121
  ```
+ Pre-installed packages
  + The SMP library v2.4.0
  + The SMDDP library v2.3.0
  + CUDNN v8.9.7.29
  + FlashAttention v2.3.3
  + TransformerEngine v1.2.1
  + Hugging Face Transformers v4.40.1
  + Hugging Face Datasets library v2.19.0
  + EFA v1.32.0
  + NCCL v2.21.5

### SMP Conda channel
<a name="model-parallel-release-notes-20240620-conda-channel"></a>

The following S3 bucket is the public Conda channel of the SMP library hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.3.1
<a name="model-parallel-release-notes-20240509"></a>

*Date: May 9, 2024*

**Bug fixes**
+ Fixed an `ImportError` issue when using `moe_load_balancing=balanced` in [`torch.sagemaker.moe.moe_config.MoEConfig`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-moe) for expert parallelism.
+ Fixed a fine-tuning issue where the [`torch.sagemaker.transform`](distributed-model-parallel-v2-reference.md#model-parallel-v2-torch-sagemaker-reference-transform) call raised `KeyError` when `load_state_dict_from_rank0` is enabled.
+ Fixed an out-of-memory (OOM) error raised when loading large Mixture of Experts (MoE) models, such as Mixtral 8x22B, for fine-tuning.

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. This release incorporates the aforementioned bug fixes into the following SMP Docker image.
+ SMP Docker container for PyTorch v2.2.0 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
  ```

## The SageMaker model parallelism library v2.3.0
<a name="model-parallel-release-notes-20240409"></a>

*Date: April 11, 2024*

**New features**
+ Added a new core feature, *expert parallelism*, to support Mixture of Experts transformer models. To learn more, see [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md).

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.214.4 or later.
+ SMP Docker container for PyTorch v2.2.0 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
  ```
  + Pre-installed packages in this Docker container
    + The SMDDP library v2.2.0
    + CUDNN v8.9.5.29
    + FlashAttention v2.3.3
    + TransformerEngine v1.2.1
    + Hugging Face Transformers v4.37.1
    + Hugging Face Datasets library v2.16.1
    + Megatron-core 0.5.0
    + EFA v1.30.0
    + NCCL v2.19.4

## The SageMaker model parallelism library v2.2.0
<a name="model-parallel-release-notes-20240307"></a>

*Date: March 7, 2024*

**New Features**
+ Added support for [FP8 training](model-parallel-core-features-v2-mixed-precision.md#model-parallel-core-features-v2-mixed-precision-fp8-training-on-p5) of the following Hugging Face transformer models on P5 instances with Transformer Engine integration:
  + GPT-NeoX
  + Llama 2

**Bug Fixes**
+ Fixed a bug where tensors were not guaranteed to be contiguous before the `AllGather` collective call during tensor parallelism training.

**Currency Updates**
+ Added support for PyTorch v2.2.0.
+ Upgraded the SMDDP library to v2.2.0. 
+ Upgraded the FlashAttention library to v2.3.3.
+ Upgraded the NCCL library to v2.19.4.

**Deprecation**
+ Discontinued support for Transformer Engine versions before v1.2.0.

**Known issues**
+ The SMP [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md) feature currently does not work. Use the native PyTorch activation offloading instead.

**Other changes**
+ Included a patch to fix the performance regression discussed in the issue thread at [https://github.com/pytorch/pytorch/issues/117748](https://github.com/pytorch/pytorch/issues/117748) in the PyTorch GitHub repository.

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.212.0 or later.
+ SMP Docker container for PyTorch v2.2.0 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.2.0-gpu-py310-cu121
  ```
  + Available for P4d, P4de, and P5 instances
  + Pre-installed packages in this Docker container
    + The SMDDP library v2.2.0
    + CUDNN v8.9.5.29
    + FlashAttention v2.3.3
    + TransformerEngine v1.2.1
    + Hugging Face Transformers v4.37.1
    + Hugging Face Datasets library v2.16.1
    + EFA v1.30.0
    + NCCL v2.19.4

## The SageMaker model parallelism library v2.1.0
<a name="model-parallel-release-notes-20240206"></a>

*Date: February 6, 2024*

**Currency Updates**
+ Added support for PyTorch v2.1.2.

**Deprecation**
+ Discontinued support for Hugging Face Transformers v4.31.0.

**Known issues**
+ An issue is discovered that fine-tuning of the Hugging Face Llama 2 model with `attn_implementation=flash_attention_2` and FSDP causes the model to diverge. For reference, see the [issue ticket](https://github.com/huggingface/transformers/issues/28826) in the *Hugging Face Transformers GitHub repository*. To avoid the divergence issue, use `attn_implementation=sdpa`. Alternatively, use the SMP transformer model implementation by setting up `use_smp_implementation=True`.

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.207.0 or later.
+ SMP Docker container for PyTorch v2.1.2 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.1.2-gpu-py310-cu121
  ```
  + Available for P4d, P4de, and P5 instances
  + Pre-installed packages in this Docker container
    + The SMDDP library v2.1.0
    + CUDNN v8.9.5.29
    + FlashAttention v2.3.3
    + TransformerEngine v1.2.1
    + Hugging Face Transformers v4.37.1
    + Hugging Face Datasets library v2.16.1
    + EFA v1.30.0

**SMP Conda channel**

The following S3 bucket is a public Conda channel hosted by the SMP service team. If you want to install the SMP v2 library in an environment of highly customizable compute resources such as SageMaker HyperPod clusters, use this Conda channel to properly install the SMP library.
+ `https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/smp-v2/`

For more information about Conda channels in general, see [Channels](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html) in the *Conda documentation*.

## The SageMaker model parallelism library v2.0.0
<a name="model-parallel-release-notes-20231219"></a>

*Date: December 19, 2023*

**New features**

Released the SageMaker model parallelism (SMP) library v2.0.0 with the following new offerings.
+ A new `torch.sagemaker` package, entirely revamped from the previous `smdistributed.modelparallel.torch` package in SMP v1.x. 
+ Support for PyTorch 2.0.1.
+ Support for PyTorch FSDP.
+ Tensor parallelism implementation by integrating with the [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) library.
+ Support for both [SageMaker Training](train-model.md) and [SageMaker HyperPod](sagemaker-hyperpod.md).

**Breaking changes**
+ SMP v2 revamped the APIs entirely and provides the `torch.sagemaker` package. Mostly, you only need to initialize with the `torch.sagemaker.init()` module and pass model parallel configuration parameters. With this new package, you can significantly simplify code modifications in your training script. To learn more about adapting your training script to use SMP v2, see [Use the SageMaker model parallelism library v2](model-parallel-use-api-v2.md).
+ If you've used SMP v1 for training Hugging Face Transformer models and want to reuse the models in SMP v2, see [Upgrade from SMP v1 to SMP v2](distributed-model-parallel-v2-reference.md#model-parallel-v2-upgrade-from-v1).
+ For PyTorch FSDP training, you should use SMP v2.

**Known issues**
+ Activation checkpointing currently only works with the following wrapping policies with FSDP.
  + `auto_wrap_policy = functools.partial(transformer_auto_wrap_policy, ...)`
+ To use [Activation offloading](model-parallel-core-features-v2-pytorch-activation-offloading.md), FSDP activation checkpointing type must be [REENTRANT](https://pytorch.org/docs/stable/checkpoint.html).
+ When running with tensor parallel enabled with the sharded data parallel degree set to `1`, you must use `backend = nccl`. The `smddp` backend option is not supported in this scenario.
+ [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/index.html) is required to use PyTorch with the SMP library even when not using tensor parallelism.

**Other changes**
+ Starting from this release, the documentation for the SageMaker model parallelism library is fully available in this *Amazon SageMaker AI Developer Guide*. In favor of this complete developer guide for SMP v2 in the *Amazon SageMaker AI Developer Guide*, the [additional reference for SMP v1.x](https://sagemaker.readthedocs.io/en/stable/api/training/distributed.html#the-sagemaker-distributed-model-parallel-library) in the *SageMaker Python SDK documentation* is deprecated. If you still need the documentation for SMP v1.x, the developer guide for SMP v1.x is available at [(Archived) SageMaker model parallelism library v1.x](model-parallel.md), and the SMP Python library v1.x reference is available in the [SageMaker Python SDK v2.199.0 documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html).

**Deprecations**
+ Discontinued support for TensorFlow.
+ There is no pipeline parallelism support in SMP v2.
+ There is no support for the DeepSpeed library in favor of native PyTorch FSDP.

**SMP Docker container**

The SMP library team distributes Docker containers in replacement of the SageMaker PyTorch framework containers. If you use the PyTorch estimator class in the SageMaker Python SDK and specify distribution configuration to use SMP v2, SageMaker AI automatically picks up the SMP Docker containers. To use this release of SMP v2, upgrade your SageMaker Python SDK to v2.207.0 or later.
+ SMP Docker container for PyTorch v2.0.1 with CUDA v12.1

  ```
  658645717510.dkr.ecr.us-west-2.amazonaws.com/smdistributed-modelparallel:2.0.1-gpu-py310-cu121
  ```

# (Archived) SageMaker model parallelism library v1.x
<a name="model-parallel"></a>

**Important**  
As of December 19, 2023, the SageMaker model parallelism (SMP) library v2 is released. In favor of the SMP library v2, the SMP v1 capabilites are no longer supported in future releases. The following section and topics are archived and specific to using the SMP library v1. For information about using the SMP library v2, see [SageMaker model parallelism library v2](model-parallel-v2.md).

Use Amazon SageMaker AI's model parallel library to train large deep learning (DL) models that are difficult to train due to GPU memory limitations. The library automatically and efficiently splits a model across multiple GPUs and instances. Using the library, you can achieve a target prediction accuracy faster by efficiently training larger DL models with billions or trillions of parameters.

You can use the library to automatically partition your own TensorFlow and PyTorch models across multiple GPUs and multiple nodes with minimal code changes. You can access the library's API through the SageMaker Python SDK.

Use the following sections to learn more about model parallelism and the SageMaker model parallel library. This library's API documentation is located at [Distributed Training APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html) in the *SageMaker Python SDK v2.199.0 documentation*. 

**Topics**
+ [

# Introduction to Model Parallelism
](model-parallel-intro.md)
+ [

# Supported Frameworks and AWS Regions
](distributed-model-parallel-support.md)
+ [

# Core Features of the SageMaker Model Parallelism Library
](model-parallel-core-features.md)
+ [

# Run a SageMaker Distributed Training Job with Model Parallelism
](model-parallel-use-api.md)
+ [

# Checkpointing and Fine-Tuning a Model with Model Parallelism
](distributed-model-parallel-checkpointing-and-finetuning.md)
+ [

# Amazon SageMaker AI model parallelism library v1 examples
](distributed-model-parallel-examples.md)
+ [

# SageMaker Distributed Model Parallelism Best Practices
](model-parallel-best-practices.md)
+ [

# The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls
](model-parallel-customize-tips-pitfalls.md)
+ [

# Model Parallel Troubleshooting
](distributed-troubleshooting-model-parallel.md)

# Introduction to Model Parallelism
<a name="model-parallel-intro"></a>

Model parallelism is a distributed training method in which the deep learning model is partitioned across multiple devices, within or across instances. This introduction page provides a high-level overview about model parallelism, a description of how it can help overcome issues that arise when training DL models that are typically very large in size, and examples of what the SageMaker model parallel library offers to help manage model parallel strategies as well as memory consumption.

## What is Model Parallelism?
<a name="model-parallel-what-is"></a>

Increasing the size of deep learning models (layers and parameters) yields better accuracy for complex tasks such as computer vision and natural language processing. However, there is a limit to the maximum model size you can fit in the memory of a single GPU. When training DL models, GPU memory limitations can be bottlenecks in the following ways:
+ They limit the size of the model you can train, since the memory footprint of a model scales proportionally to the number of parameters.
+ They limit the per-GPU batch size during training, driving down GPU utilization and training efficiency.

To overcome the limitations associated with training a model on a single GPU, SageMaker provides the model parallel library to help distribute and train DL models efficiently on multiple compute nodes. Furthermore, with the library, you can achieve most optimized distributed training using EFA-supported devices, which enhance the performance of inter-node communication with low latency, high throughput, and OS bypass.

## Estimate Memory Requirements Before Using Model Parallelism
<a name="model-parallel-intro-estimate-memory-requirements"></a>

Before you use the SageMaker model parallel library, consider the following to get a sense of the memory requirements of training large DL models.

For a training job that uses AMP (FP16) and Adam optimizers, the required GPU memory per parameter is about 20 bytes, which we can break down as follows:
+ An FP16 parameter \$1 2 bytes
+ An FP16 gradient \$1 2 bytes
+ An FP32 optimizer state \$1 8 bytes based on the Adam optimizers
+ An FP32 copy of parameter \$1 4 bytes (needed for the `optimizer apply` (OA) operation)
+ An FP32 copy of gradient \$1 4 bytes (needed for the OA operation)

Even for a relatively small DL model with 10 billion parameters, it can require at least 200GB of memory, which is much larger than the typical GPU memory (for example, NVIDIA A100 with 40GB/80GB memory and V100 with 16/32 GB) available on a single GPU. Note that on top of the memory requirements for model and optimizer states, there are other memory consumers such as activations generated in the forward pass. The memory required can be a lot greater than 200GB.

For distributed training, we recommend that you use Amazon EC2 P3 and P4 instances that have NVIDIA V100 and A100 Tensor Core GPUs respectively. For more details about specifications such as CPU cores, RAM, attached storage volume, and network bandwidth, see the *Accelerated Computing* section in the [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/) page.

Even with the accelerated computing instances, it is obvious that models with about 10 billion parameters such as Megatron-LM and T5 and even larger models with hundreds of billions of parameters such as GPT-3 cannot fit model replicas in each GPU device. 

## How the Library Employs Model Parallelism and Memory Saving Techniques
<a name="model-parallel-intro-features"></a>

The library consists of various types of model parallelism features and memory-saving features such as optimizer state sharding, activation checkpointing, and activation offloading. All these techniques can be combined to efficiently train large models that consist of hundreds of billions of parameters.

**Topics**
+ [

### Sharded data parallelism (available for PyTorch)
](#model-parallel-intro-sdp)
+ [

### Pipeline parallelism (available for PyTorch and TensorFlow)
](#model-parallel-intro-pp)
+ [

### Tensor parallelism (available for PyTorch)
](#model-parallel-intro-tp)
+ [

### Optimizer state sharding (available for PyTorch)
](#model-parallel-intro-oss)
+ [

### Activation offloading and checkpointing (available for PyTorch)
](#model-parallel-intro-activation-offloading-checkpointing)
+ [

### Choosing the right techniques for your model
](#model-parallel-intro-choosing-techniques)

### Sharded data parallelism (available for PyTorch)
<a name="model-parallel-intro-sdp"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs within a data-parallel group.

SageMaker AI implements sharded data parallelism through the implementation of MiCS, which is a library that **mi**nimizes **c**ommunication **s**cale and discussed in the blog post [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws).

You can apply sharded data parallelism to your model as a stand-alone strategy. Furthermore, if you are using the most performant GPU instances equipped with NVIDIA A100 Tensor Core GPUs, `ml.p4d.24xlarge`, you can take the advantage of improved training speed from the `AllGather` operation offered by SMDDP Collectives.

To dive deep into sharded data parallelism and learn how to set it up or use a combination of sharded data parallelism with other techniques like tensor parallelism and FP16 training, see [Sharded Data Parallelism](model-parallel-extended-features-pytorch-sharded-data-parallelism.md).

### Pipeline parallelism (available for PyTorch and TensorFlow)
<a name="model-parallel-intro-pp"></a>

*Pipeline parallelism* partitions the set of layers or operations across the set of devices, leaving each operation intact. When you specify a value for the number of model partitions (`pipeline_parallel_degree`), the total number of GPUs (`processes_per_host`) must be divisible by the number of the model partitions. To set this up properly, you have to specify the correct values for the `pipeline_parallel_degree` and `processes_per_host` parameters. The simple math is as follows:

```
(pipeline_parallel_degree) x (data_parallel_degree) = processes_per_host
```

The library takes care of calculating the number of model replicas (also called `data_parallel_degree`) given the two input parameters you provide. 

For example, if you set `"pipeline_parallel_degree": 2` and `"processes_per_host": 8` to use an ML instance with eight GPU workers such as `ml.p3.16xlarge`, the library automatically sets up the distributed model across the GPUs and four-way data parallelism. The following image illustrates how a model is distributed across the eight GPUs achieving four-way data parallelism and two-way pipeline parallelism. Each model replica, where we define it as a *pipeline parallel group* and label it as `PP_GROUP`, is partitioned across two GPUs. Each partition of the model is assigned to four GPUs, where the four partition replicas are in a *data parallel group* and labeled as `DP_GROUP`. Without tensor parallelism, the pipeline parallel group is essentially the model parallel group.

![\[How a model is distributed across the eight GPUs achieving four-way data parallelism and two-way pipeline parallelism.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/smdmp-pipeline-parallel-only.png)


To dive deep into pipeline parallelism, see [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md). 

To get started with running your model using pipeline parallelism, see [Run a SageMaker Distributed Training Job with the SageMaker Model Parallel Library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html).

### Tensor parallelism (available for PyTorch)
<a name="model-parallel-intro-tp"></a>

*Tensor parallelism* splits individual layers, or `nn.Modules`, across devices, to be run in parallel. The following figure shows the simplest example of how the library splits a model with four layers to achieve two-way tensor parallelism (`"tensor_parallel_degree": 2`). The layers of each model replica are bisected and distributed into two GPUs. In this example case, the model parallel configuration also includes `"pipeline_parallel_degree": 1` and `"ddp": True` (uses PyTorch DistributedDataParallel package in the background), so the degree of data parallelism becomes eight. The library manages communication across the tensor-distributed model replicas.

![\[The simplest example of how the library splits a model with four layers to achieve two-way tensor parallelism ("tensor_parallel_degree": 2).\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/smdmp-tensor-parallel-only.png)


The usefulness of this feature is in the fact that you can select specific layers or a subset of layers to apply tensor parallelism. To dive deep into tensor parallelism and other memory-saving features for PyTorch, and to learn how to set a combination of pipeline and tensor parallelism, see [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md).

### Optimizer state sharding (available for PyTorch)
<a name="model-parallel-intro-oss"></a>

To understand how the library performs *optimizer state sharding*, consider a simple example model with four layers. The key idea in optimizing state sharding is you don't need to replicate your optimizer state in all of your GPUs. Instead, a single replica of the optimizer state is sharded across data-parallel ranks, with no redundancy across devices. For example, GPU 0 holds the optimizer state for layer one, the next GPU 1 holds the optimizer state for L2, and so on. The following animated figure shows a backward propagation with the optimizer state sharding technique. At the end of the backward propagation, there's compute and network time for the `optimizer apply` (OA) operation to update optimizer states and the `all-gather` (AG) operation to update the model parameters for the next iteration. Most importantly, the `reduce` operation can overlap with the compute on GPU 0, resulting in a more memory-efficient and faster backward propagation. In the current implementation, AG and OA operations do not overlap with `compute`. It can result in an extended computation during the AG operation, so there might be a tradeoff. 

![\[A backward propagation with the optimizer state sharding technique.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/smdmp-optimizer-state-sharding.gif)


For more information about how to use this feature, see [Optimizer State Sharding](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-optimizer-state-sharding.html).

### Activation offloading and checkpointing (available for PyTorch)
<a name="model-parallel-intro-activation-offloading-checkpointing"></a>

To save GPU memory, the library supports activation checkpointing to avoid storing internal activations in the GPU memory for user-specified modules during the forward pass. The library recomputes these activations during the backward pass. In addition, the activation offloading feature offloads the stored activations to CPU memory and fetches back to GPU during the backward pass to further reduce activation memory footprint. For more information about how to use these features, see [Activation Checkpointing](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-checkpointing.html) and [Activation Offloading](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-activation-offloading.html).

### Choosing the right techniques for your model
<a name="model-parallel-intro-choosing-techniques"></a>

For more information about choosing the right techniques and configurations, see [SageMaker Distributed Model Parallel Best Practices](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-best-practices.html) and [Configuration Tips and Pitfalls](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html).

# Supported Frameworks and AWS Regions
<a name="distributed-model-parallel-support"></a>

Before using the SageMaker model parallelism library, check the supported frameworks and instance types, and determine if there are enough quotas in your AWS account and AWS Region.

**Note**  
To check the latest updates and release notes of the library, see the [SageMaker Model Parallel Release Notes](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html) in the *SageMaker Python SDK documentation*.

## Supported Frameworks
<a name="distributed-model-parallel-supported-frameworks"></a>

The SageMaker model parallelism library supports the following deep learning frameworks and is available in AWS Deep Learning Containers (DLC) or downloadable as a binary file.

PyTorch versions supported by SageMaker AI and the SageMaker model parallelism library


| PyTorch version | SageMaker model parallelism library version | `smdistributed-modelparallel` integrated DLC image URI | URL of the binary file\$1\$1 | 
| --- | --- | --- | --- | 
| v2.0.0 | smdistributed-modelparallel==v1.15.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker`  | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-2.0.0/build-artifacts/2023-04-14-20-14/smdistributed\$1modelparallel-1.15.0-cp310-cp310-linux\$1x86\$164.whl | 
| v1.13.1 | smdistributed-modelparallel==v1.15.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker`  | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-04-17-15-49/smdistributed\$1modelparallel-1.15.0-cp39-cp39-linux\$1x86\$164.whl | 
| v1.12.1 | smdistributed-modelparallel==v1.13.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker`  | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.1/build-artifacts/2022-12-08-21-34/smdistributed\$1modelparallel-1.13.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.12.0 | smdistributed-modelparallel==v1.11.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker`   | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed\$1modelparallel-1.11.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.11.0 | smdistributed-modelparallel==v1.10.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker`  | https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.11.0/build-artifacts/2022-07-11-19-23/smdistributed\$1modelparallel-1.10.0-cp38-cp38-linux\$1x86\$164.whl | 
| v1.10.2 |  smdistributed-modelparallel==v1.7.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker`  | - | 
| v1.10.0 |  smdistributed-modelparallel==v1.5.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.10.0-gpu-py38-cu113-ubuntu20.04-sagemaker`  | - | 
| v1.9.1 |  smdistributed-modelparallel==v1.4.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.9.1-gpu-py38-cu111-ubuntu20.04`  | - | 
| v1.8.1\$1 |  smdistributed-modelparallel==v1.6.0 |  `763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.8.1-gpu-py36-cu111-ubuntu18.04`  | - | 

**Note**  
The SageMaker model parallelism library v1.6.0 and later provides extended features for PyTorch. For more information, see [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md).

\$1\$1 The URLs of the binary files are for installing the SageMaker model parallelism library in custom containers. For more information, see [Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library](model-parallel-sm-sdk.md#model-parallel-bring-your-own-container).

TensorFlow versions supported by SageMaker AI and the SageMaker model parallelism library


| TensorFlow version | SageMaker model parallelism library version | `smdistributed-modelparallel` integrated DLC image URI | 
| --- | --- | --- | 
| v2.6.0 | smdistributed-modelparallel==v1.4.0 | 763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.6.0-gpu-py38-cu112-ubuntu20.04 | 
| v2.5.1 | smdistributed-modelparallel==v1.4.0  | 763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.5.1-gpu-py37-cu112-ubuntu18.04  | 

**Hugging Face Transformers versions supported by SageMaker AI and the SageMaker distributed data parallel library**

The AWS Deep Learning Containers for Hugging Face use the SageMaker Training Containers for PyTorch and TensorFlow as their base images. To look up the Hugging Face Transformers library versions and paired PyTorch and TensorFlow versions, see the latest [Hugging Face Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-training-containers) and the [Prior Hugging Face Container Versions](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#prior-hugging-face-container-versions).

## AWS Regions
<a name="distributed-model-parallel-availablity-zone"></a>

The SageMaker data parallel library is available in all of the AWS Regions where the [AWS Deep Learning Containers for SageMaker](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) are in service. For more information, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#available-deep-learning-containers-images).

## Supported Instance Types
<a name="distributed-model-parallel-supported-instance-types"></a>

The SageMaker model parallelism library requires one of the following ML instance types.


| Instance type | 
| --- | 
| ml.g4dn.12xlarge | 
| ml.p3.16xlarge | 
| ml.p3dn.24xlarge  | 
| ml.p4d.24xlarge | 
| ml.p4de.24xlarge | 

For specs of the instance types, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Request a service quota increase for SageMaker AI resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure).

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
    the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
    for training job usage' is 0 Instances, with current utilization of 0 Instances
    and a request delta of 1 Instances.
    Please contact AWS support to request an increase for this limit.
```

# Core Features of the SageMaker Model Parallelism Library
<a name="model-parallel-core-features"></a>

Amazon SageMaker AI's model parallelism library offers distribution strategies and memory-saving techniques, such as sharded data parallelism, tensor parallelism, model partitioning by layers for pipeline scheduling, and checkpointing. The model parallelism strategies and techniques help distribute large models across multiple devices while optimizing training speed and memory consumption. The library also provides Python helper functions, context managers, and wrapper functions to adapt your training script for automated or manual partitioning of your model.

When you implement model parallelism to your training job, you keep the same two-step workflow shown in the [Run a SageMaker Distributed Training Job with Model Parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html) section. For adapting your training script, you'll add zero or few additional code lines to your training script. For launching a training job of the adapted training script, you'll need to set the distribution configuration parameters to activate the memory-saving features or to pass values for the degree of parallelism.

To get started with examples, see the following Jupyter notebooks that demonstrate how to use the SageMaker model parallelism library.
+ [PyTorch example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/model_parallel)
+ [TensorFlow example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/tensorflow/model_parallel/mnist)

To dive deep into the core features of the library, see the following topics.

**Note**  
The SageMaker distributed training libraries are available through the AWS deep learning containers for PyTorch, Hugging Face, and TensorFlow within the SageMaker Training platform. To utilize the features of the distributed training libraries, we recommend that you use the SageMaker Python SDK. You can also manually configure in JSON request syntax if you use SageMaker APIs through SDK for Python (Boto3) or AWS Command Line Interface. Throughout the documentation, instructions and examples focus on how to use the distributed training libraries with the SageMaker Python SDK.

**Important**  
The SageMaker model parallelism library supports all the core features for PyTorch, and supports pipeline parallelism for TensorFlow.

**Topics**
+ [

# Sharded Data Parallelism
](model-parallel-extended-features-pytorch-sharded-data-parallelism.md)
+ [

# Pipelining a Model
](model-parallel-core-features-pipieline-parallelism.md)
+ [

# Tensor Parallelism
](model-parallel-extended-features-pytorch-tensor-parallelism.md)
+ [

# Optimizer State Sharding
](model-parallel-extended-features-pytorch-optimizer-state-sharding.md)
+ [

# Activation Checkpointing
](model-parallel-extended-features-pytorch-activation-checkpointing.md)
+ [

# Activation Offloading
](model-parallel-extended-features-pytorch-activation-offloading.md)
+ [

# FP16 Training with Model Parallelism
](model-parallel-extended-features-pytorch-fp16.md)
+ [

# Support for FlashAttention
](model-parallel-attention-head-size-for-flash-attention.md)

# Sharded Data Parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism"></a>

*Sharded data parallelism* is a memory-saving distributed training technique that splits the state of a model (model parameters, gradients, and optimizer states) across GPUs in a data parallel group. 

**Note**  
Sharded data parallelism is available for PyTorch in the SageMaker model parallelism library v1.11.0 and later.

When scaling up your training job to a large GPU cluster, you can reduce the per-GPU memory footprint of the model by sharding the training state of the model over multiple GPUs. This returns two benefits: you can fit larger models, which would otherwise run out of memory with standard data parallelism, or you can increase the batch size using the freed-up GPU memory.

The standard data parallelism technique replicates the training states across the GPUs in the data parallel group, and performs gradient aggregation based on the `AllReduce` operation. Sharded data parallelism modifies the standard data-parallel distributed training procedure to account for the sharded nature of the optimizer states. A group of ranks over which the model and optimizer states are sharded is called a *sharding group*. The sharded data parallelism technique shards the trainable parameters of a model and corresponding gradients and optimizer states across the GPUs in the *sharding group*.

SageMaker AI achieves sharded data parallelism through the implementation of MiCS, which is discussed in the AWS blog post [Near-linear scaling of gigantic-model training on AWS](https://www.amazon.science/blog/near-linear-scaling-of-gigantic-model-training-on-aws). In this implementation, you can set the sharding degree as a configurable parameter, which must be less than the data parallelism degree. During each forward and backward pass, MiCS temporarily recombines the model parameters in all GPUs through the `AllGather` operation. After the forward or backward pass of each layer, MiCS shards the parameters again to save GPU memory. During the backward pass, MiCS reduces gradients and simultaneously shards them across GPUs through the `ReduceScatter` operation. Finally, MiCS applies the local reduced and sharded gradients to their corresponding local parameter shards, using the local shards of optimizer states. To bring down communication overhead, the SageMaker model parallelism library prefetches the upcoming layers in the forward or backward pass, and overlaps the network communication with the computation.

The training state of the model is replicated across the sharding groups. This means that before gradients are applied to the parameters, the `AllReduce` operation must take place across the sharding groups, in addition to the `ReduceScatter` operation that takes place within the sharding group.

In effect, sharded data parallelism introduces a tradeoff between the communication overhead and GPU memory efficiency. Using sharded data parallelism increases the communication cost, but the memory footprint per GPU (excluding the memory usage due to activations) is divided by the sharded data parallelism degree, thus larger models can be fit in the GPU cluster.

**Selecting the degree of sharded data parallelism**

When you select a value for the degree of sharded data parallelism, the value must evenly divide the degree of data parallelism. For example, for an 8-way data parallelism job, choose 2, 4, or 8 for the sharded data parallelism degree. While choosing the sharded data parallelism degree, we recommend that you start with a small number, and gradually increase it until the model fits in the memory together with the desired batch size.

**Selecting the batch size**

After setting up sharded data parallelism, make sure you find the most optimal training configuration that can successfully run on the GPU cluster. For training large language models (LLM), start from the batch size 1, and gradually increase it until you reach the point to receive the out-of-memory (OOM) error. If you encounter the OOM error even with the smallest batch size, apply a higher degree of sharded data parallelism or a combination of sharded data parallelism and tensor parallelism.

**Topics**
+ [

## How to apply sharded data parallelism to your training job
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use)
+ [

## Reference configurations
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-config-sample)
+ [

## Sharded data parallelism with SMDDP Collectives
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives)
+ [

## Mixed precision training with sharded data parallelism
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-16bits-training)
+ [

## Sharded data parallelism with tensor parallelism
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism)
+ [

## Tips and considerations for using sharded data parallelism
](#model-parallel-extended-features-pytorch-sharded-data-parallelism-considerations)

## How to apply sharded data parallelism to your training job
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use"></a>

To get started with sharded data parallelism, apply required modifications to your training script, and set up the SageMaker PyTorch estimator with the sharded-data-parallelism-specific parameters. Also consider to take reference values and example notebooks as a starting point.

### Adapt your PyTorch training script
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-modify-script"></a>

Follow the instructions at [Step 1: Modify a PyTorch Training Script](model-parallel-customize-training-script-pt.md) to wrap the model and optimizer objects with the `smdistributed.modelparallel.torch` wrappers of the `torch.nn.parallel` and `torch.distributed` modules.

**(Optional) Additional modification to register external model parameters**

If your model is built with `torch.nn.Module` and uses parameters that is not defined within the module class, you should register them to the module manually for SMP to gather the full parameters while . To register parameters to a module, use `smp.register_parameter(module, parameter)`.

```
class Module(torch.nn.Module):
    def __init__(self, *args):
        super().__init__(self, *args)
        self.layer1 = Layer1()
        self.layer2 = Layer2()
        smp.register_parameter(self, self.layer1.weight)

    def forward(self, input):
        x = self.layer1(input)
        # self.layer1.weight is required by self.layer2.forward
        y = self.layer2(x, self.layer1.weight)
        return y
```

### Set up the SageMaker PyTorch estimator
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-set-estimator"></a>

When configuring a SageMaker PyTorch estimator in [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md), add the parameters for sharded data parallelism. 

To turn on sharded data parallelism, add the `sharded_data_parallel_degree` parameter to the SageMaker PyTorch Estimator. This parameter specifies the number of GPUs over which the training state is sharded. The value for `sharded_data_parallel_degree` must be an integer between one and the data parallelism degree and must evenly divide the data parallelism degree. Note that the library automatically detects the number of GPUs so thus the data parallel degree. The following additional parameters are available for configuring sharded data parallelism.
+ `"sdp_reduce_bucket_size"` *(int, default: 5e8)* – Specifies the size of [PyTorch DDP gradient buckets](https://pytorch.org/docs/stable/notes/ddp.html#internal-design) in number of elements of the default dtype.
+ `"sdp_param_persistence_threshold"` *(int, default: 1e6)* – Specifies the size of a parameter tensor in number of elements that can persist at each GPU. Sharded data parallelism splits each parameter tensor across GPUs of a data parallel group. If the number of elements in the parameter tensor is smaller than this threshold, the parameter tensor is not split; this helps reduce communication overhead because the parameter tensor is replicated across data-parallel GPUs.
+ `"sdp_max_live_parameters"` *(int, default: 1e9)* – Specifies the maximum number of parameters that can simultaneously be in a recombined training state during the forward and backward pass. Parameter fetching with the `AllGather` operation pauses when the number of active parameters reaches the given threshold. Note that increasing this parameter increases the memory footprint.
+ `"sdp_hierarchical_allgather"` *(bool, default: True)* – If set to `True`, the `AllGather` operation runs hierarchically: it runs within each node first, and then runs across nodes. For multi-node distributed training jobs, the hierarchical `AllGather` operation is automatically activated.
+ `"sdp_gradient_clipping"` *(float, default: 1.0)* – Specifies a threshold for gradient clipping the L2 norm of the gradients before propagating them backward through the model parameters. When sharded data parallelism is activated, gradient clipping is also activated. The default threshold is `1.0`. Adjust this parameter if you have the exploding gradients problem.

The following code shows an example of how to configure sharded data parallelism.

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled": True,
    "parameters": {
        # "pipeline_parallel_degree": 1,    # Optional, default is 1
        # "tensor_parallel_degree": 1,      # Optional, default is 1
        "ddp": True,
        # parameters for sharded data parallelism
        "sharded_data_parallel_degree": 2,              # Add this to activate sharded data parallelism
        "sdp_reduce_bucket_size": int(5e8),             # Optional
        "sdp_param_persistence_threshold": int(1e6),    # Optional
        "sdp_max_live_parameters": int(1e9),            # Optional
        "sdp_hierarchical_allgather": True,             # Optional
        "sdp_gradient_clipping": 1.0                    # Optional
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

smp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-job"
)

smp_estimator.fit('s3://my_bucket/my_training_data/')
```

## Reference configurations
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use-config-sample"></a>

The SageMaker distributed training team provides the following reference configurations that you can use as a starting point. You can extrapolate from the following configurations to experiment and estimate the GPU memory usage for your model configuration. 

Sharded data parallelism with SMDDP Collectives


| Model/the number of parameters | Num instances | Instance type | Sequence length | Global batch size | Mini batch size | Sharded data parallel degree | 
| --- | --- | --- | --- | --- | --- | --- | 
| GPT-NEOX-20B | 2 | ml.p4d.24xlarge | 2048 | 64 | 4 | 16 | 
| GPT-NEOX-20B | 8 | ml.p4d.24xlarge | 2048 | 768 | 12 | 32 | 

For example, if you increase the sequence length for a 20-billion-parameter model or increase the size of the model to 65 billion parameters, you need to try reducing the batch size first. If the model still doesn’t fit with the smallest batch size (the batch size of 1), try increasing the degree of model parallelism.

Sharded data parallelism with tensor parallelism and NCCL Collectives


| Model/the number of parameters | Num instances | Instance type | Sequence length | Global batch size | Mini batch size | Sharded data parallel degree | Tensor parallel degree | Activation offloading | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| GPT-NEOX-65B | 64 | ml.p4d.24xlarge | 2048 | 512 | 8 | 16 | 8 | Y | 
| GPT-NEOX-65B | 64 | ml.p4d.24xlarge | 4096 | 512 | 2 | 64 | 2 | Y | 

The combined usage of sharded data parallelism and tensor parallelism is useful when you want to fit a large language model (LLM) into a large-scale cluster while using text data with a longer sequence length, which leads to use a smaller batch size, and consequently handling the GPU memory usage to train LLMs against longer text sequences. To learn more, see [Sharded data parallelism with tensor parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism).

For case studies, benchmarks, and more configuration examples, see the blog post [New performance improvements in Amazon SageMaker AI model parallel library](https://aws.amazon.com/blogs/machine-learning/new-performance-improvements-in-amazon-sagemaker-model-parallel-library/).

## Sharded data parallelism with SMDDP Collectives
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-smddp-collectives"></a>

The SageMaker data parallelism library offers collective communication primitives (SMDDP collectives) optimized for the AWS infrastructure. It achieves optimization by adopting an all-to-all-type communication pattern by making use of [Elastic Fabric Adapter (EFA)](https://aws.amazon.com/hpc/efa/), resulting in high-throughput and less latency-sensitive collectives, offloading the communication-related processing to the CPU, and freeing up GPU cycles for computation. On large clusters, SMDDP Collectives can offer improvements in distributed training performance by up to 40% compared to NCCL. For case studies and benchmark results, see the blog [New performance improvements in the Amazon SageMaker AI model parallelism library](https://aws.amazon.com/blogs/machine-learning/new-performance-improvements-in-amazon-sagemaker-model-parallel-library/).

**Note**  
Sharded data parallelism with SMDDP Collectives is available in the SageMaker model parallelism library v1.13.0 and later, and the SageMaker data parallelism library v1.6.0 and later. See also [Supported configurations](#sharded-data-parallelism-smddp-collectives-supported-config) to use sharded data parallelism with SMDDP Collectives.

In sharded data parallelism, which is a commonly used technique in large-scale distributed training, the `AllGather` collective is used to reconstitute the sharded layer parameters for forward and backward pass computations, in parallel with GPU computation. For large models, performing the `AllGather` operation efficiently is critical to avoid GPU bottleneck problems and slowing down training speed. When sharded data parallelism is activated, SMDDP Collectives drops into these performance-critical `AllGather` collectives, improving training throughput.

**Train with SMDDP Collectives**

When your training job has sharded data parallelism activated and meets the [Supported configurations](#sharded-data-parallelism-smddp-collectives-supported-config), SMDDP Collectives are automatically activated. Internally, SMDDP Collectives optimize the `AllGather` collective to be performant on the AWS infrastructure and falls back to NCCL for all other collectives. Furthermore, under unsupported configurations, all collectives, including `AllGather`, automatically use the NCCL backend.

Since the SageMaker model parallelism library version 1.13.0, the `"ddp_dist_backend"` parameter is added to the `modelparallel` options. The default value for this configuration parameter is `"auto"`, which uses SMDDP Collectives whenever possible, and falls back to NCCL otherwise. To force the library to always use NCCL, specify `"nccl"` to the `"ddp_dist_backend"` configuration parameter. 

The following code example shows how to set up a PyTorch estimator using the sharded data parallelism with the `"ddp_dist_backend"` parameter, which is set to `"auto"` by default and, therefore, optional to add. 

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled":True,
    "parameters": {                        
        "partitions": 1,
        "ddp": True,
        "sharded_data_parallel_degree": 64
        "bf16": True,
        "ddp_dist_backend": "auto"  # Specify "nccl" to force to use NCCL.
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=8,
    instance_type='ml.p4d.24xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

**Supported configurations**

The `AllGather` operation with SMDDP Collectives are activated in training jobs when all the following configuration requirements are met.
+ The sharded data parallelism degree greater than 1
+ `Instance_count` greater than 1 
+ `Instance_type` equal to `ml.p4d.24xlarge` 
+ SageMaker training container for PyTorch v1.12.1 or later
+ The SageMaker data parallelism library v1.6.0 or later
+ The SageMaker model parallelism library v1.13.0 or later

**Performance and memory tuning**

SMDDP Collectives utilize additional GPU memory. There are two environment variables to configure the GPU memory usage depending on different model training use cases.
+ `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` – During the SMDDP `AllGather` operation, the `AllGather` input buffer is copied into a temporary buffer for inter-node communication. The `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` variable controls the size (in bytes) of this temporary buffer. If the size of the temporary buffer is smaller than the `AllGather` input buffer size, the `AllGather` collective falls back to use NCCL.
  + Default value: 16 \$1 1024 \$1 1024 (16 MB)
  + Acceptable values: any multiple of 8192
+  `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` – The `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` variable is to size the temporary buffer (in bytes) to hold data gathered from inter-node communication. If the size of this temporary buffer is smaller than `1/8 * sharded_data_parallel_degree * AllGather input size`, the `AllGather` collective falls back to use NCCL.
  + Default value: 128 \$1 1024 \$1 1024 (128 MB)
  + Acceptable values: any multiple of 8192

**Tuning guidance on the buffer size variables**

The default values for the environment variables should work well for most use cases. We recommend tuning these variables only if training runs into the out-of-memory (OOM) error. 

The following list discusses some tuning tips to reduce the GPU memory footprint of SMDDP Collectives while retaining the performance gain from them.
+ Tuning `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES`
  + The `AllGather` input buffer size is smaller for smaller models. Hence, the required size for `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` can be smaller for models with fewer parameters.
  + The `AllGather` input buffer size decreases as `sharded_data_parallel_degree` increases, because the model gets sharded across more GPUs. Hence, the required size for `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` can be smaller for training jobs with large values for `sharded_data_parallel_degree`.
+ Tuning `SMDDP_AG_SORT_BUFFER_SIZE_BYTES`
  + The amount of data gathered from inter-node communication is less for models with fewer parameters. Hence, the required size for `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` can be smaller for such models with fewer number of parameters.

Some collectives might fall back to use NCCL; hence, you might not get the performance gain from the optimized SMDDP collectives. If additional GPU memory is available for use, you can consider increasing the values of `SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES` and `SMDDP_AG_SORT_BUFFER_SIZE_BYTES` to benefit from the performance gain.

The following code shows how you can configure the environment variables by appending them to `mpi_options` in the distribution parameter for the PyTorch estimator.

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    .... # All modelparallel configuration options go here
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8               # Required
}

# Use the following two lines to tune values of the environment variables for buffer
mpioptions += " -x SMDDP_AG_SCRATCH_BUFFER_SIZE_BYTES=8192" 
mpioptions += " -x SMDDP_AG_SORT_BUFFER_SIZE_BYTES=8192"

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=8,
    instance_type='ml.p4d.24xlarge',
    framework_version='1.13.1',
    py_version='py3',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="sharded-data-parallel-demo-with-tuning",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

## Mixed precision training with sharded data parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-16bits-training"></a>

To further save GPU memory with half-precision floating point numbers and sharded data parallelism, you can activate 16-bit floating point format (FP16) or [Brain floating point format](https://en.wikichip.org/wiki/brain_floating-point_format) (BF16) by adding one additional parameter to the distributed training configuration.

**Note**  
Mixed precision training with sharded data parallelism is available in the SageMaker model parallelism library v1.11.0 and later.

**For FP16 Training with Sharded Data Parallelism**

To run FP16 training with sharded data parallelism, add `"fp16": True"` to the `smp_options` configuration dictionary. In your training script, you can choose between the static and dynamic loss scaling options through the `smp.DistributedOptimizer` module. For more information, see [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md).

```
smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "sharded_data_parallel_degree": 2,
        "fp16": True
    }
}
```

**For BF16 Training with Sharded Data Parallelism**

The sharded data parallelism feature of SageMaker AI supports training in BF16 data type. The BF16 data type uses 8 bits to represent the exponent of a floating point number, while the FP16 data type uses 5 bits. Preserving the 8 bits for the exponent allows to keep the same representation of the exponent of a 32-bit single precision floating point (FP32) number. This makes the conversion between FP32 and BF16 simpler and significantly less prone to cause overflow and underflow issues that arise often in FP16 training, especially when training larger models. While both data types use 16 bits in total, this increased representation range for the exponent in the BF16 format comes at the expense of reduced precision. For training large models, this reduced precision is often considered an acceptable trade-off for the range and training stability.

**Note**  
Currently, BF16 training works only when sharded data parallelism is activated.

To run BF16 training with sharded data parallelism, add `"bf16": True` to the `smp_options` configuration dictionary.

```
smp_options = {
    "enabled": True,
    "parameters": {
        "ddp": True,
        "sharded_data_parallel_degree": 2,
        "bf16": True
    }
}
```

## Sharded data parallelism with tensor parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism"></a>

If you use sharded data parallelism and also need to reduce the global batch size, consider using [tensor parallelism](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism.html) with sharded data parallelism. When training a large model with sharded data parallelism on a very large compute cluster (typically 128 nodes or beyond), even a small batch size per GPU results in a very large global batch size. It might lead to convergence issues or low computational performance issues. Reducing the batch size per GPU sometimes is not possible with sharded data parallelism alone when a single batch is already large and cannot be reduced further. In such cases, using sharded data parallelism in combination with tensor parallelism helps reduce the global batch size.

Choosing the optimal sharded data parallel and tensor parallel degrees depends on the scale of the model, the instance type, and the global batch size that is reasonable for the model to converge. We recommend that you start from a low tensor parallel degree to fit the global batch size into the compute cluster to resolve CUDA out-of-memory errors and achieve the best performance. See the following two example cases to learn how the combination of tensor parallelism and sharded data parallelism helps you adjust the global batch size by grouping GPUs for model parallelism, resulting in a lower number of model replicas and a smaller global batch size.

**Note**  
This feature is available from the SageMaker model parallelism library v1.15, and supports PyTorch v1.13.1.

**Note**  
This feature is available for the supported models by the tensor parallelism functionality of the library. To find the list of the supported models, see [Support for Hugging Face Transformer Models](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-hugging-face.html). Also note that you need to pass `tensor_parallelism=True` to the `smp.model_creation` argument while modifying your training script. To learn more, see the training script [https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L793](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L793) in the *SageMaker AI Examples GitHub repository*.

### Example 1
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex1"></a>

Assume that we want to train a model over a cluster of 1536 GPUs (192 nodes with 8 GPUs in each), setting the degree of sharded data parallelism to 32 (`sharded_data_parallel_degree=32`) and the batch size per GPU to 1, where each batch has a sequence length of 4096 tokens. In this case, there are 1536 model replicas, the global batch size becomes 1536, and each global batch contains about 6 million tokens. 

```
(1536 GPUs) * (1 batch per GPU) = (1536 global batches)
(1536 batches) * (4096 tokens per batch) = (6,291,456 tokens)
```

Adding tensor parallelism to it can lower the global batch size. One configuration example can be setting the tensor parallel degree to 8 and the batch size per GPU to 4. This forms 192 tensor parallel groups or 192 model replicas, where each model replica is distributed across 8 GPUs. The batch size of 4 is the amount of training data per iteration and per tensor parallel group; that is, each model replica consumes 4 batches per iteration. In this case, the global batch size becomes 768, and each global batch contains about 3 million tokens. Hence, the global batch size is reduced by half compared to the previous case with sharded data parallelism only.

```
(1536 GPUs) / (8 tensor parallel degree) = (192 tensor parallelism groups)
(192 tensor parallelism groups) * (4 batches per tensor parallelism group) = (768 global batches)
(768 batches) * (4096 tokens per batch) = (3,145,728 tokens)
```

### Example 2
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex2"></a>

When both sharded data parallelism and tensor parallelism are activated, the library first applies tensor parallelism and shards the model across this dimension. For each tensor parallel rank, the data parallelism is applied as per `sharded_data_parallel_degree`.

For example, assume that we want to set 32 GPUs with a tensor parallel degree of 4 (forming groups of 4 GPUs), a sharded data parallel degree of 4, ending up with a replication degree of 2. The assignment creates eight GPU groups based on the tensor parallel degree as follows: `(0,1,2,3)`, `(4,5,6,7)`, `(8,9,10,11)`, `(12,13,14,15)`, `(16,17,18,19)`, `(20,21,22,23)`, `(24,25,26,27)`, `(28,29,30,31)`. That is, four GPUs form one tensor parallel group. In this case, the reduced data parallel group for the 0th rank GPUs of the tensor parallel groups would be `(0,4,8,12,16,20,24,28)`. The reduced data parallel group is sharded based on the sharded data parallel degree of 4, resulting in two replication groups for data parallelism. GPUs `(0,4,8,12)` form one sharding group, which collectively hold a complete copy of all parameters for the 0th tensor parallel rank, and GPUs `(16,20,24,28)` form another such group. Other tensor parallel ranks also have similar sharding and replication groups.

![\[Figure 1: Tensor parallelism groups.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/sdp_tp_group_tp.jpg)


Figure 1: Tensor parallelism groups for (nodes, sharded data parallel degree, tensor parallel degree) = (4, 4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form tensor parallelism groups from TPG0 to TPG7. Replication groups are (\$1TPG0, TPG4\$1, \$1TPG1, TPG5\$1, \$1TPG2, TPG6\$1 and \$1TPG3, TPG7\$1); each replication group pair shares the same color but filled differently.

![\[Figure 2: Sharded data parallelism groups.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/sdp_tp_group_sdp.jpg)


Figure 2: Sharded data parallelism groups for (nodes, sharded data parallel degree, tensor parallel degree) = (4, 4, 4), where each rectangle represents a GPU with indices from 0 to 31. The GPUs form sharded data parallelism groups from SDPG0 to SDPG7. Replication groups are (\$1SDPG0, SDPG4\$1, \$1SDPG1, SDPG5\$1, \$1SDPG2, SDPG6\$1 and \$1SDPG3, SDPG7\$1); each replication group pair shares the same color but filled differently.

### How to activate sharded data parallelism with tensor parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-activate"></a>

To use sharded data parallelism with tensor parallelism, you need to set both `sharded_data_parallel_degree` and `tensor_parallel_degree` in the configuration for `distribution` while creating an object of the SageMaker PyTorch estimator class. 

You also need to activate `prescaled_batch`. This means that, instead of each GPU reading its own batch of data, each tensor parallel group collectively reads a combined batch of the chosen batch size. Effectively, instead of dividing the dataset into parts equal to the number of GPUs (or data parallel size, `smp.dp_size()`), it divides into parts equal to the number of GPUs divided by `tensor_parallel_degree` (also called reduced data parallel size, `smp.rdp_size()`). For more details on prescaled batch, see [Prescaled Batch](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#prescaled-batch) in the *SageMaker Python SDK documentation*. See also the example training script [https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L164](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/train_gpt_simple.py#L164) for GPT-2 in the *SageMaker AI Examples GitHub repository*.

The following code snippet shows an example of creating a PyTorch estimator object based on the aforementioned scenario in [Example 2](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism-ex2).

```
mpi_options = "-verbose --mca orte_base_help_aggregate 0 "
smp_parameters = {
    "ddp": True,
    "fp16": True,
    "prescaled_batch": True,
    "sharded_data_parallel_degree": 4,
    "tensor_parallel_degree": 4
}

pytorch_estimator = PyTorch(
    entry_point="your_training_script.py",
    role=role,
    instance_type="ml.p4d.24xlarge",
    volume_size=200,
    instance_count=4,
    sagemaker_session=sagemaker_session,
    py_version="py3",
    framework_version="1.13.1",
    distribution={
        "smdistributed": {
            "modelparallel": {
                "enabled": True, 
                "parameters": smp_parameters,
            }
        },
        "mpi": {
            "enabled": True,
            "processes_per_host": 8,
            "custom_mpi_options": mpi_options,
        },
    },
    source_dir="source_directory_of_your_code",
    output_path=s3_output_location
)
```

## Tips and considerations for using sharded data parallelism
<a name="model-parallel-extended-features-pytorch-sharded-data-parallelism-considerations"></a>

Consider the following when using the SageMaker model parallelism library's sharded data parallelism.
+ Sharded data parallelism is compatible with FP16 training. To run FP16 training, see the [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md) section.
+ Sharded data parallelism is compatible with tensor parallelism. The following items are what you might need to consider for using sharded data parallelism with tensor parallelism.
  + When using sharded data parallelism with tensor parallelism, the embedding layers are also automatically distributed across the tensor parallel group. In other words, the `distribute_embedding` parameter is automatically set to `True`. For more information about tensor parallelism, see [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md).
  + Note that sharded data parallelism with tensor parallelism currently uses the NCCL collectives as the backend of the distributed training strategy.

  To learn more, see the [Sharded data parallelism with tensor parallelism](#model-parallel-extended-features-pytorch-sharded-data-parallelism-with-tensor-parallelism) section.
+ Sharded data parallelism currently is not compatible with [pipeline parallelism](model-parallel-intro.md#model-parallel-intro-pp) or [optimizer state sharding](model-parallel-extended-features-pytorch-optimizer-state-sharding.md). To activate sharded data parallelism, turn off optimizer state sharding and set the pipeline parallel degree to 1.
+ The [activation checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md) and [activation offloading](model-parallel-extended-features-pytorch-activation-offloading.md) features are compatible with sharded data parallelism.
+ To use sharded data parallelism with gradient accumulation, set the `backward_passes_per_step` argument to the number of accumulation steps while wrapping your model with the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel) module. This ensures that the gradient `AllReduce` operation across the model replication groups (sharding groups) takes place at the boundary of gradient accumulation.
+ You can checkpoint your models trained with sharded data parallelism using the library's checkpointing APIs, `smp.save_checkpoint` and `smp.resume_from_checkpoint`. For more information, see [Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and later)](distributed-model-parallel-checkpointing-and-finetuning.md#model-parallel-extended-features-pytorch-checkpoint).
+ The behavior of the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.delay_param_initialization](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.delay_param_initialization) configuration parameter changes under sharded data parallelism. When these two features are simultaneously turned on, parameters are immediately initialized upon model creation in a sharded manner instead of delaying the parameter initialization, so that each rank initializes and stores its own shard of parameters.
+ When sharded data parallelism is activated, the library performs gradient clipping internally when the `optimizer.step()` call runs. You don't need to use utility APIs for gradient clipping, such as [https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html). To adjust the threshold value for gradient clipping, you can set it through the `sdp_gradient_clipping` parameter for the distribution parameter configuration when you construct the SageMaker PyTorch estimator, as shown in the [How to apply sharded data parallelism to your training job](#model-parallel-extended-features-pytorch-sharded-data-parallelism-how-to-use) section.

# Pipelining a Model
<a name="model-parallel-core-features-pipieline-parallelism"></a>

One of the core features of SageMaker's model parallelism library is *pipeline parallelism*, which determines the order in which computations are made and data is processed across devices during model training. Pipelining is a technique to achieve true parallelization in model parallelism, by having the GPUs compute simultaneously on different data samples, and to overcome the performance loss due to sequential computation. When you use pipeline parallelism, training job is executed in a pipelined fashion over microbatches to maximize GPU usage.

**Note**  
Pipeline parallelism, also called model partitioning, is available for both PyTorch and TensorFlow. For supported versions of the frameworks, see [Supported Frameworks and AWS Regions](distributed-model-parallel-support.md).

## Pipeline Execution Schedule
<a name="model-parallel-pipeline-execution"></a>

Pipelining is based on splitting a mini-batch into microbatches, which are fed into the training pipeline one-by-one and follow an execution schedule defined by the library runtime. A *microbatch* is a smaller subset of a given training mini-batch. The pipeline schedule determines which microbatch is executed by which device for every time slot. 

For example, depending on the pipeline schedule and the model partition, GPU `i` might perform (forward or backward) computation on microbatch `b` while GPU `i+1` performs computation on microbatch `b+1`, thereby keeping both GPUs active at the same time. During a single forward or backward pass, execution flow for a single microbatch might visit the same device multiple times, depending on the partitioning decision. For instance, an operation that is at the beginning of the model might be placed on the same device as an operation at the end of the model, while the operations in between are on different devices, which means this device is visited twice.

The library offers two different pipeline schedules, *simple* and *interleaved*, which can be configured using the `pipeline` parameter in the SageMaker Python SDK. In most cases, interleaved pipeline can achieve better performance by utilizing the GPUs more efficiently.

### Interleaved Pipeline
<a name="model-parallel-pipeline-execution-interleaved"></a>

In an interleaved pipeline, backward execution of the microbatches is prioritized whenever possible. This allows quicker release of the memory used for activations, using memory more efficiently. It also allows for scaling the number of microbatches higher, reducing the idle time of the GPUs. At steady-state, each device alternates between running forward and backward passes. This means that the backward pass of one microbatch may run before the forward pass of another microbatch finishes.

![\[Example execution schedule for the interleaved pipeline over 2 GPUs.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-execution.png)


The preceding figure illustrates an example execution schedule for the interleaved pipeline over 2 GPUs. In the figure, F0 represents the forward pass for microbatch 0, and B1 represents the backward pass for microbatch 1. **Update** represents the optimizer update of the parameters. GPU0 always prioritizes backward passes whenever possible (for instance, executes B0 before F2), which allows for clearing of the memory used for activations earlier.

### Simple Pipeline
<a name="model-parallel-pipeline-execution-simple"></a>

A simple pipeline, by contrast, finishes running the forward pass for each microbatch before starting the backward pass. This means that it only pipelines the forward pass and backward pass stages within themselves. The following figure illustrates an example of how this works, over 2 GPUs.

![\[Example on a pipeline running the forward pass for each microbatch before starting the backward pass.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/simple-pipeline-execution.png)


### Pipelining Execution in Specific Frameworks
<a name="model-parallel-pipeline-frameworks"></a>

Use the following sections to learn about the framework-specific pipeline scheduling decisions SageMaker's model parallelism library makes for TensorFlow and PyTorch. 

#### Pipeline Execution with TensorFlow
<a name="model-parallel-pipeline-execution-interleaved-tf"></a>

The following image is an example of a TensorFlow graph partitioned by the model parallelism library, using automated model splitting. When a graph is split, each resulting subgraph is replicated B times (except for the variables), where B is the number of microbatches. In this figure, each subgraph is replicated 2 times (B=2). An `SMPInput` operation is inserted at each input of a subgraph, and an `SMPOutput` operation is inserted at each output. These operations communicate with the library backend to transfer tensors to and from each other.

![\[Example of a TensorFlow graph partitioned by the model parallelism library, using automated model splitting.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-tf.png)


The following image is an example of 2 subgraphs split with B=2 with gradient operations added. The gradient of a `SMPInput` op is a `SMPOutput` op, and vice versa. This enables the gradients to flow backwards during back-propagation.

![\[Example of 2 subgraphs split with B=2 with gradient operations added.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/interleaved-pipeline-tf.gif)


This GIF demonstrates an example interleaved pipeline execution schedule with B=2 microbatches and 2 subgraphs. Each device sequentially executes one of the subgraph replicas to improve GPU utilization. As B grows larger, the fraction of idle time slots goes to zero. Whenever it is time to do (forward or backward) computation on a specific subgraph replica, the pipeline layer signals to the corresponding blue `SMPInput` operations to start executing.

Once the gradients from all microbatches in a single mini-batch are computed, the library combines the gradients across microbatches, which can then be applied to the parameters. 

#### Pipeline Execution with PyTorch
<a name="model-parallel-pipeline-execution-interleaved-pt"></a>

Conceptually, pipelining follows a similar idea in PyTorch. However, since PyTorch does not involve static graphs and so the model parallelism library's PyTorch feature uses a more dynamic pipelining paradigm. 

As in TensorFlow, each batch is split into a number of microbatches, which are executed one at a time on each device. However, the execution schedule is handled via execution servers launched on each device. Whenever the output of a submodule that is placed on another device is needed on the current device, an execution request is sent to the execution server of the remote device along with the input tensors to the submodule. The server then executes this module with the given inputs and returns the response to the current device.

Since the current device is idle during the remote submodule execution, the local execution for the current microbatch pauses, and the library runtime switches execution to another microbatch which the current device can actively work on. The prioritization of microbatches is determined by the chosen pipeline schedule. For an interleaved pipeline schedule, microbatches that are in the backward stage of the computation are prioritized whenever possible.

# Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-tensor-parallelism"></a>

*Tensor parallelism* is a type of model parallelism in which specific model weights, gradients, and optimizer states are split across devices. In contrast to pipeline parallelism, which keeps individual weights intact but partitions the *set* of weights, tensor parallelism splits individual weights. This typically involves distributed computation of specific operations, modules, or layers of the model.

Tensor parallelism is required in cases in which a single parameter consumes most of the GPU memory (such as large embedding tables with a large vocabulary size or a large softmax layer with a large number of classes). In this case, treating this large tensor or operation as an atomic unit is inefficient and impedes balance of the memory load. 

Tensor parallelism is also useful for extremely large models in which a pure pipelining is simply not enough. For example, with GPT-3-scale models that require partitioning over tens of instances, a pure microbatch pipelining is inefficient because the pipeline depth becomes too high and the overhead becomes prohibitively large.

**Note**  
Tensor parallelism is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

**Topics**
+ [

# How Tensor Parallelism Works
](model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.md)
+ [

# Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism
](model-parallel-extended-features-pytorch-tensor-parallelism-examples.md)
+ [

# Support for Hugging Face Transformer Models
](model-parallel-extended-features-pytorch-hugging-face.md)
+ [

# Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism
](model-parallel-extended-features-pytorch-ranking-mechanism.md)

# How Tensor Parallelism Works
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works"></a>

Tensor parallelism takes place at the level of `nn.Modules`; it partitions specific modules in the model across tensor parallel ranks. This is in addition to the existing partition of the *set of modules* used in pipeline parallelism.

When a module is partitioned through tensor parallelism, its forward and backward propagation are distributed. The library handles the necessary communication across devices to implement the distributed execution of these modules. The modules are partitioned across multiple data parallel ranks. Contrary to the traditional distribution of workloads, each data parallel rank does **not** have the complete model replica when the library’s tensor parallelism is used. Instead, each data parallel rank may have only a partition of the distributed modules, in addition to the entirety of the modules that are not distributed.

**Example:** Consider tensor parallelism across data parallel ranks, where the degree of data parallelism is 4 and the degree of tensor parallelism is 2. Assume that you have a data parallel group that holds the following module tree, after partitioning the set of modules.

```
A
├── B
|   ├── E
|   ├── F
├── C
└── D
    ├── G
    └── H
```

Assume that tensor parallelism is supported for the modules B, G, and H. One possible outcome of tensor parallel partition of this model could be:

```
dp_rank 0 (tensor parallel rank 0): A, B:0, C, D, G:0, H
dp_rank 1 (tensor parallel rank 1): A, B:1, C, D, G:1, H
dp_rank 2 (tensor parallel rank 0): A, B:0, C, D, G:0, H
dp_rank 3 (tensor parallel rank 1): A, B:1, C, D, G:1, H
```

Each line represents the set of modules stored in that `dp_rank`, and the notation `X:y` represents the `y`th fraction of the module `X`. Note the following:

1. Partitioning takes place across subsets of data parallel ranks, which we call `TP_GROUP`, not the entire `DP_GROUP`, so that the exact model partition is replicated across `dp_rank` 0 and `dp_rank` 2, and similarly across `dp_rank` 1 and `dp_rank` 3.

1. The modules `E` and `F` are no longer part of the model, since their parent module `B` is partitioned, and any execution that is normally a part of `E` and `F` takes place within the (partitioned) `B` module.

1. Even though `H` is supported for tensor parallelism, in this example it is not partitioned, which highlights that whether to partition a module depends on user input. The fact that a module is supported for tensor parallelism does not necessarily mean it is partitioned.

## How the library adapts tensor parallelism to PyTorch `nn.Linear` module
<a name="model-parallel-extended-for-pytorch-adapt-to-module"></a>

When tensor parallelism is performed over data parallel ranks, a subset of the parameters, gradients, and optimizer states are partitioned across the tensor parallel devices *for the modules that are partitioned*. For the rest of the modules, the tensor parallel devices operate in a regular data parallel manner. To execute the partitioned module, a device first collects the necessary parts of *all data samples* across peer devices in the same tensor parallelism group. The device then runs the local fraction of the module on all these data samples, followed by another round of synchronization which both combines the parts of the output for each data sample and returns the combined data samples to the GPUs from which the data sample first originated. The following figure shows an example of this process over a partitioned `nn.Linear` module. 

![\[Two figures showing two tensor parallel concepts.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/tensor-parallel-concept.png)


The first figure shows a small model with a large `nn.Linear` module with data parallelism over the two tensor parallelism ranks. The `nn.Linear` module is replicated into the two parallel ranks. 

The second figure shows tensor parallelism applied on a larger model while splitting the `nn.Linear` module. Each `tp_rank` holds half the linear module, and the entirety of the rest of the operations. While the linear module runs, each `tp_rank` collects the relevant half of all data samples and passes it through their half of the `nn.Linear` module. The result needs to be reduce-scattered (with summation as the reduction operation) so that each rank has the final linear output for their own data samples. The rest of the model runs in the typical data parallel manner.

# Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-examples"></a>

In this section, you learn:
+ How to configure a SageMaker PyTorch estimator and the SageMaker model parallelism option to use tensor parallelism.
+ How to adapt your training script using the extended `smdistributed.modelparallel` modules for tensor parallelism.

To learn more about the `smdistributed.modelparallel` modules, see the [SageMaker model parallel APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html) in the *SageMaker Python SDK documentation*.

**Topics**
+ [

## Tensor parallelism alone
](#model-parallel-extended-features-pytorch-tensor-parallelism-alone)
+ [

## Tensor parallelism combined with pipeline parallelism
](#model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism)

## Tensor parallelism alone
<a name="model-parallel-extended-features-pytorch-tensor-parallelism-alone"></a>

The following is an example of a distributed training option to activate tensor parallelism alone, without pipeline parallelism. Configure the `mpi_options` and `smp_options` dictionaries to specify distributed training options to the SageMaker `PyTorch` estimator.

**Note**  
Extended memory-saving features are available through Deep Learning Containers for PyTorch, which implements the SageMaker model parallelism library v1.6.0 or later.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
               
smp_options = {
    "enabled":True,
    "parameters": {
        "pipeline_parallel_degree": 1,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 4,      # tp over 4 devices
        "ddp": True
    }
}
              
smp_estimator = PyTorch(
    entry_point='your_training_script.py', # Specify
    role=role,
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    framework_version='1.13.1',
    py_version='py36',
    instance_count=1,
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smp_estimator.fit('s3://my_bucket/my_training_data/')
```

**Tip**  
To find a complete list of parameters for `distribution`, see [Configuration Parameters for Model Parallelism](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html) in the SageMaker Python SDK documentation.

**Adapt your PyTorch training script**

The following example training script shows how to adapt the SageMaker model parallelism library to a training script. In this example, it is assumed that the script is named `your_training_script.py`. 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)

def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by
        # the current process, based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target, reduction="mean")
        loss.backward()
        optimizer.step()

# smdistributed: Initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
if smp.local_rank() == 0:
    dataset = datasets.MNIST("../data", train=True, download=False)
smp.barrier()

# smdistributed: Shard the dataset based on data parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

train_loader = torch.utils.data.DataLoader(dataset, batch_size=64)

# smdistributed: Enable tensor parallelism for all supported modules in the model
# i.e., nn.Linear in this case. Alternatively, we can use
# smp.set_tensor_parallelism(model.fc1, True)
# to enable it only for model.fc1
with smp.tensor_parallelism():
    model = Net()

# smdistributed: Use the DistributedModel wrapper to distribute the
# modules for which tensor parallelism is enabled
model = smp.DistributedModel(model)

optimizer = optim.AdaDelta(model.parameters(), lr=4.0)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

## Tensor parallelism combined with pipeline parallelism
<a name="model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism"></a>

The following is an example of a distributed training option that enables tensor parallelism combined with pipeline parallelism. Set up the `mpi_options` and `smp_options` parameters to specify model parallel options with tensor parallelism when you configure a SageMaker `PyTorch` estimator.

**Note**  
Extended memory-saving features are available through Deep Learning Containers for PyTorch, which implements the SageMaker model parallelism library v1.6.0 or later.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}
               
smp_options = {
    "enabled":True,
    "parameters": {
    "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True
    }
}
              
smp_estimator = PyTorch(
    entry_point='your_training_script.py', # Specify
    role=role,
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    framework_version='1.13.1',
    py_version='py36',
    instance_count=1,
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smp_estimator.fit('s3://my_bucket/my_training_data/')  
```

<a name="model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism-script"></a>**Adapt your PyTorch training script**

The following example training script shows how to adapt the SageMaker model parallelism library to a training script. Note that the training script now includes the `smp.step` decorator: 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)


# smdistributed: Define smp.step. Return any tensors needed outside.
@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(loss)
    return output, loss

def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by
        # the current process, based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        # Return value, loss_mb is a StepOutput object
        _, loss_mb = train_step(model, data, target)

        # smdistributed: Average the loss across microbatches.
        loss = loss_mb.reduce_mean()

        optimizer.step()

# smdistributed: Initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
if smp.local_rank() == 0:
    dataset = datasets.MNIST("../data", train=True, download=False)
smp.barrier()

# smdistributed: Shard the dataset based on data parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible
# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = Net()

# smdistributed: enable tensor parallelism only for model.fc1
smp.set_tensor_parallelism(model.fc1, True)

# smdistributed: Use the DistributedModel container to provide the model
# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)

optimizer = optim.AdaDelta(model.parameters(), lr=4.0)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

# Support for Hugging Face Transformer Models
<a name="model-parallel-extended-features-pytorch-hugging-face"></a>

The SageMaker model parallelism library's tensor parallelism offers out-of-the-box support for the following Hugging Face Transformer models:
+ GPT-2, BERT, and RoBERTa (Available in the SageMaker model parallelism library v1.7.0 and later)
+ GPT-J (Available in the SageMaker model parallelism library v1.8.0 and later)
+ GPT-Neo (Available in the SageMaker model parallelism library v1.10.0 and later)

**Note**  
For any other Transformers models, you need to use the [smdistributed.modelparallel.torch.tp\$1register\$1with\$1module()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.tp_register_with_module) API to apply tensor parallelism.

**Note**  
To use tensor parallelism for training Hugging Face Transformer models, make sure you use Hugging Face Deep Learning Containers for PyTorch that has the SageMaker model parallelism library v1.7.0 and later. For more information, see the [SageMaker model parallelism library release notes](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html).

## Supported Models Out of the Box
<a name="model-parallel-extended-features-pytorch-hugging-face-out-of-the-box"></a>

For the Hugging Face transformer models supported by the library out of the box, you don't need to manually implement hooks to translate Transformer APIs to `smdistributed` transformer layers. You can activate tensor parallelism by using the context manager [smdistributed.modelparallel.torch.tensor\$1parallelism()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch_tensor_parallel.html#smdistributed.modelparallel.torch.tensor_parallelism) and wrapping the model by [smdistributed.modelparallel.torch.DistributedModel()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.DistributedModel). You don't need to manually register hooks for tensor parallelism using the `smp.tp_register` API.

The `state_dict` translation functions between Hugging Face Transformers and `smdistributed.modelparallel` can be accessed as follows.
+  `smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_state_dict_to_hf_gpt2(state_dict, max_seq_len=None)`
+  `smdistributed.modelparallel.torch.nn.huggingface.gpt2.translate_hf_state_dict_to_smdistributed_gpt2(state_dict)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.bert.translate_state_dict_to_hf_bert(state_dict, max_seq_len=None)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.bert.translate_hf_state_dict_to_smdistributed_bert(state_dict)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_state_dict_to_hf_roberta(state_dict, max_seq_len=None)` 
+  `smdistributed.modelparallel.torch.nn.huggingface.roberta.translate_hf_state_dict_to_smdistributed_roberta(state_dict)` 
+ `smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_state_dict_to_hf_gptj(state_dict, max_seq_len=None)` (Available in the SageMaker model parallelism library v1.8.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptj.translate_hf_gptj_state_dict_to_smdistributed_gptj` (Available in the SageMaker model parallelism library v1.8.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_state_dict_to_hf_gptneo(state_dict, max_seq_len=None)` (Available in the SageMaker model parallelism library v1.10.0 and later)
+ `smdistributed.modelparallel.torch.nn.huggingface.gptneo.translate_hf_state_dict_to_smdistributed_gptneo(state_dict)` (Available in the SageMaker model parallelism library v1.10.0 and later)

**Example usage of the GPT-2 translation function**

Start with wrapping the model as shown in the following code.

```
from transformers import AutoModelForCausalLM

with smp.tensor_parallelism():
    model = AutoModelForCausalLM.from_config(hf_gpt2_config)

model = smp.DistributedModel(model)
```

Given a `state_dict` from the `DistributedModel` object, you can load the weights into the original Hugging Face GPT-2 model using the `translate_state_dict_to_hf_gpt2` function as shown in the following code.

```
from smdistributed.modelparallel.torch.nn.huggingface.gpt2 \
                                      import translate_state_dict_to_hf_gpt2
max_seq_len = 1024

# [... code block for training ...]

if smp.rdp_rank() == 0:
    state_dict = dist_model.state_dict()
    hf_state_dict = translate_state_dict_to_hf_gpt2(state_dict, max_seq_len)

    # can now call model.load_state_dict(hf_state_dict) to the original HF model
```

**Example usage of the RoBERTa translation function**

Similarly, given a supported HuggingFace model `state_dict`, you can use the `translate_hf_state_dict_to_smdistributed` function to convert it to a format readable by `smp.DistributedModel`. This can be useful in transfer learning use cases, where a pre-trained model is loaded into a `smp.DistributedModel` for model-parallel fine-tuning:

```
from smdistributed.modelparallel.torch.nn.huggingface.roberta \
                                      import translate_state_dict_to_smdistributed

model = AutoModelForMaskedLM.from_config(roberta_config)
model = smp.DistributedModel(model)

pretrained_model = AutoModelForMaskedLM.from_pretrained("roberta-large")
translated_state_dict =
        translate_state_dict_to_smdistributed(pretrained_model.state_dict())

# load the translated pretrained weights into the smp.DistributedModel
model.load_state_dict(translated_state_dict)

# start fine-tuning...
```

# Ranking Mechanism when Using a Combination of Pipeline Parallelism and Tensor Parallelism
<a name="model-parallel-extended-features-pytorch-ranking-mechanism"></a>

This section explains how the ranking mechanism of model parallelism works with tensor parallelism. This is extended from the [Ranking Basics](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#ranking-basics) for [Core Features of the SageMaker Model Parallelism Library](model-parallel-core-features.md). With tensor parallelism, the library introduces three types of ranking and process group APIs: `smp.tp_rank()` for tensor parallel rank, `smp.pp_rank()` for pipeline parallel rank, and `smp.rdp_rank()` for reduced-data parallel rank. The corresponding communication process groups are tensor parallel group (`TP_GROUP`), pipeline parallel group (`PP_GROUP`), and reduced-data parallel group (`RDP_GROUP`). These groups are defined as follows:
+ A *tensor parallel group* (`TP_GROUP`) is an evenly divisible subset of the data parallel group, over which tensor parallel distribution of modules takes place. When the degree of pipeline parallelism is 1, `TP_GROUP` is the same as *model parallel group* (`MP_GROUP`). 
+ A *pipeline parallel group* (`PP_GROUP`) is the group of processes over which pipeline parallelism takes place. When the tensor parallelism degree is 1, `PP_GROUP` is the same as `MP_GROUP`. 
+ A *reduced-data parallel group* (`RDP_GROUP`) is a set of processes that hold both the same pipeline parallelism partitions and the same tensor parallel partitions, and perform data parallelism among themselves. This is called the reduced data parallel group because it is a subset of the entire data parallelism group, `DP_GROUP`. For the model parameters that are distributed within the `TP_GROUP` , the gradient `allreduce` operation is performed only for reduced-data parallel group, while for the parameters that are not distributed, the gradient `allreduce` takes place over the entire `DP_GROUP`. 
+ A model parallel group (`MP_GROUP`) refers to a group of processes that collectively store the entire model. It consists of the union of the `PP_GROUP`s of all the ranks that are in the `TP_GROUP` of the current process. When the degree of tensor parallelism is 1, `MP_GROUP` is equivalent to `PP_GROUP`. It is also consistent with the existing definition of `MP_GROUP` from previous `smdistributed` releases. Note that the current `TP_GROUP` is a subset of both the current `DP_GROUP` and the current `MP_GROUP`. 

To learn more about the communication process APIs in the SageMaker model parallelism library, see the [Common API](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_common_api.html#) and the [PyTorch-specific APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html) in the *SageMaker Python SDK documentation*.

![\[Ranking mechanism, parameter distribution, and associated AllReduce operations of tensor parallelism.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/distributed/model-parallel/tensor-parallel-ranking-mechanism.png)


For example, consider process groups for a single node with 8 GPUs, where the degree of tensor parallelism is 2, the degree of pipeline parallelism is 2, and the degree of data parallelism is 4. The upper center part of the preceding figure shows an example of a model with 4 layers. The lower left and lower right parts of figure illustrate the 4-layer model distributed across 4 GPUs using both pipeline parallelism and tensor parallelism, where tensor parallelism is used for the middle two layers. These two lower figures are simple copies to illustrate different group boundary lines. The partitioned model is replicated for data parallelism across GPUs 0-3 and 4-7. The lower left figure shows the definitions of `MP_GROUP`, `PP_GROUP`, and `TP_GROUP`. The lower right figure shows `RDP_GROUP`, `DP_GROUP`, and `WORLD` over the same set of GPUs. The gradients for the layers and layer slices that have the same color are `allreduce`d together for data parallelism. For example, the first layer (light blue) gets the `allreduce` operations across `DP_GROUP`, whereas the dark orange slice in the second layer only gets the `allreduce` operations within the `RDP_GROUP` of its process. The bold dark red arrows represent tensors with the batch of its entire `TP_GROUP`.

```
GPU0: pp_rank 0, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 0
GPU1: pp_rank 1, tp_rank 0, rdp_rank 0, dp_rank 0, mp_rank 1
GPU2: pp_rank 0, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 2
GPU3: pp_rank 1, tp_rank 1, rdp_rank 0, dp_rank 1, mp_rank 3
GPU4: pp_rank 0, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 0
GPU5: pp_rank 1, tp_rank 0, rdp_rank 1, dp_rank 2, mp_rank 1
GPU6: pp_rank 0, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 2
GPU7: pp_rank 1, tp_rank 1, rdp_rank 1, dp_rank 3, mp_rank 3
```

In this example, pipeline parallelism occurs across the GPU pairs (0,1); (2,3); (4,5) and (6,7). In addition, data parallelism (`allreduce`) takes place across GPUs 0, 2, 4, 6, and independently over GPUs 1, 3, 5, 7. Tensor parallelism happens over subsets of `DP_GROUP`s, across the GPU pairs (0,2); (1,3); (4,6) and (5,7).

  For this kind of hybrid pipeline and tensor parallelism, the math for `data_parallel_degree` remains as `data_parallel_degree = number_of_GPUs / pipeline_parallel_degree`. The library further calculates the reduced data parallel degree from the following relation `reduced_data_parallel_degree * tensor_parallel_degree = data_parallel_degree`.  

# Optimizer State Sharding
<a name="model-parallel-extended-features-pytorch-optimizer-state-sharding"></a>

*Optimizer state sharding* is a useful memory-saving technique that shards the optimizer state (the set of weights that describes the state of optimizer) across data parallel device groups. You can use optimizer state sharding whenever you use a stateful optimizer (such as Adam) or an FP16 optimizer (which stores both FP16 and FP32 copies of the parameters).

**Note**  
Optimizer state sharding is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Optimizer State Sharding
<a name="model-parallel-extended-features-pytorch-optimizer-state-sharding-how-to-use"></a>

You can turn on *optimizer state sharding* by setting `"shard_optimizer_state": True` in the `modelparallel` configuration. 

When this feature is turned on, the library partitions the set of model parameters based on the data parallelism degree. The gradients corresponding to the `i`th partition get reduced only at the `i`th data parallel rank. At the end of the first call to an `smp.step` decorator function, the optimizer wrapped by `smp.DistributedOptimizer` redefines its parameters to be only limited to those parameters corresponding to the partition of the current data parallel rank. The redefined parameters are called *virtual parameters* and share underlying storage with the original parameters. During the first call to `optimizer.step`, the optimizer states are created based on these redefined parameters, which are sharded because of the original partition. After the optimizer update, the AllGather operation (as part of the `optimizer.step` call) runs across the data parallel ranks to achieve consistent parameter states.

**Tip**  
Optimizer state sharding can be useful when the degree of data parallelism is greater than 1 and the model has more than a billion parameters.   
The degree of data parallelism is calculated by `(processes_per_host * instance_count / pipeline_parallel_degree)`, and the `smp.dp_size()` function handles the sizing in the background.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True,
        "shard_optimizer_state": True
    }
}
```

**Adapt your PyTorch training script**

See [Adapt your PyTorch training script](model-parallel-extended-features-pytorch-tensor-parallelism-examples.md#model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism-script) in the *Tensor parallelism combined with pipeline parallelism* section. There’s no additional modification required for the script.

# Activation Checkpointing
<a name="model-parallel-extended-features-pytorch-activation-checkpointing"></a>

*Activation checkpointing* (or *gradient checkpointing*) is a technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Effectively, this trades extra computation time for reduced memory usage. If a module is checkpointed, at the end of a forward pass, the inputs to and outputs from the module stay in memory. Any intermediate tensors that would have been part of the computation inside that module are freed up during the forward pass. During the backward pass of checkpointed modules, these tensors are recomputed. At this point, the layers beyond this checkpointed module have finished their backward pass, so the peak memory usage with checkpointing can be lower.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Activation Checkpointing
<a name="model-parallel-extended-for-pytorch-activation-checkpointing-how-to-use"></a>

With `smdistributed.modelparallel`, you can use activation checkpointing at the granularity of a module. For all `torch.nn` modules except `torch.nn.Sequential`, you can only checkpoint a module tree if it lies within one partition from the perspective of pipeline parallelism. In case of the `torch.nn.Sequential` module, each module tree inside the sequential module must lie completely within one partition for activation checkpointing to work. When you use manual partitioning, be aware of these restrictions.

When you use [automated model partitioning](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html#model-parallel-automated-model-splitting), you can find the partitioning assignment logs starting with `Partition assignments:` in the training job logs. If a module is partitioned across multiple ranks (for example, with one descendant on one rank and another descendant on a different rank), the library ignores the attempt to checkpoint the module and raises a warning message that the module won't be checkpointed.

**Note**  
The SageMaker model parallelism library supports both overlapping and non-overlapping `allreduce` operation in combination with checkpointing. 

**Note**  
PyTorch’s native checkpointing API is not compatible with `smdistributed.modelparallel`.

**Example 1:** The following sample code shows how to use activation checkpointing when you have a model definition in your script.

```
import torch.nn as nn
import torch.nn.functional as F

from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        # This call of fc1 will be checkpointed
        x = checkpoint(self.fc1, x)
        x = self.fc2(x)
        return F.log_softmax(x, 1)
```

**Example 2:** The following sample code shows how to use activation checkpointing when you have a sequential model in your script.

```
import torch.nn as nn
from smdistributed.modelparallel.torch.patches.checkpoint import checkpoint_sequential

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.seq = nn.Sequential(
            nn.Conv2d(1,20,5),
            nn.ReLU(),
            nn.Conv2d(20,64,5),
            nn.ReLU()
        )

    def forward(self, x):
        # This call of self.seq will be checkpointed
        x = checkpoint_sequential(self.seq, x)
        return F.log_softmax(x, 1)
```

**Example 3:** The following sample code shows how to use activation checkpointing when you import a prebuilt model from a library, such as PyTorch and Hugging Face Transformers. Whether you checkpoint sequential modules or not, do the following: 

1. Wrap the model by `smp.DistributedModel()`.

1. Define an object for sequential layers.

1. Wrap the sequential layer object by `smp.set_activation_checkpointig()`.

```
import smdistributed.modelparallel.torch as smp
from transformers import AutoModelForCausalLM

smp.init()
model = AutoModelForCausalLM(*args, **kwargs)
model = smp.DistributedModel(model)

# Call set_activation_checkpointing API
transformer_layers = model.module.module.module.transformer.seq_layers
smp.set_activation_checkpointing(
    transformer_layers, pack_args_as_tuple=True, strategy='each')
```

# Activation Offloading
<a name="model-parallel-extended-features-pytorch-activation-offloading"></a>

When activation checkpointing and pipeline parallelism are turned on and the number of microbatches is greater than one, *activation offloading* is an additional feature that can further reduce memory usage. Activation offloading asynchronously moves the checkpointed activations corresponding to their microbatches that are not currently running in the CPU. Right before the GPU needs the activations for the microbatch’s backward pass, this functionality prefetches the offloaded activations back from the CPU.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.6.0 and later.

## How to Use Activation Offloading
<a name="model-parallel-extended-for-pytorch-activation-offloading"></a>

Use activation offloading to reduce memory usage when **the number of microbatches is greater than 1, and activation checkpointing is turned on** (see [Activation Checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md)). When the activation checkpointing is not used, activation offloading has no effect. When it is used with only one microbatch, it does not save memory.

To use activation offloading, set `"offload_activations": True` in the `modelparallel` configuration.

Activation offloading moves the checkpointed activations in `nn.Sequential` modules to CPU asynchronously. The data transfer over the PCIe link overlaps with GPU computation. The offloading happens immediately, as soon as the forward pass for a particular checkpointed layer is computed. The activations are loaded back to the GPU shortly before they are needed for the backward pass of a particular microbatch. The CPU-GPU transfer similarly overlaps with computation. 

To adjust how early the activations are loaded back into the GPU, you can use the configuration parameter `"activation_loading_horizon"` (default is set to 4, must be `int` larger than 0). A larger activation loading horizon would cause the activations to be loaded back to the GPU earlier. If the horizon is too large, the memory-saving impact of activation offloading might be diminished. If the horizon is too small, the activations may not be loaded back in time, reducing the amount of overlap and degrading performance.

**Tip**  
Activation offloading can be useful for large models with over a hundred billion parameters.

**Configure a SageMaker PyTorch estimator**

```
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 8,               # 8 processes
    "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none "
}

smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 4,
        "pipeline_parallel_degree": 2,    # alias for "partitions"
        "placement_strategy": "cluster",
        "tensor_parallel_degree": 2,      # tp over 2 devices
        "ddp": True,
        "offload_activations": True,
        "activation_loading_horizon": 4   # optional. default is 4.
    }
}
```

# FP16 Training with Model Parallelism
<a name="model-parallel-extended-features-pytorch-fp16"></a>

For FP16 training, apply the following modifications to your training script and estimator.

**Note**  
This feature is available for PyTorch in the SageMaker model parallelism library v1.10.0 and later.

**Adapt your PyTorch training script**

1. Wrap your model using the [smdistributed.modelparallel.torch.model\$1creation()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.model_creation) context manager.

   ```
   # fp16_training_script.py
   
   import torch
   import smdistributed.modelparallel.torch as smp
   
   with smp.model_creation(
       dtype=torch.float16 if args.fp16 else torch.get_default_dtype()
   ):
       model = ...
   ```
**Tip**  
If you are using tensor parallelism, add `tensor_parallelism=smp.tp_size() > 1` to the `smp.model_creation` context manager. Adding this line also helps automatically detect whether tensor parallelism is activated or not.  

   ```
   with smp.model_creation(
       ... ,
       tensor_parallelism=smp.tp_size() > 1
   ):
       model = ...
   ```

1. When you wrap the optimizer with `smdistributed.modelparallel.torch.DistributedOptimizer`, set either the `static_loss_scaling` or `dynamic_loss_scaling` argument. By default, `static_loss_scaling` is set to `1.0`, and `dynamic_loss_scaling` is set to `False`. If you set `dynamic_loss_scale=True`, you can feed dynamic loss scaling options as a dictionary through the `dynamic_loss_args` argument. In most cases, we recommend you use dynamic loss scaling with the default options. For more information, options, and examples of the optimizer wrapper function, see the [smdistributed.modelparallel.torch.DistributedOptimizer](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedoptimizer) API.

   The following code is an example of wrapping an `Adadelta` optimizer object with dynamic loss scaling for FP16 training.

   ```
   optimizer = torch.optim.Adadelta(...)
   optimizer = smp.DistributedOptimizer(
       optimizer,
       static_loss_scale=None,
       dynamic_loss_scale=True,
       dynamic_loss_args={
           "scale_window": 1000,
           "min_scale": 1,
           "delayed_shift": 2
       }
   )
   ```

**Configure a SageMaker PyTorch estimator**

Add the FP16 parameter (`"fp16"`) to the distribution configuration for model parallelism when creating a SageMaker PyTorch estimator object. For a complete list of the configuration parameters for model parallelism, see [Parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#parameters-for-smdistributed).

```
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled": True,
    "parameters":  {
        "microbatches":  4,
        "pipeline_parallel_degree":  2,
        "tensor_parallel_degree":  2,
        ...,

        "fp16": True
    }
}

fp16_estimator = PyTorch(
    entry_point="fp16_training_script.py", # Specify your train script
    ...,

    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": {...}
    }
)

fp16_estimator.fit(...)
```

When FP16 training starts, the model and the optimizer are wrapped by `FP16_Module` and `FP16_Optimizer` respectively, which are modified `smdistributed` versions of the [Apex utils](https://nvidia.github.io/apex/fp16_utils.html#apex-fp16-utils). `FP16_Module` converts the model to FP16 dtype and deals with the forward pass in FP16.

**Tip**  
You can apply gradient clipping by calling `clip_master_grads` before `optimizer.step`.  

```
optimizer.clip_master_grads(max_norm)     # max_norm(float or int): max norm of the gradients
```

**Tip**  
When using `torch.optim.lr_scheduler` and FP16 training, you need to pass `optimizer.optimizer` to the LR scheduler rather than the optimizer. See the following example code.  

```
from torch.optim.lr_scheduler import StepLR

scheduler = StepLR(
    optimizer.optimizer if smp.state.cfg.fp16 else optimizer,
    step_size=1,
    gamma=args.gamma
)
```

# Support for FlashAttention
<a name="model-parallel-attention-head-size-for-flash-attention"></a>

Support for FlashAttention is a feature of the library only applicable for the *distributed transformer* model, which is a Transformer model wrapped by [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedmodel](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed-modelparallel-torch-distributedmodel) for model-parallel training. This feature is also compatible with [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md). 

The [FlashAttention](https://github.com/HazyResearch/flash-attention) library only supports models when `attention_head_size` is set to a value that's a multiple of 8 and less than 128. Therefore, when you train a distributed transformer and make sure that FlashAttention works properly, you should adjust parameters to make the attention head size comply the requirements. For more information, see also [Installation and features](https://github.com/HazyResearch/flash-attention#installation-and-features) in the *FlashAttention GitHub repository*.

For example, assume that you configure a Transformer model with `hidden_width=864` and `num_heads=48`. The head size of FlashAttention is calculated as `attention_head_size = hidden_width / num_heads = 864 / 48 = 18`. To enable FlashAttention, you need to adjust the `num_heads` parameter to `54`, so that `attention_head_size = hidden_width / num_heads = 864 / 54 = 16`, which is a multiple of 8.

# Run a SageMaker Distributed Training Job with Model Parallelism
<a name="model-parallel-use-api"></a>

Learn how to run a model-parallel training job of your own training script using the SageMaker Python SDK with the SageMaker model parallelism library.

There are three use-case scenarios for running a SageMaker training job.

1. You can use one of the pre-built AWS Deep Learning Container for TensorFlow and PyTorch. This option is recommended if it is the first time for you to use the model parallel library. To find a tutorial for how to run a SageMaker model parallel training job, see the example notebooks at [PyTorch training with Amazon SageMaker AI's model parallelism library](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/model_parallel).

1. You can extend the pre-built containers to handle any additional functional requirements for your algorithm or model that the pre-built SageMaker Docker image doesn't support. To find an example of how you can extend a pre-built container, see [Extend a Pre-built Container](prebuilt-containers-extend.md).

1. You can adapt your own Docker container to work with SageMaker AI using the [SageMaker Training toolkit](https://github.com/aws/sagemaker-training-toolkit). For an example, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html).

For options 2 and 3 in the preceding list, refer to [Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library](model-parallel-sm-sdk.md#model-parallel-customize-container) to learn how to install the model parallel library in an extended or customized Docker container. 

In all cases, you launch your training job configuring a SageMaker `TensorFlow` or `PyTorch` estimator to activate the library. To learn more, see the following topics.

**Topics**
+ [

# Step 1: Modify Your Own Training Script Using SageMaker's Distributed Model Parallel Library
](model-parallel-customize-training-script.md)
+ [

# Step 2: Launch a Training Job Using the SageMaker Python SDK
](model-parallel-sm-sdk.md)

# Step 1: Modify Your Own Training Script Using SageMaker's Distributed Model Parallel Library
<a name="model-parallel-customize-training-script"></a>

Use this section to learn how to customize your training script to use the core features of the Amazon SageMaker AI model parallelism library. To use the library-specific API functions and parameters, we recommend you use this documentation alongside the [SageMaker model parallel library APIs](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html) in the *SageMaker Python SDK documentation*.

The training script examples provided in these sections are simplified and designed to highlight the required changes you must make to use the library. For end-to-end, runnable notebook examples that demonstrate how to use a TensorFlow or PyTorch training script with the SageMaker model parallelism library, see [Amazon SageMaker AI model parallelism library v2 examples](distributed-model-parallel-v2-examples.md).

**Topics**
+ [

## Split the model of your training script using the SageMaker model parallelism library
](#model-parallel-model-splitting-using-smp-lib)
+ [

# Modify a TensorFlow training script
](model-parallel-customize-training-script-tf.md)
+ [

# Modify a PyTorch Training Script
](model-parallel-customize-training-script-pt.md)

## Split the model of your training script using the SageMaker model parallelism library
<a name="model-parallel-model-splitting-using-smp-lib"></a>

There are two ways to modify your training script to set up model splitting: automated splitting or manual splitting.

### Automated model splitting
<a name="model-parallel-automated-model-splitting"></a>

When you use SageMaker's model parallelism library, you can take advantage of *automated model splitting*, also referred to as *automated model partitioning*. The library uses a partitioning algorithm that balances memory, minimizes communication between devices, and optimizes performance. You can configure the automated partitioning algorithm to optimize for speed or memory. 

Alternatively, you can use manual model splitting. We recommend automated model splitting, unless you are very familiar with the model architecture and have a good idea of how to efficiently partition your model.

#### How it works
<a name="model-parallel-automated-model-splitting-how-it-works"></a>

Auto-partitioning occurs during the first training step, when the `smp.step`-decorated function is first called. During this call, the library first constructs a version of the model on the CPU RAM (to avoid GPU memory limitations), and then analyzes the model graph and makes a partitioning decision. Based on this decision, each model partition is loaded on a GPU, and only then the first step is executed. Because of these analysis and partitioning steps, the first training step might take longer. 

In either framework, the library manages the communication between devices through its own backend, which is optimized for AWS infrastructure.

The auto-partition design adapts to the characteristics of the framework, and the library does the partitioning at the granularity level that is more natural in each framework. For instance, in TensorFlow, each specific operation can be assigned to a different device, whereas in PyTorch, the assignment is done at the module level, where each module consists of multiple operations. The follow section reviews the specifics of the design in each framework.

##### Automated model splitting with PyTorch
<a name="model-parallel-auto-model-split-pt"></a>

During the first training step, the model parallelism library internally runs a tracing step that is meant to construct the model graph and determine the tensor and parameter shapes. After this tracing step, the library constructs a tree, which consists of the nested `nn.Module` objects in the model, as well as additional data gathered from tracing, such as the amount of stored `nn.Parameters`, and execution time for each `nn.Module`. 

Next, the library traverses this tree from the root and runs a partitioning algorithm that assigns each `nn.Module` to a device, which balances computational load (measured by module execution time) and memory use (measured by the total stored `nn.Parameter` size and activations). If multiple `nn.Modules` share the same `nn.Parameter`, then these modules are placed on the same device to avoid maintaining multiple versions of the same parameter. Once the partitioning decision is made, the assigned modules and weights are loaded to their devices.

For instructions on how to register the `smp.step` decorator to your PyTorch training script, see [Automated splitting with PyTorch](model-parallel-customize-training-script-pt.md#model-parallel-customize-training-script-pt-16).

##### Automated model splitting with TensorFlow
<a name="model-parallel-auto-model-split-tf"></a>

The model parallelism library analyzes the sizes of the trainable variables and the graph structure, and internally uses a graph partitioning algorithm. This algorithm comes up with a device assignment for each operation, with the objective of minimizing the amount of communication needed across devices, subject to two constraints: 
+ Balancing the number of variables stored in each device
+ Balancing the number of operations executed in each device

If you specify `speed` for `optimize` (in the model parallelism parameters in the Python SDK), the library tries to balance the number of operations and `tf.Variable` objects in each device. Otherwise, it tries to balance the total size of `tf.Variables`.

Once the partitioning decision is made, the library creates a serialized representation of the subgraph that each device needs to execute and imports them onto each device. While partitioning, the library places operations that consume the same `tf.Variable` and operations that are part of the same Keras layer onto the same device. It also respects the colocation constraints imposed by TensorFlow. This means that, for example, if there are two Keras layers that share a `tf.Variable`, then all operations that are part of these layers are placed on a single device.

For instructions on how to register the `smp.step` decorator to your PyTorch training script, see [Automated splitting with TensorFlow](model-parallel-customize-training-script-tf.md#model-parallel-customize-training-script-tf-23).

##### Comparison of automated model splitting between frameworks
<a name="model-parallel-auto-model-split-comparison"></a>

In TensorFlow, the fundamental unit of computation is a `tf.Operation`, and TensorFlow represents the model as a directed acyclic graph (DAG) of `tf.Operation`s, and therefore the model parallelism library partitions this DAG so that each node goes to one device. Crucially, `tf.Operation` objects are sufficiently rich with customizable attributes, and they are universal in the sense that every model is guaranteed to consist of a graph of such objects. 

PyTorch on the other hand, does not have an equivalent notion of operation that is sufficiently rich and universal. The closest unit of computation in PyTorch that has these characteristics is an `nn.Module`, which is at a much higher granularity level, and this is why the library does partitioning at this level in PyTorch.

### Manual Model Splitting
<a name="model-parallel-manual-model-splitting"></a>

If you want to manually specify how to partition your model across devices, use the `smp.partition` context manager. For instructions on how to set the context manager for manual partitioning, see the following pages.
+ [Manual splitting with TensorFlow](model-parallel-customize-training-script-tf.md#model-parallel-customize-training-script-tf-manual)
+ [Manual splitting with PyTorch](model-parallel-customize-training-script-pt.md#model-parallel-customize-training-script-pt-16-hvd)

To use this option after making modifications, in Step 2, you'll need to set `auto_partition` to `False`, and define a `default_partition` in the framework estimator class of the SageMaker Python SDK. Any operation that is not explicitly placed on a partition through the `smp.partition` context manager is executed on the `default_partition`. In this case, the automated splitting logic is bypassed, and each operation is placed based on your specification. Based on the resulting graph structure, the model parallelism library creates a pipelined execution schedule automatically.

# Modify a TensorFlow training script
<a name="model-parallel-customize-training-script-tf"></a>

In this section, you learn how to modify TensorFlow training scripts to configure the SageMaker model parallelism library for auto-partitioning and manual partitioning. This selection of examples also includes an example integrated with Horovod for hybrid model and data parallelism.

**Note**  
To find which TensorFlow versions are supported by the library, see [Supported Frameworks and AWS Regions](distributed-model-parallel-support.md).

The required modifications you must make to your training script to use the library are listed in [Automated splitting with TensorFlow](#model-parallel-customize-training-script-tf-23).

To learn how to modify your training script to use hybrid model and data parallelism with Horovod, see [Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism](#model-parallel-customize-training-script-tf-2.3).

If you want to use manual partitioning, also review [Manual splitting with TensorFlow](#model-parallel-customize-training-script-tf-manual). 

The following topics show examples of training scripts that you can use to configure SageMaker's model parallelism library for auto-partitioning and manual partitioning TensorFlow models. 

**Note**  
Auto-partitioning is enabled by default. Unless otherwise specified, the example scripts use auto-partitioning.

**Topics**
+ [

## Automated splitting with TensorFlow
](#model-parallel-customize-training-script-tf-23)
+ [

## Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism
](#model-parallel-customize-training-script-tf-2.3)
+ [

## Manual splitting with TensorFlow
](#model-parallel-customize-training-script-tf-manual)
+ [

## Unsupported framework features
](#model-parallel-tf-unsupported-features)

## Automated splitting with TensorFlow
<a name="model-parallel-customize-training-script-tf-23"></a>

The following training script changes are required to run a TensorFlow model with SageMaker's model parallelism library:

1. Import and initialize the library with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init).

1. Define a Keras model by inheriting from [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.html](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.html) instead of the Keras Model class. Return the model outputs from the call method of the `smp.DistributedModel` object. Be mindful that any tensors returned from the call method will be broadcast across model-parallel devices, incurring communication overhead, so any tensors that are not needed outside the call method (such as intermediate activations) should not be returned.

1. Set `drop_remainder=True` in `tf.Dataset.batch()` method. This is to ensure that the batch size is always divisible by the number of microbatches.

1. Seed the random operations in the data pipeline using `smp.dp_rank()`, e.g., `shuffle(ds, seed=smp.dp_rank())` to ensure consistency of data samples across GPUs that hold different model partitions.

1. Put the forward and backward logic in a step function and decorate it with `smp.step`.

1. Perform post-processing on the outputs across microbatches using [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput) methods such as `reduce_mean`. The [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init) function must have a return value that depends on the output of `smp.DistributedModel`.

1. If there is an evaluation step, similarly place the forward logic inside an `smp.step`-decorated function and post-process the outputs using [`StepOutput` API](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput).

To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html). 

The following Python script is an example of a training script after the changes are made.

```
import tensorflow as tf

# smdistributed: Import TF2.x API
import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
    "MNIST-data-%d" % smp.rank()
)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: If needed, seed the shuffle with smp.dp_rank(), and drop_remainder
# in batching to make sure batch size is always divisible by number of microbatches
train_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(10000, seed=smp.dp_rank())
    .batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API 
class MyModel(smp.DistributedModel):
    def __init__(self):
        super(MyModel, self).__init__()
        # define layers

    def call(self, x, training=None):
        # define forward pass and return the model output

model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside
@smp.step
def get_grads(images, labels):
    predictions = model(images, training=True)
    loss = loss_object(labels, predictions)

    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss, predictions


@tf.function
def train_step(images, labels):
    gradients, loss, predictions = get_grads(images, labels)

    # smdistributed: Accumulate the gradients across microbatches
    gradients = [g.accumulate() for g in gradients]
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # smdistributed: Merge predictions and average losses across microbatches
    train_accuracy(labels, predictions.merge())
    return loss.reduce_mean()


for epoch in range(5):
    # Reset the metrics at the start of the next epoch
    train_accuracy.reset_states()
    for images, labels in train_ds:
        loss = train_step(images, labels)
    accuracy = train_accuracy.result()
```

If you are done preparing your training script, proceed to [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md). If you want to run a hybrid model and data parallel training job, continue to the next section.

## Automated splitting with TensorFlow and Horovod for hybrid model and data parallelism
<a name="model-parallel-customize-training-script-tf-2.3"></a>

You can use the SageMaker model parallelism library with Horovod for hybrid model and data parallelism. To read more about how the library splits a model for hybrid parallelism, see [Pipeline parallelism (available for PyTorch and TensorFlow)](model-parallel-intro.md#model-parallel-intro-pp).

In this step, we focus on how to modify your training script to adapt the SageMaker model parallelism library.

To properly set up your training script to pick up the hybrid parallelism configuration that you'll set in [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md), use the library's helper functions, `smp.dp_rank()` and `smp.mp_rank()`, which automatically detect the data parallel rank and model parallel rank respectively. 

To find all MPI primitives the library supports, see [MPI Basics](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#mpi-basics) in the SageMaker Python SDK documentation. 

The required changes needed in the script are:
+ Adding `hvd.allreduce`
+ Broadcasting variables after the first batch, as required by Horovod
+ Seeding shuffling and/or sharding operations in the data pipeline with `smp.dp_rank()`.

**Note**  
When you use Horovod, you must not directly call `hvd.init` in your training script. Instead, you'll have to set `"horovod"` to `True` in the SageMaker Python SDK `modelparallel` parameters in [Step 2: Launch a Training Job Using the SageMaker Python SDK](model-parallel-sm-sdk.md). This allows the library to internally initialize Horovod based on the device assignments of model partitions. Calling `hvd.init()` directly in your training script can cause problems.

**Note**  
Using the `hvd.DistributedOptimizer` API directly in your training script might result in a poor training performance and speed, because the API implicitly places the `AllReduce` operation inside `smp.step`. We recommend you to use the model parallelism library with Horovod by directly calling `hvd.allreduce` after calling `accumulate()` or `reduce_mean()` on the gradients returned from `smp.step`, as will be shown in the following example.

To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html).

```
import tensorflow as tf
import horovod.tensorflow as hvd

# smdistributed: Import TF2.x API 
import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
    "MNIST-data-%d" % smp.rank()
)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: Seed the shuffle with smp.dp_rank(), and drop_remainder
# in batching to make sure batch size is always divisible by number of microbatches
train_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(10000, seed=smp.dp_rank())
    .batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API 
class MyModel(smp.DistributedModel):
    def __init__(self):
        super(MyModel, self).__init__()
        # define layers

    def call(self, x, training=None):
        # define forward pass and return model outputs


model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside
@smp.step
def get_grads(images, labels):
    predictions = model(images, training=True)
    loss = loss_object(labels, predictions)

    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss, predictions


@tf.function
def train_step(images, labels, first_batch):
    gradients, loss, predictions = get_grads(images, labels)

    # smdistributed: Accumulate the gradients across microbatches
    # Horovod: AllReduce the accumulated gradients
    gradients = [hvd.allreduce(g.accumulate()) for g in gradients]
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # Horovod: Broadcast the variables after first batch 
    if first_batch:
        hvd.broadcast_variables(model.variables, root_rank=0)
        hvd.broadcast_variables(optimizer.variables(), root_rank=0)

    # smdistributed: Merge predictions across microbatches
    train_accuracy(labels, predictions.merge())
    return loss.reduce_mean()


for epoch in range(5):
    # Reset the metrics at the start of the next epoch
    train_accuracy.reset_states()

    for batch, (images, labels) in enumerate(train_ds):
        loss = train_step(images, labels, tf.constant(batch == 0))
```

## Manual splitting with TensorFlow
<a name="model-parallel-customize-training-script-tf-manual"></a>

Use `smp.partition` context managers to place operations in specific partition. Any operation not placed in any `smp.partition` contexts is placed in the `default_partition`. To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html). 

```
import tensorflow as tf

# smdistributed: Import TF2.x API.
import smdistributed.modelparallel.tensorflow as smp

# smdistributed: Initialize
smp.init()

# Download and load MNIST dataset.
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data(
    "MNIST-data-%d" % smp.rank()
)
x_train, x_test = x_train / 255.0, x_test / 255.0

# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

# smdistributed: If needed, seed the shuffle with smp.dp_rank(), and drop_remainder
# in batching to make sure batch size is always divisible by number of microbatches.
train_ds = (
    tf.data.Dataset.from_tensor_slices((x_train, y_train))
    .shuffle(10000, seed=smp.dp_rank())
    .batch(256, drop_remainder=True)
)

# smdistributed: Define smp.DistributedModel the same way as Keras sub-classing API.
class MyModel(smp.DistributedModel):
    def __init__(self):
         # define layers

    def call(self, x):
        with smp.partition(0):
            x = self.layer0(x)
        with smp.partition(1):
            return self.layer1(x)


model = MyModel()

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam()
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name="train_accuracy")

# smdistributed: Define smp.step. Return any tensors needed outside
@smp.step
def get_grads(images, labels):
    predictions = model(images, training=True)
    loss = loss_object(labels, predictions)

    grads = optimizer.get_gradients(loss, model.trainable_variables)
    return grads, loss, predictions


@tf.function
def train_step(images, labels):
    gradients, loss, predictions = get_grads(images, labels)

    # smdistributed: Accumulate the gradients across microbatches
    gradients = [g.accumulate() for g in gradients]
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # smdistributed: Merge predictions and average losses across microbatches
    train_accuracy(labels, predictions.merge())
    return loss.reduce_mean()


for epoch in range(5):
    # Reset the metrics at the start of the next epoch
    train_accuracy.reset_states()
    for images, labels in train_ds:
        loss = train_step(images, labels)
    accuracy = train_accuracy.result()
```

## Unsupported framework features
<a name="model-parallel-tf-unsupported-features"></a>

The following TensorFlow features are not supported by the library:
+ `tf.GradientTape()` is currently not supported. You can use `Optimizer.get_gradients()` or `Optimizer.compute_gradients()` instead to compute gradients.
+ The `tf.train.Checkpoint.restore()` API is currently not supported. For checkpointing, use `smp.CheckpointManager` instead, which provides the same API and functionality. Note that checkpoint restores with `smp.CheckpointManager` should take place after the first step.

# Modify a PyTorch Training Script
<a name="model-parallel-customize-training-script-pt"></a>

In this section, you learn how to modify PyTorch training scripts to configure the SageMaker model parallelism library for auto-partitioning and manual partitioning.

**Note**  
To find which PyTorch versions are supported by the library, see [Supported Frameworks and AWS Regions](distributed-model-parallel-support.md).

**Tip**  
For end-to-end notebook examples that demonstrate how to use a PyTorch training script with the SageMaker model parallelism library, see [Amazon SageMaker AI model parallelism library v1 examples](distributed-model-parallel-examples.md).

Note that auto-partitioning is enabled by default. Unless otherwise specified, the following scripts use auto-partitioning. 

**Topics**
+ [

## Automated splitting with PyTorch
](#model-parallel-customize-training-script-pt-16)
+ [

## Manual splitting with PyTorch
](#model-parallel-customize-training-script-pt-16-hvd)
+ [

## Considerations
](#model-parallel-pt-considerations)
+ [

## Unsupported framework features
](#model-parallel-pt-unsupported-features)

## Automated splitting with PyTorch
<a name="model-parallel-customize-training-script-pt-16"></a>

The following training script changes are required to run a PyTorch training script with SageMaker's model parallelism library:

1. Import and initialize the library with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init).

1. Wrap the model with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedModel). Be mindful that any tensors returned from the `forward` method of the underlying `nn.Module` object will be broadcast across model-parallel devices, incurring communication overhead, so any tensors that are not needed outside the call method (such as intermediate activations) should not be returned.
**Note**  
For FP16 training, you need to use the [smdistributed.modelparallel.torch.model\$1creation()](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html) context manager to wrap the model. For more information, see [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md).

1. Wrap the optimizer with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedOptimizer](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedOptimizer).
**Note**  
For FP16 training, you need to set up static or dynamic loss scaling. For more information, see [FP16 Training with Model Parallelism](model-parallel-extended-features-pytorch-fp16.md).

1. Use the returned `DistributedModel` object instead of a user model.

1. Put the forward and backward logic in a step function and decorate it with [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#smp.init).

1. Restrict each process to its own device through `torch.cuda.set_device(smp.local_rank())`.

1. Move the input tensors to the GPU using the `.to()` API before the `smp.step` call (see example below).

1. Replace `torch.Tensor.backward` and `torch.autograd.backward` with `DistributedModel.backward`.

1. Perform post-processing on the outputs across microbatches using [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput) methods such as `reduce_mean`.

1. If there is an evaluation step, similarly place the forward logic inside an `smp.step`-decorated function and post-process the outputs using [`StepOutput` API](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.html#StepOutput).

1. Set `drop_last=True` in `DataLoader`. Alternatively, manually skip a batch in the training loop if the batch size is not divisible by the number of microbatches.

To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html). 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class GroupedNet(nn.Module):
    def __init__(self):
        super(GroupedNet, self).__init__()
        # define layers

    def forward(self, x):
        # define forward pass and return model outputs


# smdistributed: Define smp.step. Return any tensors needed outside.
@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(loss)
    return output, loss


def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by the current process,
        # based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        # Return value, loss_mb is a StepOutput object
        _, loss_mb = train_step(model, data, target)

        # smdistributed: Average the loss across microbatches.
        loss = loss_mb.reduce_mean()

        optimizer.step()

# smdistributed: initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
dataset = datasets.MNIST("../data", train=True, download=False)

# smdistributed: Shard the dataset based on data-parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible
# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = GroupedNet()
optimizer = optim.Adadelta(model.parameters(), lr=4.0)

# smdistributed: Use the DistributedModel container to provide the model
# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

## Manual splitting with PyTorch
<a name="model-parallel-customize-training-script-pt-16-hvd"></a>

Use [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedOptimizer](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.html#smp.DistributedOptimizer) context managers to place modules in specific devices. Any module not placed in any `smp.partition` contexts is placed in the `default_partition`. The `default_partition` needs to be provided if `auto_partition` is set to `False`. The modules that are created within a specific `smp.partition` context are placed on the corresponding partition.

To learn more about the SageMaker's model parallelism library API, refer to the [API documentation](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel.html). 

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchnet.dataset import SplitDataset
from torchvision import datasets

import smdistributed.modelparallel.torch as smp

class GroupedNet(nn.Module):
    def __init__(self):
        super(GroupedNet, self).__init__()
        with smp.partition(0):
            # define child modules on device 0
        with smp.partition(1):
            # define child modules on device 1

    def forward(self, x):
        # define forward pass and return model outputs


# smdistributed: Define smp.step. Return any tensors needed outside.
@smp.step
def train_step(model, data, target):
    output = model(data)
    loss = F.nll_loss(output, target, reduction="mean")
    model.backward(loss)
    return output, loss


def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # smdistributed: Move input tensors to the GPU ID used by the current process,
        # based on the set_device call.
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        # Return value, loss_mb is a StepOutput object
        _, loss_mb = train_step(model, data, target)

        # smdistributed: Average the loss across microbatches.
        loss = loss_mb.reduce_mean()

        optimizer.step()

# smdistributed: initialize the backend
smp.init()

# smdistributed: Set the device to the GPU ID used by the current process.
# Input tensors should be transferred to this device.
torch.cuda.set_device(smp.local_rank())
device = torch.device("cuda")

# smdistributed: Download only on a single process per instance.
# When this is not present, the file is corrupted by multiple processes trying
# to download and extract at the same time
dataset = datasets.MNIST("../data", train=True, download=False)

# smdistributed: Shard the dataset based on data-parallel ranks
if smp.dp_size() > 1:
    partitions_dict = {f"{i}": 1 / smp.dp_size() for i in range(smp.dp_size())}
    dataset = SplitDataset(dataset, partitions=partitions_dict)
    dataset.select(f"{smp.dp_rank()}")

# smdistributed: Set drop_last=True to ensure that batch size is always divisible
# by the number of microbatches
train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

model = GroupedNet()
optimizer = optim.Adadelta(model.parameters(), lr=4.0)

# smdistributed: Use the DistributedModel container to provide the model
# to be partitioned across different ranks. For the rest of the script,
# the returned DistributedModel object should be used in place of
# the model provided for DistributedModel class instantiation.
model = smp.DistributedModel(model)
optimizer = smp.DistributedOptimizer(optimizer)

train(model, device, train_loader, optimizer)
```

## Considerations
<a name="model-parallel-pt-considerations"></a>

When you configure a PyTorch training script using SageMaker's model parallelism library, you should be aware of the following:
+ If you are using an optimization technique that relies on global gradient norms, for example gradient norm from the entire model, such as some variants of LAMB optimizer or global gradient clipping, you need to gather all the norms across the model partitions for correctness. You can use the library’s communication basic data types to do this.
+ All `torch.Tensor` arguments to the forward methods of the `nn.Modules` in your model must be used in the computation of the module output. In other words, the library does not support the case where there is a `torch.Tensor` argument to a module on which the module output does not depend.
+ The argument to the `smp.DistributedModel.backward()` call must depend on all model outputs. In other words, there cannot be an output from the `smp.DistributedModel.forward` call that is not used in the computation of the tensor that is fed into the `smp.DistributedModel.backward` call.
+ If there are `torch.cuda.synchronize()` calls in your code, you might need to call `torch.cuda.set_device(smp.local_rank())` immediately before the synchronize call. Otherwise unnecessary CUDA contexts might be created in device 0, which will needlessly consume memory.
+ Since the library places `nn.Modules` on different devices, the modules in the model must not depend on any global state that is modified inside `smp.step`. Any state that remains fixed throughout training, or that is modified outside `smp.step` in a way that is visible to all processes, is allowed.
+ You don’t need to move the model to GPU (for example, using `model.to(device)`) when using the library. If you try to move the model to GPU before the model is partitioned (before the first `smp.step` call), the move call is ignored. The library automatically moves the part of the model assigned to a rank to its GPU. Once training with the library starts, don’t move the model to CPU and use it, as it won’t have correct parameters for modules not assigned to the partition held by the process. If you want to retrain a model or use it for inference without the library after it was trained using the model parallelism library, the recommended way is to save the full model using our checkpointing API and load it back to a regular PyTorch Module.
+ If you have a list of modules such that output of one feeds into another, replacing that list with `nn.Sequential` can significantly improve performance.
+ The weight update (`optimizer.step()`) needs to happen outside of `smp.step` because that is when the entire backward pass is done and gradients are ready. When using a hybrid model with model and data parallelism, at this point, AllReduce of gradients is also guaranteed to finish.
+ When using the library in combination with data parallelism, make sure that the number of batches on all data parallel ranks is the same so that AllReduce does not hang waiting for a rank which is not participating in the step.
+ If you launch a training job using an ml.p4d instance type (such as ml.p4d.24xlarge), you must set the data loader variable `num_workers=0`. For example, you may define your `DataLoader` as follows:

  ```
  dataloader = torch.utils.data.DataLoader(
              data,
              batch_size=batch_size,
              num_workers=0,
              pin_memory=True,
              drop_last=True,
              shuffle=shuffle,
          )
  ```
+ The inputs to `smp.step` must be the model inputs generated by `DataLoader`. This is because `smp.step` internally splits the input tensors along the batch dimension and pipelines them. This means that passing `DataLoader` itself to the `smp.step` function to generate the model inputs inside does not work. 

  For example, if you define a `DataLoader` as follows:

  ```
  train_loader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)
  ```

  You should access the model inputs generated by `train_loader` and pass those to an `smp.step` decorated function. Do not pass `train_loader` directly to the `smp.step` function.

  ```
  def train(model, device, train_loader, optimizer):
      model.train()
      for batch_idx, (data, target) in enumerate(train_loader):
          ...
          _, loss_mb = train_step(model, data, target)
          ...
  
  @smp.step
  def train_step(model, data, target):
      ...
      return output, loss
  ```
+ The input tensors to `smp.step` must be moved to the current device using `.to()` API, which must take place after the `torch.cuda.set_device(local_rank())` call.

  For example, you may define the `train` function as follows. This function adds `data` and `target` to the current device using `.to()` API before using those input tensors to call `train_step`.

  ```
  def train(model, device, train_loader, optimizer):
      model.train()
      for batch_idx, (data, target) in enumerate(train_loader):
          # smdistributed: Move input tensors to the GPU ID used by the current process,
          # based on the set_device call.
          data, target = data.to(device), target.to(device)
          optimizer.zero_grad()
          # Return value, loss_mb is a StepOutput object
          _, loss_mb = train_step(model, data, target)
  
          # smdistributed: Average the loss across microbatches.
          loss = loss_mb.reduce_mean()
  
          optimizer.step()
  ```

  The input tensors to this `smp.set` decorated function have been moved to the current device in the `train` function above. The model does *not* need to be moved to the current device. The library automatically moves the part of the model assigned to a rank to its GPU.

  ```
  @smp.step
  def train_step(model, data, target):
      output = model(data)
      loss = F.nll_loss(output, target, reduction="mean")
      model.backward(loss)
      return output, loss
  ```

## Unsupported framework features
<a name="model-parallel-pt-unsupported-features"></a>

The following PyTorch features are unsupported by SageMaker's model parallelism library:
+ If you use data parallelism with the native [PyTorch DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html), the [https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) wrapper module is not supported by the library. The library internally manages integrating with PyTorch DDP, including parameter broadcast and gradient AllReduce. When using the library, module buffers are only broadcast once at the start of training. If your model has module buffers that need to be synchronized across data parallel groups at each step, you can do so through the `torch.distributed` API, using the process group that can be obtained via `smp.get_dp_process_group()`.
+ For mixed precision training, the `apex.amp` module is not supported. The recommended way to use the library with automatic mixed-precision is to use `torch.cuda.amp`, with the exception of using `smp.amp.GradScaler` instead of the implementation in torch.
+ `torch.jit.ScriptModules` or `ScriptFunctions` are not supported by `smp.DistributedModel`.
+ `apex` : `FusedLayerNorm`, `FusedAdam`, `FusedLAMB`, and `FusedNovoGrad` from `apex` are not supported. You can use the library implementations of these through `smp.optimizers` and `smp.nn` APIs instead.

# Step 2: Launch a Training Job Using the SageMaker Python SDK
<a name="model-parallel-sm-sdk"></a>

The SageMaker Python SDK supports managed training of models with ML frameworks such as TensorFlow and PyTorch. To launch a training job using one of these frameworks, you define a SageMaker [TensorFlow estimator](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator), a SageMaker [PyTorch estimator](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator), or a SageMaker generic [Estimator](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/estimators.html#sagemaker.estimator.Estimator) to use the modified training script and model parallelism configuration.

**Topics**
+ [

## Using the SageMaker TensorFlow and PyTorch Estimators
](#model-parallel-using-sagemaker-pysdk)
+ [

## Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library
](#model-parallel-customize-container)
+ [

## Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
](#model-parallel-bring-your-own-container)

## Using the SageMaker TensorFlow and PyTorch Estimators
<a name="model-parallel-using-sagemaker-pysdk"></a>

The TensorFlow and PyTorch estimator classes contain the `distribution` parameter, which you can use to specify configuration parameters for using distributed training frameworks. The SageMaker model parallel library internally uses MPI for hybrid data and model parallelism, so you must use the MPI option with the library.

The following template of a TensorFlow or PyTorch estimator shows how to configure the `distribution` parameter for using the SageMaker model parallel library with MPI.

------
#### [ Using the SageMaker TensorFlow estimator ]

```
import sagemaker
from sagemaker.tensorflow import TensorFlow

smp_options = {
    "enabled":True,              # Required
    "parameters": {
        "partitions": 2,         # Required
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "horovod": True,         # Use this for hybrid model and data parallelism
    }
}

mpi_options = {
    "enabled" : True,            # Required
    "processes_per_host" : 8,    # Required
    # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}

smd_mp_estimator = TensorFlow(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='2.6.3',
    py_version='py38',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

------
#### [ Using the SageMaker PyTorch estimator ]

```
import sagemaker
from sagemaker.pytorch import PyTorch

smp_options = {
    "enabled":True,
    "parameters": {                        # Required
        "pipeline_parallel_degree": 2,     # Required
        "microbatches": 4,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "ddp": True,
    }
}

mpi_options = {
    "enabled" : True,                      # Required
    "processes_per_host" : 8,              # Required
    # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none"
}

smd_mp_estimator = PyTorch(
    entry_point="your_training_script.py", # Specify your train script
    source_dir="location_to_your_script",
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.p3.16xlarge',
    framework_version='1.13.1',
    py_version='py38',
    distribution={
        "smdistributed": {"modelparallel": smp_options},
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

------

To enable the library, you need to pass configuration dictionaries to the `"smdistributed"` and `"mpi"` keys through the `distribution` argument of the SageMaker estimator constructors.

**Configuration parameters for SageMaker model parallelism**
+ For the `"smdistributed"` key, pass a dictionary with the `"modelparallel"` key and the following inner dictionaries. 
**Note**  
Using `"modelparallel"` and `"dataparallel"` in one training job is not supported. 
  + `"enabled"` – Required. To enable model parallelism, set `"enabled": True`.
  + `"parameters"` – Required. Specify a set of parameters for SageMaker model parallelism.
    + For a complete list of common parameters, see [Parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#smdistributed-parameters) in the *SageMaker Python SDK documentation*.

      For TensorFlow, see [TensorFlow-specific Parameters](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#tensorflow-specific-parameters).

      For PyTorch, see [PyTorch-specific Parameters](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#pytorch-specific-parameters).
    + `"pipeline_parallel_degree"` (or `"partitions"` in `smdistributed-modelparallel<v1.6.0`) – Required. Among the [parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#smdistributed-parameters), this parameter is required to specify how many model partitions you want to split into.
**Important**  
There is a breaking change in the parameter name. The `"pipeline_parallel_degree"` parameter replaces the `"partitions"` since `smdistributed-modelparallel` v1.6.0. For more information, see [Common Parameters](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#common-parameters) for SageMaker model parallelism configuration and [SageMaker Distributed Model Parallel Release Notes](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.html) in the *SageMaker Python SDK documentation*.
+ For the `"mpi"` key, pass a dictionary that contains the following:
  + `"enabled"` – Required. Set `True` to launch the distributed training job with MPI.
  + `"processes_per_host"` – Required. Specify the number of processes MPI should launch on each host. In SageMaker AI, a host is a single Amazon EC2 ML instance. The SageMaker Python SDK maintains a one-to-one mapping between processes and GPUs across model and data parallelism. This means that SageMaker AI schedules each process on a single, separate GPU and no GPU contains more than one process. If you are using PyTorch, you must restrict each process to its own device through `torch.cuda.set_device(smp.local_rank())`. To learn more, see [Automated splitting with PyTorch](model-parallel-customize-training-script-pt.md#model-parallel-customize-training-script-pt-16).
**Important**  
 `process_per_host` *must* not be greater than the number of GPUs per instance and typically will be equal to the number of GPUs per instance.
  + `"custom_mpi_options"` (optional) – Use this key to pass any custom MPI options you might need. If you do not pass any MPI custom options to the key, the MPI option is set by default to the following flag.

    ```
    --mca btl_vader_single_copy_mechanism none
    ```
**Note**  
You do not need to explicitly specify this default flag to the key. If you explicitly specify it, your distributed model parallel training job might fail with the following error:  

    ```
    The following MCA parameter has been listed multiple times on the command line: 
    MCA param: btl_vader_single_copy_mechanism MCA parameters can only be listed once 
    on a command line to ensure there is no ambiguity as to its value. 
    Please correct the situation and try again.
    ```
**Tip**  
If you launch a training job using an EFA-enabled instance type, such as `ml.p4d.24xlarge` and `ml.p3dn.24xlarge`, use the following flag for best performance:  

    ```
    -x FI_EFA_USE_DEVICE_RDMA=1 -x FI_PROVIDER=efa -x RDMAV_FORK_SAFE=1
    ```

To launch the training job using the estimator and your SageMaker model parallel configured training script, run the `estimator.fit()` function.

Use the following resources to learn more about using the model parallelism features in the SageMaker Python SDK:
+ [Use TensorFlow with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/tensorflow/using_tf.html)
+ [Use PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/pytorch/using_pytorch.html)
+ We recommend you use a SageMaker notebook instance if you are new users. To see an example of how you can launch a training job using a SageMaker notebook instance, see [Amazon SageMaker AI model parallelism library v2 examples](distributed-model-parallel-v2-examples.md).
+ You can also submit a distributed training job from your machine using AWS CLI. To set up AWS CLI on your machine, see [set up your AWS credentials and Region for development](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html).

## Extend a Pre-built Docker Container that Contains SageMaker's Distributed Model Parallel Library
<a name="model-parallel-customize-container"></a>

To extend a pre-built container and use SageMaker's model parallelism library, you must use one of the available AWS Deep Learning Containers (DLC) images for PyTorch or TensorFlow. The SageMaker model parallelism library is included in the TensorFlow (2.3.0 and later) and PyTorch (1.6.0 and later) DLC images with CUDA (`cuxyz`). For a complete list of DLC images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) in the *AWS Deep Learning Containers GitHub repository*.

**Tip**  
We recommend that you use the image that contains the latest version of TensorFlow or PyTorch to access the most up-to-date version of the SageMaker model parallelism library.

For example, your Dockerfile should contain a `FROM` statement similar to the following:

```
# Use the SageMaker DLC image URI for TensorFlow or PyTorch
FROM aws-dlc-account-id.dkr.ecr.aws-region.amazonaws.com/framework-training:{framework-version-tag}

# Add your dependencies here
RUN ...

ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker AI container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code
```

Additionally, when you define a PyTorch or TensorFlow estimator, you must specify that the `entry_point` for your training script. This should be the same path identified with `ENV SAGEMAKER_SUBMIT_DIRECTORY` in your Dockerfile. 

**Tip**  
You must push this Docker container to Amazon Elastic Container Registry (Amazon ECR) and use the image URI (`image_uri`) to define a SageMaker estimator for training. For more information, see [Extend a Pre-built Container](prebuilt-containers-extend.md). 

After you finish hosting the Docker container and retrieving the image URI of the container, create a SageMaker `PyTorch` estimator object as follows. This example assumes that you have already defined `smp_options` and `mpi_options`. 

```
smd_mp_estimator = Estimator(
    entry_point="your_training_script.py",
    role=sagemaker.get_execution_role(),
    instance_type='ml.p3.16xlarge',
    sagemaker_session=sagemaker_session,
    image_uri='your_aws_account_id.dkr.ecr.region.amazonaws.com/name:tag'
    instance_count=1,
    distribution={
        "smdistributed": smp_options,
        "mpi": mpi_options
    },
    base_job_name="SMD-MP-demo",
)

smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
```

## Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library
<a name="model-parallel-bring-your-own-container"></a>

To build your own Docker container for training and use the SageMaker model parallel library, you must include the correct dependencies and the binary files of the SageMaker distributed parallel libraries in your Dockerfile. This section provides the minimum set of code blocks you must include to properly prepare a SageMaker training environment and the model parallel library in your own Docker container.

**Note**  
This custom Docker option with the SageMaker model parallel library as a binary is available only for PyTorch.

**To create a Dockerfile with the SageMaker training toolkit and the model parallel library**

1. Start with one of the [NVIDIA CUDA base images](https://hub.docker.com/r/nvidia/cuda).

   ```
   FROM <cuda-cudnn-base-image>
   ```
**Tip**  
The official AWS Deep Learning Container (DLC) images are built from the [NVIDIA CUDA base images](https://hub.docker.com/r/nvidia/cuda). We recommend you look into the [official Dockerfiles of AWS Deep Learning Container for PyTorch](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker) to find which versions of the libraries you need to install and how to configure them. The official Dockerfiles are complete, benchmark tested, and managed by the SageMaker and Deep Learning Container service teams. In the provided link, choose the PyTorch version you use, choose the CUDA (`cuxyz`) folder, and choose the Dockerfile ending with `.gpu` or `.sagemaker.gpu`.

1. To set up a distributed training environment, you need to install software for communication and network devices, such as [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html), [NVIDIA Collective Communications Library (NCCL)](https://developer.nvidia.com/nccl), and [Open MPI](https://www.open-mpi.org/). Depending on the PyTorch and CUDA versions you choose, you must install compatible versions of the libraries.
**Important**  
Because the SageMaker model parallel library requires the SageMaker data parallel library in the subsequent steps, we highly recommend that you follow the instructions at [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md) to properly set up a SageMaker training environment for distributed training.

   For more information about setting up EFA with NCCL and Open MPI, see [Get started with EFA and MPI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html) and [Get started with EFA and NCCL](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html).

1. Add the following arguments to specify the URLs of the SageMaker distributed training packages for PyTorch. The SageMaker model parallel library requires the SageMaker data parallel library to use the cross-node Remote Direct Memory Access (RDMA).

   ```
   ARG SMD_MODEL_PARALLEL_URL=https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-02-21-19-26/smdistributed_modelparallel-1.7.0-cp38-cp38-linux_x86_64.whl
   ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.10.2/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
   ```

1. Install dependencies that the SageMaker model parallel library requires.

   1. Install the [METIS](http://glaros.dtc.umn.edu/gkhome/metis/metis/overview) library.

      ```
      ARG METIS=metis-5.1.0
      
      RUN rm /etc/apt/sources.list.d/* \
        && wget -nv http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/${METIS}.tar.gz \
        && gunzip -f ${METIS}.tar.gz \
        && tar -xvf ${METIS}.tar \
        && cd ${METIS} \
        && apt-get update \
        && make config shared=1 \
        && make install \
        && cd .. \
        && rm -rf ${METIS}.tar* \
        && rm -rf ${METIS} \
        && rm -rf /var/lib/apt/lists/* \
        && apt-get clean
      ```

   1. Install the [RAPIDS Memory Manager library](https://github.com/rapidsai/rmm#rmm-rapids-memory-manager). This requires [CMake](https://cmake.org/) 3.14 or later.

      ```
      ARG RMM_VERSION=0.15.0
      
      RUN  wget -nv https://github.com/rapidsai/rmm/archive/v${RMM_VERSION}.tar.gz \
        && tar -xvf v${RMM_VERSION}.tar.gz \
        && cd rmm-${RMM_VERSION} \
        && INSTALL_PREFIX=/usr/local ./build.sh librmm \
        && cd .. \
        && rm -rf v${RMM_VERSION}.tar* \
        && rm -rf rmm-${RMM_VERSION}
      ```

1. Install the SageMaker model parallel library.

   ```
   RUN pip install --no-cache-dir -U ${SMD_MODEL_PARALLEL_URL}
   ```

1. Install the SageMaker data parallel library.

   ```
   RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
   ```

1. Install the [sagemaker-training toolkit](https://github.com/aws/sagemaker-training-toolkit). The toolkit contains the common functionality that's necessary to create a container compatible with the SageMaker training platform and the SageMaker Python SDK.

   ```
   RUN pip install sagemaker-training
   ```

1. After you finish creating the Dockerfile, see [Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html) to learn how to build the Docker container and host it in Amazon ECR.

**Tip**  
For more general information about creating a custom Dockerfile for training in SageMaker AI, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).

# Checkpointing and Fine-Tuning a Model with Model Parallelism
<a name="distributed-model-parallel-checkpointing-and-finetuning"></a>

The SageMaker model parallelism library provides checkpointing APIs to save the model state and the optimizer state split by the various model parallelism strategies, and to load checkpoints for continuous training from where you want to restart training and fine-tune. The APIs also support options to save the model and optimizer states partially or fully.

**Topics**
+ [

## Checkpointing a distributed model
](#distributed-model-parallel-checkpoint)
+ [

## Fine-tuning a distributed model
](#distributed-model-parallel-fine-tuning)

## Checkpointing a distributed model
<a name="distributed-model-parallel-checkpoint"></a>

Choose one of the following topics depending on the framework between PyTorch and TensorFlow and the version of the SageMaker model parallelism library you use.

**Topics**
+ [

### Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and later)
](#model-parallel-extended-features-pytorch-checkpoint)
+ [

### Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between v1.6.0 and v1.9.0)
](#model-parallel-extended-features-pytorch-saving-loading-checkpoints)
+ [

### Checkpointing a distributed TensorFlow model
](#distributed-model-parallel-checkpoint-tensorflow)

### Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library v1.10.0 and later)
<a name="model-parallel-extended-features-pytorch-checkpoint"></a>

The SageMaker model parallelism library provides checkpoint APIs to save and load full or partial checkpoints of the distributed model state and its optimizer state.

**Note**  
This checkpointing method is recommended if you use PyTorch and the SageMaker model parallelism library v1.10.0 or later.

**Partial checkpointing**

To save checkpoints of a model trained with model parallelism, use the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save_checkpoint](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save_checkpoint) API with the partial checkpointing option set to true (`partial=True`). This saves each model partition individually. In addition to the model and the optimizer state, you can also save any additional custom data through the `user_content` argument. The checkpointed model, optimizer, and user content are saved as separate files. The `save_checkpoint` API call creates checkpoint folders in the following structure. 

```
- path
  - ${tag}_partial (folder for partial checkpoints)
    - model_rankinfo.pt
    - optimizer_rankinfo.pt
    - fp16_states_rankinfo.pt
    - user_content.pt
  - $tag (checkpoint file for full checkpoints)
  - user_content_$tag (user_content file for full checkpoints)
  - newest (a file that indicates the newest checkpoint)
```

To resume training from partial checkpoints, use the [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.resume_from_checkpoint](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.resume_from_checkpoint) API with `partial=True`, and specify the checkpoint directory and the tag used while saving the partial checkpoints. Note that the actual loading of model weights happens after model partitioning, during the first run of the `smdistributed.modelparallel.torch.step`-decorated training step function.

When saving a partial checkpoint, the library also saves the model partition decision as files with `.pt` file extension. Conversely, when resuming from the partial checkpoint, the library loads the partition decision files together. Once the partition decision is loaded, you can't change the partition.

The following code snippet shows how to set the checkpoint APIs in a PyTorch training script.

```
import smdistributed.modelparallel.torch as smp

model = ...
model = smp.DistributedModel(model)
optimizer = ...
optimizer = smp.DistributedOptimizer(optimizer)
user_content = ...     # additional custom data
checkpoint_path = "/opt/ml/checkpoint/model_parallel"

# Save a checkpoint.
smp.save_checkpoint(
    path=checkpoint_path,
    tag=f"total_steps{total_steps}",
    partial=True,
    model=model,
    optimizer=optimizer,
    user_content=user_content
    num_kept_partial_checkpoints=5
)

# Load a checkpoint.
# This automatically loads the most recently saved checkpoint.
smp_checkpoint = smp.resume_from_checkpoint(
    path=checkpoint_path, 
    partial=True
)
```

**Full checkpointing**

To save the final model artifact for inference purposes, use the `smdistributed.modelparallel.torch.save_checkpoint` API with `partial=False`, which combines the model partitions to create a single model artifact. Note that this does not combine the optimizer states.

To initialize training with particular weights, given a full model checkpoint, you can use the `smdistributed.modelparallel.torch.resume_from_checkpoint` API with `partial=False`. Note that this does not load optimizer states.

**Note**  
With tensor parallelism, in general, the `state_dict` must be translated between the original model implementation and the `DistributedModel` implementation. Optionally, you can provide the `state_dict` translation function as an argument to the `smdistributed.modelparallel.torch.resume_from_checkpoint`. However, for [Supported Models Out of the Box](model-parallel-extended-features-pytorch-hugging-face.md#model-parallel-extended-features-pytorch-hugging-face-out-of-the-box), the library takes care of this translation automatically.

The following code shows an example of how to use the checkpoint APIs for fully checkpointing a PyTorch model trained with model parallelism.

```
import smdistributed.modelparallel.torch as smp

model = ...
model = smp.DistributedModel(model)
optimizer = ...
optimizer = smp.DistributedOptimizer(optimizer)
user_content = ...     # additional custom data
checkpoint_path = "/opt/ml/checkpoint/model_parallel"

# Save a checkpoint.
smp.save_checkpoint(
    path=checkpoint_path,
    tag=f"total_steps{total_steps}",
    partial=False,
    model=model,
    optimizer=optimizer,
    user_content=user_content
    num_kept_partial_checkpoints=5
)

# Load a checkpoint.
# This automatically loads the most recently saved checkpoint.
smp_checkpoint = smp.resume_from_checkpoint(
    path=checkpoint_path, 
    partial=False
)
```

### Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between v1.6.0 and v1.9.0)
<a name="model-parallel-extended-features-pytorch-saving-loading-checkpoints"></a>

The SageMaker model parallelism library provides Python functions for saving partial or full checkpoints for training jobs with tensor parallelism. The following procedure shows how to use [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save) and [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.load](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.load) to save and load a checkpoint when you use tensor parallelism.

**Note**  
This checkpointing method is recommended if you use PyTorch, [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md), and the SageMaker model parallelism library between v1.6.0 and v1.9.0.

1. Prepare a model object and wrap it with the library's wrapper function `smp.DistributedModel()`.

   ```
   model = MyModel(...)
   model = smp.DistributedModel(model)
   ```

1. Prepare an optimizer for the model. A set of model parameters is an iterable argument required by optimizer functions. To prepare a set of model parameters, you must process `model.parameters()` to assign unique IDs to individual model parameters. 

   If there are parameters with duplicated IDs in the model parameter iterable, loading the checkpointed optimizer state fails. To create an iterable of model parameters with unique IDs for your optimizer, see the following:

   ```
   unique_params = []
   unique_params_set = set()
   for p in model.parameters():
     if p not in unique_params_set:
       unique_params.append(p)
       unique_params_set.add(p)
   del unique_params_set
   
   optimizer = MyOpt(unique_params, ...)
   ```

1. Wrap the optimizer using the library's wrapper function `smp.DistributedOptimizer()`.

   ```
   optimizer = smp.DistributedOptimizer(optimizer)
   ```

1. Save the model and the optimizer state using [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.save). Depending on how you want to save checkpoints, choose one of the following two options:
   + **Option 1:** Save a partial model on each `mp_rank` for a single `MP_GROUP`.

     ```
     model_dict = model.local_state_dict() # save a partial model
     opt_dict = optimizer.local_state_dict() # save a partial optimizer state
     # Save the dictionaries at rdp_rank 0 as a checkpoint
     if smp.rdp_rank() == 0:
         smp.save(
             {"model_state_dict": model_dict, "optimizer_state_dict": opt_dict},
             f"/checkpoint.pt",
             partial=True,
         )
     ```

     With tensor parallelism, the library saves checkpointed files named in the following format: `checkpoint.pt_{pp_rank}_{tp_rank}`.
**Note**  
With tensor parallelism, make sure you set the if statement as `if smp.rdp_rank() == 0` instead of `if smp.dp_rank() == 0`. When the optimizer state is sharded with tensor parallelism, all reduced-data parallel ranks must save their own partition of the optimizer state. Using a wrong *if* statement for checkpointing might result in a stalling training job. For more information about using `if smp.dp_rank() == 0` without tensor parallelism, see [General Instruction for Saving and Loading](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#general-instruction-for-saving-and-loading) in the *SageMaker Python SDK documentation*. 
   + **Option 2:** Save the full model.

     ```
     if smp.rdp_rank() == 0:
         model_dict = model.state_dict(gather_to_rank0=True) # save the full model
         if smp.rank() == 0:
             smp.save(
                 {"model_state_dict": model_dict},
                 "/checkpoint.pt",
                 partial=False,
             )
     ```
**Note**  
Consider the following for full checkpointing:   
If you set `gather_to_rank0=True`, all ranks other than `0` return empty dictionaries.
For full checkpointing, you can only checkpoint the model. Full checkpointing of optimizer states is currently not supported.
The full model only needs to be saved at `smp.rank() == 0`.

1. Load the checkpoints using [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.load](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.load). Depending on how you checkpointed in the previous step, choose one of the following two options:
   + **Option 1:** Load the partial checkpoints.

     ```
     checkpoint = smp.load("/checkpoint.pt", partial=True)
     model.load_state_dict(checkpoint["model_state_dict"], same_partition_load=False)
     optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
     ```

     You can set `same_partition_load=True` in `model.load_state_dict()` for a faster load, if you know that the partition will not change.
   + **Option 2:** Load the full checkpoints.

     ```
     if smp.rdp_rank() == 0:
         checkpoint = smp.load("/checkpoint.pt", partial=False)
         model.load_state_dict(checkpoint["model_state_dict"])
     ```

     The `if smp.rdp_rank() == 0` condition is not required, but it can help avoid redundant loading among different `MP_GROUP`s. Full checkpointing optimizer state dict is currently not supported with tensor parallelism.

### Checkpointing a distributed TensorFlow model
<a name="distributed-model-parallel-checkpoint-tensorflow"></a>

To save a TensorFlow model while training with model parallelism, use the following functions provided by the SageMaker model parallelism library.
+ [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.DistributedModel.save_model](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.DistributedModel.save_model)
+ [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.CheckpointManager](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_tensorflow.html#smp.CheckpointManager)

## Fine-tuning a distributed model
<a name="distributed-model-parallel-fine-tuning"></a>

The fine-tuning needs to be configured in your training script. The following code snippet shows an example structure of a training script using the [AutoModelForCausalLM](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM) class of Hugging Face Transformers with modifications for registering the `smdistributed.model.parallel.torch` modules and settings for fine-tuning.

**Note**  
Fine-tuning a distributed transformer (a Transformer model wrapped by `smp.DistributedModel()`) with the [smp.delayed\$1param\$1initialization](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#smdistributed.modelparallel.torch.delay_param_initialization) function activated requires the fine-tuning job to be configured with an FSx for Lustre file system. In cases where you want to fine-tune a large-scale model with the delayed parameter initialization option, you should set up an FSx for Lustre file system.

```
import argparse
from transformers import AutoModelForCausalLM
import smdistributed.modelparallel
import smdistributed.modelparallel.torch as smp

def parse_args():

    parser = argparse.ArgumentParser()

    # set an arg group for model
    model_grp = parser.add_argument_group(
        title="model", description="arguments to describe model configuration"
    )

    ... # set up numerous args to parse from the configuration dictionary to the script for training

    # add arg for activating fine-tuning
    model_grp.add_argument(
        "--fine_tune",
        type=int,
        default=0,
        help="Fine-tune model from checkpoint or pretrained model",
    )

def main():
    """Main function to train GPT."""
    args = parse_args()

    ... # parse numerous args

    if args.fine_tune > 0 and args.delayed_param > 0 and smp.rank() == 0:
        pretrained_model = AutoModelForCausalLM.from_pretrained(
            args.model_name or args.model_dir
        )
        model_state_dict = pretrained_model.state_dict()
        path = os.path.join(args.model_dir, "fullmodel.pt")
        torch.save(model_state_dict, path)

    # create a Transformer model and wrap by smp.model_creation() 
    # with options to configure model parallelism parameters offered by SageMaker AI
    with smp.model_creation(
        tensor_parallelism=smp.tp_size() > 1 or args.use_distributed_transformer > 0,
        zero_init=args.use_distributed_transformer == 0,
        dtype=dtype,
        distribute_embedding=args.sharded_data_parallel_degree > 1 and smp.tp_size() > 1,
        use_alibi=args.alibi > 0,
        attention_in_fp32=args.attention_in_fp32 > 0,
        fp32_residual_addition=args.residual_addition_in_fp32 > 0,
        query_key_layer_scaling=args.query_key_layer_scaling > 0 and args.bf16 < 1,
        fused_softmax=args.fused_softmax > 0,
        fused_dropout=args.fused_dropout > 0,
        fused_bias_gelu=args.fused_bias_gelu > 0,
        flash_attention=args.flash_attention > 0,
    ):
        if args.fine_tune > 0 and args.delayed_param == 0:
            model = AutoModelForCausalLM.from_pretrained(
                args.model_name or args.model_dir
            )
        else:
            model = AutoModelForCausalLM.from_config(model_config)

    # wrap the model by smp.DistributedModel() to apply SageMaker model parallelism
    model = smp.DistributedModel(
        model, trace_device="gpu", backward_passes_per_step=args.gradient_accumulation
    )

    # wrap the optimizer by smp.DistributedOptimizer() to apply SageMaker model parallelism
    optimizer= ... # define an optimizer
    optimizer = smp.DistributedOptimizer(
        optimizer,
        static_loss_scale=None,
        dynamic_loss_scale=True,
        dynamic_loss_args={"scale_window": 1000, "min_scale": 1, "delayed_shift": 2},
    )

    # for fine-tuning, use smp.resume_from_checkpoint() to load a pre-trained model
    if args.fine_tune > 0 and args.delayed_param > 0:
        smp.resume_from_checkpoint(args.model_dir, tag="fullmodel.pt", partial=False)
```

For a complete example of training scripts and Jupyter notebooks, see the [GPT-2 examples for PyTorch](https://github.com/aws/amazon-sagemaker-examples/tree/main/training/distributed_training/pytorch/model_parallel/gpt2) in the *SageMaker AI Examples GitHub repository*. 

# Amazon SageMaker AI model parallelism library v1 examples
<a name="distributed-model-parallel-examples"></a>

This page provides a list of blogs and Jupyter notebooks that present practical examples of implementing the SageMaker model parallelism (SMP) library v1 to run distributed training jobs on SageMaker AI.

## Blogs and Case Studies
<a name="distributed-model-parallel-examples-blog"></a>

The following blogs discuss case studies about using SMP v1.
+ [New performance improvements in the Amazon SageMaker AI model parallelism library](https://aws.amazon.com/blogs/machine-learning/new-performance-improvements-in-amazon-sagemaker-model-parallel-library/), *AWS Machine Learning Blog* (December 16, 2022)
+ [Train gigantic models with near-linear scaling using sharded data parallelism on Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/train-gigantic-models-with-near-linear-scaling-using-sharded-data-parallelism-on-amazon-sagemaker/), *AWS Machine Learning Blog* (October 31, 2022)

## Example notebooks
<a name="distributed-model-parallel-examples-pytorch"></a>

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/). To download the examples, run the following command to clone the repository and go to `training/distributed_training/pytorch/model_parallel`.

**Note**  
Clone and run the example notebooks in the following SageMaker AI ML IDEs.  
[SageMaker JupyterLab](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-jl.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Code Editor](https://docs.aws.amazon.com/sagemaker/latest/dg/code-editor.html) (available in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) (available as an application in [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) created after December 2023)
[SageMaker Notebook Instances](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)

```
git clone https://github.com/aws/amazon-sagemaker-examples.git
cd amazon-sagemaker-examples/training/distributed_training/pytorch/model_parallel
```

**SMP v1 example notebooks for PyTorch**
+ [Train GPT-2 with near-linear scaling using the sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-sharded-data-parallel.ipynb)
+ [Fine-tune GPT-2 with near-linear scaling using sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-fine-tune-gpt-sharded-data-parallel.ipynb)
+ [Train GPT-NeoX-20B with near-linear scaling using the sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt-neox/smp-train-gpt-neox-sharded-data-parallel.ipynb)
+ [Train GPT-J 6B using the sharded data parallelism and tensor parallelism techniques in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt-j/smp-train-gptj-sharded-data-parallel-tp.ipynb)
+ [Train FLAN-T5 with near-linear scaling using sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/flan-t5/smp-train-t5-sharded-data-parallel.ipynb)
+ [Train Falcon with near-linear scaling using sharded data parallelism technique in the SageMaker model parallelism library](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/falcon/smp-train-falcon-sharded-data-parallel.ipynb)

**SMP v1 example notebooks for TensorFlow**
+ [CNN with TensorFlow 2.3.1 and the SageMaker model parallelism library](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/tensorflow/model_parallel/mnist/tensorflow_smmodelparallel_mnist.html)
+ [HuggingFace with TensorFlow Distributed model parallelism library Training on SageMaker AI](https://github.com/huggingface/notebooks/blob/master/sagemaker/04_distributed_training_model_parallelism/sagemaker-notebook.ipynb)

# SageMaker Distributed Model Parallelism Best Practices
<a name="model-parallel-best-practices"></a>

Use the following guidelines when you run a distributed training job with the SageMaker model parallel library.

## Setting Up the Right Configuration for a Given Model
<a name="model-parallel-best-practices-configuration"></a>

When scaling up a model, we recommend you to go over the following list in order. Each list item discusses the advantage of using the library's techniques along with the tradeoffs that might arise. 

**Tip**  
If a model can fit well using a subset of the library's features, adding more model parallelism or memory saving features does not usually improve performance.

**Using large GPU instance types**
+ In the realm of model parallelism, it is best to use powerful instances with large GPU memories to handle overhead from model parallelism operations such as partitioning models across multiple GPUs. We recommend using `ml.p4d` or `ml.p3dn` instances for training large DL models. These instances are also equipped with Elastic Fabric Adapter (EFA), which provides higher network bandwidth and enables large-scale training with model parallelism.

**Sharding optimizer state**
+ The impact of sharding optimizer state depends on the number of data parallel ranks. Typically, a higher degree of data parallelism (proportional to the size of compute node) can improve the efficiency of memory usage.

  When you want to downsize a cluster, make sure you check the optimizer state sharding configuration. For example, a large DL model with optimizer state sharding that fits on a compute cluster with 16 GPUs (for example, two P4d or P4de instances) might not always fit on a node with 8 GPUs (for example, a single P4d or P4de instance). This is because the combined memory of 8 GPUs is lower than the combined memory of 16 GPUs, and the required memory per GPU for sharding over 8 GPUs is also higher than the memory per GPU for sharding over the 16-GPU scenario. As a result, the increased memory requirement might not fit into the smaller cluster.

  For more information, see [Optimizer State Sharding](model-parallel-extended-features-pytorch-optimizer-state-sharding.md).

**Activation checkpointing**
+ Memory efficiency can be improved by using activation checkpointing for a group of modules. The more you group the modules, the more efficient the memory usage. When checkpointing sequential modules for layers, the `strategy` argument of the `smp.set_activation_checkpointing` function groups the layers together for checkpointing. For example, grouping two or more layers together for checkpointing is more memory efficient than checkpointing one layer at a time, and this trades extra computation time for reduced memory usage.

  For more information, see [Activation Checkpointing](model-parallel-extended-features-pytorch-activation-checkpointing.md).

**Tensor parallelism**
+ The degree of tensor parallelism should be a power of two (2, 4, 8, ..., 2n), where the maximum degree must be equal to the number of GPUs per node. For example, if you use a node with 8 GPUs, possible numbers for the degree of tensor parallelism are 2, 4, and 8. We don’t recommend arbitrary numbers (such as 3, 5, 6, and 7) for the degree of tensor parallelism. When you use multiple nodes, misconfiguring the degree of tensor parallelism might result in running tensor parallelism across the nodes; this adds significant overhead from communication of activations across the nodes and can become computationally expensive.

  For more information, see [Tensor Parallelism](model-parallel-extended-features-pytorch-tensor-parallelism.md).<a name="model-parallel-best-practices-configuration-pipeline-across-nodes"></a>

**Pipeline parallelism across nodes**
+ You can run pipeline parallelism both within a single node and across multiple nodes. When you use pipeline parallelism in combination with tensor parallelism, we recommend running pipeline parallelism across multiple nodes and keeping tensor parallelism within individual nodes. 
+ Pipeline parallelism comes with the following three knobs: `microbatches`, `active_microbatches`, and `prescaled_batch`.
  + When you use tensor parallelism with pipeline parallelism, we recommend activating `prescaled_batch` so that the batch size per model parallel group can be increased for efficient pipelining. With `prescaled_batch` activated, the batch size set in the training script becomes `tp_size` times the batch size set for each rank without `prescaled_batch`.
  + Increasing the number of `microbatches` helps achieve efficient pipelining and better performance. Note that the effective microbatch size is the batch size divided by number of microbatches. If you increase the number of microbatches while keeping batch size constant, each microbatch processes fewer samples.
  + The number of `active_microbatches` is the maximum number of microbatches that are simultaneously in process during pipelining. For each active microbatch in process, its activations and gradients take up GPU memory. Therefore, increasing `active_microbatches` takes up more GPU memory.
+ If both GPU and GPU memory are underutilized, increase `active_microbatches` for better parallelization during pipelining.
+ For more information about how to use tensor parallelism with pipeline parallelism, see [Tensor parallelism combined with pipeline parallelism](model-parallel-extended-features-pytorch-tensor-parallelism-examples.md#model-parallel-extended-features-pytorch-tensor-and-pipeline-parallelism).
+ To find descriptions of the aforementioned parameters, see [Parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#parameters-for-smdistributed) in the *SageMaker Python SDK documentation*.

**Offloading activations to CPU**
+ Make sure that this is used in combination with activation checkpointing and pipeline parallelism. To ensure that the offloading and preloading happen in the background, specify a value greater than 1 to the microbatches parameter. 
+ When offloading activations, you might be able to increase `active_microbatches` and sometimes match with the total number of microbatches. This depends on which modules are checkpointed and how the model is partitioned.

  For more information, see [Activation Offloading](model-parallel-extended-features-pytorch-activation-offloading.md).

### Reference configurations
<a name="model-parallel-best-practices-configuration-reference"></a>

The SageMaker model parallelism training team provides the following reference points based on experiments with the GPT-2 model, the sequence length of 512, and the vocabulary size of 50,000. 


| The number of model parameters | Instance type | Pipeline parallelism | Tensor parallelism | Optimizer state sharding | Activation checkpointing | Prescaled batch | Batch size | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
| 10 billion | 16 ml.p4d.24xlarge | 1 | 4 | True | Each transformer layer | True | batch\$1size=40 | 
| 30 billion | 16 ml.p4d.24xlarge | 1 | 8 | True | Each transformer layer | True | batch\$1size=32 | 
| 60 billion | 32 ml.p4d.24xlarge | 2 | 8 | True | Each transformer layer | True | batch\$1size=56, microbatches=4, active\$1microbatches=2 | 

You can extrapolate from the preceding configurations to estimate GPU memory usage for your model configuration. For example, if you increase the sequence length for a 10-billion-parameter model or increase the size of the model to 20 billion, you might want to lower batch size first. If the model still doesn’t fit, try increasing the degree of tensor parallelism.

## Modifying Your Training Script
<a name="model-parallel-best-practices-modify-training-script"></a>
+ Before you use the SageMaker model parallel library’s features in your training script, review [The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls](model-parallel-customize-tips-pitfalls.md).
+ To launch a training job faster, use the [SageMaker AI local mode](https://sagemaker.readthedocs.io/en/v2.199.0/overview.html?highlight=local%20mode#local-mode). This helps you quickly run a training job locally on a SageMaker notebook instance. Depending on the scale of the ML instance on which your SageMaker notebook instance is running, you might need to adjust the size of your model by changing the model configurations, such as the hidden width, number of transformer layers, and attention heads. Validate if the reduced model runs well on the notebook instance before using a large cluster for training the full model. 

## Monitoring and Logging a Training Job Using the SageMaker AI Console and Amazon CloudWatch
<a name="model-parallel-best-practices-monitoring"></a>

To monitor system-level metrics such as CPU memory utilization, GPU memory utilization, and GPU utilization, use visualization provided through the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Training**.

1. Choose **Training jobs**.

1. In the main pane, choose the training job name for which you want to see more details.

1. Browse the main pane and find the **Monitor** section to see the automated visualization.

1. To see training job logs, choose **View logs** in the **Monitor** section. You can access the distributed training job logs of the training job in CloudWatch. If you launched multi-node distributed training, you should see multiple log streams with tags in the format of **algo-n-1234567890**. The **algo-1** log stream tracks training logs from the main (0th) node.

For more information, see [Amazon CloudWatch Metrics for Monitoring and Analyzing Training Jobs](training-metrics.md).

## Permissions
<a name="model-parallel-best-practices-permissions"></a>

To run a SageMaker training job with model parallelism or the [SageMaker distributed training example notebooks](https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html), make sure you have the right permissions in your IAM role, such as the following:
+ To use [FSx for Lustre](https://aws.amazon.com/fsx/), add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonFSxFullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonFSxFullAccess).
+ To use Amazon S3 as a data channel, add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonS3FullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonS3FullAccess).
+ To use Docker, build your own container, and push it to Amazon ECR, add [https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonEC2ContainerRegistryFullAccess).
+ To have a full access to use the entire suite of SageMaker AI features, add [https://console.aws.amazon.com/iam/home#/policies/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#/policies/iam/home#/policies/arn%3Aaws%3Aiam%3A%3Aaws%3Apolicy%2FAmazonSageMakerFullAccess). 

# The SageMaker Distributed Model Parallelism Library Configuration Tips and Pitfalls
<a name="model-parallel-customize-tips-pitfalls"></a>

Review the following tips and pitfalls before using Amazon SageMaker AI's model parallelism library. This list includes tips that are applicable across frameworks. For TensorFlow and PyTorch specific tips, see [Modify a TensorFlow training script](model-parallel-customize-training-script-tf.md) and [Modify a PyTorch Training Script](model-parallel-customize-training-script-pt.md), respectively. 

## Batch Size and Number of Microbatches
<a name="model-parallel-customize-tips-pitfalls-batch-size"></a>
+ The library is most efficient when the batch size is increased. For use cases where the model fits within a single device, but can only be trained with a small batch size, batch size can and should be increased after the library is integrated. Model parallelism saves memory for large models, enabling you to train using batch sizes that previously did not fit in memory.
+ Choosing a number of microbatches that is too small or too large can lower performance. The library executes each microbatch sequentially in each device, so microbatch size (batch size divided by number of microbatches) must be large enough to fully utilize each GPU. At the same time, pipeline efficiency increases with the number of microbatches, so striking the right balance is important. Typically, a good starting point is to try 2 or 4 microbatches, increasing the batch size to the memory limit, and then experiment with larger batch sizes and numbers of microbatches. As the number of microbatches is increased, larger batch sizes might become feasible if an interleaved pipeline is used.
+ Your batch size must be always divisible by the number of microbatches. Note that depending on the size of the dataset, sometimes the last batch of every epoch can be of a smaller size than the rest, and this smaller batch needs to be divisible by the number of microbatches as well. If it is not, you can set `drop_remainder=True` in the `tf.Dataset.batch()` call (in TensorFlow), or set `drop_last=True` in `DataLoader` (in PyTorch), so that this last, small batch is not used. If you are using a different API for the data pipeline, you might need to manually skip the last batch whenever it is not divisible by the number of microbatches.

## Manual Partitioning
<a name="model-parallel-customize-tips-pitfalls-manual-partitioning"></a>
+ If you use manual partitioning, be mindful of the parameters that are consumed by multiple operations and modules in your model, such as the embedding table in transformer architectures. Modules that share the same parameter must be placed in the same device for correctness. When auto-partitioning is used, the library automatically enforces this constraint.

## Data Preparation
<a name="model-parallel-customize-tips-pitfalls-data-preparation"></a>
+ If the model takes multiple inputs, make sure you seed the random operations in your data pipeline (e.g., shuffling) with `smp.dp_rank()`. If the dataset is being deterministically sharded across data parallel devices, make sure that the shard is indexed by `smp.dp_rank()`. This is to make sure that the order of the data seen on all ranks that form a model partition is consistent.

## Returning Tensors from `smp.DistributedModel`
<a name="model-parallel-customize-tips-pitfalls-return-tensors"></a>
+ Any tensor that is returned from the `smp.DistributedModel.call` (for TensorFlow) or `smp.DistributedModel.forward` (for PyTorch) function is broadcast to all other ranks, from the rank that computed that particular tensor. As a result, any tensor that is not needed outside the call and forward methods (intermediate activations, for example) should not be returned, as this causes needless communication and memory overhead and hurts performance.

## The `@smp.step` Decorator
<a name="model-parallel-customize-tips-pitfalls-smp-step-decorator"></a>
+ If an `smp.step`-decorated function has a tensor argument that does not have a batch dimension, the argument name must be provided in the `non_split_inputs` list when calling `smp.step`. This prevents the library from attempting to split the tensor into microbatches. For more information see [https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_common_api.html](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_common_api.html) in the API documentation.

## Delaying Parameter Initialization
<a name="model-parallel-customize-tips-pitfalls-delaying-param-initialization"></a>

For very large models over 100 billion parameters, weight initialization through the CPU memory might result in an out-of-memory error. To get around this, the library offers `smp.delay_param_initialization` context manager. This delays the physical allocation of parameters until they move to GPU during the first execution of a `smp.step`-decorated function. This avoids unnecessary memory usage of the CPU during the initialization of training. Use the context manager when you create a model object as shown in the following code.

```
with smp.delay_param_initialization(enabled=True):    
    model = MyModel()
```

## Tensor Parallelism for PyTorch
<a name="model-parallel-customize-tips-pitfalls-tensor-parallelism-pytorch"></a>
+ If you are using a seed for deterministic results, set the seed based on `smp.dp_rank()` (for example, `torch.manual_seed(42 + smp.dp_rank())`). If you do not do this, different partitions of an `nn.Parameter` are initialized in the same way, impacting convergence. 
+ SageMaker’s model parallelism library uses NCCL to implement collectives needed for the distribution of the modules. Especially for smaller models, if too many NCCL calls are scheduled on the GPU at the same time, memory usage might increase because of additional space used by NCCL. To counteract this, `smp` throttles the NCCL calls so that the number of ongoing NCCL operations at any given time is less than or equal to a given limit. The default limit is 8, but this can be adjusted using the environment variable `SMP_NCCL_THROTTLE_LIMIT`. If you observe more memory usage than you expect while using tensor parallelism, you can try reducing this limit. However, choosing a limit that is too small might cause throughput loss. To disable throttling altogether, you can set `SMP_NCCL_THROTTLE_LIMIT=-1`. 
+ The following identity, which holds when the degree of tensor parallelism is 1, does not hold when the degree of tensor parallelism is greater than 1: `smp.mp_size() * smp.dp_size() == smp.size()`. This is because the tensor parallel group is part of both the model parallelism group and the data parallelism group. If your code has existing references to `mp_rank`, `mp_size`, `MP_GROUP`, and so on, and if you want to work with only the pipeline parallel group, you might need to replace the references with `smp.pp_size()`. The following identities are always true: 
  +  `smp.mp_size() * smp.rdp_size() == smp.size()` 
  +  `smp.pp_size() * smp.dp_size() == smp.size()` 
  +  `smp.pp_size() * smp.tp_size() * smp.rdp_size() == smp.size()` 
+ Since the `smp.DistributedModel` wrapper modifies the model parameters when tensor parallelism is enabled, the optimizer should be created after calling `smp.DistributedModel`, with the distributed parameters. For example, the following does not work: 

  ```
  ## WRONG
  model = MyModel()
  optimizer = SomeOptimizer(model.parameters())
  model = smp.DistributedModel(model)  # optimizer now has outdated parameters! 
  ```

  Instead, the optimizer should be created with the parameters of the `smp.DistributedModel` as follows:

  ```
  ## CORRECT
  model = smp.DistributedModel(MyModel())
  optimizer = SomeOptimizer(model.optimizers())
  ```
+ When a module is replaced with its distributed counterpart through tensor parallelism, the distributed module does not inherit its weights from the original module, and initializes new weights. This means that, for instance, if the weights need to be initialized in a particular call (for example, through a `load_state_dict` call), this needs to happen after the `smp.DistributedModel` call, once the module distribution takes place. 
+ When accessing the parameters of distributed modules directly, note that the weight does not have the same shape as the original module. For instance,  

  ```
  with smp.tensor_parallelism():
      linear = nn.Linear(60, 60)
  
  # will pass
  assert tuple(linear.weight.shape) == (60, 60)
  
  distributed_linear = smp.DistributedModel(linear)
  
  # will fail. the number of input channels will have been divided by smp.tp_size()
  assert tuple(distributed_linear.module.weight.shape) == (60, 60)
  ```
+ Using `torch.utils.data.distributed.DistributedSampler` is strongly recommended for tensor parallelism. This ensures that every data parallel rank receives the same number of data samples, which prevents hangs that might result from different `dp_rank`s taking a different number of steps. 
+ If you use the `join` API of PyTorch's `DistributedDataParallel` class to handle cases in which different data parallel ranks have different numbers of batches, you still need to make sure that ranks that are in the same `TP_GROUP` have the same number of batches; otherwise the communication collectives used in distributed execution of modules may hang. Ranks that are in different `TP_GROUP`s can have different numbers of batches, as long as `join` API is used. 
+ If you want to checkpoint your model and use tensor parallelism, consider the following: 
  + To avoid stalling and race conditions while saving and loading models when you use tensor parallelism, make sure you call appropriate functions from the following model and optimizer states inside a reduced-data parallelism rank.
  + If you are transitioning an existing pipeline parallel script and enabling tensor parallel for the script, ensure that you modify any `if smp.dp_rank() == 0` block used for saving and loading with `if smp.rdp_rank() == 0` blocks. Otherwise, it might cause your training job to stall. 

  For more information about checkpointing a model with tensor parallelism, see [Checkpointing a distributed model](distributed-model-parallel-checkpointing-and-finetuning.md#distributed-model-parallel-checkpoint).

# Model Parallel Troubleshooting
<a name="distributed-troubleshooting-model-parallel"></a>

If you run into an error, you can use the following list to try to troubleshoot your training job. If the problem persists, contact [AWS Support](https://aws.amazon.com/premiumsupport). 

**Topics**
+ [

## Considerations for Using SageMaker Debugger with the SageMaker Model Parallelism Library
](#distributed-ts-model-parallel-debugger)
+ [

## Saving Checkpoints
](#distributed-ts-model-parallel-checkpoints)
+ [

## Convergence Using Model Parallel and TensorFlow
](#distributed-ts-model-parallel-tf-convergence)
+ [

## Stalling or Crashing Distributed Training Jobs
](#distributed-ts-model-parallel-training-issues)
+ [

## Receiving NCCL Error for a PyTorch Training Job
](#distributed-ts-model-parallel-nccl-error)
+ [

## Receiving `RecursionError` for a PyTorch Training Job
](#distributed-ts-model-parallel-super-forward-not-supported)

## Considerations for Using SageMaker Debugger with the SageMaker Model Parallelism Library
<a name="distributed-ts-model-parallel-debugger"></a>

SageMaker Debugger is not available for the SageMaker model parallelism library. Debugger is enabled by default for all SageMaker TensorFlow and PyTorch training jobs, and you might see an error that looks like the following: 

```
FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/checkpoints/metadata.json.sagemaker-uploading
```

To fix this issue, disable Debugger by passing `debugger_hook_config=False` when creating a framework `estimator` as shown in the following example.

```
bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
checkpoint_in_bucket="checkpoints"

# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)

estimator = TensorFlow(
    ...

    distribution={"smdistributed": {"modelparallel": { "enabled": True }}},
    checkpoint_s3_uri=checkpoint_s3_bucket,
    checkpoint_local_path="/opt/ml/checkpoints",
    debugger_hook_config=False
)
```

## Saving Checkpoints
<a name="distributed-ts-model-parallel-checkpoints"></a>

You might run into the following error when saving checkpoints of a large model on SageMaker AI: 

```
InternalServerError: We encountered an internal error. Please try again
```

This could be caused by a SageMaker AI limitation while uploading the local checkpoint to Amazon S3 during training. To disable checkpointing in SageMaker AI, use the following example to explicitly upload the checkpoints.

If you run into the preceding error, do not use `checkpoint_s3_uri` with the SageMaker `estimator` call. While saving checkpoints for larger models, we recommend saving checkpoints to a custom directory and passing the same to the helper function (as a `local_path` argument).

```
import os

def aws_s3_sync(source, destination):
    """aws s3 sync in quiet mode and time profile"""
    import time, subprocess
    cmd = ["aws", "s3", "sync", "--quiet", source, destination]
    print(f"Syncing files from {source} to {destination}")
    start_time = time.time()
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    p.wait()
    end_time = time.time()
    print("Time Taken to Sync: ", (end_time-start_time))
    return

def sync_local_checkpoints_to_s3(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'):
    """ sample function to sync checkpoints from local path to s3 """

    import boto3
    #check if local path exists
    if not os.path.exists(local_path):
        raise RuntimeError("Provided local path {local_path} does not exist. Please check")

    #check if s3 bucket exists
    s3 = boto3.resource('s3')
    if not s3_uri.startswith("s3://"):
        raise ValueError(f"Provided s3 uri {s3_uri} is not valid.")

    s3_bucket = s3_uri.replace('s3://','').split('/')[0]
    print(f"S3 Bucket: {s3_bucket}")
    try:
        s3.meta.client.head_bucket(Bucket=s3_bucket)
    except Exception as e:
        raise e
    aws_s3_sync(local_path, s3_uri)
    return

def sync_s3_checkpoints_to_local(local_path="/opt/ml/checkpoints", s3_uri=os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))+'/checkpoints'):
    """ sample function to sync checkpoints from s3 to local path """

    import boto3
    #try to create local path if it does not exist
    if not os.path.exists(local_path):
        print(f"Provided local path {local_path} does not exist. Creating...")
        try:
            os.makedirs(local_path)
        except Exception as e:
            raise RuntimeError(f"Failed to create {local_path}")

    #check if s3 bucket exists
    s3 = boto3.resource('s3')
    if not s3_uri.startswith("s3://"):
        raise ValueError(f"Provided s3 uri {s3_uri} is not valid.")

    s3_bucket = s3_uri.replace('s3://','').split('/')[0]
    print(f"S3 Bucket: {s3_bucket}")
    try:
        s3.meta.client.head_bucket(Bucket=s3_bucket)
    except Exception as e:
        raise e
    aws_s3_sync(s3_uri, local_path)
    return
```

Usage of helper functions:

```
#base_s3_uri - user input s3 uri or save to model directory (default)
#curr_host - to save checkpoints of current host
#iteration - current step/epoch during which checkpoint is saved

# save checkpoints on every node using local_rank
if smp.local_rank() == 0:
    base_s3_uri = os.path.dirname(os.path.dirname(os.getenv('SM_MODULE_DIR', '')))
    curr_host = os.environ['SM_CURRENT_HOST']
    full_s3_uri = f'{base_s3_uri}/checkpoints/{curr_host}/{iteration}'
    sync_local_checkpoints_to_s3(local_path=checkpoint_dir, s3_uri=full_s3_uri)
```

## Convergence Using Model Parallel and TensorFlow
<a name="distributed-ts-model-parallel-tf-convergence"></a>

When you use SageMaker AI multi-node training with TensorFlow and the model parallelism library, the loss may not converge as expected because the order of training input files may be different on each node. This may cause different ranks in the same model parallel group to work on different input files, causing inconsistencies. To prevent this, ensure the input files are ordered the same way in all the ranks before they get converted to TensorFlow datasets. One way to achieve this is to sort the input file names in the training script.

## Stalling or Crashing Distributed Training Jobs
<a name="distributed-ts-model-parallel-training-issues"></a>

If your training job has stalling, crashing, or not responding issues, read the following troubleshooting items to identify what's the cause of the issue. If you need any further support, reach out to the SageMaker distributed training team through [AWS Support](https://aws.amazon.com/premiumsupport).
+  If you see **a distributed training job stalling at the NCCL initialization step**, consider the following: 
  + If you are using one of the EFA-enabled instances ( `ml.p4d` or `ml.p3dn` instances) with a custom VPC and its subnet, ensure that the security group used has inbound and outbound connections for all ports to and from the same SG. You also generally need outbound connections to any IP as a separate rule (for internet access). To find instructions on how to add inbound and outbound rules for EFA communication, refer to [SageMaker AI distributed training job stalling during initialization](distributed-troubleshooting-data-parallel.md#distributed-ts-data-parallel-efa-sg).
+ If you see a **distributed training job stalling when checkpointing** the full model, this might be because the `state_dict()` call on the model or optimizer was not made on all ranks with `rdp_rank()==0` (when using tensor parallelism) or `dp_rank()==0` (when using only pipeline parallelism). These ranks need to communicate to construct the checkpoint to be saved. Similar stalling issues can also happen when checkpointing partial optimizer if `shard_optimizer_state` is enabled. 

  For more information about checkpointing a model with model parallelism, see [General Instruction for Saving and Loading](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smp_versions/latest/smd_model_parallel_pytorch.html#general-instruction-for-saving-and-loading) and [Checkpointing a distributed PyTorch model (for the SageMaker model parallelism library between v1.6.0 and v1.9.0)](distributed-model-parallel-checkpointing-and-finetuning.md#model-parallel-extended-features-pytorch-saving-loading-checkpoints).
+ If the training job crashes with a **CUDA Out of Memory error**, this means that the distributed training configuration needs to be adjusted to fit the model on the GPU cluster. For more information and best practices, see [Setting Up the Right Configuration for a Given Model](model-parallel-best-practices.md#model-parallel-best-practices-configuration).
+ If the training job crashes with an **uncorrectable [ECC error](https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/index.html)**, this means that one of the GPUs in the cluster has gone bad. If you need technical support, share the job ARN with the AWS team and restart your training job from a checkpoint if possible.
+ In rare cases, a job configuration that worked previously but is close to the limits of GPU memory might fail later with a different cluster due to a **CUDA Out of Memory error**. This could be because some GPU has lower available memory than usual due to ECC errors.
+ **Network timeout crash** might happen when running a multinode job which doesn’t use all GPUs in the node. To get around this, use all GPUs on the node by ensuring that the `processes_per_host` parameter is set to the number of GPUs in each instance. For example, this is `processes_per_host=8` for `ml.p3.16xlarge`, `ml.p3dn.24xlarge`, and `ml.p4d.24xlarge` instances.
+ If you find that your training job takes a long time during the data downloading stage, make sure the Amazon S3 path you provided to `checkpoint_s3_uri` for the SageMaker `Estimator` class is unique for the current training job. If this path is reused across multiple training jobs running simultaneously, all those checkpoints are uploaded and downloaded to the same Amazon S3 path and might significantly increase checkpoint loading time.
+ Use FSx for Lustre when you deal with large data and models.
  + If your dataset is large and fetching it takes a long time, we recommend keeping your dataset in [FSx for Lustre](https://aws.amazon.com/fsx/lustre/).
  + When training models are beyond 10 billion parameters, we recommend using FSx for Lustre for checkpointing.
  + After you create a file system, make sure to wait for the status to become **available** before starting a training job using it. 

## Receiving NCCL Error for a PyTorch Training Job
<a name="distributed-ts-model-parallel-nccl-error"></a>

If you encountered the following error, it might be due to a process running out of GPU memory.

```
NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```

You can resolve this by reducing the batch size or `active_microbatches`. If auto partitioning is not resulting in a well-balanced partitioning, you might have to consider manual partitioning. For more information, see [Pipeline parallelism across nodes](model-parallel-best-practices.md#model-parallel-best-practices-configuration-pipeline-across-nodes).

## Receiving `RecursionError` for a PyTorch Training Job
<a name="distributed-ts-model-parallel-super-forward-not-supported"></a>

The library does not support calling `super.forward()` inside a module's forward call. If you use `super.forward()`, you might receive the following error message. 

```
RecursionError: maximum recursion depth exceeded
```

To fix the error, instead of calling `super.forward()`, you should call `super()._orig_forward()`. 

# Distributed computing with SageMaker AI best practices
<a name="distributed-training-options"></a>

This best practices page presents various flavors of distributed computing for machine learning (ML) jobs in general. The term *distributed computing* in this page encompasses distributed training for machine learning tasks and parallel computing for data processing, data generation, feature engineering, and reinforcement learning. In this page, we discuss about common challenges in distributed computing, and available options in SageMaker Training and SageMaker Processing. For additional reading materials about distributed computing, see [What Is Distributed Computing?](https://aws.amazon.com/what-is/distributed-computing/).

You can configure ML tasks to run in a distributed manner across multiple nodes (instances), accelerators (NVIDIA GPUs, AWS Trainium chips), and vCPU cores. By running distributed computation, you can achieve a variety of goals such as computing operations faster, handling large datasets, or training large ML models.

The following list covers common challenges that you might face when you run an ML training job at scale.
+ You need to make decisions on how to distribute computation depending on ML tasks, software libraries you want to use, and compute resources.
+ Not all ML tasks are straightforward to distribute. Also, not all ML libraries support distributed computation.
+ Distributed computation might not always result in a linear increase in compute efficiency. In particular, you need to identify if data I/O and inter-GPU communication have bottlenecks or cause overhead. 
+ Distributed computation might disturb numerical processes and change model accuracy. Specifically to data-parallel neural network training, when you change the global batch size while scaling up to a larger compute cluster, you also need to adjust the learning rate accordingly.

SageMaker AI provides distributed training solutions to ease such challenges for various use cases. Choose one of the following options that best fits your use case.

**Topics**
+ [

## Option 1: Use a SageMaker AI built-in algorithm that supports distributed training
](#distributed-training-options-1)
+ [

## Option 2: Run a custom ML code in the SageMaker AI managed training or processing environment
](#distributed-training-options-2)
+ [

## Option 3: Write your own custom distributed training code
](#distributed-training-options-3)
+ [

## Option 4: Launch multiple jobs in parallel or sequentially
](#distributed-training-options-4)

## Option 1: Use a SageMaker AI built-in algorithm that supports distributed training
<a name="distributed-training-options-1"></a>

SageMaker AI provides [built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) that you can use out of the box through the SageMaker AI console or the SageMaker Python SDK. Using the built-in algorithms, you don’t need to spend time for code customization, understanding science behind the models, or running Docker on provisioned Amazon EC2 instances. 

A subset of the SageMaker AI built-in algorithms support distributed training. To check if the algorithm of your choice supports distributed training, see the **Parallelizable** column in the [Common Information About Built-in Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/common-info-all-im-models.html) table. Some of the algorithms support multi-instance distributed training, while the rest of the parallelizable algorithms support parallelization across multiple GPUs in a single instance, as indicated in the **Parallelizable** column.

## Option 2: Run a custom ML code in the SageMaker AI managed training or processing environment
<a name="distributed-training-options-2"></a>

SageMaker AI jobs can instantiate distributed training environment for specific use cases and frameworks. This environment acts as a ready-to-use whiteboard, where you can bring and run your own ML code. 

### If your ML code uses a deep learning framework
<a name="distributed-training-options-2-1"></a>

You can launch distributed training jobs using the [Deep Learning Containers (DLC)](https://github.com/aws/deep-learning-containers) for SageMaker Training, which you can orchestrate either through the dedicated Python modules in the [SageMaker AI Python SDK](http://sagemaker.readthedocs.io/), or through the SageMaker APIs with [AWS CLI](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/index.html), [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html). SageMaker AI provides training containers for machine learning frameworks, including [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/index.html), [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/index.html), [Hugging Face Transformers](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/index.html), and [Apache MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/index.html). You have two options to write deep learning code for distributed training.
+ **The SageMaker AI distributed training libraries**

  The SageMaker AI distributed training libraries propose AWS-managed code for neural network data parallelism and model parallelism. SageMaker AI distributed training also comes with launcher clients built into the SageMaker Python SDK, and you don’t need to author parallel launch code. To learn more, see [SageMaker AI's data parallelism library](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html) and [SageMaker AI's model parallelism library](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).
+ **Open-source distributed training libraries** 

  Open source frameworks have their own distribution mechanisms such as [DistributedDataParallelism (DDP) in PyTorch](https://pytorch.org/docs/stable/notes/ddp.html) or `tf.distribute` modules in TensorFlow. You can choose to run these distributed training frameworks in the SageMaker AI-managed framework containers. For example, the sample code for [training MaskRCNN in SageMaker AI](https://github.com/aws-samples/amazon-sagemaker-cv) shows how to use both PyTorch DDP in the SageMaker AI PyTorch framework container and [Horovod](https://horovod.readthedocs.io/en/stable/) in the SageMaker TensorFlow framework container.

SageMaker AI ML containers also come with [MPI](https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/mpi_on_sagemaker/intro/mpi_demo.ipynb) preinstalled, so you can parallelize your entry point script using [mpi4py](https://mpi4py.readthedocs.io/en/stable/). Using the MPI integrated training containers is a great option when you launch a third-party distributed training launcher or write ad-hoc parallel code in the SageMaker AI managed training environment.

*Notes for data-parallel neural network training on GPUs*
+ **Scale to multi-GPU and multi-machine parallelism when appropriate**

  We often run neural network training jobs on multiple-CPU or multiple-GPU instances. Each GPU-based instance usually contains multiple GPU devices. Consequently, distributed GPU computing can happen either within a single GPU instance with multiple GPUs (single-node multi-GPU training), or across multiple GPU instances with multiple GPU cores in each (multi-node multi-GPU training). Single-instance training is easier to write code and debug, and the intra-node GPU-to-GPU throughput is usually faster than the inter-node GPU-to-GPU throughput. Therefore, it is a good idea to scale data parallelism vertically first (use one GPU instance with multiple GPUs) and expand to multiple GPU instances if needed. This might not apply to cases where the CPU budget is high (for example, a massive workload for data pre-processing) and when the CPU-to-GPU ratio of a multi-GPU instance is too low. In all cases, you need to experiment with different combinations of instance types based on your own ML training needs and workload. 
+ **Monitor the quality of convergence**

  When training a neural network with data parallelism, increasing the number of GPUs while keeping the mini-batch size per GPU constant leads to increasing the size of global mini-batch for the mini-batch stochastic gradient descent (MSGD) process. The size of mini-batches for MSGD is known to impact the descent noise and convergence. For properly scaling while preserving accuracy, you need to adjust other hyperparameters such as the learning rate [[Goyal et al.](https://arxiv.org/abs/1706.02677) (2017)].
+ **Monitor I/O bottlenecks**

  As you increase the number of GPUs, the throughput for reading and writing storage should also increase. Make sure that your data source and pipeline don’t become bottlenecks.
+ **Modify your training script as needed**

  Training scripts written for single-GPU training must be modified for multi-node multi-GPU training. In most data parallelism libraries, script modification is required to do the following.
  + Assign batches of training data to each GPU.
  + Use an optimizer that can deal with gradient computation and parameter updates across multiple GPUs.
  + Assign responsibility of checkpointing to a specific host and GPU. 

   
### If your ML code involves tabular data processing
<a name="distributed-training-options-2-2"></a>

PySpark is a Python frontend of Apache Spark, which is an open-source distributed computing framework. PySpark has been widely adopted for distributed tabular data processing for large-scale production workloads. If you want to run tabular data processing code, consider using the [SageMaker Processing PySpark containers](https://docs.aws.amazon.com/sagemaker/latest/dg/use-spark-processing-container.html) and running parallel jobs. You can also run data processing jobs in parallel using SageMaker Training and SageMaker Processing APIs in Amazon SageMaker Studio Classic, which is integrated with [Amazon EMR](https://aws.amazon.com/blogs/machine-learning/part-1-create-and-manage-amazon-emr-clusters-from-sagemaker-studio-to-run-interactive-spark-and-ml-workloads/) and [AWS Glue](https://aws.amazon.com/about-aws/whats-new/2022/09/sagemaker-studio-supports-glue-interactive-sessions/?nc1=h_ls).

## Option 3: Write your own custom distributed training code
<a name="distributed-training-options-3"></a>

When you submit a training or processing job to SageMaker AI, SageMaker Training and SageMaker AI Processing APIs launch Amazon EC2 compute instances. You can customize training and processing environment in the instances by running your own Docker container or installing additional libraries in the AWS managed containers. For more information about Docker with SageMaker Training, see [Adapting your own Docker container to work with SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own.html) and [Create a container with your own algorithms and models](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-create.html). For more information about Docker with SageMaker AI Processing, see [Use Your Own Processing Code](https://docs.aws.amazon.com/sagemaker/latest/dg/use-your-own-processing-code.html).

Every SageMaker training job environment contains a configuration file at `/opt/ml/input/config/resourceconfig.json`, and every SageMaker Processing job environment contains a similar configuration file at `/opt/ml/config/resourceconfig.json`. Your code can read this file to find `hostnames` and establish inter-node communications. To learn more, including the schema of the JSON file, see [Distributed Training Configuration](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html#your-algorithms-training-algo-running-container-dist-training) and [How Amazon SageMaker Processing Configures Your Processing Container](https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html#byoc-config). You can also install and use third-party distributed computing libraries such as [Ray](https://github.com/aws-samples/aws-samples-for-ray/tree/main/sagemaker) or DeepSpeed in SageMaker AI.

You can also use SageMaker Training and SageMaker Processing to run custom distributed computations that do not require inter-worker communication. In the computing literature, those tasks are often described as *embarrassingly parallel* or *share-nothing*. Examples include parallel processing of data files, training models in parallel on different configurations, or running batch inference on a collection of records. You can trivially parallelize such share-nothing use cases with Amazon SageMaker AI. When you launch a SageMaker Training or SageMaker Processing job on a cluster with multiple nodes, SageMaker AI by default replicates and launches your training code (in Python or Docker) on all the nodes. Tasks requiring random spread of input data across such multiple nodes can be facilitated by setting `S3DataDistributionType=ShardedByS3Key` in the data input configuration of the SageMaker AI `TrainingInput` API. 

## Option 4: Launch multiple jobs in parallel or sequentially
<a name="distributed-training-options-4"></a>

You can also distribute an ML compute workflow into smaller parallel or sequential compute tasks, each represented by its own SageMaker Training or SageMaker Processing job. Splitting a task into multiple jobs can be beneficial for the following situations or tasks:
+ When you have specific [data channels](https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html) and metadata entries (such as hyperparameters, model configuration, or instance types) for each sub-tasks.
+ When you implement retry steps at a sub-task level.
+ When you vary the configuration of the sub-tasks over the course of the workload, such as when training on increasing batch sizes.
+ When you need to run an ML task that takes longer than the maximum training time allowed for a single training job (28 days maximum).
+ When different steps of a compute workflow require different instance types.

For the specific case of hyperparameter search, use [SageMaker AI Automated Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html). SageMaker AI Automated Model Tuning is a serverless parameter search orchestrator that launches multiple training jobs on your behalf, according to a search logic that can be random, Bayesian, or HyperBand. 

Additionally, to orchestrate multiple training jobs, you can also consider workflow orchestration tools, such as [SageMaker Pipelines](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-pipelines/index.html), [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/connect-sagemaker.html), and Apache Airflow supported by [Amazon Managed Workflows for Apache Airflow (MWAA)](https://aws.amazon.com/managed-workflows-for-apache-airflow/) and [SageMaker AI Workflows](https://sagemaker.readthedocs.io/en/stable/workflows/airflow/using_workflow.html). 

# Amazon SageMaker Training Compiler
<a name="training-compiler"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

Use Amazon SageMaker Training Compiler to train deep learning (DL) models faster on scalable GPU instances managed by SageMaker AI.

## What Is SageMaker Training Compiler?
<a name="training-compiler-what-is"></a>

State-of-the-art deep learning (DL) models consist of complex multi-layered neural networks with billions of parameters that can take thousands of GPU hours to train. Optimizing such models on training infrastructure requires extensive knowledge of DL and systems engineering; this is challenging even for narrow use cases. Although there are open-source implementations of compilers that optimize the DL training process, they can lack the flexibility to integrate DL frameworks with some hardware such as GPU instances.

SageMaker Training Compiler is a capability of SageMaker AI that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker AI machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker AI and can help reduce total billable time as it accelerates training.

![\[A conceptual diagram of how SageMaker Training Compiler works with SageMaker AI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-compiler-marketing-diagram.png)


SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler–enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker AI and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker AI ML instances for accelerated computing.

## How It Works
<a name="training-compiler-how-it-works"></a>

SageMaker Training Compiler converts DL models from their high-level language representation to hardware-optimized instructions. Specifically, SageMaker Training Compiler applies graph-level optimizations, dataflow-level optimizations, and backend optimizations to produce an optimized model that efficiently uses hardware resources. As a result, you can train your models faster than when you train them without compilation.

It is a two-step process to activate SageMaker Training Compiler for your training job:

1. Bring your own DL script and, if needed, adapt to compile and train with SageMaker Training Compiler. To learn more, see [Bring Your Own Deep Learning Model](training-compiler-modify-scripts.md).

1. Create a SageMaker AI estimator object with the compiler configuration parameter using the SageMaker Python SDK.

   1. Turn on SageMaker Training Compiler by adding `compiler_config=TrainingCompilerConfig()` to the SageMaker AI estimator class.

   1. Adjust hyperparameters (`batch_size` and `learning_rate`) to maximize the benefit that SageMaker Training Compiler provides.

      Compilation through SageMaker Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. In some cases, the compiler intelligently promotes caching which leads to a decrease in the largest batch size that can fit on the GPU. Note that if you want to change the batch size, you must adjust the learning rate appropriately.

      For a reference for `batch_size` tested for popular models, see [Tested Models](training-compiler-support.md#training-compiler-tested-models).

      When you adjust the batch size, you also have to adjust the `learning_rate` appropriately. For best practices for adjusting the learning rate along with the change in batch size, see [SageMaker Training Compiler Best Practices and Considerations](training-compiler-tips-pitfalls.md).

   1. By running the `estimator.fit()` class method, SageMaker AI compiles your model and starts the training job.

   For instructions on how to launch a training job, see [Enable SageMaker Training Compiler](training-compiler-enable.md).

SageMaker Training Compiler does not alter the final trained model, while allowing you to accelerate the training job by more efficiently using the GPU memory and fitting a larger batch size per iteration. The final trained model from the compiler-accelerated training job is identical to the one from the ordinary training job.

**Tip**  
SageMaker Training Compiler only compiles DL models for training on [supported GPU instances](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html#training-compiler-supported-instance-types) managed by SageMaker AI. To compile your model for inference and deploy it to run anywhere in the cloud and at the edge, use [SageMaker Neo compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/neo.html).

**Topics**
+ [

## What Is SageMaker Training Compiler?
](#training-compiler-what-is)
+ [

## How It Works
](#training-compiler-how-it-works)
+ [

# Supported Frameworks, AWS Regions, Instance Types, and Tested Models
](training-compiler-support.md)
+ [

# Bring Your Own Deep Learning Model
](training-compiler-modify-scripts.md)
+ [

# Enable SageMaker Training Compiler
](training-compiler-enable.md)
+ [

# SageMaker Training Compiler Example Notebooks and Blogs
](training-compiler-examples-and-blogs.md)
+ [

# SageMaker Training Compiler Best Practices and Considerations
](training-compiler-tips-pitfalls.md)
+ [

# SageMaker Training Compiler FAQ
](training-compiler-faq.md)
+ [

# SageMaker Training Compiler Troubleshooting
](training-compiler-troubleshooting.md)
+ [

# Amazon SageMaker Training Compiler Release Notes
](training-compiler-release-notes.md)

# Supported Frameworks, AWS Regions, Instance Types, and Tested Models
<a name="training-compiler-support"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

Before using SageMaker Training Compiler, check if your framework of choice is supported, the instance types are available in your AWS account, and your AWS account is in one of the supported AWS Regions.

**Note**  
SageMaker Training Compiler is available in the SageMaker Python SDK v2.70.0 or later.

## Supported Frameworks
<a name="training-compiler-supported-frameworks"></a>

SageMaker Training Compiler supports the following deep learning frameworks and is available through AWS Deep Learning Containers.

**Topics**
+ [

### PyTorch
](#training-compiler-supported-frameworks-pytorch)
+ [

### TensorFlow
](#training-compiler-supported-frameworks-tensorflow)

### PyTorch
<a name="training-compiler-supported-frameworks-pytorch"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

### TensorFlow
<a name="training-compiler-supported-frameworks-tensorflow"></a>

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

For more information, see [Available Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) in the *AWS Deep Learning Containers GitHub repository*.

## AWS Regions
<a name="training-compiler-availablity-zone"></a>

The [SageMaker Training Compiler Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-training-compiler-containers) are available in the AWS Regions where [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) are in service except the China regions.

## Supported Instance Types
<a name="training-compiler-supported-instance-types"></a>

SageMaker Training Compiler is tested on and supports the following ML instance types.
+ P4 instances
+ P3 instances
+ G4dn instances
+ G5 instances

For specs of the instance types, see the **Accelerated Computing** section in the [Amazon EC2 Instance Types page](https://aws.amazon.com/ec2/instance-types/). For information about instance pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

If you encountered an error message similar to the following, follow the instructions at [Request a service quota increase for SageMaker AI resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure).

```
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling
the CreateTrainingJob operation: The account-level service limit 'ml.p3dn.24xlarge
for training job usage' is 0 Instances, with current utilization of 0 Instances
and a request delta of 1 Instances.
Please contact AWS support to request an increase for this limit.
```

## Tested Models
<a name="training-compiler-tested-models"></a>

The following table includes a list of the models that have been tested with SageMaker Training Compiler. For reference, the largest batch size that is able to fit into memory is also included alongside other training parameters. SageMaker Training Compiler can change the memory footprint of the model training process; as a result, a larger batch size can often be used during the training process, further decreasing total training time. In some cases, SageMaker Training Compiler intelligently promotes caching which leads to a decrease in the largest batch size that can fit on the GPU. You must retune your model hyperparameters and find an optimal batch size for your case. To save time, use the following reference tables to look up a batch size that can be a good starting point for your use case.

**Note**  
The batch sizes are local batch size that fit into each individual GPU in the respective instance type. You should also adjust the learning rate when changing the batch size.

### PyTorch 1.13.1
<a name="training-compiler-tested-models-pt1131"></a>

**Natural language processing (NLP) models**

The following models are tested for training jobs for all combinations of single-node and multi-node with single or multi GPU cores and Automatic Mixed Precision (AMP) as indicated.


| Single-node/multi-node single-GPU/multi-GPU | Model | Dataset | Instance type | Precision | Sequence Length | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
| albert-base-v2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 80 | 192 | 
| albert-base-v2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 332 | 
| albert-base-v2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 80 | 224 | 
| bert-base-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 160 | 288 | 
| camembert-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 160 | 280 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 240 | 472 | 
| distilgpt2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 77 | 128 | 
| distilgpt2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 138 | 390 | 
| distilgpt2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 96 | 256 | 
| distilroberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 96 | 192 | 
| distilroberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 171 | 380 | 
| distilroberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 112 | 256 | 
| gpt2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 52 | 152 | 
| gpt2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 84 | 240 | 
| gpt2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 58 | 164 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 48 | 128 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 84 | 207 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 53 | 133 | 
| roberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 125 | 224 | 
| xlm-roberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 16 | 31 | 
| xlm-roberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 18 | 50 | 
| xlnet-base-cased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 240 | 
| bert-base-uncased | wikitext-103-v1 | g5.48xlarge | float16 | 512 | 29 | 50 | 
| distilbert-base-uncased | wikitext-103-v1 | g5.48xlarge | float16 | 512 | 45 | 64 | 
| gpt2 | wikitext-103-v1 | g5.48xlarge | float16 | 512 | 18 | 45 | 
| roberta-base | wikitext-103-v1 | g5.48xlarge | float16 | 512 | 23 | 44 | 
| gpt2 | wikitext-103-v1 | p4d.24xlarge | float16 | 512 | 36 | 64 | 

**Computer Vision (CV) models**

Tested using [TensorFlow Model Garden](https://github.com/tensorflow/models) with Automatic Mixed Precision (AMP) as indicated.


| Single/multi-node single/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| ResNet152 | food101 | g4dn.16xlarge | float16 | 128 | 144 | 
| ResNet152 | food101 | g5.4xlarge | float16 | 128 | 192 | 
| ResNet152 | food101 | p3.2xlarge | float16 | 152 | 156 | 
| ViT | food101 | g4dn.16xlarge | float16 | 512 | 512 | 
| ViT | food101 | g5.4xlarge | float16 | 992 | 768 | 
| ViT | food101 | p3.2xlarge | float16 | 848 | 768 | 

### PyTorch 1.12.0
<a name="training-compiler-tested-models-pt1120"></a>

**Natural language processing (NLP) models**

The following models are tested for training jobs for all combinations of single-node and multi-node with single or multi GPU cores and Automatic Mixed Precision (AMP) as indicated.


| Single-node/multi-node single-GPU/multi-GPU | Model | Dataset | Instance type | Precision | Sequence Length | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | --- | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 128 | 248 | 
| bert-base-uncased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 160 | 288 | 
| camembert-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 160 | 279 | 
| camembert-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 105 | 164 | 
| distilgpt2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 136 | 256 | 
| distilgpt2 | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 80 | 118 | 
| gpt2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 84 | 240 | 
| gpt2 | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 80 | 119 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 93 | 197 | 
| microsoft/deberta-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 113 | 130 | 
| roberta-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 125 | 224 | 
| roberta-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 78 | 112 | 
| xlnet-base-cased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 138 | 240 | 
| bert-base-uncased | wikitext-103-v1 | ml.p4d.24xlarge | float16 | 512 |  | 52 | 
| distilbert-base-uncased | wikitext-103-v1 | ml.p4d.24xlarge | float16 | 512 |  | 160 | 
| gpt2 | wikitext-103-v1 | ml.p4d.24xlarge | float16 | 512 |  | 25 | 
| roberta-base | wikitext-103-v1 | ml.p4d.24xlarge | float16 | 512 |  | 64 | 

### TensorFlow 2.11.0
<a name="training-compiler-tested-models-tf2110"></a>

**Computer Vision (CV) models**

Tested using [TensorFlow Model Garden](https://github.com/tensorflow/models) with Automatic Mixed Precision (AMP) as indicated.


| Single/multi-node single/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.g5.2xlarge | float16 | 6 | 8 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.p3.2xlarge | float16 | 4 | 6 | 
| ResNet50 | ImageNet | ml.g5.2xlarge | float16 | 192 | 256 | 
| ResNet50 | ImageNet | ml.p3.2xlarge | float16 | 256 | 256 | 
| ResNet101 | ImageNet | ml.g5.2xlarge | float16 | 128 | 256 | 
| ResNet101 | ImageNet | ml.p3.2xlarge | float16 | 128 | 128 | 
| ResNet152 | ImageNet | ml.g5.2xlarge | float16 | 128 | 224 | 
| ResNet152 | ImageNet | ml.p3.2xlarge | float16 | 128 | 128 | 
| VisionTransformer | ImageNet | ml.g5.2xlarge | float16 | 112 | 144 | 
| VisionTransformer | ImageNet | ml.p3.2xlarge | float16 | 96 | 128 | 

**Natural Language Processing (NLP) models**

Tested using [Transformer models](https://github.com/huggingface/transformers) with `Sequence_Len=128` and Automatic Mixed Precision (AMP) as indicated.


| Single/multi-node single/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 160 | 197 | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 95 | 127 | 
| bert-base-uncased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 160 | 128 | 
| bert-base-uncased | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 104 | 111 | 
| bert-large-uncased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 65 | 48 | 
| bert-large-uncased | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 40 | 35 | 
| camembert-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 162 | 
| camembert-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 105 | 111 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 256 | 264 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 128 | 169 | 
| gpt2 | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 120 | 
| gpt2 | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 80 | 83 | 
| jplu/tf-xlm-roberta-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 32 | 32 | 
| jplu/tf-xlm-roberta-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 32 | 36 | 
| microsoft/mpnet-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 144 | 160 | 
| microsoft/mpnet-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 106 | 110 | 
| roberta-base | wikitext-2-raw-v1 | ml.g5.2xlarge | float16 | 128 | 128 | 
| roberta-base | wikitext-2-raw-v1 | ml.p3.2xlarge | float16 | 72 | 98 | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 128 | 192 | 
| albert-base-v2 | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 95 | 96 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 256 | 256 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 140 | 184 | 
| google/electra-small-discriminator | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 256 | 384 | 
| google/electra-small-discriminator | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 256 | 268 | 
| gpt2 | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 116 | 116 | 
| gpt2 | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 85 | 83 | 
| gpt2 | wikitext-2-raw-v1 | ml.p4d.24xlarge | float16 | 94 | 110 | 
| microsoft/mpnet-base | wikitext-2-raw-v1 | ml.g5.48xlarge | float16 | 187 | 164 | 
| microsoft/mpnet-base | wikitext-2-raw-v1 | ml.p3.16xlarge | float16 | 106 | 111 | 

### TensorFlow 2.10.0
<a name="training-compiler-tested-models-tf2100"></a>

**Computer Vision (CV) models**

Tested using [TensorFlow Model Garden](https://github.com/tensorflow/models) with Automatic Mixed Precision (AMP) as indicated.


| Single-node single-GPU/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| DetectionTransformer-ResNet50 | COCO-2017 | ml.g4dn.2xlarge | float32 | 2 | 4 | 
| DetectionTransformer-ResNet50 | COCO-2017 | ml.g5.2xlarge | float32 | 3 | 6 | 
| DetectionTransformer-ResNet50 | COCO-2017 | ml.p3.2xlarge | float32 | 2 | 4 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.g4dn.2xlarge | float16 | 4 | 6 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.g5.2xlarge | float16 | 6 | 8 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.g5.48xlarge | float16 | 48 | 64 | 
| MaskRCNN-ResNet50-FPN | COCO-2017 | ml.p3.2xlarge | float16 | 4 | 6 | 
| ResNet50 | ImageNet | ml.g4dn.2xlarge | float16 | 224 | 256 | 
| ResNet50 | ImageNet | ml.g5.2xlarge | float16 | 192 | 160 | 
| ResNet50 | ImageNet | ml.g5.48xlarge | float16 | 2048 | 2048 | 
| ResNet50 | ImageNet | ml.p3.2xlarge | float16 | 224 | 160 | 
| ResNet101 | ImageNet | ml.g4dn.2xlarge | float16 | 160 | 128 | 
| ResNet101 | ImageNet | ml.g5.2xlarge | float16 | 192 | 256 | 
| ResNet101 | ImageNet | ml.g5.48xlarge | float16 | 2048 | 2048 | 
| ResNet101 | ImageNet | ml.p3.2xlarge | float16 | 160 | 224 | 
| ResNet152 | ImageNet | ml.g4dn.2xlarge | float16 | 128 | 128 | 
| ResNet152 | ImageNet | ml.g5.2xlarge | float16 | 192 | 224 | 
| ResNet152 | ImageNet | ml.g5.48xlarge | float16 | 1536 | 1792 | 
| ResNet152 | ImageNet | ml.p3.2xlarge | float16 | 128 | 160 | 
| VisionTransformer | ImageNet | ml.g4dn.2xlarge | float16 | 80 | 128 | 
| VisionTransformer | ImageNet | ml.g5.2xlarge | float16 | 112 | 144 | 
| VisionTransformer | ImageNet | ml.g5.48xlarge | float16 | 896 | 1152 | 
| VisionTransformer | ImageNet | ml.p3.2xlarge | float16 | 80 | 128 | 

**Natural Language Processing (NLP) models**

Tested using [Transformer models](https://github.com/huggingface/transformers) with `Sequence_Len=128` and Automatic Mixed Precision (AMP) as indicated.


| Single-node single-GPU/multi-GPU | Model | Dataset | Instance type | Precision | Batch size for native frameworks  | Batch size for SageMaker Training Compiler  | 
| --- | --- | --- | --- | --- | --- | --- | 
| albert-base-v2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 128 | 112 | 
| albert-base-v2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 128 | 
| albert-base-v2 | wikitext-2-raw-v1 | p3.8xlarge | float16 | 128 | 135 | 
| albert-base-v2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 191 | 
| bert-base-uncased | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 64 | 94 | 
| bert-base-uncased | wikitext-2-raw-v1 | p3.2xlarge | float16 | 96 | 101 | 
| bert-base-uncased | wikitext-2-raw-v1 | p3.8xlarge | float16 | 96 | 96 | 
| bert-base-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 
| bert-large-uncased | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 35 | 21 | 
| bert-large-uncased | wikitext-2-raw-v1 | p3.2xlarge | float16 | 39 | 26 | 
| bert-large-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 60 | 50 | 
| camembert-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 96 | 90 | 
| camembert-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 96 | 98 | 
| camembert-base | wikitext-2-raw-v1 | p3.8xlarge | float16 | 96 | 96 | 
| camembert-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 256 | 160 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | p3.2xlarge | float16 | 128 | 176 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | p3.8xlarge | float16 | 128 | 160 | 
| distilbert-base-uncased | wikitext-2-raw-v1 | g5.4xlarge | float16 | 256 | 258 | 
| google\$1electra-small-discriminator | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 256 | 216 | 
| google\$1electra-small-discriminator | wikitext-2-raw-v1 | p3.2xlarge | float16 | 256 | 230 | 
| google\$1electra-small-discriminator | wikitext-2-raw-v1 | p3.8xlarge | float16 | 256 | 224 | 
| google\$1electra-small-discriminator | wikitext-2-raw-v1 | g5.4xlarge | float16 | 256 | 320 | 
| gpt2 | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 80 | 64 | 
| gpt2 | wikitext-2-raw-v1 | p3.2xlarge | float16 | 80 | 77 | 
| gpt2 | wikitext-2-raw-v1 | p3.8xlarge | float16 | 80 | 72 | 
| gpt2 | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 120 | 
| jplu\$1tf-xlm-roberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 28 | 24 | 
| jplu\$1tf-xlm-roberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 32 | 24 | 
| jplu\$1tf-xlm-roberta-base | wikitext-2-raw-v1 | p3.8xlarge | float16 | 32 | 26 | 
| jplu\$1tf-xlm-roberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 66 | 52 | 
| microsoft\$1mpnet-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 96 | 92 | 
| microsoft\$1mpnet-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 96 | 101 | 
| microsoft\$1mpnet-base | wikitext-2-raw-v1 | p3.8xlarge | float16 | 96 | 101 | 
| microsoft\$1mpnet-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 152 | 
| roberta-base | wikitext-2-raw-v1 | g4dn.16xlarge | float16 | 64 | 72 | 
| roberta-base | wikitext-2-raw-v1 | p3.2xlarge | float16 | 64 | 84 | 
| roberta-base | wikitext-2-raw-v1 | p3.8xlarge | float16 | 64 | 86 | 
| roberta-base | wikitext-2-raw-v1 | g5.4xlarge | float16 | 128 | 128 | 

### TensorFlow 2.9.1
<a name="training-compiler-tested-models-tf291"></a>

Tested using [TensorFlow Model Garden](https://github.com/tensorflow/models) with Automatic Mixed Precision (AMP).

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

\$1 The batch sizes marked with the asterisk symbol (\$1) indicate the largest batch size tested by the SageMaker Training Compiler developer team. For the marked cells, the instance may be able to fit a larger batch size than what is indicated.

### Transformers 4.21.1 with PyTorch 1.11.0
<a name="training-compiler-tested-models-hf421-pt111"></a>

Tested with `Sequence_Len=512` and Automatic Mixed Precision (AMP).

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

### Transformers 4.17.0 with PyTorch 1.10.2
<a name="training-compiler-tested-models-hf417-pt110"></a>

Tested with `Sequence_Len=512` and Automatic Mixed Precision (AMP).

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html)

### Transformers 4.11.0 with PyTorch 1.9.0
<a name="training-compiler-tested-models-hf411-pt190"></a>

Tested with `Sequence_Len=512` and Automatic Mixed Precision (AMP).


| Single-node single-GPU | Model  | Instance type | Batch size for native | Batch size for Training Compiler | 
| --- | --- | --- | --- | --- | 
| albert-base-v2  | ml.p3.2xlarge | 12 | 32 | 
| bert-base-cased  | ml.p3.2xlarge | 14 | 24 | 
| bert-base-chinese | ml.p3.2xlarge | 16 | 24 | 
| bert-base-multilingual-cased  | ml.p3.2xlarge | 4 | 16 | 
| bert-base-multilingual-uncased  | ml.p3.2xlarge | 8 | 16 | 
| bert-base-uncased  | ml.p3.2xlarge | 12 | 24 | 
| cl-tohoku/bert-base-japanese-whole-word-masking | ml.p3.2xlarge | 12 | 24 | 
| cl-tohoku/bert-base-japanese  | ml.p3.2xlarge | 12 | 24 | 
| distilbert-base-uncased  | ml.p3.2xlarge | 28 | 32 | 
| distilbert-base-uncased-finetuned-sst-2-english | ml.p3.2xlarge | 28 | 32 | 
| distilgpt2  | ml.p3.2xlarge | 16 | 32 | 
| facebook/bart-base  | ml.p3.2xlarge | 4 | 8 | 
| gpt2 | ml.p3.2xlarge | 6 | 20 | 
| nreimers/MiniLMv2-L6-H384-distilled-from-RoBERTa-Large  | ml.p3.2xlarge | 20 | 32 | 
| roberta-base  | ml.p3.2xlarge | 12 | 20 | 


| Single-node multi-GPU | Model  | Instance type | Batch size for native | Batch size for Training Compiler | 
| --- | --- | --- | --- | --- | 
| bert-base-chinese  | ml.p3.8xlarge | 16 | 26 | 
| bert-base-multilingual-cased  | ml.p3.8xlarge | 6 | 16 | 
| bert-base-multilingual-uncased | ml.p3.8xlarge | 6 | 16 | 
| bert-base-uncased  | ml.p3.8xlarge | 14 | 24 | 
| distilbert-base-uncased  | ml.p3.8xlarge | 14 | 32 | 
| distilgpt2 | ml.p3.8xlarge | 6 | 32 | 
| facebook/bart-base | ml.p3.8xlarge | 8 | 16 | 
| gpt2  | ml.p3.8xlarge | 8 | 20 | 
| roberta-base  | ml.p3.8xlarge | 12 | 20 | 

### Transformers 4.17.0 with TensorFlow 2.6.3
<a name="training-compiler-tested-models-hf417-tf263"></a>

Tested with `Sequence_Len=128` and Automatic Mixed Precision (AMP).


| Model  | Instance type | Batch size for native frameworks | Batch size for Training Compiler | 
| --- | --- | --- | --- | 
| albert-base-v2 | ml.g4dn.16xlarge | 136 | 208 | 
| albert-base-v2 | ml.g5.4xlarge | 219 | 312 | 
| albert-base-v2 | ml.p3.2xlarge | 152 | 208 | 
| albert-base-v2 | ml.p3.8xlarge | 152 | 192 | 
| bert-base-uncased | ml.g4dn.16xlarge | 120 | 101 | 
| bert-base-uncased | ml.g5.4xlarge | 184 | 160 | 
| bert-base-uncased | ml.p3.2xlarge | 128 | 108 | 
| bert-large-uncased | ml.g4dn.16xlarge | 37 | 28 | 
| bert-large-uncased | ml.g5.4xlarge | 64 | 55 | 
| bert-large-uncased | ml.p3.2xlarge | 40 | 32 | 
| camembert-base | ml.g4dn.16xlarge | 96 | 100 | 
| camembert-base | ml.g5.4xlarge | 190 | 160 | 
| camembert-base | ml.p3.2xlarge | 129 | 108 | 
| camembert-base | ml.p3.8xlarge | 128 | 104 | 
| distilbert-base-uncased | ml.g4dn.16xlarge | 210 | 160 | 
| distilbert-base-uncased | ml.g5.4xlarge | 327 | 288 | 
| distilbert-base-uncased | ml.p3.2xlarge | 224 | 196 | 
| distilbert-base-uncased | ml.p3.8xlarge | 192 | 182 | 
| google\$1electra-small-discriminator | ml.g4dn.16xlarge | 336 | 288 | 
| google\$1electra-small-discriminator | ml.g5.4xlarge | 504 | 384 | 
| google\$1electra-small-discriminator | ml.p3.2xlarge | 352 | 323 | 
| gpt2 | ml.g4dn.16xlarge | 89 | 64 | 
| gpt2 | ml.g5.4xlarge | 140 | 146 | 
| gpt2 | ml.p3.2xlarge | 94 | 96 | 
| gpt2 | ml.p3.8xlarge | 96 | 88 | 
| jplu\$1tf-xlm-roberta-base | ml.g4dn.16xlarge | 52 | 16 | 
| jplu\$1tf-xlm-roberta-base | ml.g5.4xlarge | 64 | 44 | 
| microsoft\$1mpnet-base | ml.g4dn.16xlarge | 120 | 100 | 
| microsoft\$1mpnet-base | ml.g5.4xlarge | 192 | 160 | 
| microsoft\$1mpnet-base | ml.p3.2xlarge | 128 | 104 | 
| microsoft\$1mpnet-base | ml.p3.8xlarge | 130 | 92 | 
| roberta-base | ml.g4dn.16xlarge | 108 | 64 | 
| roberta-base | ml.g5.4xlarge | 176 | 142 | 
| roberta-base | ml.p3.2xlarge | 118 | 100 | 
| roberta-base | ml.p3.8xlarge | 112 | 88 | 

### Transformers 4.11.0 with TensorFlow 2.5.1
<a name="training-compiler-tested-models-hf411-tf251"></a>

Tested with `Sequence_Len=128` and Automatic Mixed Precision (AMP).


| Single-node single-GPU | Model  | Instance type | Batch size for native | Batch size for Training Compiler | 
| --- | --- | --- | --- | --- | 
| albert-base-v2  | ml.p3.2xlarge | 128 | 128 | 
| bart-base  | ml.p3.2xlarge | 12 | 64 | 
| bart-large  | ml.p3.2xlarge | 4 | 28 | 
| bert-base-cased  | ml.p3.2xlarge | 16 | 128 | 
| bert-base-chinese | ml.p3.2xlarge | 16 | 128 | 
| bert-base-multilingual-cased  | ml.p3.2xlarge | 12 | 64 | 
| bert-base-multilingual-uncased  | ml.p3.2xlarge | 16 | 96 | 
| bert-base-uncased | ml.p3.2xlarge | 16 | 96 | 
| bert-large-uncased  | ml.p3.2xlarge | 4 | 24 | 
| cl-tohoku/bert-base-japanese  | ml.p3.2xlarge | 16 | 128 | 
| cl-tohoku/bert-base-japanese-whole-word-masking  | ml.p3.2xlarge | 16 | 128 | 
| distilbert-base-sst2  | ml.p3.2xlarge | 32 | 128 | 
| distilbert-base-uncased  | ml.p3.2xlarge | 32 | 128 | 
| distilgpt2 | ml.p3.2xlarge | 32 | 128 | 
| gpt2  | ml.p3.2xlarge | 12 | 64 | 
| gpt2-large  | ml.p3.2xlarge | 2 | 24 | 
| jplu/tf-xlm-roberta-base  | ml.p3.2xlarge | 12 | 32 | 
| roberta-base  | ml.p3.2xlarge | 4 | 64 | 
| roberta-large  | ml.p3.2xlarge | 4 | 64 | 
| t5-base  | ml.p3.2xlarge | 64 | 64 | 
| t5-small  | ml.p3.2xlarge | 128 | 128 | 

# Bring Your Own Deep Learning Model
<a name="training-compiler-modify-scripts"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

This guide walks you through how to adapt your training script for a compiler-accelerated training job. The preparation of your training script depends on the following:
+ Training settings such as single-core or distributed training.
+ Frameworks and libraries that you use to create the training script.

Choose one of the following topics depending on the framework you use.

**Topics**
+ [

# PyTorch
](training-compiler-pytorch-models.md)
+ [

# TensorFlow
](training-compiler-tensorflow.md)

**Note**  
After you finish preparing your training script, you can run a SageMaker training job using the SageMaker AI framework estimator classes. For more information, see the previous topic at [Enable SageMaker Training Compiler](training-compiler-enable.md).

# PyTorch
<a name="training-compiler-pytorch-models"></a>

Bring your own PyTorch model to SageMaker AI, and run the training job with SageMaker Training Compiler.

**Topics**
+ [

## PyTorch Models with Hugging Face Transformers
](#training-compiler-pytorch-models-transformers)

## PyTorch Models with Hugging Face Transformers
<a name="training-compiler-pytorch-models-transformers"></a>

PyTorch models with [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) are based on PyTorch's [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) API. Hugging Face Transformers also provides [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) and pretrained model classes for PyTorch to help reduce the effort for configuring natural language processing (NLP) models. After preparing your training script, you can launch a training job using the SageMaker AI `PyTorch` or `HuggingFace` estimator with the SageMaker Training Compiler configuration when you'll proceed to the next topic at [Enable SageMaker Training Compiler](training-compiler-enable.md).

**Tip**  
When you create a tokenizer for an NLP model using Transformers in your training script, make sure that you use a static input tensor shape by specifying `padding='max_length'`. Do not use `padding='longest'` because padding to the longest sequence in the batch can change the tensor shape for each training batch. The dynamic input shape can trigger recompilation of the model and might increase total training time. For more information about padding options of the Transformers tokenizers, see [Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation) in the *Hugging Face Transformers documentation*.

**Topics**
+ [

### Large Language Models Using the Hugging Face Transformers `Trainer` Class
](#training-compiler-pytorch-models-transformers-trainer)
+ [

### Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)
](#training-compiler-pytorch-models-non-trainer)

### Large Language Models Using the Hugging Face Transformers `Trainer` Class
<a name="training-compiler-pytorch-models-transformers-trainer"></a>

If you use the transformers library’s Trainer class, you don’t need to make any additional changes to your training script. SageMaker Training Compiler automatically compiles your Trainer model if you enable it through the estimator class. The following code shows the basic form of a PyTorch training script with Hugging Face Trainer API.

```
from transformers import Trainer, TrainingArguments

training_args=TrainingArguments(**kwargs)
trainer=Trainer(args=training_args, **kwargs)
```

**Topics**
+ [

#### For single GPU training
](#training-compiler-pytorch-models-transformers-trainer-single-gpu)
+ [

#### For distributed training
](#training-compiler-pytorch-models-transformers-trainer-distributed)
+ [

#### Best Practices to Use SageMaker Training Compiler with `Trainer`
](#training-compiler-pytorch-models-transformers-trainer-best-practices)

#### For single GPU training
<a name="training-compiler-pytorch-models-transformers-trainer-single-gpu"></a>

You don't need to change your code when you use the [https://huggingface.co/docs/transformers/main_classes/trainer](https://huggingface.co/docs/transformers/main_classes/trainer) class. 

#### For distributed training
<a name="training-compiler-pytorch-models-transformers-trainer-distributed"></a>

**PyTorch v1.11.0 and later**

To run distributed training with SageMaker Training Compiler, you must add the following `_mp_fn()` function in your training script and wrap the `main()` function. It redirects the `_mp_fn(index)` function calls from the SageMaker AI distributed runtime for PyTorch (`pytorchxla`) to the `main()` function of your training script. 

```
def _mp_fn(index):
    main()
```

This function accepts the `index` argument to indicate the rank of the current GPU in the cluster for distributed training. To find more example scripts, see the [Hugging Face Transformers language modeling example scripts](https://github.com/huggingface/transformers/blob/v4.21.1/examples/pytorch/language-modeling).

**For Transformers v4.17 and before with PyTorch v1.10.2 and before**

SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job, and you don't need to make any modification in your training script. Instead, SageMaker Training Compiler requires you to pass a SageMaker AI distributed training launcher script to the `entry_point` argument and pass your training script to the `hyperparameters` argument in the SageMaker AI Hugging Face estimator.

#### Best Practices to Use SageMaker Training Compiler with `Trainer`
<a name="training-compiler-pytorch-models-transformers-trainer-best-practices"></a>
+ Make sure that you use SyncFree optimizers by setting the `optim` argument to `adamw_torch_xla` while setting up [transformers.TrainingArgument](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments). See also [Optimizer](https://huggingface.co/docs/transformers/v4.23.1/en/perf_train_gpu_one#optimizer) in the *Hugging Face Transformers documentation*.
+ Ensure that the throughput of the data processing pipeline is higher than the training throughput. You can tweak the `dataloader_num_workers` and `preprocessing_num_workers` arguments of the [transformers.TrainingArgument](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) class to achieve this. Typically, these need to be greater than or equal to the number of GPUs but less than the number of CPUs.

After you have completed adapting your training script, proceed to [Run PyTorch Training Jobs with SageMaker Training Compiler](training-compiler-enable-pytorch.md).

### Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)
<a name="training-compiler-pytorch-models-non-trainer"></a>

If you have a training script that uses PyTorch directly, you need to make additional changes to your PyTorch training script to implement PyTorch/XLA. Follow the instructions to modify your script to properly set up the PyTorch/XLA primatives.

**Topics**
+ [

#### For single GPU training
](#training-compiler-pytorch-models-non-trainer-single-gpu)
+ [

#### For distributed training
](#training-compiler-pytorch-models-non-trainer-distributed)
+ [

#### Best Practices to Use SageMaker Training Compiler with PyTorch/XLA
](#training-compiler-pytorch-models-best-practices)

#### For single GPU training
<a name="training-compiler-pytorch-models-non-trainer-single-gpu"></a>

1. Import the optimization libraries.

   ```
   import torch_xla
   import torch_xla.core.xla_model as xm
   ```

1. Change the target device to be XLA instead of `torch.device("cuda")`

   ```
   device=xm.xla_device()
   ```

1. If you're using PyTorch's [Automatic Mixed Precision](https://pytorch.org/docs/stable/amp.html) (AMP), do the following:

   1. Replace `torch.cuda.amp` with the following:

      ```
      import torch_xla.amp
      ```

   1. Replace `torch.optim.SGD` and `torch.optim.Adam` with the following:

      ```
      import torch_xla.amp.syncfree.Adam as adam
      import torch_xla.amp.syncfree.SGD as SGD
      ```

   1. Replace `torch.cuda.amp.GradScaler` with the following:

      ```
      import torch_xla.amp.GradScaler as grad_scaler
      ```

1. If you're not using AMP, replace `optimizer.step()` with the following:

   ```
   xm.optimizer_step(optimizer)
   ```

1. If you're using a distributed dataloader, wrap your dataloader in the PyTorch/XLA's `ParallelLoader` class:

   ```
   import torch_xla.distributed.parallel_loader as pl
   parallel_loader=pl.ParallelLoader(dataloader, [device]).per_device_loader(device)
   ```

1. Add `mark_step` at the end of the training loop when you're not using `parallel_loader`:

   ```
   xm.mark_step()
   ```

1. To checkpoint your training, use the PyTorch/XLA's model checkpoint method:

   ```
   xm.save(model.state_dict(), path_to_save)
   ```

After you have completed adapting your training script, proceed to [Run PyTorch Training Jobs with SageMaker Training Compiler](training-compiler-enable-pytorch.md).

#### For distributed training
<a name="training-compiler-pytorch-models-non-trainer-distributed"></a>

In addition to the changes listed in the previous [For single GPU training](#training-compiler-pytorch-models-non-trainer-single-gpu) section, add the following changes to properly distribute workload across GPUs.

1. If you're using AMP, add `all_reduce` after `scaler.scale(loss).backward()`:

   ```
   gradients=xm._fetch_gradients(optimizer)
   xm.all_reduce('sum', gradients, scale=1.0/xm.xrt_world_size())
   ```

1. If you need to set variables for `local_ranks` and `world_size`, use similar code to the following:

   ```
   local_rank=xm.get_local_ordinal()
   world_size=xm.xrt_world_size()
   ```

1. For any `world_size` (`num_gpus_per_node*num_nodes`) greater than `1`, you must define a train sampler which should look similar to the following:

   ```
   import torch_xla.core.xla_model as xm
   
   if xm.xrt_world_size() > 1:
       train_sampler=torch.utils.data.distributed.DistributedSampler(
           train_dataset,
           num_replicas=xm.xrt_world_size(),
           rank=xm.get_ordinal(),
           shuffle=True
       )
   
   train_loader=torch.utils.data.DataLoader(
       train_dataset, 
       batch_size=args.batch_size,
       sampler=train_sampler,
       drop_last=args.drop_last,
       shuffle=False if train_sampler else True,
       num_workers=args.num_workers
   )
   ```

1. Make the following changes to make sure you use the `parallel_loader` provided by the `torch_xla distributed` module. 

   ```
   import torch_xla.distributed.parallel_loader as pl
   train_device_loader=pl.MpDeviceLoader(train_loader, device)
   ```

   The `train_device_loader` functions like a regular PyTorch loader as follows: 

   ```
   for step, (data, target) in enumerate(train_device_loader):
       optimizer.zero_grad()
       output=model(data)
       loss=torch.nn.NLLLoss(output, target)
       loss.backward()
   ```

   With all of these changes, you should be able to launch distributed training with any PyTorch model without the Transformer Trainer API. Note that these instructions can be used for both single-node multi-GPU and multi-node multi-GPU.

1. **For PyTorch v1.11.0 and later**

   To run distributed training with SageMaker Training Compiler, you must add the following `_mp_fn()` function in your training script and wrap the `main()` function. It redirects the `_mp_fn(index)` function calls from the SageMaker AI distributed runtime for PyTorch (`pytorchxla`) to the `main()` function of your training script. 

   ```
   def _mp_fn(index):
       main()
   ```

   This function accepts the `index` argument to indicate the rank of the current GPU in the cluster for distributed training. To find more example scripts, see the [Hugging Face Transformers language modeling example scripts](https://github.com/huggingface/transformers/blob/v4.21.1/examples/pytorch/language-modeling).

   **For Transformers v4.17 and before with PyTorch v1.10.2 and before**

   SageMaker Training Compiler uses an alternate mechanism for launching a distributed training job and requires you to pass a SageMaker AI distributed training launcher script to the `entry_point` argument and pass your training script to the `hyperparameters` argument in the SageMaker AI Hugging Face estimator.

After you have completed adapting your training script, proceed to [Run PyTorch Training Jobs with SageMaker Training Compiler](training-compiler-enable-pytorch.md).

#### Best Practices to Use SageMaker Training Compiler with PyTorch/XLA
<a name="training-compiler-pytorch-models-best-practices"></a>

If you want to leverage the SageMaker Training Compiler on your native PyTorch training script, you may want to first get familiar with [PyTorch on XLA devices](https://pytorch.org/xla/release/1.9/index.html). The following sections list some best practices to enable XLA for PyTorch.

**Note**  
This section for best practices assumes that you use the following PyTorch/XLA modules:  

```
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl
```

##### Understand the lazy mode in PyTorch/XLA
<a name="training-compiler-pytorch-models-best-practices-lazy-mode"></a>

One significant difference between PyTorch/XLA and native PyTorch is that the PyTorch/XLA system runs in lazy mode while the native PyTorch runs in eager mode. Tensors in lazy mode are placeholders for building the computational graph until they are materialized after the compilation and evaluation are complete. The PyTorch/XLA system builds the computational graph on the fly when you call PyTorch APIs to build the computation using tensors and operators. The computational graph gets compiled and executed when `xm.mark_step()` is called explicitly or implicitly by `pl.MpDeviceLoader/pl.ParallelLoader`, or when you explicitly request the value of a tensor such as by calling `loss.item()` or `print(loss)`. 

##### Minimize the number of *compilation-and-executions* using `pl.MpDeviceLoader/pl.ParallelLoader` and `xm.step_closure`
<a name="training-compiler-pytorch-models-best-practices-minimize-comp-exec"></a>

For best performance, you should keep in mind the possible ways to initiate *compilation-and-executions* as described in [Understand the lazy mode in PyTorch/XLA](#training-compiler-pytorch-models-best-practices-lazy-mode) and should try to minimize the number of compilation-and-executions. Ideally, only one compilation-and-execution is necessary per training iteration and is initiated automatically by `pl.MpDeviceLoader/pl.ParallelLoader`. The `MpDeviceLoader` is optimized for XLA and should always be used if possible for best performance. During training, you might want to examine some intermediate results such as loss values. In such case, the printing of lazy tensors should be wrapped using `xm.add_step_closure()` to avoid unnecessary compilation-and-executions.

##### Use AMP and `syncfree` optimizers
<a name="training-compiler-pytorch-models-best-practices-amp-optimizers"></a>

Training in Automatic Mixed Precision (AMP) mode significantly accelerates your training speed by leveraging the Tensor cores of NVIDIA GPUs. SageMaker Training Compiler provides `syncfree` optimizers that are optimized for XLA to improve AMP performance. Currently, the following three `syncfree` optimizers are available and should be used if possible for best performance.

```
torch_xla.amp.syncfree.SGD
torch_xla.amp.syncfree.Adam
torch_xla.amp.syncfree.AdamW
```

These `syncfree` optimizers should be paired with `torch_xla.amp.GradScaler` for gradient scaling/unscaling.

**Tip**  
Starting PyTorch 1.13.1, SageMaker Training Compiler improves performance by letting PyTorch/XLA to automatically override the optimizers (such as SGD, Adam, AdamW) in `torch.optim` or `transformers.optimization` with the syncfree versions of them in `torch_xla.amp.syncfree` (such as `torch_xla.amp.syncfree.SGD`, `torch_xla.amp.syncfree.Adam`, `torch_xla.amp.syncfree.AdamW`). You don't need to change those code lines where you define optimizers in your training script.

# TensorFlow
<a name="training-compiler-tensorflow"></a>

Bring your own TensorFlow model to SageMaker AI, and run the training job with SageMaker Training Compiler.

## TensorFlow Models
<a name="training-compiler-tensorflow-models"></a>

SageMaker Training Compiler automatically optimizes model training workloads that are built on top of the native TensorFlow API or the high-level Keras API.

**Tip**  
For preprocessing your input dataset, ensure that you use a static input shape. Dynamic input shape can initiate recompilation of the model and might increase total training time. 

### Using Keras (Recommended)
<a name="training-compiler-tensorflow-models-keras"></a>

For the best compiler acceleration, we recommend using models that are subclasses of TensorFlow Keras ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)).

#### For single GPU training
<a name="training-compiler-tensorflow-models-keras-single-gpu"></a>

There's no additional change you need to make in the training script.

### Without Keras
<a name="training-compiler-tensorflow-models-no-keras"></a>

SageMaker Training Compiler does not support eager execution in TensorFlow. Accordingly, you should wrap your model and training loops with the TensorFlow function decorator (`@tf.function`) to leverage compiler acceleration.

SageMaker Training Compiler performs a graph-level optimization, and uses the decorator to make sure your TensorFlow functions are set to run in [graph mode](https://www.tensorflow.org/guide/intro_to_graphs).

#### For single GPU training
<a name="training-compiler-tensorflow-models-no-keras-single-gpu"></a>

TensorFlow 2.0 or later has the eager execution on by default, so you should add the `@tf.function` decorator in front of every function that you use for constructing a TensorFlow model.

## TensorFlow Models with Hugging Face Transformers
<a name="training-compiler-tensorflow-models-transformers"></a>

TensorFlow models with [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) are based on TensorFlow's [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) API. Hugging Face Transformers also provides pretrained model classes for TensorFlow to help reduce the effort for configuring natural language processing (NLP) models. After creating your own training script using the Transformers library, you can run the training script using the SageMaker AI `HuggingFace` estimator with the SageMaker Training Compiler configuration class as shown in the previous topic at [Run TensorFlow Training Jobs with SageMaker Training Compiler](training-compiler-enable-tensorflow.md).

SageMaker Training Compiler automatically optimizes model training workloads that are built on top of the native TensorFlow API or the high-level Keras API, such as the TensorFlow transformer models.

**Tip**  
When you create a tokenizer for an NLP model using Transformers in your training script, make sure that you use a static input tensor shape by specifying `padding='max_length'`. Do not use `padding='longest'` because padding to the longest sequence in the batch can change the tensor shape for each training batch. The dynamic input shape can initiate recompilation of the model and might increase total training time. For more information about padding options of the Transformers tokenizers, see [Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation) in the *Hugging Face Transformers documentation*.

**Topics**
+ [

### Using Keras
](#training-compiler-tensorflow-models-transformers-keras)
+ [

### Without Keras
](#training-compiler-tensorflow-models-transformers-no-keras)

### Using Keras
<a name="training-compiler-tensorflow-models-transformers-keras"></a>

For the best compiler acceleration, we recommend using models that are subclasses of TensorFlow Keras ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)). As noted in the [Quick tour](https://huggingface.co/docs/transformers/quicktour) page in the *Hugging Face Transformers documentation*, you can use the models as regular TensorFlow Keras models.

#### For single GPU training
<a name="training-compiler-tensorflow-models-transformers-keras-single-gpu"></a>

There's no additional change you need to make in the training script.

#### For distributed training
<a name="training-compiler-tensorflow-models-transformers-keras-distributed"></a>

SageMaker Training Compiler acceleration works transparently for multi-GPU workloads when the model is constructed and trained using Keras APIs within the scope of [https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy](https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy) call.

1. Choose the right distributed training strategy.

   1. For single-node multi-GPU, use `tf.distribute.MirroredStrategy` to set the strategy.

      ```
      strategy = tf.distribute.MirroredStrategy()
      ```

   1. For multi-node multi-GPU, add the following code to properly set the TensorFlow distributed training configuration before creating the strategy.

      ```
      def set_sm_dist_config():
          DEFAULT_PORT = '8890'
          DEFAULT_CONFIG_FILE = '/opt/ml/input/config/resourceconfig.json'
          with open(DEFAULT_CONFIG_FILE) as f:
              config = json.loads(f.read())
              current_host = config['current_host']
          tf_config = {
              'cluster': {
                  'worker': []
              },
              'task': {'type': 'worker', 'index': -1}
          }
          for i, host in enumerate(config['hosts']):
              tf_config['cluster']['worker'].append("%s:%s" % (host, DEFAULT_PORT))
              if current_host == host:
                  tf_config['task']['index'] = i
          os.environ['TF_CONFIG'] = json.dumps(tf_config)
      
      set_sm_dist_config()
      ```

       Use `tf.distribute.MultiWorkerMirroredStrategy` to set the strategy.

      ```
      strategy = tf.distribute.MultiWorkerMirroredStrategy()
      ```

1. Using the strategy of your choice, wrap the model.

   ```
   with strategy.scope():
       # create a model and do fit
   ```

### Without Keras
<a name="training-compiler-tensorflow-models-transformers-no-keras"></a>

If you want to bring custom models with custom training loops using TensorFlow without Keras, you should wrap the model and the training loop with the TensorFlow function decorator (`@tf.function`) to leverage compiler acceleration.

SageMaker Training Compiler performs a graph-level optimization, and uses the decorator to make sure your TensorFlow functions are set to run in graph mode. 

#### For single GPU training
<a name="training-compiler-tensorflow-models-transformers-no-keras-single-gpu"></a>

TensorFlow 2.0 or later has the eager execution on by default, so you should add the `@tf.function` decorator in front of every function that you use for constructing a TensorFlow model.

#### For distributed training
<a name="training-compiler-tensorflow-models-transformers-no-keras-distributed"></a>

In addition to the changes needed for [Using Keras for distributed training](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-tensorflow-models.html#training-compiler-tensorflow-models-transformers-keras), you need to ensure that functions to be run on each GPU are annotated with `@tf.function`, while cross-GPU communication functions are not annotated. An example training code should look like the following:

```
@tf.function()
def compiled_step(inputs, outputs):
    with tf.GradientTape() as tape:
        pred=model(inputs, training=True)
        total_loss=loss_object(outputs, pred)/args.batch_size
    gradients=tape.gradient(total_loss, model.trainable_variables)
    return total_loss, pred, gradients

def train_step(inputs, outputs):
    total_loss, pred, gradients=compiled_step(inputs, outputs)
    if args.weight_decay > 0.:
        gradients=[g+v*args.weight_decay for g,v in zip(gradients, model.trainable_variables)]

    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss.update_state(total_loss)
    train_accuracy.update_state(outputs, pred)

@tf.function()
def train_step_dist(inputs, outputs):
    strategy.run(train_step, args= (inputs, outputs))
```

Note that this instruction can be used for both single-node multi-GPU and multi-node multi-GPU.

# Enable SageMaker Training Compiler
<a name="training-compiler-enable"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

SageMaker Training Compiler is built into the SageMaker Python SDK and AWS Deep Learning Containers so that you don’t need to change your workflows to enable Training Compiler. Choose one of the following topics that matches with your use case.

**Topics**
+ [

# Run PyTorch Training Jobs with SageMaker Training Compiler
](training-compiler-enable-pytorch.md)
+ [

# Run TensorFlow Training Jobs with SageMaker Training Compiler
](training-compiler-enable-tensorflow.md)

# Run PyTorch Training Jobs with SageMaker Training Compiler
<a name="training-compiler-enable-pytorch"></a>

You can use any of the SageMaker AI interfaces to run a training job with SageMaker Training Compiler: Amazon SageMaker Studio Classic, Amazon SageMaker notebook instances, AWS SDK for Python (Boto3), and AWS Command Line Interface.

**Topics**
+ [

## Using the SageMaker Python SDK
](#training-compiler-enable-pytorch-pysdk)
+ [

## Using the SageMaker AI `CreateTrainingJob` API Operation
](#training-compiler-enable-pytorch-api)

## Using the SageMaker Python SDK
<a name="training-compiler-enable-pytorch-pysdk"></a>

SageMaker Training Compiler for PyTorch is available through the SageMaker AI [https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) and [https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator) framework estimator classes. To turn on SageMaker Training Compiler, add the `compiler_config` parameter to the SageMaker AI estimators. Import the `TrainingCompilerConfig` class and pass an instance of it to the `compiler_config` parameter. The following code examples show the structure of SageMaker AI estimator classes with SageMaker Training Compiler turned on.

**Tip**  
To get started with prebuilt models provided by PyTorch or Transformers, try using the batch sizes provided in the reference table at [Tested Models](training-compiler-support.md#training-compiler-tested-models).

**Note**  
The native PyTorch support is available in the SageMaker Python SDK v2.121.0 and later. Make sure that you update the SageMaker Python SDK accordingly.

**Note**  
Starting PyTorch v1.12.0, SageMaker Training Compiler containers for PyTorch are available. Note that the SageMaker Training Compiler containers for PyTorch are not prepackaged with Hugging Face Transformers. If you need to install the library in the container, make sure that you add the `requirements.txt` file under the source directory when submitting a training job.  
For PyTorch v1.11.0 and before, use the previous versions of the SageMaker Training Compiler containers for Hugging Face and PyTorch.  
For a complete list of framework versions and corresponding container information, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).

For information that fits your use case, see one of the following options.

### For single GPU training
<a name="training-compiler-estimator-pytorch-single"></a>

------
#### [ PyTorch v1.12.0 and later ]

To compile and train a PyTorch model, configure a SageMaker AI PyTorch estimator with SageMaker Training Compiler as shown in the following code example.

**Note**  
This native PyTorch support is available in the SageMaker AI Python SDK v2.120.0 and later. Make sure that you update the SageMaker AI Python SDK.

```
from sagemaker.pytorch import PyTorch, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

pytorch_estimator=PyTorch(
    entry_point='train.py',
    source_dir='path-to-requirements-file', # Optional. Add this if need to install additional packages.
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='1.13.1',
    py_version='py3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_estimator.fit()
```

------
#### [ Hugging Face Transformers with PyTorch v1.11.0 and before ]

To compile and train a transformer model with PyTorch, configure a SageMaker AI Hugging Face estimator with SageMaker Training Compiler as shown in the following code example.

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

pytorch_huggingface_estimator=HuggingFace(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    transformers_version='4.21.1',
    pytorch_version='1.11.0',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_huggingface_estimator.fit()
```

To prepare your training script, see the following pages.
+ [For single GPU training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-transformers-trainer-single-gpu) of a PyTorch model using Hugging Face Transformers' [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer)
+ [For single GPU training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer-single-gpu) of a PyTorch model without Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)

To find end-to-end examples, see the following notebooks:
+ [Compile and Train a Hugging Face Transformers Trainer Model for Question and Answering with the SQuAD dataset ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_single_gpu_single_node/albert-base-v2/albert-base-v2.html) 
+ [Compile and Train a Hugging Face Transformer `BERT` Model with the SST Dataset using SageMaker Training Compiler](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_single_gpu_single_node/bert-base-cased/bert-base-cased-single-node-single-gpu.html) 
+ [Compile and Train a Binary Classification Trainer Model with the SST2 Dataset for Single-Node Single-GPU Training ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_single_gpu_single_node/roberta-base/roberta-base.html)

------

### For distributed training
<a name="training-compiler-estimator-pytorch-distributed"></a>

------
#### [ PyTorch v1.12 ]

For PyTorch v1.12, you can run distributed training with SageMaker Training Compiler by adding the `pytorch_xla` option specified to the `distribution` parameter of the SageMaker AI PyTorch estimator class.

**Note**  
This native PyTorch support is available in the SageMaker AI Python SDK v2.121.0 and later. Make sure that you update the SageMaker AI Python SDK.

```
from sagemaker.pytorch import PyTorch, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
    "n_gpus": num_gpus,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

pytorch_estimator=PyTorch(
    entry_point='your_training_script.py',
    source_dir='path-to-requirements-file', # Optional. Add this if need to install additional packages.
    instance_count=instance_count,
    instance_type=instance_type,
    framework_version='1.13.1',
    py_version='py3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    distribution ={'pytorchxla' : { 'enabled': True }},
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_estimator.fit()
```

**Tip**  
To prepare your training script, see [PyTorch](training-compiler-pytorch-models.md)

------
#### [ Transformers v4.21 with PyTorch v1.11 ]

For PyTorch v1.11 and later, SageMaker Training Compiler is available for distributed training with the `pytorch_xla` option specified to the `distribution` parameter.

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
    "n_gpus": num_gpus,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

pytorch_huggingface_estimator=HuggingFace(
    entry_point='your_training_script.py',
    instance_count=instance_count,
    instance_type=instance_type,
    transformers_version='4.21.1',
    pytorch_version='1.11.0',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    distribution ={'pytorchxla' : { 'enabled': True }},
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_huggingface_estimator.fit()
```

**Tip**  
To prepare your training script, see the following pages.  
[For distributed training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-transformers-trainer-distributed) of a PyTorch model using Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)
[For distributed training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer-distributed) of a PyTorch model without Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)

------
#### [ Transformers v4.17 with PyTorch v1.10.2 and before ]

For the supported version of PyTorch v1.10.2 and before, SageMaker Training Compiler requires an alternate mechanism for launching a distributed training job. To run distributed training, SageMaker Training Compiler requires you to pass a SageMaker AI distributed training launcher script to the `entry_point` argument, and pass your training script to the `hyperparameters` argument. The following code example shows how to configure a SageMaker AI Hugging Face estimator applying the required changes.

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

training_script="your_training_script.py"

hyperparameters={
    "n_gpus": num_gpus,
    "batch_size": batch_size,
    "learning_rate": learning_rate,
    "training_script": training_script     # Specify the file name of your training script.
}

pytorch_huggingface_estimator=HuggingFace(
    entry_point='distributed_training_launcher.py',    # Specify the distributed training launcher script.
    instance_count=instance_count,
    instance_type=instance_type,
    transformers_version='4.17.0',
    pytorch_version='1.10.2',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

pytorch_huggingface_estimator.fit()
```

The launcher script should look like the following. It wraps your training script and configures the distributed training environment depending on the size of the training instance of your choice. 

```
# distributed_training_launcher.py

#!/bin/python

import subprocess
import sys

if __name__ == "__main__":
    arguments_command = " ".join([arg for arg in sys.argv[1:]])
    """
    The following line takes care of setting up an inter-node communication
    as well as managing intra-node workers for each GPU.
    """
    subprocess.check_call("python -m torch_xla.distributed.sm_dist " + arguments_command, shell=True)
```

**Tip**  
To prepare your training script, see the following pages.  
[For distributed training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-transformers-trainer-distributed) of a PyTorch model using Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)
[For distributed training](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer-distributed) of a PyTorch model without Hugging Face Transformers' [Trainer API](https://huggingface.co/transformers/main_classes/trainer.html)

**Tip**  
To find end-to-end examples, see the following notebooks:  
[Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Single-Node Multi-GPU Training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_single_node/language-modeling-multi-gpu-single-node.html)
[Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Multi-Node Multi-GPU Training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/language-modeling-multi-gpu-multi-node.html)

------

The following list is the minimal set of parameters required to run a SageMaker training job with the compiler.

**Note**  
When using the SageMaker AI Hugging Face estimator, you must specify the `transformers_version`, `pytorch_version`, `hyperparameters`, and `compiler_config` parameters to enable SageMaker Training Compiler. You cannot use `image_uri` to manually specify the Training Compiler integrated Deep Learning Containers that are listed at [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
+ `entry_point` (str) – Required. Specify the file name of your training script.
**Note**  
To run a distributed training with SageMaker Training Compiler and PyTorch v1.10.2 and before, specify the file name of a launcher script to this parameter. The launcher script should be prepared to wrap your training script and configure the distributed training environment. For more information, see the following example notebooks:  
[Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Single-Node Multi-GPU Training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_single_node/language-modeling-multi-gpu-single-node.html)
[Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Multi-Node Multi-GPU Training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/language-modeling-multi-gpu-multi-node.html)
+ `source_dir` (str) – Optional. Add this if need to install additional packages. To install packages, you need to prapare a `requirements.txt` file under this directory.
+ `instance_count` (int) – Required. Specify the number of instances.
+ `instance_type` (str) – Required. Specify the instance type.
+ `transformers_version` (str) – Required only when using the SageMaker AI Hugging Face estimator. Specify the Hugging Face Transformers library version supported by SageMaker Training Compiler. To find available versions, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
+ `framework_version` or `pytorch_version` (str) – Required. Specify the PyTorch version supported by SageMaker Training Compiler. To find available versions, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
**Note**  
When using the SageMaker AI Hugging Face estimator, you must specify both `transformers_version` and `pytorch_version`.
+ `hyperparameters` (dict) – Optional. Specify hyperparameters for the training job, such as `n_gpus`, `batch_size`, and `learning_rate`. When you enable SageMaker Training Compiler, try larger batch sizes and adjust the learning rate accordingly. To find case studies of using the compiler and adjusted batch sizes to improve training speed, see [Tested Models](training-compiler-support.md#training-compiler-tested-models) and [SageMaker Training Compiler Example Notebooks and Blogs](training-compiler-examples-and-blogs.md).
**Note**  
To run a distributed training with SageMaker Training Compiler and PyTorch v1.10.2 and before, you need to add an additional parameter, `"training_script"`, to specify your training script, as shown in the preceding code example.
+ `compiler_config` (TrainingCompilerConfig object) – Required to activate SageMaker Training Compiler. Include this parameter to turn on SageMaker Training Compiler. The following are parameters for the `TrainingCompilerConfig` class.
  + `enabled` (bool) – Optional. Specify `True` or `False` to turn on or turn off SageMaker Training Compiler. The default value is `True`.
  + `debug` (bool) – Optional. To receive more detailed training logs from your compiler-accelerated training jobs, change it to `True`. However, the additional logging might add overhead and slow down the compiled training job. The default value is `False`.
+ `distribution` (dict) – Optional. To run a distributed training job with SageMaker Training Compiler, add `distribution = { 'pytorchxla' : { 'enabled': True }}`.

**Warning**  
If you turn on SageMaker Debugger, it might impact the performance of SageMaker Training Compiler. We recommend that you turn off Debugger when running SageMaker Training Compiler to make sure there's no impact on performance. For more information, see [Considerations](training-compiler-tips-pitfalls.md#training-compiler-tips-pitfalls-considerations). To turn the Debugger functionalities off, add the following two arguments to the estimator:  

```
disable_profiler=True,
debugger_hook_config=False
```

If the training job with the compiler is launched successfully, you receive the following logs during the job initialization phase: 
+ With `TrainingCompilerConfig(debug=False)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  ```
+ With `TrainingCompilerConfig(debug=True)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  Training Compiler set to debug mode
  ```

## Using the SageMaker AI `CreateTrainingJob` API Operation
<a name="training-compiler-enable-pytorch-api"></a>

SageMaker Training Compiler configuration options must be specified through the `AlgorithmSpecification` and `HyperParameters` field in the request syntax for the [`CreateTrainingJob` API operation](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

```
"AlgorithmSpecification": {
    "TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},

"HyperParameters": {
    "sagemaker_training_compiler_enabled": "true",
    "sagemaker_training_compiler_debug_mode": "false",
    "sagemaker_pytorch_xla_multi_worker_enabled": "false"    // set to "true" for distributed training
}
```

To find a complete list of deep learning container image URIs that have SageMaker Training Compiler implemented, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).

# Run TensorFlow Training Jobs with SageMaker Training Compiler
<a name="training-compiler-enable-tensorflow"></a>

You can use any of the SageMaker AI interfaces to run a training job with SageMaker Training Compiler: Amazon SageMaker Studio Classic, Amazon SageMaker notebook instances, AWS SDK for Python (Boto3), and AWS Command Line Interface.

**Topics**
+ [

## Using the SageMaker Python SDK
](#training-compiler-enable-tensorflow-pysdk)
+ [

## Using the SageMaker AI Python SDK and Extending SageMaker AI Framework Deep Learning Containers
](#training-compiler-enable-tensorflow-sdk-extend-container)
+ [

## Enable SageMaker Training Compiler Using the SageMaker AI `CreateTrainingJob` API Operation
](#training-compiler-enable-tensorflow-api)

## Using the SageMaker Python SDK
<a name="training-compiler-enable-tensorflow-pysdk"></a>

To turn on SageMaker Training Compiler, add the `compiler_config` parameter to the SageMaker AI TensorFlow or Hugging Face estimator. Import the `TrainingCompilerConfig` class and pass an instance of it to the `compiler_config` parameter. The following code examples show the structure of the SageMaker AI estimator classes with SageMaker Training Compiler turned on.

**Tip**  
To get started with prebuilt models provided by the TensorFlow and Transformers libraries, try using the batch sizes provided in the reference table at [Tested Models](training-compiler-support.md#training-compiler-tested-models).

**Note**  
SageMaker Training Compiler for TensorFlow is available through the SageMaker AI [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) and [Hugging Face](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator) framework estimators.

For information that fits your use case, see one of the following options.

### For single GPU training
<a name="training-compiler-estimator-tensorflow-single"></a>

------
#### [ TensorFlow ]

```
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64    

# update the global learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_estimator=TensorFlow(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='2.9.1',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_estimator.fit()
```

To prepare your training script, see the following pages.
+ [For single GPU training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-keras-single-gpu) of a model constructed using TensorFlow Keras (`tf.keras.*`).
+ [For single GPU training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-no-keras-single-gpu) of a model constructed using TensorFlow modules (`tf.*` excluding the TensorFlow Keras modules).

------
#### [ Hugging Face Estimator with TensorFlow ]

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# the original max batch size that can fit into GPU memory without compiler
batch_size_native=12
learning_rate_native=float('5e-5')

# an updated max batch size that can fit into GPU memory with compiler
batch_size=64

# update the global learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size

hyperparameters={
    "n_gpus": 1,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_huggingface_estimator=HuggingFace(
    entry_point='train.py',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    transformers_version='4.21.1',
    tensorflow_version='2.6.3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_huggingface_estimator.fit()
```

To prepare your training script, see the following pages.
+ [For single GPU training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-transformers-keras-single-gpu) of a TensorFlow Keras model with Hugging Face Transformers
+ [For single GPU training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-transformers-no-keras-single-gpu) of a TensorFlow model with Hugging Face Transformers

------

### For distributed training
<a name="training-compiler-estimator-tensorflow-distributed"></a>

------
#### [ Hugging Face Estimator with TensorFlow ]

```
from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig

# choose an instance type, specify the number of instances you want to use,
# and set the num_gpus variable the number of GPUs per instance.
instance_count=1
instance_type='ml.p3.8xlarge'
num_gpus=4

# the original max batch size that can fit to GPU memory without compiler
batch_size_native=16
learning_rate_native=float('5e-5')

# an updated max batch size that can fit to GPU memory with compiler
batch_size=26

# update learning rate
learning_rate=learning_rate_native/batch_size_native*batch_size*num_gpus*instance_count

hyperparameters={
    "n_gpus": num_gpus,
    "batch_size": batch_size,
    "learning_rate": learning_rate
}

tensorflow_huggingface_estimator=HuggingFace(
    entry_point='train.py',
    instance_count=instance_count,
    instance_type=instance_type,
    transformers_version='4.21.1',
    tensorflow_version='2.6.3',
    hyperparameters=hyperparameters,
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

tensorflow_huggingface_estimator.fit()
```

**Tip**  
To prepare your training script, see the following pages.  
[For distributed training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-transformers-keras-distributed) of a TensorFlow Keras model with Hugging Face Transformers
[For distributed training](training-compiler-tensorflow.md#training-compiler-tensorflow-models-transformers-no-keras-distributed) of a TensorFlow model with Hugging Face Transformers

------

The following list is the minimal set of parameters required to run a SageMaker training job with the compiler.

**Note**  
When using the SageMaker AI Hugging Face estimator, you must specify the `transformers_version`, `tensorflow_version`, `hyperparameters`, and `compiler_config` parameters to enable SageMaker Training Compiler. You cannot use `image_uri` to manually specify the Training Compiler integrated Deep Learning Containers that are listed at [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
+ `entry_point` (str) – Required. Specify the file name of your training script.
+ `instance_count` (int) – Required. Specify the number of instances.
+ `instance_type` (str) – Required. Specify the instance type.
+ `transformers_version` (str) – Required only when using the SageMaker AI Hugging Face estimator. Specify the Hugging Face Transformers library version supported by SageMaker Training Compiler. To find available versions, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
+ `framework_version` or `tensorflow_version` (str) – Required. Specify the TensorFlow version supported by SageMaker Training Compiler. To find available versions, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).
**Note**  
When using the SageMaker AI TensorFlow estimator, you must specify `framework_version`.  
When using the SageMaker AI Hugging Face estimator, you must specify both `transformers_version` and `tensorflow_version`.
+ `hyperparameters` (dict) – Optional. Specify hyperparameters for the training job, such as `n_gpus`, `batch_size`, and `learning_rate`. When you enable SageMaker Training Compiler, try larger batch sizes and adjust the learning rate accordingly. To find case studies of using the compiler and adjusted batch sizes to improve training speed, see [Tested Models](training-compiler-support.md#training-compiler-tested-models) and [SageMaker Training Compiler Example Notebooks and Blogs](training-compiler-examples-and-blogs.md).
+ `compiler_config` (TrainingCompilerConfig object) – Required. Include this parameter to turn on SageMaker Training Compiler. The following are parameters for the `TrainingCompilerConfig` class.
  + `enabled` (bool) – Optional. Specify `True` or `False` to turn on or turn off SageMaker Training Compiler. The default value is `True`.
  + `debug` (bool) – Optional. To receive more detailed training logs from your compiler-accelerated training jobs, change it to `True`. However, the additional logging might add overhead and slow down the compiled training job. The default value is `False`.

**Warning**  
If you turn on SageMaker Debugger, it might impact the performance of SageMaker Training Compiler. We recommend that you turn off Debugger when running SageMaker Training Compiler to make sure there's no impact on performance. For more information, see [Considerations](training-compiler-tips-pitfalls.md#training-compiler-tips-pitfalls-considerations). To turn the Debugger functionalities off, add the following two arguments to the estimator:  

```
disable_profiler=True,
debugger_hook_config=False
```

If the training job with the compiler is launched successfully, you receive the following logs during the job initialization phase: 
+ With `TrainingCompilerConfig(debug=False)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  ```
+ With `TrainingCompilerConfig(debug=True)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  Training Compiler set to debug mode
  ```

## Using the SageMaker AI Python SDK and Extending SageMaker AI Framework Deep Learning Containers
<a name="training-compiler-enable-tensorflow-sdk-extend-container"></a>

AWS Deep Learning Containers (DLC) for TensorFlow use adapted versions of TensorFlow that include changes on top of the open-source TensorFlow framework. The [SageMaker AI Framework Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) are optimized for the underlying AWS infrastructure and Amazon SageMaker AI. With the advantage of using the DLCs, SageMaker Training Compiler integration adds more performance improvements over the native TensorFlow. Furthermore, you can create a custom training container by extending the DLC image.

**Note**  
This Docker customization feature is currently available only for TensorFlow.

To extend and customize the SageMaker AI TensorFlow DLCs for your use-case, use the following instructions.

### Create a Dockerfile
<a name="training-compiler-enable-tensorflow-sdk-extend-container-create-dockerfile"></a>

Use the following Dockerfile template to extend the SageMaker AI TensorFlow DLC. You must use the SageMaker AI TensorFlow DLC image as the base image of your Docker container. To find the SageMaker AI TensorFlow DLC image URIs, see [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html#training-compiler-supported-frameworks).

```
# SageMaker AI TensorFlow Deep Learning Container image
FROM 763104351884.dkr.ecr.<aws-region>.amazonaws.com/tensorflow-training:<image-tag>

ENV PATH="/opt/ml/code:${PATH}"

# This environment variable is used by the SageMaker AI container 
# to determine user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# Add more code lines to customize for your use-case
...
```

For more information, see [Step 2: Create and upload the Dockerfile and Python training scripts](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step2).

Consider the following pitfalls when extending SageMaker AI Framework DLCs:
+ Do not explicitly uninstall or change the version of TensorFlow packages in SageMaker AI containers. Doing so causes the AWS optimized TensorFlow packages to be overwritten by open-source TensorFlow packages, which might result in performance degradation.
+ Watch out for packages that have a particular TensorFlow version or flavor as a dependency. These packages might implicitly uninstall the AWS optimized TensorFlow and install open-source TensorFlow packages.

For example, there’s a known issue that the [tensorflow/models](https://github.com/tensorflow/models) and [tensorflow/text](https://github.com/tensorflow/text) libraries always attempt to [reinstall open source TensorFlow](https://github.com/tensorflow/models/issues/9267). If you need to install these libraries to choose a specific version for your use case, we recommend that you look into the SageMaker AI TensorFlow DLC Dockerfiles for v2.9 or later. The paths to the Dockerfiles are typically in the following format: `tensorflow/training/docker/<tensorflow-version>/py3/<cuda-version>/Dockerfile.gpu`. In the Dockerfiles, you should find the code lines to reinstall AWS managed TensorFlow binary (specified to the `TF_URL` environment variable) and other dependencies in order. The reinstallation section should look like the following example:

```
# tf-models does not respect existing installations of TensorFlow 
# and always installs open source TensorFlow

RUN pip3 install --no-cache-dir -U \
    tf-models-official==x.y.z

RUN pip3 uninstall -y tensorflow tensorflow-gpu \
  ; pip3 install --no-cache-dir -U \
    ${TF_URL} \
    tensorflow-io==x.y.z \
    tensorflow-datasets==x.y.z
```

### Build and push to ECR
<a name="training-compiler-enable-tensorflow-sdk-extend-container-build-and-push"></a>

To build and push your Docker container to Amazon ECR, follow the instructions in the following links:
+ [Step 3: Build the container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step3)
+ [Step 4: Test the container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step4)
+ [Step 5: Push the container to Amazon ECR](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html#byoc-training-step5)

### Run using the SageMaker Python SDK Estimator
<a name="training-compiler-enable-tensorflow-sdk-extend-container-run-job"></a>

Use the SageMaker AI TensorFlow framework estimator as usual. You must specify `image_uri` to use the new container you hosted in Amazon ECR.

```
import sagemaker, boto3
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow, TrainingCompilerConfig

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'tf-custom-container-test'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'

byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(
    account_id, region, uri_suffix, ecr_repository + tag
)

byoc_image_uri
# This should return something like
# 111122223333.dkr.ecr.us-east-2.amazonaws.com/tf-custom-container-test:latest

estimator = TensorFlow(
    image_uri=image_uri,
    role=get_execution_role(),
    base_job_name='tf-custom-container-test-job',
    instance_count=1,
    instance_type='ml.p3.8xlarge'
    compiler_config=TrainingCompilerConfig(),
    disable_profiler=True,
    debugger_hook_config=False
)

# Start training
estimator.fit()
```

## Enable SageMaker Training Compiler Using the SageMaker AI `CreateTrainingJob` API Operation
<a name="training-compiler-enable-tensorflow-api"></a>

SageMaker Training Compiler configuration options must be specified through the `AlgorithmSpecification` and `HyperParameters` field in the request syntax for the [`CreateTrainingJob` API operation](https://amazonaws.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

```
"AlgorithmSpecification": {
    "TrainingImage": "<sagemaker-training-compiler-enabled-dlc-image>"
},

"HyperParameters": {
    "sagemaker_training_compiler_enabled": "true",
    "sagemaker_training_compiler_debug_mode": "false"
}
```

To find a complete list of deep learning container image URIs that have SageMaker Training Compiler implemented, see [Supported Frameworks](training-compiler-support.md#training-compiler-supported-frameworks).

# SageMaker Training Compiler Example Notebooks and Blogs
<a name="training-compiler-examples-and-blogs"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

The following blogs, case studies, and notebooks provide examples of how to implement SageMaker Training Compiler.

Example notebooks are provided in the [SageMaker AI examples GitHub repository](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-training-compiler), and you can also browse them on the [SageMaker AI examples website](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/index.html).

## Blogs and Case Studies
<a name="training-compiler-blogs"></a>

The following blogs discuss case studies about using SageMaker Training Compiler.
+ [New – Introducing SageMaker Training Compiler](https://aws.amazon.com/blogs/aws/new-introducing-sagemaker-training-compiler/)
+ [Hugging Face Transformers BERT fine-tuning using Amazon SageMaker Training Compiler](https://www.philschmid.de/huggingface-amazon-sagemaker-training-compiler)
+ [Speed up Hugging Face Training Jobs on AWS by Up to 50% with SageMaker Training Compiler](https://towardsdatascience.com/speed-up-hugging-face-training-jobs-on-aws-by-up-to-50-with-sagemaker-training-compiler-9ad2ac5b0eb)

## Examples Notebooks
<a name="training-compiler-example-notebooks"></a>

To find examples of using SageMaker Training Compiler, see the [Training Compiler page](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-training-compiler/index.html) in the *Amazon SageMaker AI Example Read the Docs website*.

# SageMaker Training Compiler Best Practices and Considerations
<a name="training-compiler-tips-pitfalls"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

Review the following best practices and considerations when using SageMaker Training Compiler.

## Best Practices
<a name="training-compiler-tips-pitfalls-best-practices"></a>

Use the following guidelines to achieve the best results when you run training jobs with SageMaker Training Compiler.

**General Best Practices**
+ Make sure that you use one of the [Supported Instance Types](training-compiler-support.md#training-compiler-supported-instance-types) and [Tested Models](training-compiler-support.md#training-compiler-tested-models). 
+ When you create a tokenizer for an NLP model using the Hugging Face Transformers library in your training script, make sure that you use a static input tensor shape by specifying `padding='max_length'`. Do not use `padding='longest'` because padding to the longest sequence in the batch can change the tensor shape for each training batch. The dynamic input shape can initiate recompilation of the model and might increase total training time. For more information about padding options of the Transformers tokenizers, see [Padding and truncation](https://huggingface.co/docs/transformers/pad_truncation) in the *Hugging Face Transformers documentation*.
+ Measure GPU memory utilization to make sure that you use the maximum batch size that can fit into the GPU memory. Amazon SageMaker Training Compiler reduces the memory footprint of your model during training, which typically allows you to fit a larger `batch_size` in the GPU memory. Using a larger `batch_size` results in a better GPU utilization and reduces the total training time. 

  When you adjust the batch size, you also have to adjust the `learning_rate` appropriately. For example, if you increased the batch size by a factor of `k`, you need to adjust `learning_rate` linearly (simple multiplication by `k`) or multiply by the square root of `k`. This is to achieve the same or similar convergence behavior in the reduced training time. For reference of `batch_size` tested for popular models, see [Tested Models](training-compiler-support.md#training-compiler-tested-models).
+ To debug the compiler-accelerated training job, enable the `debug` flag in the `compiler_config` parameter. This enables SageMaker AI to put the debugging logs into SageMaker training job logs.

  ```
  huggingface_estimator=HuggingFace(
      ...
      compiler_config=TrainingCompilerConfig(debug=True)
  )
  ```

  Note that if you enable full debugging of the training job with the compiler, this might add some overhead.

**Best Practices for PyTorch**
+ If you bring a PyTorch model and want to checkpoint it, make sure you use PyTorch/XLA's model save function to properly checkpoint your model. For more information about the function, see [https://pytorch.org/xla/release/1.9/index.html#torch_xla.core.xla_model.save](https://pytorch.org/xla/release/1.9/index.html#torch_xla.core.xla_model.save) in the *PyTorch on XLA Devices documentation*. 

  To learn how to add the modifications to your PyTorch script, see [Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer).

  For more information about the actual application of using the model save function, see [Checkpoint Writing and Loading](https://huggingface.co/blog/pytorch-xla#checkpoint-writing-and-loading) in the *Hugging Face on PyTorch/XLA TPUs: Faster and cheaper training blog*.
+ To achieve the most optimal training time for distributed training, consider the following.
  + Use instances with multiple GPUs instead of using single-gpu instances. For example, a single `ml.p3dn.24xlarge` instance has faster training time compared to 8 x `ml.p3.2xlarge` instances.
  + Use instances with EFA support such as `ml.p3dn.24xlarge` and `ml.p4d.24xlarge`. These instance types have accelerated networking speed and reduce training time.
  + Tune the `preprocessing_num_workers` parameter for datasets, so that model training is not delayed by slow preprocessing.

## Considerations
<a name="training-compiler-tips-pitfalls-considerations"></a>

Consider the following when using SageMaker Training Compiler.

### Performance degradation due to logging, checkpointing, and profiling
<a name="training-compiler-considerations-performance-degradation"></a>
+ Avoid logging, checkpointing, and profiling model tensors that lead to explicit evaluations. To understand what an explicit evaluation is, consider the following code compiling example.

  ```
  a = b+c
  e = a+d
  ```

  A compiler interprets the code as follows and reduces the memory footprint for the variable `a`:

  ```
  e = b+c+d
  ```

  Now consider the following case in which the code is changed to add a print function for the variable `a`.

  ```
  a = b+c
  e = a+d
  print(a)
  ```

  The compiler makes an explicit evaluation of the variable `a` as follows.

  ```
  e = b+c+d
  a = b+c    # Explicit evaluation
  print(a)
  ```

  In PyTorch, for example, avoid using [torch.tensor.items()](https://pytorch.org/docs/stable/generated/torch.Tensor.item.html), which might introduce explicit evaluations. In deep learning, such explicit evaluations can cause overhead because they break fused operations in a compilation graph of a model and lead to recomputation of the tensors. 

  If you still want to periodically evaluate the model during training while using SageMaker Training Compiler, we recommend logging and checkpointing at a lower frequency to reduce overhead due to explicit evaluations. For example, log every 10 epochs instead of every epoch.
+ Graph compilation runs during the first few steps of training. As a result, the first few steps are expected to be exceptionally slow. However, this is a one-time compilation cost and can be amortized by training for a longer duration because compilation makes future steps much faster. The initial compilation overhead depends on the size of the model, the size of the input tensors, and the distribution of input tensor shapes.

### Incorrect use of the PyTorch/XLA APIs when using PyTorch directly
<a name="training-compiler-considerations-incorrect-api-use"></a>

PyTorch/XLA defines a set of APIs to replace some of the existing PyTorch training APIs. Failing to use them properly leads PyTorch training to fail.
+ One of the most typical errors when compiling a PyTorch model is due to a wrong device type for operators and tensors. To properly compile a PyTorch model, make sure you use XLA devices ([https://pytorch.org/xla/release/1.9/index.html](https://pytorch.org/xla/release/1.9/index.html)) instead of using CUDA or mixing CUDA devices and XLA devices.
+ `mark_step()` is a barrier just for XLA. Failing to set it correctly causes a training job to stall.
+ PyTorch/XLA provides additional distributed training APIs. Failing to program the APIs properly causes gradients to be collected incorrectly, which causes a training convergence failure.

To properly set up your PyTorch script and avoid the aforementioned incorrect API uses, see [Large Language Models Using PyTorch Directly (without the Hugging Face Transformers Trainer API)](training-compiler-pytorch-models.md#training-compiler-pytorch-models-non-trainer).

# SageMaker Training Compiler FAQ
<a name="training-compiler-faq"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

Use the following FAQ items to find answers to commonly asked questions about SageMaker Training Compiler.

**Q. How do I know SageMaker Training Compiler is working?**

If you successfully launched your training job with SageMaker Training Compiler, you receive the following log messages:
+ With `TrainingCompilerConfig(debug=False)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  ```
+ With `TrainingCompilerConfig(debug=True)`

  ```
  Found configuration for Training Compiler
  Configuring SM Training Compiler...
  Training Compiler set to debug mode
  ```

**Q. Which models does SageMaker Training Compiler accelerate?**

SageMaker Training Compiler supports the most popular deep learning models from the Hugging Face transformers library. With most of the operators that the compiler supports, these models can be trained faster with SageMaker Training Compiler. Compilable models include but are not limited to the following: `bert-base-cased`, `bert-base-chinese`, `bert-base-uncased`, `distilbert-base-uncased`, `distilbert-base-uncased-finetuned-sst-2-english`, `gpt2`, `roberta-base`, `roberta-large`, `t5-base`, and `xlm-roberta-base`. The compiler works with most DL operators and data structures and can accelerate many other DL models beyond those that have been tested.

**Q. What happens if I enable SageMaker Training Compiler with a model that isn't tested?**

For an untested model, you might need to first modify the training script to be compatible with SageMaker Training Compiler. For more information, see [Bring Your Own Deep Learning Model](training-compiler-modify-scripts.md) and follow the instructions on how to prepare your training script.

Once you have updated your training script, you can start the training job. The compiler proceeds to compile the model. However, training speed may not increase and might even decrease relative to the baseline with an untested model. You might need to retune training parameters such as `batch_size` and `learning_rate` to achieve any speedup benefits.

If compilation of the untested model fails, the compiler returns an error. See [SageMaker Training Compiler Troubleshooting](training-compiler-troubleshooting.md) for detailed information about the failure types and error messages.

**Q. Will I always get a faster training job with SageMaker Training Compiler? **

No, not necessarily. First, SageMaker Training Compiler adds some compilation overhead before the ongoing training process can be accelerated. The optimized training job must run sufficiently long to amortize and make up for this incremental compilation overhead at the beginning of the training job.

Additionally, as with any model training process, training with suboptimal parameters can increase training time. SageMaker Training Compiler can change the characteristics of the training job by, for example, changing the memory footprint of the job. Because of these differences, you might need to retune your training job parameters to speed up training. A reference table specifying the best performing parameters for training jobs with different instance types and models can be found at [Tested Models](training-compiler-support.md#training-compiler-tested-models).

Finally, some code in a training script might add additional overhead or disrupt the compiled computation graph and slow training. If working with a customized or untested model, see the instructions at [Best Practices to Use SageMaker Training Compiler with PyTorch/XLA](training-compiler-pytorch-models.md#training-compiler-pytorch-models-best-practices).

**Q. Can I always use a larger batch size with SageMaker Training Compiler? **

Batch size increases in most, but not all, cases. The optimizations made by SageMaker Training Compiler can change the characteristics of your training job, such as the memory footprint. Typically, a Training Compiler job occupies less memory than an uncompiled training job with the native framework, which allows for a larger batch size during training. A larger batch size, and a corresponding adjustment to the learning rate, increases training throughput and can decrease total training time.

However, there could be cases where SageMaker Training Compiler might actually increase memory footprint based on its optimization scheme. The compiler uses an analytical cost model to predict the execution schedule with the lowest cost of execution for any compute-intensive operator. This model could find an optimal schedule that increases memory use. In this case, you won’t be able to increase batch sizes, but your sample throughput is still higher.

**Q. Does SageMaker Training Compiler work with other SageMaker training features, such as the SageMaker AI distributed training libraries and SageMaker Debugger?**

SageMaker Training Compiler is currently not compatible with SageMaker AI’s distributed training libraries.

SageMaker Training Compiler is compatible with SageMaker Debugger, but Debugger might degrade computational performance by adding overhead.

**Q. Does SageMaker Training Compiler support custom containers (bring your own container)?**

SageMaker Training Compiler is provided through AWS Deep Learning Containers, and you can extend a subset of the containers to customize for your use-case. Containers that are extended from AWS DLCs are supported by SageMaker Training Compiler. For more information, see [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-support.html#training-compiler-supported-frameworks) and [Using the SageMaker AI Python SDK and Extending SageMaker AI Framework Deep Learning Containers](training-compiler-enable-tensorflow.md#training-compiler-enable-tensorflow-sdk-extend-container). If you need further support, reach out to the SageMaker AI team through [AWS Support](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

# SageMaker Training Compiler Troubleshooting
<a name="training-compiler-troubleshooting"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

If you run into an error, you can use the following list to try to troubleshoot your training job. If you need further support, reach out to the SageMaker AI team through [AWS Support](https://console.aws.amazon.com/support/) or [AWS Developer Forums for Amazon SageMaker AI](https://forums.aws.amazon.com/forum.jspa?forumID=285).

## Training job is not converging as expected when compared to the native framework training job
<a name="training-compiler-troubleshooting-convergence-issue"></a>

Convergence issues range from “the model is not learning when SageMaker Training Compiler is turned on” to “the model is learning but slower than the native framework”. In this troubleshooting guide, we assume your convergence is fine without SageMaker Training Compiler (in the native framework) and consider this the baseline.

When faced with such convergence issues, the first step is to identify if the issue is limited to distributed training or stems from single-GPU training. Distributed training with SageMaker Training Compiler is an extension of single-GPU training with additional steps.

1. Set up a cluster with multiple instances or GPUs.

1. Distribute input data to all workers.

1. Synchronize the model updates from all workers.

Therefore, any convergence issue in single-GPU training propagates to distributed training with multiple workers.

![\[A flow chart to troubleshoot convergence issues in training jobs when using SageMaker Training Compiler.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-compiler-troubleshooting-convergence-flow.jpg)


### Convergence issues occurring in single-GPU training
<a name="training-compiler-troubleshooting-convergence-issue-single-gpu"></a>

If your convergence issue stems from single-GPU training, this is likely due to improper settings for hyperparameters or the `torch_xla` APIs.

**Check the hyperparameters**

Training with SageMaker Training Compiler leads to change in the memory footprint of a model. The compiler intelligently arbitrates between re-use and re-compute leading to a corresponding increase or decrease in memory consumption. To leverage this, it is essential to re-tune the batch size and associated hyperparameters when migrating a training job to SageMaker Training Compiler. However, incorrect hyperparameter settings often cause oscillation in training loss and possibly a slower convergence as a result. In rare cases, aggressive hyperparameters might result in the model not learning (the training loss metric doesn’t decrease or returns `NaN`). To identify if the convergence issue is due to the hyperparameters, do a side-by-side test of two training jobs with and without SageMaker Training Compiler while keeping all the hyperparameters the same.

**Check if the `torch_xla` APIs are properly set up for single-GPU training**

If the convergence issue persists with the baseline hyperparameters, you need to check if there’s any improper usage of the `torch_xla` APIs, specifically the ones for updating the model. Fundamentally, `torch_xla` continues to accumulate instructions (deferring execution) in the form of graph until it is explicitly instructed to run the accumulated graph. The `torch_xla.core.xla_model.mark_step()` function facilitates the execution of the accumulated graph. The graph execution should be synchronized using this function ***after each model update*** and ***before printing and logging any variables***. If it lacks the synchronization step, the model might use stale values from memory during prints, logs, and the subsequent forward passes, instead of using the most recent values that have to be synchronized after every iteration and model update.

It can be more complicated when using SageMaker Training Compiler with gradient scaling (possibly from the use of AMP) or gradient clipping techniques. The appropriate order of gradient computation with AMP is as follows.

1. Gradient computation with scaling

1. Gradient un-scaling, gradient clipping, and then scaling

1. Model update

1. Synchronizing the graph execution with `mark_step()`

To find the right APIs for the operations mentioned in the list, see the guide for [migrating your training script to SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-pytorch-models.html).

**Consider using Automatic Model Tuning**

If the convergence issue arises when re-tuning the batch size and associated hyperparameters such as the learning rate while using SageMaker Training Compiler, consider using [Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html) to tune your hyperparameters. You can refer to the [example notebook on tuning hyperparameters with SageMaker Training Compiler](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-training-compiler/tensorflow/single_gpu_single_node/hyper-parameter-tuning.ipynb). 

### Convergence issues occurring in distributed training
<a name="training-compiler-troubleshooting-convergence-issue-distributed-training"></a>

If your convergence issue persists in distributed training, this is likely due to improper settings for weight initialization or the `torch_xla` APIs. 

**Check weight initialization across the workers**

If the convergence issue arises when running a distributed training job with multiple workers, ensure there is a uniform deterministic behavior across all workers by setting a constant seed where applicable. Beware of techniques such as weight initialization, which involves randomization. Each worker might end up training a different model in the absence of a constant seed.

**Check if the `torch_xla` APIs are properly set up for distributed training**

If the issue still persists, this is likely due to improper use of the `torch_xla` APIs for distributed training. Make sure that you add the following in your estimator to set up a cluster for distributed training with SageMaker Training Compiler.

```
distribution={'torchxla': {'enabled': True}}
```

This should be accompanied by a function `_mp_fn(index)` in your training script, which is invoked once per worker. Without the `mp_fn(index)` function, you might end up letting each of the workers train the model independently without sharing model updates. 

Next, make sure that you use the `torch_xla.distributed.parallel_loader.MpDeviceLoader` API along with the distributed data sampler, as guided in the documentation about [migrating your training script to SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler-pytorch-models.html), as in the following example.

```
torch.utils.data.distributed.DistributedSampler()
```

 This ensures that the input data is properly distributed across all workers. 

Finally, to synchronize model updates from all workers, use `torch_xla.core.xla_model._fetch_gradients` to gather gradients from all workers and `torch_xla.core.xla_model.all_reduce` to combine all the gathered gradients into a single update. 

It can be more complicated when using SageMaker Training Compiler with gradient scaling (possibly from use of AMP) or gradient clipping techniques. The appropriate order of gradient computation with AMP is as follows.

1. Gradient computation with scaling

1. Gradient synchronization across all workers

1. Gradient un-scaling, gradient clipping, and then gradient scaling

1. Model update

1. Synchronizing the graph execution with `mark_step()`

Note that this checklist has an additional item for synchronizing all workers, compared to the checklist for single-GPU training.

## Training job fails due to missing PyTorch/XLA configuration
<a name="training-compiler-troubleshooting-missing-xla-config"></a>

If a training job fails with the `Missing XLA configuration` error message, it might be due to a misconfiguration in the number of GPUs per instance that you use.

XLA requires additional environment variables to compile the training job. The most common missing environment variable is `GPU_NUM_DEVICES`. For the compiler to work properly, you must set this environment variable equal to the number of GPUs per instance.

There are three approaches to set the `GPU_NUM_DEVICES` environment variable:
+ **Approach 1** – Use the `environment` argument of the SageMaker AI estimator class. For example, if you use an `ml.p3.8xlarge` instance that has four GPUs, do the following:

  ```
  # Using the SageMaker Python SDK's HuggingFace estimator
  
  hf_estimator=HuggingFace(
      ...
      instance_type="ml.p3.8xlarge",
      hyperparameters={...},
      environment={
          ...
          "GPU_NUM_DEVICES": "4" # corresponds to number of GPUs on the specified instance
      },
  )
  ```
+ **Approach 2** – Use the `hyperparameters` argument of the SageMaker AI estimator class and parse it in your training script.

  1. To specify the number of GPUs, add a key-value pair to the `hyperparameters` argument.

     For example, if you use an `ml.p3.8xlarge` instance that has four GPUs, do the following:

     ```
     # Using the SageMaker Python SDK's HuggingFace estimator
     
     hf_estimator=HuggingFace(
         ...
         entry_point = "train.py"
         instance_type= "ml.p3.8xlarge",
         hyperparameters = {
             ...
             "n_gpus": 4 # corresponds to number of GPUs on specified instance
         }
     )
     hf_estimator.fit()
     ```

  1. In your training script, parse the `n_gpus` hyperparameter and specify it as an input for the `GPU_NUM_DEVICES` environment variable.

     ```
     # train.py
     import os, argparse
     
     if __name__ == "__main__":
         parser = argparse.ArgumentParser()
         ...
         # Data, model, and output directories
         parser.add_argument("--output_data_dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
         parser.add_argument("--model_dir", type=str, default=os.environ["SM_MODEL_DIR"])
         parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
         parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])
         parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
     
         args, _ = parser.parse_known_args()
     
         os.environ["GPU_NUM_DEVICES"] = args.n_gpus
     ```
+ **Approach 3** – Hard-code the `GPU_NUM_DEVICES` environment variable in your training script. For example, add the following to your script if you use an instance that has four GPUs.

  ```
  # train.py
  
  import os
  os.environ["GPU_NUM_DEVICES"] = 4
  ```

**Tip**  
To find the number of GPU devices on machine learning instances that you want to use, see [Accelerated Computing](https://aws.amazon.com/ec2/instance-types/#Accelerated_Computing) in the *Amazon EC2 Instance Types page*. 

## SageMaker Training Compiler doesn't reduce the total training time
<a name="training-compiler-troubleshooting-no-improved-training-time"></a>

If the total training time does not decrease with SageMaker Training Compiler, we highly recommend you to go over the [SageMaker Training Compiler Best Practices and Considerations](training-compiler-tips-pitfalls.md) page to check your training configuration, padding strategy for the input tensor shape, and hyperparameters. 

# Amazon SageMaker Training Compiler Release Notes
<a name="training-compiler-release-notes"></a>

**Important**  
Amazon Web Services (AWS) announces that there will be no new releases or versions of SageMaker Training Compiler. You can continue to utilize SageMaker Training Compiler through the existing AWS Deep Learning Containers (DLCs) for SageMaker Training. It is important to note that while the existing DLCs remain accessible, they will no longer receive patches or updates from AWS, in accordance with the [AWS Deep Learning Containers Framework Support Policy](https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/support-policy.html).

See the following release notes to track the latest updates for Amazon SageMaker Training Compiler.

## SageMaker Training Compiler Release Notes: February 13, 2023
<a name="training-compiler-release-notes-20230213"></a>

**Currency Updates**
+ Added support for PyTorch v1.13.1

**Bug Fixes**
+ Fixed a race condition issue on GPU which was causing NAN loss in some models like vision transformer (ViT) models.

**Other Changes**
+ SageMaker Training Compiler improves performance by letting PyTorch/XLA to automatically override the optimizers (such as SGD, Adam, AdamW) in `torch.optim` or `transformers.optimization` with the syncfree versions of them in `torch_xla.amp.syncfree` (such as `torch_xla.amp.syncfree.SGD`, `torch_xla.amp.syncfree.Adam`, `torch_xla.amp.syncfree.AdamW`). You don't need to change those code lines where you define optimizers in your training script.

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ PyTorch v1.13.1

  ```
  763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-trcomp-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: January 9, 2023
<a name="training-compiler-release-notes-20230109"></a>

**Breaking Changes**
+ `tf.keras.optimizers.Optimizer` points to a new optimizer in TensorFlow 2.11.0 and later. The old optimizers are moved to `tf.keras.optimizers.legacy`. You might encounter job failure due to the breaking change when you do the following. 
  + Load checkpoints from an old optimizer. We recommend you to switch to use the legacy optimizers.
  + Use TensorFlow v1. We recommend you to migrate to TensorFlow v2, or switch to the legacy optimizers if you need to continue using TensorFlow v1.

  For more detailed list of breaking changes from the optimizer changes, see the [official TensorFlow v2.11.0 release notes](https://github.com/tensorflow/tensorflow/releases/tag/v2.11.0) in the TensorFlow GitHub repository.

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ TensorFlow v2.11.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-ubuntu20.04-sagemaker
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: December 8, 2022
<a name="training-compiler-release-notes-20221208"></a>

**Bug Fixes**
+ Fixed the seed for PyTorch training jobs starting PyTorch v1.12 to ensure that there is no discrepancy in model initialization across different processes. See also [PyTorch Reproducibility](https://pytorch.org/docs/stable/notes/randomness.html).
+ Fixed the issue causing PyTorch distributed training jobs on G4dn and G5 instances to not default to communication through [PCIe](https://en.wikipedia.org/wiki/PCI_Express).

**Known Issues**
+ Improper use of PyTorch/XLA APIs in Hugging Face’s vision transformers might lead to convergence issues.

**Other Changes**
+ When using the Hugging Face Transformers `Trainer` class, make sure that you use SyncFree optimizers by setting the `optim` argument to `adamw_torch_xla`. For more information, see [Large Language Models Using the Hugging Face Transformers `Trainer` Class](training-compiler-pytorch-models.md#training-compiler-pytorch-models-transformers-trainer). See also [Optimizer](https://huggingface.co/docs/transformers/v4.23.1/en/perf_train_gpu_one#optimizer) in the *Hugging Face Transformers documentation*.

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ PyTorch v1.12.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-trcomp-training:1.12.0-gpu-py38-cu113-ubuntu20.04-sagemaker
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: October 4, 2022
<a name="training-compiler-release-notes-20221004"></a>

**Currency Updates**
+ Added support for TensorFlow v2.10.0.

**Other Changes**
+ Added Hugging Face NLP models using the Transformers library to TensorFlow framework tests. To find the tested Transformer models, see [Tested Models](training-compiler-support.md#training-compiler-tested-models).

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ TensorFlow v2.10.0

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.10.0-gpu-py39-cu112-ubuntu20.04-sagemaker
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: September 1, 2022
<a name="training-compiler-release-notes-20220825"></a>

**Currency Updates**
+ Added support for Hugging Face Transformers v4.21.1 with PyTorch v1.11.0.

**Improvements**
+ Implemented a new distributed training launcher mechanism to activate SageMaker Training Compiler for Hugging Face Transformer models with PyTorch. To learn more, see [Run PyTorch Training Jobs with SageMaker Training Compiler for Distributed Training](training-compiler-enable-pytorch.md#training-compiler-estimator-pytorch-distributed).
+ Integrated with EFA to improve the collective communication in distributed training.
+ Added support for G5 instances for PyTorch training jobs. For more information, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ [HuggingFace v4.21.1 with PyTorch v1.11.0](https://github.com/aws/deep-learning-containers/releases/tag/v1.0-trcomp-hf-4.21.1-pt-1.11.0-tr-gpu-py38)

  ```
  763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-trcomp-training:1.11.0-transformers4.21.1-gpu-py38-cu113-ubuntu20.04
  ```

  To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: June 14, 2022
<a name="training-compiler-release-notes-20220614"></a>

**New Features**
+ Added support for TensorFlow v2.9.1. SageMaker Training Compiler fully supports compiling TensorFlow modules (`tf.*`) and TensorFlow Keras modules (`tf.keras.*`).
+ Added support for custom containers created by extending AWS Deep Learning Containers for TensorFlow. For more information, see [Enable SageMaker Training Compiler Using the SageMaker Python SDK and Extending SageMaker AI Framework Deep Learning Containers](training-compiler-enable-tensorflow.md#training-compiler-enable-tensorflow-sdk-extend-container).
+ Added support for G5 instances for TensorFlow training jobs.

**Migration to AWS Deep Learning Containers**

This release passed benchmark testing and is migrated to the following AWS Deep Learning Container:
+ TensorFlow 2.9.1

  ```
  763104351884.dkr.ecr.<region>.amazonaws.com/tensorflow-training:2.9.1-gpu-py39-cu112-ubuntu20.04-sagemaker
  ```

  To find a complete list of the pre-built containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

## SageMaker Training Compiler Release Notes: April 26, 2022
<a name="training-compiler-release-notes-20220426"></a>

**Improvements**
+ Added support for all of the AWS Regions where [AWS Deep Learning Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) are in service except the China regions.

## SageMaker Training Compiler Release Notes: April 12, 2022
<a name="training-compiler-release-notes-20220412"></a>

**Currency Updates**
+ Added support for Hugging Face Transformers v4.17.0 with TensorFlow v2.6.3 and PyTorch v1.10.2.

## SageMaker Training Compiler Release Notes: February 21, 2022
<a name="training-compiler-release-notes-20220221"></a>

**Improvements**
+ Completed benchmark test and confirmed training speed-ups on the `ml.g4dn` instance types. To find a complete list of tested `ml` instances, see [Supported Instance Types](training-compiler-support.md#training-compiler-supported-instance-types).

## SageMaker Training Compiler Release Notes: December 01, 2021
<a name="training-compiler-release-notes-20211201"></a>

**New Features**
+ Launched Amazon SageMaker Training Compiler at AWS re:Invent 2021.

**Migration to AWS Deep Learning Containers**
+ Amazon SageMaker Training Compiler passed benchmark testing and is migrated to AWS Deep Learning Containers. To find a complete list of the prebuilt containers with Amazon SageMaker Training Compiler, see [Supported Frameworks, AWS Regions, Instance Types, and Tested Models](training-compiler-support.md).

# Setting up training jobs to access datasets
<a name="model-access-training-data"></a>

When creating a training job, you specify the location of training datasets in a data storage of your choice and the data input mode for the job. Amazon SageMaker AI supports Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), and Amazon FSx for Lustre. You can choose one of the input modes to stream the dataset in real time or download the whole dataset at the start of the training job.

**Note**  
Your dataset must reside in the same AWS Region as the training job.

## SageMaker AI input modes and AWS cloud storage options
<a name="model-access-training-data-input-modes"></a>

This section provides an overview of the file input modes supported by SageMaker for data stored in Amazon EFS and Amazon FSx for Lustre.

![\[Summary of the SageMaker AI input modes for Amazon S3 and file systems in Amazon EFS and Amazon FSx for Lustre.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-training-input-mode.png)

+ *File mode* presents a file system view of the dataset to the training container. This is the default input mode if you don't explicitly specify one of the other two options. If you use file mode, SageMaker AI downloads the training data from the storage location to a local directory in the Docker container. Training starts after the full dataset has been downloaded. In file mode, the training instance must have enough storage space to fit the entire dataset. File mode download speed depends on the size of dataset, the average size of files, and the number of files. You can configure the dataset for file mode by providing either an Amazon S3 prefix, manifest file, or augmented manifest file. You should use an S3 prefix when all your dataset files are located within a common S3 prefix. File mode is compatible with [SageMaker AI local mode](https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode) (starting a SageMaker training container interactively in seconds). For distributed training, you can shard the dataset across multiple instances with the `ShardedByS3Key` option.
+ *Fast file mode* provides file system access to an Amazon S3 data source while leveraging the performance advantage of pipe mode. At the start of training, fast file mode identifies the data files but does not download them. Training can start without waiting for the entire dataset to download. This means that the training startup takes less time when there are fewer files in the Amazon S3 prefix provided.

  In contrast to pipe mode, fast file mode works with random access to the data. However, it works best when data is read sequentially. Fast file mode doesn't support augmented manifest files.

  Fast file mode exposes S3 objects using a POSIX-compliant file system interface, as if the files are available on the local disk of your training instance. It streams S3 content on demand as your training script consumes data. This means that your dataset no longer needs to fit into the training instance storage space as a whole, and you don't need to wait for the dataset to be downloaded to the training instance before training starts. Fast file currently supports S3 prefixes only (it does not support manifest and augmented manifest). Fast file mode is compatible with SageMaker AI local mode.
**Note**  
Using Fast File mode might lead to increased CloudTrail costs due to additional logging of:  
Amazon S3 data events (if enabled in CloudTrail).
AWS KMS decryption events when accessing Amazon S3 objects encrypted with AWS KMS keys.
Management events related to AWS KMS operations.
Review your CloudTrail configuration and monitoring costs if you have CloudTrail logging enabled for these event types.
+ *Pipe mode* streams data directly from an Amazon S3 data source. Streaming can provide faster start times and better throughput than file mode.

  When you stream the data directly, you can reduce the size of the Amazon EBS volumes used by the training instance. Pipe mode needs only enough disk space to store the final model artifacts.

  It is another streaming mode that is largely replaced by the newer and simpler-to-use fast file mode. In pipe mode, data is pre-fetched from Amazon S3 at high concurrency and throughput, and streamed into a named pipe, which also known as a First-In-First-Out (FIFO) pipe for its behavior. Each pipe may only be read by a single process. A SageMaker AI specific extension to TensorFlow conveniently [integrates Pipe mode into the native TensorFlow data loader](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#training-with-pipe-mode-using-pipemodedataset) for streaming text, TFRecords, or RecordIO file formats. Pipe mode also supports managed sharding and shuffling of data.
+ *Amazon S3 Express One Zone* is a high-performance, single Availability Zone storage class that can deliver consistent, single-digit millisecond data access for the most latency-sensitive applications including SageMaker model training. Amazon S3 Express One Zone allows customers to collocate their object storage and compute resources in a single AWS Availability Zone, optimizing both compute performance and costs with increased data processing speed. To further increase access speed and support hundreds of thousands of requests per second, data is stored in a new bucket type, an Amazon S3 directory bucket.

  SageMaker AI model training supports high-performance Amazon S3 Express One Zone directory buckets as a data input location for file mode, fast file mode, and pipe mode. To use Amazon S3 Express One Zone, input the location of the Amazon S3 Express One Zone directory bucket instead of an Amazon S3 bucket. Provide the ARN for the IAM role with the required access control and permissions policy. Refer to [AmazonSageMakerFullAccesspolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) for details. You can only encrypt your SageMaker AI output data in directory buckets with server-side encryption with Amazon S3 managed keys (SSE-S3). Server-side encryption with AWS KMS keys (SSE-KMS) is not currently supported for storing SageMaker AI output data in directory buckets. For more information, see [Amazon S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html).
+ Amazon FSx for Lustre – FSx for Lustre can scale to hundreds of gigabytes of throughput and millions of IOPS with low-latency file retrieval. When starting a training job, SageMaker AI mounts the FSx for Lustre file system to the training instance file system, then starts your training script. Mounting itself is a relatively fast operation that doesn't depend on the size of the dataset stored in FSx for Lustre. 

  To access FSx for Lustre, your training job must connect to an Amazon Virtual Private Cloud (VPC), which requires DevOps setup and involvement. To avoid data transfer costs, the file system uses a single Availability Zone, and you need to specify a VPC subnet which maps to this Availability Zone ID when running the training job.
+ Amazon EFS – To use Amazon EFS as a data source, the data must already reside in Amazon EFS prior to training. SageMaker AI mounts the specified Amazon EFS file system to the training instance, then starts your training script. Your training job must connect to a VPC to access Amazon EFS.
**Tip**  
To learn more about how to specify your VPC configuration to SageMaker AI estimators, see [Use File Systems as Training Inputs](https://sagemaker.readthedocs.io/en/stable/overview.html?highlight=VPC#use-file-systems-as-training-inputs) in the *SageMaker AI Python SDK documentation*.

# Configure data input mode using the SageMaker Python SDK
<a name="model-access-training-data-using-pysdk"></a>

SageMaker Python SDK provides the generic [Estimator class](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) and its [variations for ML frameworks](https://sagemaker.readthedocs.io/en/stable/frameworks/index.html) for launching training jobs. You can specify one of the data input modes while configuring the SageMaker AI `Estimator` class or the `Estimator.fit` method. The following code templates show the two ways to specify input modes.

**To specify the input mode using the Estimator class**

```
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

estimator = Estimator(
    checkpoint_s3_uri='s3://amzn-s3-demo-bucket/checkpoint-destination/',
    output_path='s3://amzn-s3-demo-bucket/output-path/',
    base_job_name='job-name',
    input_mode='File'  # Available options: File | Pipe | FastFile
    ...
)

# Run the training job
estimator.fit(
    inputs=TrainingInput(s3_data="s3://amzn-s3-demo-bucket/my-data/train")
)
```

For more information, see the [sagemaker.estimator.Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) class in the *SageMaker Python SDK documentation*.

**To specify the input mode through the `estimator.fit()` method**

```
from sagemaker.estimator import Estimator
from sagemaker.inputs import TrainingInput

estimator = Estimator(
    checkpoint_s3_uri='s3://amzn-s3-demo-bucket/checkpoint-destination/',
    output_path='s3://amzn-s3-demo-bucket/output-path/',
    base_job_name='job-name',
    ...
)

# Run the training job
estimator.fit(
    inputs=TrainingInput(
        s3_data="s3://amzn-s3-demo-bucket/my-data/train",
        input_mode='File'  # Available options: File | Pipe | FastFile
    )
)
```

For more information, see the [sagemaker.estimator.Estimator.fit](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.fit) class method and the [sagemaker.inputs.TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html#sagemaker.inputs.TrainingInput) class in the *SageMaker Python SDK documentation*.

**Tip**  
To learn more about how to configure Amazon FSx for Lustre or Amazon EFS with your VPC configuration using the SageMaker Python SDK estimators, see [Use File Systems as Training Inputs](https://sagemaker.readthedocs.io/en/stable/overview.html?highlight=VPC#use-file-systems-as-training-inputs) in the *SageMaker AI Python SDK documentation*.

**Tip**  
The data input mode integrations with Amazon S3, Amazon EFS, and FSx for Lustre are recommended ways to optimally configure data source for the best practices. You can strategically improve data loading performance using the SageMaker AI managed storage options and input modes, but it's not strictly constrained. You can write your own data reading logic directly in your training container. For example, you can set to read from a different data source, write your own S3 data loader class, or use third-party frameworks' data loading functions within your training script. However, you must make sure that you specify the right paths that SageMaker AI can recognize.

**Tip**  
If you use a custom training container, make sure you install the [SageMaker training toolkit](https://github.com/aws/sagemaker-training-toolkit) that helps set up the environment for SageMaker training jobs. Otherwise, you must specify the environment variables explicitly in your Dockerfile. For more information, see [Create a container with your own algorithms and models](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-create.html).

For more information about how to set the data input modes using the low-level SageMaker APIs, see [How Amazon SageMaker AI Provides Training Information](your-algorithms-training-algo-running-container.md), the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API, and the `TrainingInputMode` in [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html).

# Configure data input channel to use Amazon FSx for Lustre
<a name="model-access-training-data-fsx"></a>

Learn how to use Amazon FSx for Lustre as your data source for higher throughput and faster training by reducing the time for data loading.

**Note**  
When you use EFA-enabled instances such as P4d and P3dn, make sure that you set appropriate inbound and output rules in the security group. Specially, opening up these ports is necessary for SageMaker AI to access the Amazon FSx file system in the training job. To learn more, see [File System Access Control with Amazon VPC](https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html).

## Sync Amazon S3 and Amazon FSx for Lustre
<a name="model-access-training-data-fsx-sync-s3"></a>

To link your Amazon S3 to Amazon FSx for Lustre and upload your training datasets, do the following.

1. Prepare your dataset and upload to an Amazon S3 bucket. For example, assume that the Amazon S3 paths for a train dataset and a test dataset are in the following format.

   ```
   s3://amzn-s3-demo-bucket/data/train
   s3://amzn-s3-demo-bucket/data/test
   ```

1. To create an FSx for Lustre file system linked with the Amazon S3 bucket with the training data, follow the steps at [Linking your file system to an Amazon S3 bucket](https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-dra-linked-data-repo.html) in the *Amazon FSx for Lustre User Guide*. Make sure that you add an endpoint to your VPC allowing Amazon S3 access. For more information, see [Create an Amazon S3 VPC Endpoint](train-vpc.md#train-vpc-s3). When you specify **Data repository path**, provide the Amazon S3 bucket URI of the folder that contains your datasets. For example, based on the example S3 paths in step 1, the data repository path should be the following.

   ```
   s3://amzn-s3-demo-bucket/data
   ```

1. After the FSx for Lustre file system is created, check the configuration information by running the following commands.

   ```
   aws fsx describe-file-systems && \
   aws fsx describe-data-repository-association
   ```

   These commands return `FileSystemId`, `MountName`, `FileSystemPath`, and `DataRepositoryPath`. For example, the outputs should look like the following.

   ```
   # Output of aws fsx describe-file-systems
   "FileSystemId": "fs-0123456789abcdef0"
   "MountName": "1234abcd"
   
   # Output of aws fsx describe-data-repository-association
   "FileSystemPath": "/ns1",
   "DataRepositoryPath": "s3://amzn-s3-demo-bucket/data/"
   ```

   After the sync between Amazon S3 and Amazon FSx has completed, your datasets are saved in Amazon FSx in the following directories.

   ```
   /ns1/train  # synced with s3://amzn-s3-demo-bucket/data/train
   /ns1/test   # synced with s3://amzn-s3-demo-bucket/data/test
   ```

## Set the Amazon FSx file system path as the data input channel for SageMaker training
<a name="model-access-training-data-fsx-set-as-input-channel"></a>

The following procedures walk you through the process of setting the Amazon FSx file system as the data source for SageMaker training jobs.

------
#### [ Using the SageMaker Python SDK ]

To properly set the Amazon FSx file system as the data source, configure the SageMaker AI estimator classes and `FileSystemInput` using the following instruction.

1. Configure a FileSystemInput class object.

   ```
   from sagemaker.inputs import FileSystemInput
   
   train_fs = FileSystemInput(
       file_system_id="fs-0123456789abcdef0",
       file_system_type="FSxLustre",
       directory_path="/1234abcd/ns1/",
       file_system_access_mode="ro",
   )
   ```
**Tip**  
When you specify `directory_path`, make sure that you provide the Amazon FSx file system path starting with `MountName`.

1. Configure a SageMaker AI estimator with the VPC configuration used for the Amazon FSx file system.

   ```
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(
       ...
       role="your-iam-role-with-access-to-your-fsx",
       subnets=["subnet-id"],  # Should be the same as the subnet used for Amazon FSx
       security_group_ids="security-group-id"
   )
   ```

   Make sure that the IAM role for the SageMaker training job has the permissions to access and read from Amazon FSx.

1. Launch the training job by running the estimator.fit method with the Amazon FSx file system.

   ```
   estimator.fit(train_fs)
   ```

To find more code examples, see [Use File Systems as Training Inputs](https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs) in the *SageMaker Python SDK documentation*.

------
#### [ Using the SageMaker AI CreateTrainingJob API ]

As part of the [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request JSON, configure `InputDataConfig` as follows.

```
"InputDataConfig": [ 
    { 
        "ChannelName": "string",
        "DataSource": { 
            "FileSystemDataSource": { 
                "DirectoryPath": "/1234abcd/ns1/",
                "FileSystemAccessMode": "ro",
                "FileSystemId": "fs-0123456789abcdef0",
                "FileSystemType": "FSxLustre"
            }
        }
    }
],
```

**Tip**  
When you specify `DirectoryPath`, make sure that you provide the Amazon FSx file system path starting with `MountName`.

------

# Choosing an input mode and a storage unit
<a name="model-access-training-data-best-practices"></a>

The best data source for your training job depends on workload characteristics such as the size of the dataset, the file format, the average size of files, the training duration, a sequential or random data loader read pattern, and how fast your model can consume the training data. The following best practices provide guidelines to get started with the most suitable input mode and data storage service for your use case.

![\[Flowchart summarizing best practices of choosing the best storage as the data source and input file mode.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-training-choose-mode-and-storage.png)


## When to use Amazon EFS
<a name="model-access-training-data-best-practices-efs"></a>

If your dataset is stored in Amazon Elastic File System, you might have a preprocessing or annotations application that uses Amazon EFS for storage. You can run a training job configured with a data channel that points to the Amazon EFS file system. For more information, see [Speed up training on Amazon SageMaker AI using Amazon FSx for Lustre and Amazon EFS file systems](https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/). If you cannot achieve better performance, check your optimization options following the [Amazon Elastic File System performance guide](https://docs.aws.amazon.com/efs/latest/ug/performance.html#performance-overview) or consider using different input modes or data storage.

## Use file mode for small datasets
<a name="model-access-training-data-best-practices-file-mode"></a>

If the dataset is stored in Amazon Simple Storage Service and its overall volume is relatively small (for example, less than 50-100 GB), try using file mode. The overhead of downloading a 50 GB dataset can vary based on the total number of files. For example, it takes about 5 minutes if a dataset is chunked into 100 MB shards. Whether this startup overhead is acceptable primarily depends on the overall duration of your training job, because a longer training phase means a proportionally smaller download phase.

## Serializing many small files
<a name="model-access-training-data-best-practices-serialize"></a>

If your dataset size is small (less than 50-100 GB), but is made up of many small files (less than 50 MB per file), the file mode download overhead grows, because each file needs to be downloaded individually from Amazon Simple Storage Service to the training instance volume. To reduce this overhead and data traversal time in general, consider serializing groups of such small files into fewer larger file containers (such as 150 MB per file) by using file formats, such as [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord) for TensorFlow, [ WebDataset](https://webdataset.github.io/webdataset/) for PyTorch, and [RecordIO](https://mxnet.apache.org/versions/1.8.0/api/faq/recordio) for MXNet.

## When to use fast file mode
<a name="model-access-training-data-best-practices-fastfile"></a>

For larger datasets with larger files (more than 50 MB per file), the first option is to try fast file mode, which is more straightforward to use than FSx for Lustre because it doesn't require creating a file system, or connecting to a VPC. Fast file mode is ideal for large file containers (more than 150 MB), and might also do well with files more than 50 MB. Because fast file mode provides a POSIX interface, it supports random reads (reading non-sequential byte-ranges). However, this is not the ideal use case, and your throughput might be lower than with the sequential reads. However, if you have a relatively large and computationally intensive ML model, fast file mode might still be able to saturate the effective bandwidth of the training pipeline and not result in an IO bottleneck. You'll need to experiment and see. To switch from file mode to fast file mode (and back), just add (or remove) the `input_mode='FastFile'` parameter while defining your input channel using the SageMaker Python SDK:

```
sagemaker.inputs.TrainingInput(S3_INPUT_FOLDER,  input_mode = 'FastFile')
```

## When to use Amazon FSx for Lustre
<a name="model-access-training-data-best-practices-fsx"></a>

If your dataset is too large for file mode, has many small files that you can't serialize easily, or uses a random read access pattern, FSx for Lustre is a good option to consider. Its file system scales to hundreds of gigabytes per second (GB/s) of throughput and millions of IOPS, which is ideal when you have many small files. However, note that there might be the cold start issue due to lazy loading and the overhead of setting up and initializing the FSx for Lustre file system.

**Tip**  
To learn more, see [Choose the best data source for your Amazon SageMaker training job](https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/). This AWS machine learning blog further discusses case studies and performance benchmark of data sources and input modes.

# Use attribute-based access control (ABAC) for multi-tenancy training
<a name="model-access-training-data-abac"></a>

In a multi-tenant environment, it is crucial to ensure that each tenant's data is isolated and accessible only to authorized entities. SageMaker AI supports the use of [attribute-based access control (ABAC)](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction_attribute-based-access-control.html) to achieve this isolation for training jobs. Instead of creating multiple IAM roles for each tenant, you can use the same IAM role for all tenants by configuring a session chaining configuration that uses AWS Security Token Service (AWS STS) session tags to request temporary, limited-privilege credentials for your training job to access specific tenants. For more information about session tags, see [Passing session tags in AWS STS](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_session-tags.html).

When creating a training job, your session chaining configuration uses AWS STS to request temporary security credentials. This request generates a session, which is tagged. Each SageMaker training job can only access a specific tenant using a single role shared by all training jobs. By implementing ABAC with session chaining, you can ensure that each training job has access only to the tenant specified by the session tag, effectively isolating and securing each tenant. The following section guides you through the steps to set up and use ABAC for multi-tenant training job isolation using the SageMaker Python SDK. 

## Prerequisites
<a name="model-access-training-data-abac-prerequisites"></a>

To get started with ABAC for multi-tenant training job isolation, you must have the following:
+ Tenants with consistent naming across locations. For example, if an input data Amazon S3 URI for a tenant is `s3://your-input-s3-bucket/example-tenant`, the Amazon FSx directory for that same tenant should be `/fsx-train/train/example-tenant` and the output data Amazon S3 URI should be `s3://your-output-s3-bucket/example-tenant`.
+ A SageMaker AI job creation role. You can create a SageMaker AI job creation role using Amazon SageMaker AI Role Manager. For information, see [Using the role manager](https://docs.aws.amazon.com/sagemaker/latest/dg/role-manager-tutorial.html).
+ A SageMaker AI execution role that has `sts:AssumeRole`, and `sts:TagSession` permissions in its trust policy. For more information on SageMaker AI execution roles, see [SageMaker AI Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). 

  The execution role should also have a policy that allows tenants in any attribute-based multi-tenancy architecture to read from the prefix attached to a principal tag. The following is an example policy that limits the SageMaker AI execution role to have access to the value associated with the `tenant-id` key. For more information on naming tag keys, see [Rules for tagging in IAM and STS](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_tags.html#id_tags_rules).

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Action": [
                  "s3:GetObject",
                  "s3:PutObject"
              ],
              "Resource": [
                  "arn:aws:s3:::your-input-s3-bucket/${aws:PrincipalTag/tenant-id}/*"
              ],
              "Effect": "Allow"
          },
          {
              "Action": [
                  "s3:PutObject"
              ],
              "Resource": "arn:aws:s3:::your-output-s3-bucket/${aws:PrincipalTag/tenant-id}/*",
              "Effect": "Allow"
          },
          {
              "Action": "s3:ListBucket",
              "Resource": "*",
              "Effect": "Allow"
          }
      ]
  }
  ```

------

## Create a training job with session tag chaining enabled
<a name="model-access-training-data-abac-create-training-job"></a>

The following procedure shows you how to create a training job with session tag chaining using the SageMaker Python SDK for ABAC-enabled multi-tenancy training.

**Note**  
In addition to multi-tenancy data storage, you can also use the ABAC workflow to pass session tags to your execution role for Amazon VPC, AWS Key Management Service, and any other services you allow SageMaker AI to call

**Enable session tag chaining for ABAC**

1. Import `boto3` and the SageMaker Python SDK. ABAC-enabled training job isolation is only available in version [2.217](https://pypi.org/project/sagemaker/2.217.0/) or later of the SageMaker AI Python SDK. 

   ```
   import boto3
   import sagemaker
   
   from sagemaker.estimator import Estimator
   from sagemaker.inputs import TrainingInput
   ```

1. Set up an AWS STS and SageMaker AI client to use the tenant-labeled session tags. You can change the tag value to specify a different tenant.

   ```
   # Start an AWS STS client
   sts_client = boto3.client('sts')
   
   # Define your tenants using tags
   # The session tag key must match the principal tag key in your execution role policy
   tags = []
   tag = {}
   tag['Key'] = "tenant-id"
   tag['Value'] = "example-tenant"
   tags.append(tag)
   
   # Have AWS STS assume your ABAC-enabled job creation role
   response = sts_client.assume_role(
       RoleArn="arn:aws:iam::<account-id>:role/<your-training-job-creation-role>",
       RoleSessionName="SessionName",
       Tags=tags)
   credentials = response['Credentials']
   
   # Create a client with your job creation role (which was assumed with tags)
   sagemaker_client = boto3.client(
       'sagemaker',
       aws_access_key_id=credentials['AccessKeyId'],
       aws_secret_access_key=credentials['SecretAccessKey'],
       aws_session_token=credentials['SessionToken']
   )
   sagemaker_session = sagemaker.Session(sagemaker_client=sagemaker_client)
   ```

    When appending the tags `"tenant-id=example-tenant"` to the job creation role, these tags are extracted by the execution role to use the following policy: 

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Action": [
                   "s3:GetObject",
                   "s3:PutObject"
               ],
               "Resource": [
                   "arn:aws:s3:::your-input-s3-bucket/example-tenant/*"
               ],
               "Effect": "Allow"
           },
           {
               "Action": [
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::your-output-s3-bucket/example-tenant/*",
               "Effect": "Allow"
           },
           {
               "Action": "s3:ListBucket",
               "Resource": "*",
               "Effect": "Allow"
           }
       ]
   }
   ```

------

1. Define an estimator to create a training job using the SageMaker Python SDK. Set `enable_session_tag_chaining` to `True` to allow your SageMaker AI training execution role to retrieve the tags from your job creation role.

   ```
   # Specify your training input
   trainingInput = TrainingInput(
       s3_data='s3://<your-input-bucket>/example-tenant',
       distribution='ShardedByS3Key',
       s3_data_type='S3Prefix'
   )
   
   # Specify your training job execution role 
   execution_role_arn = "arn:aws:iam::<account-id>:role/<your-training-job-execution-role>"
   
   # Define your esimator with session tag chaining enabled
   estimator = Estimator(
       image_uri="<your-training-image-uri>",
       role=execution_role_arn,
       instance_count=1,
       instance_type='ml.m4.xlarge',
       volume_size=20,
       max_run=3600,
       sagemaker_session=sagemaker_session,
       output_path="s3://<your-output-bucket>/example-tenant",
       enable_session_tag_chaining=True
   )
   
   estimator.fit(inputs=trainingInput, job_name="abac-demo")
   ```

SageMaker AI can only read tags provided in the training job request and does not add any tags to resources on your behalf.

ABAC for SageMaker training is compatible with SageMaker AI managed warm pools. To use ABAC with warm pools, matching training jobs must have identical session tags. For more information, see [Matching training jobs](train-warm-pools.md#train-warm-pools-matching-criteria).

# Mapping of training storage paths managed by Amazon SageMaker AI
<a name="model-train-storage"></a>

This page provides a high-level summary of how the SageMaker training platform manages storage paths for training datasets, model artifacts, checkpoints, and outputs between AWS cloud storage and training jobs in SageMaker AI. Throughout this guide, you learn to identify the default paths set by the SageMaker AI platform and how the data channels can be streamlined with your data sources in Amazon Simple Storage Service (Amazon S3), FSx for Lustre, and Amazon EFS. For more information about various data channel input modes and storage options, see [Setting up training jobs to access datasets](model-access-training-data.md).

## Overview of how SageMaker AI maps storage paths
<a name="model-train-storage-overview"></a>

The following diagram shows an example of how SageMaker AI maps input and output paths when you run a training job using the SageMaker Python SDK [Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) class. 

![\[An example of how SageMaker AI maps paths between the training job container and the storage when you run a training job using the SageMaker Python SDK Estimator class and its fit method.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-training-storage.png)


SageMaker AI maps storage paths between a storage (such as Amazon S3, Amazon FSx, and Amazon EFS) and the SageMaker training container based on the paths and input mode specified through a SageMaker AI estimator object. More information about how SageMaker AI reads from or writes to the paths and the purpose of the paths, see [SageMaker AI environment variables and the default paths for training storage locations](model-train-storage-env-var-summary.md).

You can use `OutputDataConfig` in the [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API to save the results of model training to an S3 bucket. Use the [ModelArtifacts](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ModelArtifacts.html) API to find the S3 bucket that contains your model artifacts. See the [abalone\$1build\$1train\$1deploy](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-pipelines/tabular/abalone_build_train_deploy/sagemaker-pipelines-preprocess-train-evaluate-batch-transform.ipynb) notebook for an example of output paths and how they are used in API calls.

For more information and examples of how SageMaker AI manages data source, input modes, and local paths in SageMaker training instances, see [Access Training Data](https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html).

**Topics**
+ [

## Overview of how SageMaker AI maps storage paths
](#model-train-storage-overview)
+ [

# Uncompressed model output
](model-train-storage-uncompressed.md)
+ [

# Managing storage paths for different types of instance local storage
](model-train-storage-tips-considerations.md)
+ [

# SageMaker AI environment variables and the default paths for training storage locations
](model-train-storage-env-var-summary.md)

# Uncompressed model output
<a name="model-train-storage-uncompressed"></a>

SageMaker AI stores your model in `/opt/ml/model` and your data in `/opt/ml/output/data`. After the model and data are written to those locations, they're uploaded to your Amazon S3 bucket as compressed files by default. 

You can save time on large data file compression by uploading model and data outputs to your S3 bucket as uncompressed files. To do this, create a training job in uncompressed upload mode by using either the AWS Command Line Interface (AWS CLI) or the SageMaker Python SDK. 

The following code example shows how to create a training job in uncompressed upload mode when using the AWS CLI. To enable uncompressed upload mode, set `CompressionType` field in the `OutputDataConfig` API to **NONE**.

```
{
   "TrainingJobName": "uncompressed_model_upload",
   ...
   "OutputDataConfig": { 
      "S3OutputPath": "s3://amzn-s3-demo-bucket/uncompressed_upload/output",
      "CompressionType": "NONE"
   },
   ...
}
```

The following code example shows you how to create a training job in uncompressed upload mode using the SageMaker Python SDK.

```
import sagemaker
from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri="your-own-image-uri",
    role=sagemaker.get_execution_role(), 
    sagemaker_session=sagemaker.Session(),
    instance_count=1,
    instance_type='ml.c4.xlarge',
    disable_output_compression=True
)
```

# Managing storage paths for different types of instance local storage
<a name="model-train-storage-tips-considerations"></a>

Consider the following when setting up storage paths for training jobs in SageMaker AI.
+ If you want to store training artifacts for distributed training in the `/opt/ml/output/data` directory, you must properly append subdirectories or use unique file names for the artifacts through your model definition or training script. If the subdirectories and file names are not properly configured, all of the distributed training workers might write outputs to the same file name in the same output path in Amazon S3.
+ If you use a custom training container, make sure you install the [SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) that helps set up the environment for SageMaker training jobs. Otherwise, you must specify the environment variables explicitly in your Dockerfile. For more information, see [Create a container with your own algorithms and models](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-create.html).
+ When using an ML instance with [NVMe SSD volumes](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html#nvme-ssd-volumes), SageMaker AI doesn't provision Amazon EBS gp2 storage. Available storage is fixed to the NVMe-type instance's storage capacity. SageMaker AI configures storage paths for training datasets, checkpoints, model artifacts, and outputs to use the entire capacity of the instance storage. For example, ML instance families with the NVMe-type instance storage include `ml.p4d`, `ml.g4dn`, and `ml.g5`. When using an ML instance with the EBS-only storage option and without instance storage, you must define the size of EBS volume through the `volume_size` parameter in the SageMaker AI estimator class (or `VolumeSizeInGB` if you are using the `ResourceConfig` API). For example, ML instance families that use EBS volumes include `ml.c5` and `ml.p2`. To look up instance types and their instance storage types and volumes, see [Amazon EC2 Instance Types](https://aws.amazon.com/ec2/instance-types/).
+ The default paths for SageMaker training jobs are mounted to Amazon EBS volumes or NVMe SSD volumes of the ML instance. When you adapt your training script to SageMaker AI, make sure that you use the default paths listed in the previous topic about [SageMaker AI environment variables and the default paths for training storage locations](model-train-storage-env-var-summary.md). We recommend that you use the `/tmp` directory as a scratch space for temporarily storing any large objects during training. This means that you must not use directories that are mounted to small disk space allocated for system, such as `/user` and `/home`, to avoid out-of-space errors.

To learn more, see the AWS machine learning blog [Choose the best data source for your Amazon SageMaker training job](https://aws.amazon.com/blogs/machine-learning/choose-the-best-data-source-for-your-amazon-sagemaker-training-job/) that further discusses case studies and performance benchmarks of data sources and input modes.

# SageMaker AI environment variables and the default paths for training storage locations
<a name="model-train-storage-env-var-summary"></a>

The following table summarizes the input and output paths for training datasets, checkpoints, model artifacts, and outputs, managed by the SageMaker training platform.


| Local path in SageMaker training instance | SageMaker AI environment variable | Purpose | Read from S3 during start | Read from S3 during Spot-restart | Writes to S3 during training | Writes to S3 when job is terminated | 
| --- | --- | --- | --- | --- | --- | --- | 
|  `/opt/ml/input/data/channel_name`1   |  SM\$1CHANNEL\$1*CHANNEL\$1NAME*  |  Reading training data from the input channels specified through the SageMaker AI Python SDK [Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) class or the [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API operation. For more information about how to specify it in your training script using the SageMaker Python SDK, see [Prepare a Training script](https://sagemaker.readthedocs.io/en/stable/overview.html?highlight=VPC#prepare-a-training-script).  | Yes | Yes | No | No | 
|  `/opt/ml/output/data`2  | SM\$1OUTPUT\$1DIR |  Saving outputs such as loss, accuracy, intermediate layers, weights, gradients, bias, and TensorBoard-compatible outputs. You can also save any arbitrary output you’d like using this path. Note that this is a different path from the one for storing the final model artifact `/opt/ml/model/`.  | No | No | No | Yes | 
|  `/opt/ml/model`3  | SM\$1MODEL\$1DIR |  Storing the final model artifact. This is also the path from where the model artifact is deployed for [Real-time inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) in SageMaker AI Hosting.  | No | No | No | Yes | 
|  `/opt/ml/checkpoints`4  | - |  Saving model checkpoints (the state of model) to resume training from a certain point, and recover from unexpected or [Managed Spot Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-managed-spot-training.html) interruptions.  | Yes | Yes | Yes | No | 
|  `/opt/ml/code`  | SAGEMAKER\$1SUBMIT\$1DIRECTORY |  Copying training scripts, additional libraries, and dependencies.  | Yes | Yes | No | No | 
|  `/tmp`  | - |  Reading or writing to `/tmp` as a scratch space.  | No | No | No | No | 

1 `channel_name` is the place to specify user-defined channel names for training data inputs. Each training job can contain several data input channels. You can specify up to 20 training input channels per training job. Note that the data downloading time from the data channels is counted to the billable time. For more information about data input paths, see [How Amazon SageMaker AI Provides Training Information](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-running-container.html). Also, there are three types of data input modes that SageMaker AI supports: file, FastFile, and pipe mode. To learn more about the data input modes for training in SageMaker AI, see [Access Training Data](https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html).

2 SageMaker AI compresses and writes training artifacts to TAR files (`tar.gz`). Compression and uploading time is counted to the billable time. For more information, see [How Amazon SageMaker AI Processes Training Output](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html).

3 SageMaker AI compresses and writes the final model artifact to a TAR file (`tar.gz`). Compression and uploading time is counted to the billable time. For more information, see [How Amazon SageMaker AI Processes Training Output](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html).

4 Sync with Amazon S3 during training. Write as is without compressing to TAR files. For more information, see [Use Checkpoints in Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html).

# Running training jobs on a heterogeneous cluster
<a name="train-heterogeneous-cluster"></a>

Using the heterogeneous cluster feature of SageMaker Training, you can run a training job with multiple types of ML instances for a better resource scaling and utilization for different ML training tasks and purposes. For example, if your training job on a cluster with GPU instances suffers low GPU utilization and CPU bottleneck problems due to CPU-intensive tasks, using a heterogeneous cluster can help offload CPU-intensive tasks by adding more cost-efficient CPU instance groups, resolve such bottleneck problems, and achieve a better GPU utilization.

**Note**  
This feature is available in the SageMaker Python SDK v2.98.0 and later.

**Note**  
This feature is available through the SageMaker AI [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) and [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) framework estimator classes. Supported frameworks are PyTorch v1.10 or later and TensorFlow v2.6 or later.

See also the blog [Improve price performance of your model training using Amazon SageMaker AI heterogeneous clusters](https://aws.amazon.com/blogs/machine-learning/improve-price-performance-of-your-model-training-using-amazon-sagemaker-heterogeneous-clusters/).

**Topics**
+ [

# Configure a training job with a heterogeneous cluster in Amazon SageMaker AI
](train-heterogeneous-cluster-configure.md)
+ [

# Run distributed training on a heterogeneous cluster in Amazon SageMaker AI
](train-heterogeneous-cluster-configure-distributed.md)
+ [

# Modify your training script to assign instance groups
](train-heterogeneous-cluster-modify-training-script.md)

# Configure a training job with a heterogeneous cluster in Amazon SageMaker AI
<a name="train-heterogeneous-cluster-configure"></a>

This section provides instructions on how to run a training job using a heterogeneous cluster that consists of multiple instance types.

Note the following before you start. 
+ All instance groups share the same Docker image and training script. Therefore, your training script should be modified to detect which instance group it belongs to and fork execution accordingly.
+ The heterogeneous cluster feature is not compatable with SageMaker AI local mode.
+ The Amazon CloudWatch log streams of a heterogeneous cluster training job are not grouped by instance groups. You need to figure out from the logs which nodes are in which group.

**Topics**
+ [

## Option 1: Using the SageMaker Python SDK
](#train-heterogeneous-cluster-configure-pysdk)
+ [

## Option 2: Using the low-level SageMaker APIs
](#train-heterogeneous-cluster-configure-api)

## Option 1: Using the SageMaker Python SDK
<a name="train-heterogeneous-cluster-configure-pysdk"></a>

Follow instructions on how to configure instance groups for a heterogeneous cluster using the SageMaker Python SDK.

1. To configure instance groups of a heterogeneous cluster for a training job, use the `sagemaker.instance_group.InstanceGroup` class. You can specify a custom name for each instance group, the instance type, and the number of instances for each instance group. For more information, see [sagemaker.instance\$1group.InstanceGroup](https://sagemaker.readthedocs.io/en/stable/api/utility/instance_group.html) in the *SageMaker AI Python SDK documentation*.
**Note**  
For more information about available instance types and the maximum number of instance groups that you can configure in a heterogeneous cluster, see the [ InstanceGroup](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InstanceGroup.html) API reference.

   The following code example shows how to set up two instance groups that consists of two `ml.c5.18xlarge` CPU-only instances named `instance_group_1` and one `ml.p3dn.24xlarge` GPU instance named `instance_group_2`, as shown in the following diagram.  
![\[A conceptual example of how data can be assigned in SageMaker Training Job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/HCTraining.png)

   The preceding diagram shows a conceptual example of how pre-training processes, such as data preprocessing, can be assigned to the CPU instance group and stream the preprocessed data to the GPU instance group.

   ```
   from sagemaker.instance_group import InstanceGroup
   
   instance_group_1 = InstanceGroup(
       "instance_group_1", "ml.c5.18xlarge", 2
   )
   instance_group_2 = InstanceGroup(
       "instance_group_2", "ml.p3dn.24xlarge", 1
   )
   ```

1. Using the instance group objects, set up training input channels and assign instance groups to the channels through the `instance_group_names` argument of the [sagemaker.inputs.TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html) class. The `instance_group_names` argument accepts a list of strings of instance group names.

   The following example shows how to set two training input channels and assign the instance groups created in the example of the previous step. You can also specify Amazon S3 bucket paths to the `s3_data` argument for the instance groups to process data for your usage purposes.

   ```
   from sagemaker.inputs import TrainingInput
   
   training_input_channel_1 = TrainingInput(
       s3_data_type='S3Prefix', # Available Options: S3Prefix | ManifestFile | AugmentedManifestFile
       s3_data='s3://your-training-data-storage/folder1',
       distribution='FullyReplicated', # Available Options: FullyReplicated | ShardedByS3Key 
       input_mode='File', # Available Options: File | Pipe | FastFile
       instance_groups=["instance_group_1"]
   )
   
   training_input_channel_2 = TrainingInput(
       s3_data_type='S3Prefix',
       s3_data='s3://your-training-data-storage/folder2',
       distribution='FullyReplicated',
       input_mode='File',
       instance_groups=["instance_group_2"]
   )
   ```

   For more information about the arguments of `TrainingInput`, see the following links.
   + The [sagemaker.inputs.TrainingInput](https://sagemaker.readthedocs.io/en/stable/api/utility/inputs.html) class in the *SageMaker Python SDK documentation*
   + The [S3DataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) API in the *SageMaker AI API Reference*

1. Configure a SageMaker AI estimator with the `instance_groups` argument as shown in the following code example. The `instance_groups` argument accepts a list of `InstanceGroup` objects.
**Note**  
The heterogeneous cluster feature is available through the SageMaker AI [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html) and [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator) framework estimator classes. Supported frameworks are PyTorch v1.10 or later and TensorFlow v2.6 or later. To find a complete list of available framework containers, framework versions, and Python versions, see [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) in the AWS Deep Learning Container GitHub repository.

------
#### [ PyTorch ]

   ```
   from sagemaker.pytorch import PyTorch
   
   estimator = PyTorch(
       ...
       entry_point='my-training-script.py',
       framework_version='x.y.z',    # 1.10.0 or later
       py_version='pyxy',            
       job_name='my-training-job-with-heterogeneous-cluster',
       instance_groups=[instance_group_1, instance_group_2]
   )
   ```

------
#### [ TensorFlow ]

   ```
   from sagemaker.tensorflow import TensorFlow
   
   estimator = TensorFlow(
       ...
       entry_point='my-training-script.py',
       framework_version='x.y.z', # 2.6.0 or later
       py_version='pyxy',
       job_name='my-training-job-with-heterogeneous-cluster',
       instance_groups=[instance_group_1, instance_group_2]
   )
   ```

------
**Note**  
The `instance_type` and `instance_count` argument pair and the `instance_groups` argument of the SageMaker AI estimator class are mutually exclusive. For homogeneous cluster training, use the `instance_type` and `instance_count` argument pair. For heterogeneous cluster training, use `instance_groups`.
**Note**  
To find a complete list of available framework containers, framework versions, and Python versions, see [SageMaker AI Framework Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) in the AWS Deep Learning Container GitHub repository.

1. Configure the `estimator.fit` method with the training input channels configured with the instance groups and start the training job.

   ```
   estimator.fit(
       inputs={
           'training': training_input_channel_1, 
           'dummy-input-channel': training_input_channel_2
       }
   )
   ```

## Option 2: Using the low-level SageMaker APIs
<a name="train-heterogeneous-cluster-configure-api"></a>

If you use the AWS Command Line Interface or AWS SDK for Python (Boto3) and want to use low-level SageMaker APIs for submitting a training job request with a heterogeneous cluster, see the following API references.
+ [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html)
+ [ResourceConfig ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ResourceConfig.html)
+ [InstanceGroup](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InstanceGroup.html)
+ [S3DataSource](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html)

# Run distributed training on a heterogeneous cluster in Amazon SageMaker AI
<a name="train-heterogeneous-cluster-configure-distributed"></a>

Through the `distribution` argument of the SageMaker AI estimator class, you can assign a specific instance group to run distributed training. For example, assume that you have the following two instance groups and want to run multi-GPU training on one of them. 

```
from sagemaker.instance_group import InstanceGroup

instance_group_1 = InstanceGroup("instance_group_1", "ml.c5.18xlarge", 1)
instance_group_2 = InstanceGroup("instance_group_2", "ml.p3dn.24xlarge", 2)
```

You can set the distributed training configuration for one of the instance groups. For example, the following code examples show how to assign `training_group_2` with two `ml.p3dn.24xlarge` instances to the distributed training configuration.

**Note**  
Currently, only one instance group of a heterogeneous cluster can be specified to the distribution configuration.

**With MPI**

------
#### [ PyTorch ]

```
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    ...
    instance_groups=[instance_group_1, instance_group_2],
    distribution={
        "mpi": {
            "enabled": True, "processes_per_host": 8
        },
        "instance_groups": [instance_group_2]
    }
)
```

------
#### [ TensorFlow ]

```
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    ...
    instance_groups=[instance_group_1, instance_group_2],
    distribution={
        "mpi": {
            "enabled": True, "processes_per_host": 8
        },
        "instance_groups": [instance_group_2]
    }
)
```

------

**With the SageMaker AI data parallel library**

------
#### [ PyTorch ]

```
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    ...
    instance_groups=[instance_group_1, instance_group_2],
    distribution={
        "smdistributed": {
            "dataparallel": {
                "enabled": True
            }
        }, 
        "instance_groups": [instance_group_2]
    }
)
```

------
#### [ TensorFlow ]

```
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    ...
    instance_groups=[instance_group_1, instance_group_2],
    distribution={
        "smdistributed": {
            "dataparallel": {
                "enabled": True
            }
        }, 
        "instance_groups": [instance_group_2]
    }
)
```

------

**Note**  
When using the SageMaker AI data parallel library, make sure the instance group consists of the [supported instance types by the library](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-instance-types). 

For more information about the SageMaker AI data parallel library, see [SageMaker AI Data Parallel Training](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html).

**With the SageMaker AI model parallel library**

------
#### [ PyTorch ]

```
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    ...
    instance_groups=[instance_group_1, instance_group_2],
    distribution={
        "smdistributed": {
            "modelparallel": {
                "enabled":True,
                "parameters": {
                    ...   # SageMaker AI model parallel parameters
                } 
            }
        }, 
        "instance_groups": [instance_group_2]
    }
)
```

------
#### [ TensorFlow ]

```
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    ...
    instance_groups=[instance_group_1, instance_group_2],
    distribution={
        "smdistributed": {
            "modelparallel": {
                "enabled":True,
                "parameters": {
                    ...   # SageMaker AI model parallel parameters
                } 
            }
        }, 
        "instance_groups": [instance_group_2]
    }
)
```

------

For more information about the SageMaker AI model parallel library, see [SageMaker AI Model Parallel Training](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).

# Modify your training script to assign instance groups
<a name="train-heterogeneous-cluster-modify-training-script"></a>

With the heterogeneous cluster configuration in the previous sections, you have prepared the SageMaker training environment and instances for your training job. To further assign the instance groups to certain training and data processing tasks, the next step is to modify your training script. By default, the training job simply makes training script replicas for all nodes regardless the size of the instance, and this might lead to performance loss. 

For example, if you mix CPU instances and GPU instances in a heterogeneous cluster while passing a deep neural network training script to the `entry_point` argument of the SageMaker AI estimator, the `entry_point` script is replicated to each instance. This means that, without proper task assignments, CPU instances also run the entire script and start the training job that’s designed for distributed training on GPU instances. Therefore, you must make changes in specific processing functions that you want to offload and run on the CPU instances. You can use the SageMaker AI environment variables to retrieve the information of the heterogeneous cluster and let specific processes to run accordingly.

When your training job starts, your training script reads SageMaker training environment information that includes heterogeneous cluster configuration. The configuration contains information such as the current instance groups, the current hosts in each group, and in which group the current host resides.

You can query instance group information during the initialization phase of a SageMaker AI training job in the following ways.

**(Recommended) Reading instance group information with the SageMaker training toolkit**

Use the environment Python module that the [SageMaker training toolkit library](https://github.com/aws/sagemaker-training-toolkit) provides. The toolkit library is preinstalled in the [SageMaker framework containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) for TensorFlow and PyTorch, so you don’t need an additional installation step when using the prebuilt containers. This is the recommended way to retrieve the SageMaker AI environment variables with fewer code changes in your training script.

```
from sagemaker_training import environment

env = environment.Environment()
```

Environment variables related to general SageMaker training and heterogeneous clusters:
+ `env.is_hetero` – Returns a Boolean result whether a heterogeneous cluster is configured or not.
+ `env.current_host` – Returns the current host.
+ `env.current_instance_type` – Returns the type of instance of the current host.
+ `env.current_instance_group` – Returns the name of the current instance group.
+ `env.current_instance_group_hosts` – Returns a list of hosts in current instance group.
+ `env.instance_groups` – Returns a list of instance group names used for training.
+ `env.instance_groups_dict` – Returns the entire heterogeneous cluster configuration of the training job.
+ `env.distribution_instance_groups` – Returns a list of instance groups assigned to the `distribution` parameter of the SageMaker AI estimator class.
+ `env.distribution_hosts` – Returns a list of hosts belonging to the instance groups assigned to the `distribution` parameter of the SageMaker AI estimator class.

For example, consider the following example of a heterogeneous cluster that consists of two instance groups.

```
from sagemaker.instance_group import InstanceGroup

instance_group_1 = InstanceGroup(
    "instance_group_1", "ml.c5.18xlarge", 1)
instance_group_2 = InstanceGroup(
    "instance_group_2", "ml.p3dn.24xlarge", 2)
```

The output of `env.instance_groups_dict` of the example heterogeneous cluster should be similar to the following.

```
{
    "instance_group_1": {
        "hosts": [
            "algo-2"
        ],
        "instance_group_name": "instance_group_1",
        "instance_type": "ml.c5.18xlarge"
    },
    "instance_group_2": {
        "hosts": [
            "algo-3",
            "algo-1"
        ],
        "instance_group_name": "instance_group_2",
        "instance_type": "ml.p3dn.24xlarge"
    }
}
```

**(Optional) Reading instance group information from the resource configuration JSON file**

If you prefer to retrieve the environment variables in JSON format, you can directly use the resource configuration JSON file. The JSON file in a SageMaker training instance is located at `/opt/ml/input/config/resourceconfig.json` by default.

```
file_path = '/opt/ml/input/config/resourceconfig.json'
config = read_file_as_json(file_path)
print(json.dumps(config, indent=4, sort_keys=True))
```

# Use Incremental Training in Amazon SageMaker AI
<a name="incremental-training"></a>

Over time, you might find that a model generates inference that are not as good as they were in the past. With incremental training, you can use the artifacts from an existing model and use an expanded dataset to train a new model. Incremental training saves both time and resources.

Use incremental training to:
+ Train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance.
+ Use the model artifacts or a portion of the model artifacts from a popular publicly available model in a training job. You don't need to train a new model from scratch.
+ Resume a training job that was stopped.
+ Train several variants of a model, either with different hyperparameter settings or using different datasets.

For more information about training jobs, see [Train a Model with Amazon SageMaker](how-it-works-training.md).

You can train incrementally using the SageMaker AI console or the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

**Important**  
Only three built-in algorithms currently support incremental training: [Object Detection - MXNet](object-detection.md), [Image Classification - MXNet](image-classification.md), and [Semantic Segmentation Algorithm](semantic-segmentation.md).

**Topics**
+ [

## Perform Incremental Training (Console)
](#incremental-training-console)
+ [

## Perform Incremental Training (API)
](#incremental-training-api)

## Perform Incremental Training (Console)
<a name="incremental-training-console"></a>

To complete this procedure, you need:
+ The Amazon Simple Storage Service (Amazon S3) bucket URI where you've stored the training data.
+ The S3 bucket URI where you want to store the output of the job. 
+ The Amazon Elastic Container Registry path where the training code is stored. For more information, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths).
+ The URL of the S3 bucket where you've stored the model artifacts that you want to use in incremental training. To find the URL for the model artifacts, see the details page of the training job used to create the model. To find the details page, in the SageMaker AI console, choose **Inference**, choose **Models**, and then choose the model.

To restart a stopped training job, use the URL to the model artifacts that are stored in the details page as you would with a model or a completed training job.

**To perform incremental training (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the navigation pane, choose **Training**, then choose **Training jobs**. 

1. Choose **Create training job**.

1. Provide a name for the training job. The name must be unique within an AWS Region in an AWS account. The training job name must have 1 to 63 characters. Valid characters: a-z, A-Z, 0-9, and . : \$1 = @ \$1 % - (hyphen).

1. Choose the algorithm that you want to use. For information about algorithms, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md). 

1. (Optional) For **Resource configuration**, either leave the default values or increase the resource consumption to reduce computation time.

   1. (Optional) For **Instance type**, choose the ML compute instance type that you want to use. In most cases, **ml.m4.xlarge** is sufficient. 

   1. For **Instance count**, use the default, 1.

   1. (Optional) For **Additional volume per instance (GB)**, choose the size of the ML storage volume that you want to provision. In most cases, you can use the default, 1. If you are using a large dataset, use a larger size.

1. Provide information about the input data for the training dataset.

   1. For **Channel name**, either leave the default (**train**) or enter a more meaningful name for the training dataset, such as **expanded-training-dataset**.

   1. For **InputMode**, choose **File**. For incremental training, you need to use file input mode.

   1. For **S3 data distribution type**, choose **FullyReplicated**. This causes each ML compute instance to use a full replicate of the expanded dataset when training incrementally.

   1. If the expanded dataset is uncompressed, set the **Compression type** to **None**. If the expanded dataset is compressed using Gzip, set it to **Gzip**.

   1. (Optional) If you are using File input mode, leave **Content type** empty. For Pipe input mode, specify the appropriate MIME type. *Content type* is the multipurpose internet mail extension (MIME) type of the data.

   1. For **Record wrapper**, if the dataset is saved in RecordIO format, choose **RecordIO**. If your dataset is not saved as a RecordIO formatted file, choose **None**.

   1. For **S3 data type**, if the dataset is stored as a single file, choose **S3Prefix**. If the dataset is stored as several files in a folder, choose **Manifest**.

   1. For **S3 location**, provide the URL to the path where you stored the expanded dataset.

   1. Choose **Done**.

1. To use model artifacts in a training job, you need to add a new channel and provide the needed information about the model artifacts.

   1. For **Input data configuration**, choose **Add channel**.

   1. For **Channel name**, enter **model** to identify this channel as the source of the model artifacts.

   1. For **InputMode**, choose **File**. Model artifacts are stored as files.

   1. For **S3 data distribution type**, choose **FullyReplicated**. This indicates that each ML compute instance should use all of the model artifacts for training. 

   1. For **Compression type**, choose **None** because we are using a model for the channel.

   1. Leave **Content type** empty. Content type is the multipurpose internet mail extension (MIME) type of the data. For model artifacts, we leave it empty.

   1. Set **Record wrapper** to **None** because model artifacts are not stored in RecordIO format.

   1. For **S3 data type**, if you are using a built-in algorithm or an algorithm that stores the model as a single file, choose **S3Prefix**. If you are using an algorithm that stores the model as several files, choose **Manifest**.

   1. For **S3 location**, provide the URL to the path where you stored the model artifacts. Typically, the model is stored with the name `model.tar.gz`. To find the URL for the model artifacts, in the navigation pane, choose **Inference**, then choose **Models**. From the list of models, choose a model to display its details page. The URL for the model artifacts is listed under **Primary container** .

   1. Choose **Done**.

1. For **Output data configuration**, provide the following information:

   1. For **S3 location**, type the path to the S3 bucket where you want to store the output data.

   1. (Optional) For **Encryption key**, you can add your AWS Key Management Service (AWS KMS) encryption key to encrypt the output data at rest. Provide the key ID or its Amazon Resource Number (ARN). For more information, see [KMS-Managed Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html).

1. (Optional) For **Tags**, add one or more tags to the training job. A *tag* is metadata that you can define and assign to AWS resources. In this case, you can use tags to help you manage your training jobs. A tag consists of a key and a value, which you define. For example, you might want to create a tag with **Project** as a key and a value referring to a project that is related to the training job, such as **Home value forecasts**.

1. Choose **Create training job**. SageMaker AI creates and runs training job.

After the training job has completed, the newly trained model artifacts are stored under the **S3 output path** that you provided in the **Output data configuration** field. To deploy the model to get predictions, see [Deploy the model to Amazon EC2](ex1-model-deployment.md).

## Perform Incremental Training (API)
<a name="incremental-training-api"></a>

This example shows how to use SageMaker AI APIs to train a model using the SageMaker AI image classification algorithm and the [Caltech 256 Image Dataset](https://data.caltech.edu/records/nyy15-4j048), then train a new model using the first one. It uses Amazon S3 for input and output sources. Please see the [incremental training sample notebook](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-incremental-training-highlevel.html) for more details on using incremental training.

**Note**  
In this example we used the original datasets in the incremental training, however you can use different datasets, such as ones that contain newly added samples. Upload the new datasets to S3 and make adjustments to the `data_channels` variable used to train the new model.

Get an AWS Identity and Access Management (IAM) role that grants required permissions and initialize environment variables:

```
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
print(role)

sess = sagemaker.Session()

bucket=sess.default_bucket()
print(bucket)
prefix = 'ic-incr-training'
```

Get the training image for the image classification algorithm:

```
from sagemaker.amazon.amazon_estimator import get_image_uri

training_image = get_image_uri(sess.boto_region_name, 'image-classification', repo_version="latest")
#Display the training image
print (training_image)
```

Download the training and validation datasets, then upload them to Amazon Simple Storage Service (Amazon S3):

```
import os
import urllib.request
import boto3

# Define a download function
def download(url):
    filename = url.split("/")[-1]
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)

# Download the caltech-256 training and validation datasets
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec')
download('http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec')

# Create four channels: train, validation, train_lst, and validation_lst
s3train = 's3://{}/{}/train/'.format(bucket, prefix)
s3validation = 's3://{}/{}/validation/'.format(bucket, prefix)

# Upload the first files to the train and validation channels
!aws s3 cp caltech-256-60-train.rec $s3train --quiet
!aws s3 cp caltech-256-60-val.rec $s3validation --quiet
```

Define the training hyperparameters:

```
# Define hyperparameters for the estimator
hyperparams = { "num_layers": "18",
                "resize": "32",
                "num_training_samples": "50000",
                "num_classes": "10",
                "image_shape": "3,28,28",
                "mini_batch_size": "128",
                "epochs": "3",
                "learning_rate": "0.1",
                "lr_scheduler_step": "2,3",
                "lr_scheduler_factor": "0.1",
                "augmentation_type": "crop_color",
                "optimizer": "sgd",
                "momentum": "0.9",
                "weight_decay": "0.0001",
                "beta_1": "0.9",
                "beta_2": "0.999",
                "gamma": "0.9",
                "eps": "1e-8",
                "top_k": "5",
                "checkpoint_frequency": "1",
                "use_pretrained_model": "0",
                "model_prefix": "" }
```

Create an estimator object and train the first model using the training and validation datasets:

```
# Fit the base estimator
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)
ic = sagemaker.estimator.Estimator(training_image,
                                   role,
                                   instance_count=1,
                                   instance_type='ml.p2.xlarge',
                                   volume_size=50,
                                   max_run=360000,
                                   input_mode='File',
                                   output_path=s3_output_location,
                                   sagemaker_session=sess,
                                   hyperparameters=hyperparams)

train_data = sagemaker.inputs.TrainingInput(s3train, distribution='FullyReplicated',
                                        content_type='application/x-recordio', s3_data_type='S3Prefix')
validation_data = sagemaker.inputs.TrainingInput(s3validation, distribution='FullyReplicated',
                                             content_type='application/x-recordio', s3_data_type='S3Prefix')

data_channels = {'train': train_data, 'validation': validation_data}

ic.fit(inputs=data_channels, logs=True)
```

To use the model to incrementally train another model, create a new estimator object and use the model artifacts (`ic.model_data`, in this example) for the `model_uri` input argument:

```
# Given the base estimator, create a new one for incremental training
incr_ic = sagemaker.estimator.Estimator(training_image,
                                        role,
                                        instance_count=1,
                                        instance_type='ml.p2.xlarge',
                                        volume_size=50,
                                        max_run=360000,
                                        input_mode='File',
                                        output_path=s3_output_location,
                                        sagemaker_session=sess,
                                        hyperparameters=hyperparams,
                                        model_uri=ic.model_data) # This parameter will ingest the previous job's model as a new channel
incr_ic.fit(inputs=data_channels, logs=True)
```

After the training job has completed, the newly trained model artifacts are stored under the `S3 output path` that you provided in `Output_path`. To deploy the model to get predictions, see [Deploy the model to Amazon EC2](ex1-model-deployment.md).

# Managed Spot Training in Amazon SageMaker AI
<a name="model-managed-spot-training"></a>

Amazon SageMaker AI makes it easy to train machine learning models using managed Amazon EC2 Spot instances. Managed spot training can optimize the cost of training models up to 90% over on-demand instances. SageMaker AI manages the Spot interruptions on your behalf. 

Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long SageMaker AI waits for a job to run using Amazon EC2 Spot instances. Metrics and logs generated during training runs are available in CloudWatch. 

Amazon SageMaker AI automatic model tuning, also known as hyperparameter tuning, can use managed spot training. For more information on automatic model tuning, see [Automatic model tuning with SageMaker AI](automatic-model-tuning.md).

Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker AI copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker AI copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting. For more information about checkpointing, see [Checkpoints in Amazon SageMaker AI](model-checkpoints.md).

**Note**  
Unless your training job will complete quickly, we recommend you use checkpointing with managed spot training. SageMaker AI built-in algorithms and marketplace algorithms that do not checkpoint are currently limited to a `MaxWaitTimeInSeconds` of 3600 seconds (60 minutes). 

To use managed spot training, create a training job. Set `EnableManagedSpotTraining` to `True` and specify the `MaxWaitTimeInSeconds`. `MaxWaitTimeInSeconds` must be larger than `MaxRuntimeInSeconds`. For more information about creating a training job, see [DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html). 

You can calculate the savings from using managed spot training using the formula `(1 - (BillableTimeInSeconds / TrainingTimeInSeconds)) * 100`. For example, if `BillableTimeInSeconds` is 100 and `TrainingTimeInSeconds` is 500, this means that your training job ran for 500 seconds, but you were billed for only 100 seconds. Your savings is (1 - (100 / 500)) \$1 100 = 80%.

To learn how to run training jobs on Amazon SageMaker AI spot instances and how managed spot training works and reduces the billable time, see the following example notebooks:
+ [Managed Spot Training with TensorFlow](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-python-sdk/managed_spot_training_tensorflow_estimator/managed_spot_training_tensorflow_estimator.html)
+ [Managed Spot Training with PyTorch](https://github.com/aws-samples/amazon-sagemaker-managed-spot-training/blob/main/pytorch_managed_spot_training_checkpointing/pytorch_managed_spot_training_checkpointing.ipynb)
+ [Managed Spot Training with XGBoost](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_managed_spot_training.html)
+ [Managed Spot Training with MXNet](https://github.com/aws/amazon-sagemaker-examples-community/blob/215215eb25b40eadaf126d055dbb718a245d7603/training/sagemaker-debugger/mxnet-spot-training-with-sagemakerdebugger.ipynb#L41)
+ [Amazon SageMaker AI Managed Spot Training Examples GitHub repository](https://github.com/aws-samples/amazon-sagemaker-managed-spot-training)

# Managed Spot Training Lifecycle
<a name="model-managed-spot-training-status"></a>

You can monitor a training job using `TrainingJobStatus` and `SecondaryStatus` returned by [DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html). The list below shows how `TrainingJobStatus` and `SecondaryStatus` values change depending on the training scenario:
+ **Spot instances acquired with no interruption during training**

  1. `InProgress`: `Starting`↠ `Downloading` ↠ `Training` ↠ `Uploading`
+ **Spot instances interrupted once. Later, enough spot instances were acquired to finish the training job.**

  1. `InProgress`: `Starting` ↠ `Downloading` ↠ `Training` ↠ `Interrupted` ↠ `Starting` ↠ `Downloading` ↠ `Training` ↠ `Uploading` 
+ **Spot instances interrupted twice and `MaxWaitTimeInSeconds` exceeded.**

  1. `InProgress`: `Starting` ↠ `Downloading` ↠ `Training` ↠ `Interrupted` ↠ `Starting` ↠ `Downloading` ↠ `Training` ↠ `Interrupted` ↠ `Downloading` ↠ `Training` 

  1. `Stopping`: `Stopping` 

  1. `Stopped`: `MaxWaitTimeExceeded` 
+ **Spot instances were never launched.**

  1. `InProgress`: `Starting` 

  1. `Stopping`: `Stopping` 

  1. `Stopped`: `MaxWaitTimeExceeded` 

# SageMaker AI Managed Warm Pools
<a name="train-warm-pools"></a>

SageMaker AI managed warm pools let you retain and reuse provisioned infrastructure after the completion of a training job to reduce latency for repetitive workloads, such as iterative experimentation or running many jobs consecutively. Subsequent training jobs that match specified parameters run on the retained warm pool infrastructure, which speeds up start times by reducing the time spent provisioning resources. 

**Important**  
SageMaker AI managed warm pools are a billable resource. For more information, see [Billing](#train-warm-pools-billing).

**Topics**
+ [

## How it works
](#train-warm-pools-how-it-works)
+ [

## Considerations
](#train-warm-pools-considerations)
+ [

# Request a warm pool quota increase
](train-warm-pools-resource-limits.md)
+ [

# Use SageMaker AI managed warm pools
](train-warm-pools-how-to-use.md)

## How it works
<a name="train-warm-pools-how-it-works"></a>

To use SageMaker AI managed warm pools and reduce latency between similar consecutive training jobs, create a training job that specifies a `KeepAlivePeriodInSeconds` value in its `ResourceConfig`. This value represents the duration of time in seconds to retain configured resources in a warm pool for subsequent training jobs. If you need to run several training jobs using similar configurations, you can further reduce latency and billable time by using a dedicated persistent cache directory to store and re-use your information in a different job.

**Topics**
+ [

### Warm pool lifecycle
](#train-warm-pools-lifecycle)
+ [

### Warm pool creation
](#train-warm-pools-creation)
+ [

### Matching training jobs
](#train-warm-pools-matching-criteria)
+ [

### Maximum warm pool duration
](#train-warm-pools-maximum-duration)
+ [

### Using persistent cache
](#train-warm-pools-persistent-cache)
+ [

### Billing
](#train-warm-pools-billing)

### Warm pool lifecycle
<a name="train-warm-pools-lifecycle"></a>

1. Create an initial training job with a `KeepAlivePeriodInSeconds` value greater than 0. When you run this first training job, this “cold-starts” a cluster with typical startup times. 

1. When the first training job completes, the provisioned resources are kept alive in a warm pool for the period specified in the `KeepAlivePeriodInSeconds` value. As long as the cluster is healthy and the warm pool is within the specified `KeepAlivePeriodInSeconds`, then the warm pool status is `Available`. 

1. The warm pool stays `Available` until it either identifies a matching training job for reuse or it exceeds the specified `KeepAlivePeriodInSeconds` and is terminated. The maximum length of time allowed for the `KeepAlivePeriodInSeconds` is 3600 seconds (60 minutes). If the warm pool status is `Terminated`, then this is the end of the warm pool lifecycle.

1. If the warm pool identifies a second training job with matching specifications such as instance count or instance type, then the warm pool moves from the first training job to the second training job for reuse. The status of the first training job warm pool becomes `Reused`. This is the end of the warm pool lifecycle for the first training job. 

1. The status of the second training job that reused the warm pool becomes `InUse`. After the second training job completes, the warm pool is `Available` for the `KeepAlivePeriodInSeconds` duration specified in the second training job. A warm pool can continue moving to subsequent matching training jobs for a maximum of 28 days.

1. If the warm pool is no longer available to reuse, the warm pool status is `Terminated`. Warm pools are no longer available if they are terminated by a user, for a patch update, or for exceeding the specified `KeepAlivePeriodInSeconds`.

For more information on warm pool status options, see [WarmPoolStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_WarmPoolStatus.html) in the *Amazon SageMaker API Reference*.

### Warm pool creation
<a name="train-warm-pools-creation"></a>

If an initial training job successfully completes and has a `KeepAlivePeriodInSeconds` value greater than 0, this creates a warm pool. If you stop a training job after a cluster is already launched, a warm pool is still retained. If the training job fails due to an algorithm or client error, a warm pool is still retained. If the training job fails for any other reason that might compromise the health of the cluster, then the warm pool is not created. 

To verify successful warm pool creation, check the warm pool status of your training job. If a warm pool successfully provisions, the warm pool status is `Available`. If a warm pool fails to provision, the warm pool status is `Terminated`.

### Matching training jobs
<a name="train-warm-pools-matching-criteria"></a>

For a warm pool to persist, it must find a matching training job within the time specified in the `KeepAlivePeriodInSeconds` value. The next training job is a match if the following values are identical: 
+ `RoleArn` 
+ `ResourceConfig` values:
  + `InstanceCount`
  + `InstanceType`
  + `VolumeKmsKeyId`
  + `VolumeSizeInGB`
+ `VpcConfig` values:
  + `SecurityGroupIds`
  + `Subnets`
+ `EnableInterContainerTrafficEncryption`
+ `EnableNetworkIsolation`
+ If you passed [session tags](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_session-tags.html#id_session-tags_operations) for your training job with `EnableSessionTagChaining` set to `True` in the training job's `SessionChainingConfig`, then a matching training job must also set `EnableSessionTagChaining` to `True` and have identical session keys. For more information, see [Use attribute-based access control (ABAC) for multi-tenancy training](model-access-training-data-abac.md). 

All of these values must be the same for a warm pool to move to a subsequent training job for reuse.

### Maximum warm pool duration
<a name="train-warm-pools-maximum-duration"></a>

The maximum `KeepAlivePeriodInSeconds` for a single training job is 3600 seconds (60 minutes) and the maximum length of time that a warm pool cluster can continue running consecutive training jobs is 28 days. 

Each subsequent training job must also specify a `KeepAlivePeriodInSeconds` value. When the warm pool moves to the next training job, it inherits the new `KeepAlivePeriodInSeconds` value specified in that training job’s `ResourceConfig`. In this way, you can keep a warm pool moving from training job to training job for a maximum of 28 days.

If no `KeepAlivePeriodInSeconds` is specified, then the warm pool spins down after the training job completes.

### Using persistent cache
<a name="train-warm-pools-persistent-cache"></a>

When you create a warm pool, SageMaker AI mounts a special directory on the volume that will persist throughout the lifecycle of the warm pool. This directory can also be used to store information that you want to re-use in another job. 

Using persistent cache can reduce latency and billable time over using warm pools alone for jobs that require the following:
+ multiple interactions with similar configurations
+ incremental training jobs
+ hyperparameter optimization

For example, you can avoid downloading the same Python dependencies on repeated runs by setting up a pip cache directory inside the persistent cache directory. You are fully responsible for managing the contents of this directory. The following are examples of types of information that you can put in your persistent cache to help reduce your latency and billable time.
+ Dependencies managed by pip.
+ Dependencies managed by conda.
+ [Checkpoint information](https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html).
+ Any additional information generated during training.

The location of the persistent cache is `/opt/ml/sagemaker/warmpoolcache`. The environment variable `SAGEMAKER_MANAGED_WARMPOOL_CACHE_DIRECTORY` points to the location of the persistent cache directory.

The following code example shows you how to set up a warm pool and use persistent cache to store your pip dependencies for use in a subsequent job. The subsequent job must run within the time frame given by the parameter `keep_alive_period_in_seconds`.

```
import sagemakerfrom sagemaker import get_execution_rolefrom sagemaker.tensorflow import TensorFlow
# Creates a SageMaker session and gets execution role
session = sagemaker.Session()
role = get_execution_role()
# Creates an example estimator
estimator = TensorFlow(
    ...
    entry_point='my-training-script.py',
    source_dir='code',
    role=role,
    model_dir='model_dir',
    framework_version='2.2',
    py_version='py37',
    job_name='my-training-job-1',
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    volume_size=250,
    hyperparameters={
"batch-size": 512,
        "epochs": 1,
        "learning-rate": 1e-3,
        "beta_1": 0.9,
        "beta_2": 0.999,
    },
    keep_alive_period_in_seconds=1800,
    environment={"PIP_CACHE_DIR": "/opt/ml/sagemaker/warmpoolcache/pip"}
)
```

In the previous code example, using the [environment](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#estimators) parameter exports the environment variable `PIP_CACHE_DIRECTORY` to point to the directory `/opt/ml/sagemaker/warmpoolcache/pip`. Exporting this environment variable will change where pip stores its cache to the new location. Any directory, including nested directories, that you create inside the persistent cache directory will be available for re-use during a subsequent training run. In the previous code example, a directory called `pip` is changed to be the default location to cache any dependencies installed using pip.

The persistent cache location may also be accessed from within your Python training script using the environment variable as shown in the following code example.

```
import os
import shutil
if __name__ == '__main__':
    PERSISTED_DIR = os.environ["SAGEMAKER_MANAGED_WARMPOOL_CACHE_DIRECTORY"]

    # create a file to be persisted
    open(os.path.join(PERSISTED_DIR, "test.txt"), 'a').close()
    # create a directory to be persisted
    os.mkdir(os.path.join(PERSISTED_DIR, "test_dir"))

    # Move a file to be persisted
    shutil.move("path/of/your/file.txt", PERSISTED_DIR)
```

### Billing
<a name="train-warm-pools-billing"></a>

SageMaker AI managed warm pools are a billable resource. Retrieve the warm pool status for your training job to check the billable time for your warm pools. You can check the warm pool status either through the [Using the Amazon SageMaker AI console](train-warm-pools-how-to-use.md#train-warm-pools-how-to-use-sagemaker-console) or directly through the [DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) API command. For more information, see [WarmPoolStatus](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_WarmPoolStatus.html) in the *Amazon SageMaker API Reference*.

**Note**  
After the time specified by the parameter `KeepAlivePeriodInSeconds` has ended, both the warm pool and persistent cache will shut down, and the contents will be deleted.

## Considerations
<a name="train-warm-pools-considerations"></a>

Consider the following items when using SageMaker AI managed warm pools.
+ SageMaker AI managed warm pools cannot be used with heterogeneous cluster training. 
+ SageMaker AI managed warm pools cannot be used with spot instances.
+ SageMaker AI managed warm pools are limited to a `KeepAlivePeriodInSeconds` value of 3600 seconds (60 minutes).
+ If a warm pool continues to successfully match training jobs within the specified `KeepAlivePeriodInSeconds` value, the cluster can only continue running for a maximum of 28 days.

# Request a warm pool quota increase
<a name="train-warm-pools-resource-limits"></a>

To get started, you must first request a service limit increase for SageMaker AI managed warm pools. The default resource limit for warm pools is 0.

If a training job is created with `KeepAlivePeriodInSeconds` specified, but you did not request a warm pool limit increase, then a warm pool is not retained after the completion of the training job. A warm pool is only created if your warm pool limit has sufficient resources. After a warm pool is created, the resources are released when they move to a matching training job or if the `KeepAlivePeriodInSeconds` expires (if the warm pool status is `Reused` or `Terminated`).

Request a warm pool quota increase using the AWS Service Quotas console.

**Note**  
All warm pool instance usage counts toward your SageMaker training resource limit. Increasing your warm pool resource limit does not increase your instance limit, but allocates a subset of your resource limit to warm pool training.

1. Open the [AWS Service Quotas console](https://console.aws.amazon.com/servicequotas/home/).

1. On the left-hand navigation panel, choose **AWS services**.

1. Search for and choose **Amazon SageMaker AI**.

1. Search for the keyword **warm pool** to see all available warm pool service quotas.

1. Find the instance type for which you want to increase your warm pool quota, select the warm pool service quota for that instance type, and choose **Request quota increase**.

1. Enter your requested instance limit number under **Change quota value**. The new value must be greater than the current **Applied quota value**.

1. Choose **Request**.

There is a limit on the number of instances that you can retain for each account, which is determined by instance type. You can check your resource limits in the [AWS Service Quotas console](https://console.aws.amazon.com/servicequotas/home/) or directly using the [list-service-quotas](https://docs.aws.amazon.com/cli/latest/reference/service-quotas/list-service-quotas.html) AWS CLI command. For more information on AWS Service Quotas, see [Requesting a quota increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *Service Quotas User Guide*. 

You can also use [AWS Support Center](https://support.console.aws.amazon.com) to request a warm pool quota increase. For a list of available instance types according to Region, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) and choose **Training** in the **On-Demand Pricing** table.

# Use SageMaker AI managed warm pools
<a name="train-warm-pools-how-to-use"></a>

You can use SageMaker AI managed warm pools through the SageMaker Python SDK, the Amazon SageMaker AI console, or through the low-level APIs. Administrators can optionally use the `sagemaker:KeepAlivePeriod` condition key to further restrict the `KeepAlivePeriodInSeconds` limits for certain users or groups.

**Topics**
+ [

## Using the SageMaker AI Python SDK
](#train-warm-pools-how-to-use-python-sdk)
+ [

## Using the Amazon SageMaker AI console
](#train-warm-pools-how-to-use-sagemaker-console)
+ [

## Using the low-level SageMaker APIs
](#train-warm-pools-how-to-use-low-level-apis)
+ [

## IAM condition key
](#train-warm-pools-how-to-use-iam-condition-key)

## Using the SageMaker AI Python SDK
<a name="train-warm-pools-how-to-use-python-sdk"></a>

Create, update, or terminate warm pools using the SageMaker Python SDK.

**Note**  
This feature is available in the SageMaker AI [Python SDK v2.110.0](https://pypi.org/project/sagemaker/2.110.0/) and later.

**Topics**
+ [

### Create a warm pool
](#train-warm-pools-how-to-use-python-sdk-create)
+ [

### Update a warm pool
](#train-warm-pools-how-to-use-python-sdk-update)
+ [

### Terminate a warm pool
](#train-warm-pools-how-to-use-python-sdk-terminate)

### Create a warm pool
<a name="train-warm-pools-how-to-use-python-sdk-create"></a>

To create a warm pool, use the SageMaker Python SDK to create an estimator with a `keep_alive_period_in_seconds` value greater than 0 and call `fit()`. When the training job completes, a warm pool is retained. For more information on training scripts and estimators, see [Train a Model with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). If your script does not create a warm pool, see [Warm pool creation](train-warm-pools.md#train-warm-pools-creation) for possible explanations.

```
import sagemaker
from sagemaker import get_execution_role
from sagemaker.tensorflow import TensorFlow

# Creates a SageMaker AI session and gets execution role
session = sagemaker.Session()
role = get_execution_role()

# Creates an example estimator
estimator = TensorFlow(
    ...
    entry_point='my-training-script.py',
    source_dir='code',
    role=role,
    model_dir='model_dir',
    framework_version='2.2',
    py_version='py37',
    job_name='my-training-job-1',
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    volume_size=250,
    hyperparameters={
        "batch-size": 512,
        "epochs": 1,
        "learning-rate": 1e-3,
        "beta_1": 0.9,
        "beta_2": 0.999,
    },
    keep_alive_period_in_seconds=1800,
)

# Starts a SageMaker training job and waits until completion
estimator.fit('s3://my_bucket/my_training_data/')
```

Next, create a second matching training job. In this example, we create `my-training-job-2`, which has all of the necessary attributes to match with `my-training-job-1`, but has a different hyperparameter for experimentation. The second training job reuses the warm pool and starts up faster than the first training job. The following code example uses a Tensorflow estimator. The warm pool feature can be used with any training algorithm that runs on Amazon SageMaker AI. For more information on which attributes need to match, see [Matching training jobs](train-warm-pools.md#train-warm-pools-matching-criteria).

```
# Creates an example estimator
estimator = TensorFlow(
    ...
    entry_point='my-training-script.py',
    source_dir='code',
    role=role,
    model_dir='model_dir',
    framework_version='py37',
    py_version='pyxy',
    job_name='my-training-job-2',
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    volume_size=250,
    hyperparameters={
        "batch-size": 512,
        "epochs": 2,
        "learning-rate": 1e-3,
        "beta_1": 0.9,
        "beta_2": 0.999,
    },
    keep_alive_period_in_seconds=1800,
)

# Starts a SageMaker training job and waits until completion
estimator.fit('s3://my_bucket/my_training_data/')
```

Check the warm pool status of both training jobs to confirm that the warm pool is `Reused` for `my-training-job-1` and `InUse` for `my-training-job-2`.

**Note**  
Training job names have date/time suffixes. The example training job names `my-training-job-1` and `my-training-job-2` should be replaced with actual training job names. You can use the `estimator.latest_training_job.job_name` command to fetch the actual training job name.

```
session.describe_training_job('my-training-job-1')
session.describe_training_job('my-training-job-2')
```

The result of `describe_training_job` provides all details about a given training job. Find the `WarmPoolStatus` attribute to check information about a training job’s warm pool. Your output should look similar to the following example:

```
# Warm pool status for training-job-1
...
'WarmPoolStatus': {'Status': 'Reused', 
  'ResourceRetainedBillableTimeInSeconds': 1000,
  'ReusedByName': my-training-job-2}
...

# Warm pool status for training-job-2
... 
'WarmPoolStatus': {'Status': 'InUse'}
...
```

### Update a warm pool
<a name="train-warm-pools-how-to-use-python-sdk-update"></a>

When the training job is complete and the warm pool status is `Available`, then you can update the `KeepAlivePeriodInSeconds` value.

```
session.update_training_job(job_name, resource_config={"KeepAlivePeriodInSeconds":3600})
```

### Terminate a warm pool
<a name="train-warm-pools-how-to-use-python-sdk-terminate"></a>

To manually terminate a warm pool, set the `KeepAlivePeriodInSeconds ` value to 0.

```
session.update_training_job(job_name, resource_config={"KeepAlivePeriodInSeconds":0})
```

The warm pool automatically terminates when it exceeds the designated `KeepAlivePeriodInSeconds` value or if there is a patch update for the cluster.

## Using the Amazon SageMaker AI console
<a name="train-warm-pools-how-to-use-sagemaker-console"></a>

Through the console, you can create a warm pool, release a warm pool, or check the warm pool status and billable time of specific training jobs. You can also see which matching training job reused a warm pool.

1. Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/ec2/) and choose **Training jobs** from the navigation pane. If applicable, the warm pool status of each training job is visible in the **Warm pool status** column and the time left for an active warm pool is visible in the **Time left** column.

1. To create a training job that uses a warm pool from the console, choose **Create training job**. Then, be sure to specify a value for the **Keep alive period** field when configuring your training job resources. This value must be an integer between 1 and 3600, which represents duration of time in seconds.

1. To release a warm pool from the console, select a specific training job and choose **Release cluster** from the **Actions **dropdown menu.

1. To see more information about a warm pool, choose a training job name. In the job details page, scroll down to the **Warm pool status** section to find the warm pool status, the time left if the warm pool status is `Available`, the warm pool billable seconds, and the name of the training job that reused the warm pool if the warm pool status is `Reused`.

## Using the low-level SageMaker APIs
<a name="train-warm-pools-how-to-use-low-level-apis"></a>

Use SageMaker AI managed warm pools with either the SageMaker API or the AWS CLI.

### SageMaker AI API
<a name="train-warm-pools-how-to-use-low-level-apis-sagemaker"></a>

Set up SageMaker AI managed warm pools using the SageMaker API with the following commands:
+ [ CreateTrainingJob ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html)
+ [ UpdateTrainingJob ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateTrainingJob.html)
+ [ ListTrainingJobs ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListTrainingJobs.html)
+ [ DescribeTrainingJob ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html)

### AWS CLI
<a name="train-warm-pools-how-to-use-low-level-apis-cli"></a>

Set up SageMaker AI managed warm pools using the AWS CLI with the following commands:
+ [create-training-job](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-training-job.html)
+ [update-training-job](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/update-training-job.html)
+ [list-training-jobs](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/list-training-jobs.html)
+ [describe-training-job](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/describe-training-job.html)

## IAM condition key
<a name="train-warm-pools-how-to-use-iam-condition-key"></a>

Administrators can optionally use the `sagemaker:KeepAlivePeriod` condition key to further restrict the `KeepAlivePeriodInSeconds` limits for certain users or groups. SageMaker AI managed warm pools are limited to a `KeepAlivePeriodInSeconds` value of 3600 seconds (60 minutes), but administrators can lower this limit if needed. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "EnforceKeepAlivePeriodLimit",
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateTrainingJob"
            ],
            "Resource": "*",
            "Condition": {
                "NumericLessThanIfExists": {
                    "sagemaker:KeepAlivePeriod": "1800"
                }
            }
        }
    ]
}
```

------

For more information, see [Condition keys for Amazon SageMaker AI](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html#amazonsagemaker-policy-keys) in the *Service Authorization Reference*.

# Amazon CloudWatch Metrics for Monitoring and Analyzing Training Jobs
<a name="training-metrics"></a>

An Amazon SageMaker training job is an iterative process that teaches a model to make predictions by presenting examples from a training dataset. Typically, a training algorithm computes several metrics, such as training error and prediction accuracy. These metrics help diagnose whether the model is learning well and will generalize well for making predictions on unseen data. The training algorithm writes the values of these metrics to logs, which SageMaker AI monitors and sends to Amazon CloudWatch in real time. To analyze the performance of your training job, you can view graphs of these metrics in CloudWatch. When a training job has completed, you can also get a list of the metric values that it computes in its final iteration by calling the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) operation.

**Note**  
Amazon CloudWatch supports [high-resolution custom metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html), and its finest resolution is 1 second. However, the finer the resolution, the shorter the lifespan of the CloudWatch metrics. For the 1-second frequency resolution, the CloudWatch metrics are available for 3 hours. For more information about the resolution and the lifespan of the CloudWatch metrics, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*. 

**Tip**  
If you want to profile your training job with a finer resolution down to 100-millisecond (0.1 second) granularity and store the training metrics indefinitely in Amazon S3 for custom analysis at any time, consider using [Amazon SageMaker Debugger](https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html). SageMaker Debugger provides built-in rules to automatically detect common training issues; it detects hardware resource utilization issues (such as CPU, GPU, and I/O bottlenecks) and non-converging model issues (such as overfit, vanishing gradients, and exploding tensors). SageMaker Debugger also provides visualizations through Studio Classic and its profiling report. To explore the Debugger visualizations, see [SageMaker Debugger Insights Dashboard Walkthrough](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-on-studio-insights-walkthrough.htm), [Debugger Profiling Report Walkthrough](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-profiling-report.html#debugger-profiling-report-walkthrough), and [Analyze Data Using the SMDebug Client Library](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-analyze-data.html).

**Topics**
+ [

# Define Training Metrics
](define-train-metrics.md)
+ [

# View training job metrics
](view-train-metrics.md)
+ [

# Example: Viewing a Training and Validation Curve
](train-valid-curve.md)

# Define Training Metrics
<a name="define-train-metrics"></a>

SageMaker AI automatically parses training job logs and sends training metrics to CloudWatch. By default, SageMaker AI sends system resource utilization metrics listed in [SageMaker AI Jobs and Endpoint Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs). If you want SageMaker AI to parse logs and send custom metrics from a training job of your own algorithm to CloudWatch, you need to specify metrics definitions by passing the name of metrics and regular expressions when you configure a SageMaker AI training job request.

You can specify the metrics that you want to track using the SageMaker AI console, the [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk), or the low-level SageMaker AI API.

If you are using your own algorithm, do the following:
+ Make sure that the algorithm writes the metrics that you want to capture to logs.
+ Define a regular expression that accurately searches the logs to capture the values of the metrics that you want to send to CloudWatch.

For example, suppose your algorithm emits the following metrics for training error and validation error:

```
Train_error=0.138318;  Valid_error=0.324557;
```

If you want to monitor both of those metrics in CloudWatch, the dictionary for the metric definitions should look like the following example:

```
[
    {
        "Name": "train:error",
        "Regex": "Train_error=(.*?);"
    },
    {
        "Name": "validation:error",
        "Regex": "Valid_error=(.*?);"
    }    
]
```

In the regex for the `train:error` metric defined in the preceding example, the first part of the regex finds the exact text "Train\$1error=", and the expression `(.*?);` captures any characters until the first semicolon character appears. In this expression, the parenthesis tell the regex to capture what is inside them, `.` means any character, `*` means zero or more, and `?` means capture only until the first instance of the `;` character.

## Define Metrics Using the SageMaker AI Python SDK
<a name="define-train-metrics-sdk"></a>

Define the metrics that you want to send to CloudWatch by specifying a list of metric names and regular expressions as the `metric_definitions` argument when you initialize an `Estimator` object. For example, if you want to monitor both the `train:error` and `validation:error` metrics in CloudWatch, your `Estimator` initialization would look like the following example:

```
import sagemaker
from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri="your-own-image-uri",
    role=sagemaker.get_execution_role(), 
    sagemaker_session=sagemaker.Session(),
    instance_count=1,
    instance_type='ml.c4.xlarge',
    metric_definitions=[
       {'Name': 'train:error', 'Regex': 'Train_error=(.*?);'},
       {'Name': 'validation:error', 'Regex': 'Valid_error=(.*?);'}
    ]
)
```

For more information about training by using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) estimators, see[ Sagemaker Python SDK](https://github.com/aws/sagemaker-python-sdk#sagemaker-python-sdk-overview) on GitHub. 

## Define Metrics Using the SageMaker AI Console
<a name="define-train-metrics-console"></a>

If you choose the **Your own algorithm container in ECR** option as your algorithm source in the SageMaker AI console when you create a training job, add the metric definitions in the **Metrics** section. The following screenshot shows how it should look after you add the example metric names and the corresponding regular expressions.

![\[Example Algorithm options form in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-metrics-using-smconsole.png)


## Define Metrics Using the Low-level SageMaker AI API
<a name="define-train-metrics-api"></a>

Define the metrics that you want to send to CloudWatch by specifying a list of metric names and regular expressions in the `MetricDefinitions` field of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html) input parameter that you pass to the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) operation. For example, if you want to monitor both the `train:error` and `validation:error` metrics in CloudWatch, your `AlgorithmSpecification` would look like the following example:

```
"AlgorithmSpecification": {
    "TrainingImage": your-own-image-uri,
    "TrainingInputMode": "File",
    "MetricDefinitions" : [
        {
            "Name": "train:error",
            "Regex": "Train_error=(.*?);"
        },
        {
            "Name": "validation:error",
            "Regex": "Valid_error=(.*?);"
        }
    ]
}
```

For more information about defining and running a training job by using the low-level SageMaker AI API, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

# View training job metrics
<a name="view-train-metrics"></a>

You can view the metrics emitted from your Amazon SageMaker training jobs in either the Amazon CloudWatch or SageMaker AI console.

## Monitor training job metrics (CloudWatch console)
<a name="view-train-metrics-cw"></a>

You can monitor the metrics that a training job emits in real time in the CloudWatch console.

**To monitor training job metrics (CloudWatch console)**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch](https://console.aws.amazon.com/cloudwatch).

1. Choose **Metrics**, then choose **/aws/sagemaker/TrainingJobs**.

1. Choose **TrainingJobName**.

1. On the **All metrics** tab, choose the names of the training metrics that you want to monitor.

1. On the **Graphed metrics** tab, configure the graph options. For more information about using CloudWatch graphs, see [Graph Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/graph_metrics.html) in the *Amazon CloudWatch User Guide*.

## Monitor training job metrics (SageMaker AI console)
<a name="view-train-metrics-sm"></a>

You can monitor the metrics that a training job emits in real time by using the SageMaker AI console.

**To monitor training job metrics (SageMaker AI console)**

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker](https://console.aws.amazon.com/sagemaker).

1. Choose **Training jobs**, then choose the training job whose metrics you want to see.

1. Choose **TrainingJobName**.

1. In the **Monitor** section, you can review the graphs of instance utilization and algorithm metrics.  
![\[Example graphs in the Monitor section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/console-metrics.png)

# Example: Viewing a Training and Validation Curve
<a name="train-valid-curve"></a>

Typically, you split the data on which you train your model into training and validation datasets. You use the training set to train the model parameters that are used to make predictions on the training dataset. Then you test how well the model makes predictions by calculating predictions for the validation set. To analyze the performance of a training job, you commonly plot a training curve against a validation curve. 

Viewing a graph that shows the accuracy for both the training and validation sets over time can help you to improve the performance of your model. For example, if training accuracy continues to increase over time, but, at some point, validation accuracy starts to decrease, you are likely overfitting your model. To address this, you can make adjustments to your model, such as increasing [regularization](https://docs.aws.amazon.com/glossary/latest/reference/glos-chap.html#regularization).

For this example, you can use the **Image-classification-full-training** example in the **Example notebooks** section of your SageMaker AI notebook instance. If you don't have a SageMaker notebook instance, create one by following the instructions at [Create an Amazon SageMaker Notebook Instance for the tutorial](gs-setup-working-env.md). If you prefer, you can follow along with the [End-to-End Multiclass Image Classification Example](https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.html) in the example notebook on GitHub. You also need an Amazon S3 bucket to store the training data and for the model output.

**To view training and validation error curves**

1. Open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker](https://console.aws.amazon.com/sagemaker).

1. Choose **Notebooks**, and then choose **Notebook instances**.

1. Choose the notebook instance that you want to use, and then choose **Open**.

1. On the dashboard for your notebook instance, choose **SageMaker AI Examples**.

1. Expand the **Introduction to Amazon Algorithms** section, and then choose **Use** next to **Image-classification-fulltraining.ipynb**.

1. Choose **Create copy**. SageMaker AI creates an editable copy of the **Image-classification-fulltraining.ipynb** notebook in your notebook instance.

1. Run all of the cells in the notebook up to the **Inference** section. You don't need to deploy an endpoint or get inference for this example.

1. After the training job starts, open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch](https://console.aws.amazon.com/cloudwatch).

1. Choose **Metrics**, then choose **/aws/sagemaker/TrainingJobs**.

1. Choose **TrainingJobName**.

1. On the **All metrics** tab, choose the **train:accuracy** and **validation:accuracy** metrics for the training job that you created in the notebook.

1. On the graph, choose an area that the metric's values to zoom in. You should see something like the following example.  
![\[Zoomed in area in the graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/train-valid-acc.png)

# Augmented Manifest Files for Training Jobs
<a name="augmented-manifest"></a>

To include metadata with your dataset in a training job, use an augmented manifest file. When using an augmented manifest file, your dataset must be stored in Amazon Simple Storage Service (Amazon S3), and you must configure your training job to use the dataset stored there. You specify the location and format of this dataset for one or more [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html). Augmented manifests can only support Pipe input mode. See the section, **InputMode** in [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html) to learn more about pipe input mode. 

When specifying a channel's parameters, you specify a path to the file, called a `S3Uri`. Amazon SageMaker AI interprets this URI based on the specified `S3DataType` in [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html). The `AugmentedManifestFile` option defines a manifest format that includes metadata with the input data. Using an augmented manifest file is an alternative to preprocessing when you have labeled data. For training jobs using labeled data, you typically need to preprocess the dataset to combine input data with metadata before training. If your training dataset is large, preprocessing can be time consuming and expensive.

## Augmented Manifest File Format
<a name="augmented-manifest-format"></a>

An augmented manifest file must be formatted in [JSON Lines](http://jsonlines.org/) format. In JSON Lines format, each line in the file is a complete JSON object followed by a newline separator.

During training, SageMaker AI parses each JSON line and sends some or all of its attributes on to the training algorithm. You specify which attribute contents to pass and the order in which to pass them with the `AttributeNames` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API. The `AttributeNames` parameter is an ordered list of attribute names that SageMaker AI looks for in the JSON object to use as training input.

For example, if you list `["line", "book"]` for `AttributeNames`, the input data must include the attribute names of `line` and `book` in the specified order. For this example, the following augmented manifest file content is valid:

```
{"author": "Herman Melville", "line": "Call me Ishmael", "book": "Moby Dick"}
{"line": "It was love at first sight.", "author": "Joseph Heller", "book": "Catch-22"}
```

SageMaker AI ignores unlisted attribute names even if they precede, follow, or are in between listed attributes.

When using augmented manifest files, observe the following guidelines:
+ The order of the attributes listed in the `AttributeNames` parameter determines the order of the attributes passed to the algorithm in the training job.
+ The listed `AttributeNames` can be a subset of all of the attributes in the JSON line. SageMaker AI ignores unlisted attributes in the file.
+ You can specify any type of data allowed by the JSON format in `AttributeNames`, including text, numerical, data arrays, or objects.
+ To include an S3 URI as an attribute name, add the suffix `-ref` to it.

If an attribute name contains the suffix `-ref`, the attribute's value must be an S3 URI to a data file that is accessible to the training job. For example, if `AttributeNames` contains `["image-ref", "is-a-cat"]`, the following example shows a valid augmented manifest file:

```
{"image-ref": "s3://amzn-s3-demo-bucket/sample01/image1.jpg", "is-a-cat": 1}
{"image-ref": "s3://amzn-s3-demo-bucket/sample02/image2.jpg", "is-a-cat": 0}
```

In case of the first JSON line of this manifest file, SageMaker AI retrieves the `image1.jpg` file from `s3://amzn-s3-demo-bucket/sample01/` and the string representation of the `is-a-cat` attribute `"1"` for image classification.

**Tip**  
To create an augmented manifest file, use Amazon SageMaker Ground Truth and create a labeling job. For more information about the output from a labeling job, see [Labeling job output data](sms-data-output.md).

# Augmented Manifest File Format for Pipe Mode Training
<a name="augmented-manifest-stream"></a>

Augmented manifest format enables you to do training in Pipe mode using files without needing to create RecordIO files. You need to specify both train and validation channels as values for the `InputDataConfig` parameter of the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) request. Augmented manifest files are supported only for channels using Pipe input mode. For each channel, the data is extracted from its augmented manifest file and streamed (in order) to the algorithm through the channel's named pipe. Pipe mode uses the first in first out (FIFO) method, so records are processed in the order in which they are queued. For information about Pipe input mode, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-InputMode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html#SageMaker-Type-Channel-InputMode).

Attribute names with a `"-ref"` suffix point to preformatted binary data. In some cases, the algorithm knows how to parse the data. In other cases, you might need to wrap the data so that records are delimited for the algorithm. If the algorithm is compatible with [RecordIO-formatted data](https://mxnet.apache.org/api/architecture/note_data_loading#data-format), specifying `RecordIO` for `RecordWrapperType` solves this issue. If the algorithm is not compatible with `RecordIO` format, specify `None` for `RecordWrapperType` and make sure that your data is parsed correctly for your algorithm.

Using the `["image-ref", "is-a-cat"]` example, if you use RecordIO wrapping, the following stream of data is sent to the queue:

`recordio_formatted(s3://amzn-s3-demo-bucket/foo/image1.jpg)recordio_formatted("1")recordio_formatted(s3://amzn-s3-demo-bucket/bar/image2.jpg)recordio_formatted("0")`

Images that are not wrapped with RecordIO format, are streamed with the corresponding `is-a-cat` attribute value as one record. This can cause a problem because the algorithm might not delimit the images and attributes correctly. For more information about using augmented manifest files for image classification, see [Train with Augmented Manifest Image Format](https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html#IC-augmented-manifest-training).

With augmented manifest files and Pipe mode in general, size limits of the EBS volume do not apply. This includes settings that otherwise must be within the EBS volume size limit such as [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html#SageMaker-Type-S3DataSource-S3DataDistributionType                 ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html#SageMaker-Type-S3DataSource-S3DataDistributionType                 ). For more information about Pipe mode and how to use it, see [Using Your Own Training Algorithms - Input Data Configuration](your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-inputdataconfig).

## Use an Augmented Manifest File
<a name="augmented-manifest-create"></a>

The following sections show you how to use augmented manifest files in your Amazon SageMaker training jobs, either with the SageMaker AI console or programmatically using the SageMaker Python SDK.

### Use an Augmented Manifest File (Console)
<a name="augmented-manifest-console"></a>

To complete this procedure, you need:
+ The URL of the S3 bucket where you've stored the augmented manifest file.
+ To store the data that is listed in the augmented manifest file in an S3 bucket.
+ The URL of the S3 bucket where you want to store the output of the job.

**To use an augmented manifest file in a training job (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the navigation pane, choose **Training**, then choose **Training jobs**. 

1. Choose **Create training job**.

1. Provide a name for the training job. The name must be unique within an AWS Region in an AWS account. It can have 1 to 63 characters. Valid characters: a-z, A-Z, 0-9, and . : \$1 = @ \$1 % - (hyphen).

1. Choose the algorithm that you want to use. For information about supported built-in algorithms, see [Built-in algorithms and pretrained models in Amazon SageMaker](algos.md). If you want to use a custom algorithm, make sure that it is compatible with Pipe mode.

1. (Optional) For **Resource configuration**, either accept the default values or, to reduce computation time, increase the resource consumption.

   1. (Optional) For **Instance type**, choose the ML compute instance type that you want to use. In most cases, **ml.m4.xlarge** is sufficient. 

   1. For **Instance count**, use the default, `1`.

   1. (Optional) For **Additional volume per instance (GB)**, choose the size of the ML storage volume that you want to provision. In most cases, you can use the default, `1`. If you are using a large dataset, use a larger size.

1. Provide information about the input data for the training dataset.

   1. For **Channel name**, either accept the default (**train**) or enter a more meaningful name, such as **training-augmented-manifest-file**.

   1. For **InputMode**, choose **Pipe**.

   1. For **S3 data distribution type**, choose **FullyReplicated**. When training incrementally, fully replicating causes each ML compute instance to use a complete copy of the expanded dataset. For neural-based algorithms, such as [Neural Topic Model (NTM) Algorithm](ntm.md), choose `ShardedByS3Key`.

   1. If the data specified in the augmented manifest file is uncompressed, set the **Compression type** to **None**. If the data is compressed using gzip, set it to **Gzip**.

   1. (Optional) For **Content type**, specify the appropriate MIME type. Content type is the multipurpose internet mail extension (MIME) type of the data.

   1. For **Record wrapper**, if the dataset specified in the augmented manifest file is saved in RecordIO format, choose **RecordIO**. If your dataset is not saved as a RecordIO-formatted file, choose **None**.

   1. For **S3 data type**, choose **AugmentedManifestFile**.

   1. For **S3 location**, provide the path to the bucket where you stored the augmented manifest file.

   1. For **AugmentedManifestFile attribute names**, specify the name of an attribute that you want to use. The attribute name must be present within the augmented manifest file, and is case-sensitive.

   1. (Optional) To add more attribute names, choose **Add row** and specify another attribute name for each attribute.

   1. (Optional) To adjust the order of attribute names, choose the up or down buttons next to the names. When using an augmented manifest file, the order of the specified attribute names is important.

   1. Choose **Done**.

1. For **Output data configuration**, provide the following information:

   1. For **S3 location**, type the path to the S3 bucket where you want to store the output data.

   1. (Optional) You can use your AWS Key Management Service (AWS KMS) encryption key to encrypt the output data at rest. For **Encryption key**, provide the key ID or its Amazon Resource Number (ARN). For more information, see [KMS-Managed Encryption Keys](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html).

1. (Optional) For **Tags**, add one or more tags to the training job. A *tag* is metadata that you can define and assign to AWS resources. In this case, you can use tags to help you manage your training jobs. A tag consists of a key and a value, which you define. For example, you might want to create a tag with **Project** as a key and a value that refers to a project that is related to the training job, such as **Home value forecasts**.

1. Choose **Create training job**. SageMaker AI creates and runs the training job.

After the training job has finished, SageMaker AI stores the model artifacts in the bucket whose path you provided for **S3 output path** in the **Output data configuration** field. To deploy the model to get predictions, see [Deploy the model to Amazon EC2](ex1-model-deployment.md).

### Use an Augmented Manifest File (API)
<a name="augmented-manifest-api"></a>

The following shows how to train a model with an augmented manifest file using the SageMaker AI high-level Python library:

```
import sagemaker

# Create a model object set to using "Pipe" mode.
model = sagemaker.estimator.Estimator(
    training_image,
    role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    volume_size = 50,
    max_run = 360000,
    input_mode = 'Pipe',
    output_path=s3_output_location,
    sagemaker_session=session
)

# Create a train data channel with S3_data_type as 'AugmentedManifestFile' and attribute names.
train_data = sagemaker.inputs.TrainingInput(
    your_augmented_manifest_file,
    distribution='FullyReplicated',
    content_type='application/x-recordio',
    s3_data_type='AugmentedManifestFile',
    attribute_names=['source-ref', 'annotations'],
    input_mode='Pipe',
    record_wrapping='RecordIO'
)

data_channels = {'train': train_data}

# Train a model.
model.fit(inputs=data_channels, logs=True)
```

After the training job has finished, SageMaker AI stores the model artifacts in the bucket whose path you provided for **S3 output path** in the **Output data configuration** field. To deploy the model to get predictions, see [Deploy the model to Amazon EC2](ex1-model-deployment.md).

# Checkpoints in Amazon SageMaker AI
<a name="model-checkpoints"></a>

Use checkpoints in Amazon SageMaker AI to save the state of machine learning (ML) models during training. Checkpoints are snapshots of the model and can be configured by the callback functions of ML frameworks. You can use the saved checkpoints to restart a training job from the last saved checkpoint. 

Using checkpoints, you can do the following:
+ Save your model snapshots under training due to an unexpected interruption to the training job or instance.
+ Resume training the model in the future from a checkpoint.
+ Analyze the model at intermediate stages of training.
+ Use checkpoints with S3 Express One Zone for increased access speeds.
+ Use checkpoints with SageMaker AI managed spot training to save on training costs.

The SageMaker training mechanism uses training containers on Amazon EC2 instances, and the checkpoint files are saved under a local directory of the containers (the default is `/opt/ml/checkpoints`). SageMaker AI provides the functionality to copy the checkpoints from the local path to Amazon S3 and automatically syncs the checkpoints in that directory with S3. Existing checkpoints in S3 are written to the SageMaker AI container at the start of the job, enabling jobs to resume from a checkpoint. Checkpoints added to the S3 folder after the job has started are not copied to the training container. SageMaker AI also writes new checkpoints from the container to S3 during training. If a checkpoint is deleted in the SageMaker AI container, it will also be deleted in the S3 folder.

You can use checkpoints in Amazon SageMaker AI with the Amazon S3 Express One Zone storage class (S3 Express One Zone) for faster access to checkpoints. When you enable checkpointing and specify the S3 URI for your checkpoint storage destination, you can provide an S3 URI for a folder in either an S3 general purpose bucket or an S3 directory bucket. S3 directory buckets that are integrated with SageMaker AI can only be encrypted with server-side encryption with Amazon S3 managed keys (SSE-S3). Server-side encryption with AWS KMS keys (SSE-KMS) is not currently supported. For more information on S3 Express One Zone and S3 directory buckets, see [What is S3 Express One Zone](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-express-one-zone.html).

If you are using checkpoints with SageMaker AI managed spot training, SageMaker AI manages checkpointing your model training on a spot instance and resuming the training job on the next spot instance. With SageMaker AI managed spot training, you can significantly reduce the billable time for training ML models. For more information, see [Managed Spot Training in Amazon SageMaker AI](model-managed-spot-training.md).

**Topics**
+ [

## Checkpoints for frameworks and algorithms in SageMaker AI
](#model-checkpoints-whats-supported)
+ [

## Considerations for checkpointing
](#model-checkpoints-considerations)
+ [

# Enable checkpointing
](model-checkpoints-enable.md)
+ [

# Browse checkpoint files
](model-checkpoints-saved-file.md)
+ [

# Resume training from a checkpoint
](model-checkpoints-resume.md)
+ [

# Cluster repairs for GPU errors
](model-checkpoints-cluster-repair.md)

## Checkpoints for frameworks and algorithms in SageMaker AI
<a name="model-checkpoints-whats-supported"></a>

Use checkpoints to save snapshots of ML models built on your preferred frameworks within SageMaker AI.

**SageMaker AI frameworks and algorithms that support checkpointing**

SageMaker AI supports checkpointing for AWS Deep Learning Containers and a subset of built-in algorithms without requiring training script changes. SageMaker AI saves the checkpoints to the default local path `'/opt/ml/checkpoints'` and copies them to Amazon S3. 
+ Deep Learning Containers: [TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html), [PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/sagemaker.pytorch.html), [MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/sagemaker.mxnet.html), and [HuggingFace](https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html)
**Note**  
If you are using the HuggingFace framework estimator, you need to specify a checkpoint output path through hyperparameters. For more information, see [Run training on Amazon SageMaker AI](https://huggingface.co/docs/sagemaker/train) in the *HuggingFace documentation*.
+ Built-in algorithms: [Image Classification](https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html), [Object Detection](https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection.html), [Semantic Segmentation](https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html), and [XGBoost](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) (0.90-1 or later)
**Note**  
If you are using the XGBoost algorithm in framework mode (script mode), you need to bring an XGBoost training script with checkpointing that's manually configured. For more information about the XGBoost training methods to save model snapshots, see [Training XGBoost](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#training) in *the XGBoost Python SDK documentation*.

If a pre-built algorithm that does not support checkpointing is used in a managed spot training job, SageMaker AI does not allow a maximum wait time greater than an hour for the job in order to limit wasted training time from interrupts.

**For custom training containers and other frameworks**

If you are using your own training containers, training scripts, or other frameworks not listed in the previous section, you must properly set up your training script using callbacks or training APIs to save checkpoints to the local path (`'/opt/ml/checkpoints'`) and load from the local path in your training script. SageMaker AI estimators can sync up with the local path and save the checkpoints to Amazon S3.

## Considerations for checkpointing
<a name="model-checkpoints-considerations"></a>

Consider the following when using checkpoints in SageMaker AI.
+ To avoid overwrites in distributed training with multiple instances, you must manually configure the checkpoint file names and paths in your training script. The high-level SageMaker AI checkpoint configuration specifies a single Amazon S3 location without additional suffixes or prefixes to tag checkpoints from multiple instances.
+ The SageMaker Python SDK does not support high-level configuration for checkpointing frequency. To control the checkpointing frequency, modify your training script using the framework's model save functions or checkpoint callbacks.
+ If you use SageMaker AI checkpoints with SageMaker Debugger and SageMaker AI distributed and are facing issues, see the following pages for troubleshooting and considerations.
  + [Distributed training supported by Amazon SageMaker Debugger](debugger-reference.md#debugger-considerations)
  + [Troubleshooting for distributed training in Amazon SageMaker AI](distributed-troubleshooting-data-parallel.md)
  + [Model Parallel Troubleshooting](distributed-troubleshooting-model-parallel.md)

# Enable checkpointing
<a name="model-checkpoints-enable"></a>

After you enable checkpointing, SageMaker AI saves checkpoints to Amazon S3 and syncs your training job with the checkpoint S3 bucket. You can use either S3 general purpose or S3 directory buckets for your checkpoint S3 bucket. 

![\[Architecture diagram of writing checkpoints during training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/checkpoints_write.png)


The following example shows how to configure checkpoint paths when you construct a SageMaker AI estimator. To enable checkpointing, add the `checkpoint_s3_uri` and `checkpoint_local_path` parameters to your estimator. 

The following example template shows how to create a generic SageMaker AI estimator and enable checkpointing. You can use this template for the supported algorithms by specifying the `image_uri` parameter. To find Docker image URIs for algorithms with checkpointing supported by SageMaker AI, see [Docker Registry Paths and Example Code](https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths). You can also replace `estimator` and `Estimator` with other SageMaker AI frameworks' estimator parent classes and estimator classes, such as `[TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#create-an-estimator)`, `[PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#create-an-estimator)`, `[MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#create-an-estimator)`, `[HuggingFace](https://huggingface.co/docs/sagemaker/train#create-a-hugging-face-estimator)` and `[XGBoost](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html#create-an-estimator)`.

```
import sagemaker
from sagemaker.estimator import Estimator

bucket=sagemaker.Session().default_bucket()
base_job_name="sagemaker-checkpoint-test"
checkpoint_in_bucket="checkpoints"

# The S3 URI to store the checkpoints
checkpoint_s3_bucket="s3://{}/{}/{}".format(bucket, base_job_name, checkpoint_in_bucket)

# The local path where the model will save its checkpoints in the training container
checkpoint_local_path="/opt/ml/checkpoints"

estimator = Estimator(
    ...
    image_uri="<ecr_path>/<algorithm-name>:<tag>" # Specify to use built-in algorithms
    output_path=bucket,
    base_job_name=base_job_name,
    
    # Parameters required to enable checkpointing
    checkpoint_s3_uri=checkpoint_s3_bucket,
    checkpoint_local_path=checkpoint_local_path
)
```

The following two parameters specify paths for checkpointing:
+ `checkpoint_local_path` – Specify the local path where the model saves the checkpoints periodically in a training container. The default path is set to `'/opt/ml/checkpoints'`. If you are using other frameworks or bringing your own training container, ensure that your training script's checkpoint configuration specifies the path to `'/opt/ml/checkpoints'`.
**Note**  
We recommend specifying the local paths as `'/opt/ml/checkpoints'` to be consistent with the default SageMaker AI checkpoint settings. If you prefer to specify your own local path, make sure you match the checkpoint saving path in your training script and the `checkpoint_local_path` parameter of the SageMaker AI estimators.
+ `checkpoint_s3_uri` – The URI to an S3 bucket where the checkpoints are stored in real time. You can specify either an S3 general purpose or S3 directory bucket to store your checkpoints. For more information on S3 directory buckets, see [Directory buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/directory-buckets-overview.html) in the *Amazon Simple Storage Service User Guide*. 

To find a complete list of SageMaker AI estimator parameters, see the [Estimator API](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) in the *[Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation*.

# Browse checkpoint files
<a name="model-checkpoints-saved-file"></a>

Locate checkpoint files using the SageMaker Python SDK and the Amazon S3 console.

**To find the checkpoint files programmatically**

To retrieve the S3 bucket URI where the checkpoints are saved, check the following estimator attribute:

```
estimator.checkpoint_s3_uri
```

This returns the S3 output path for checkpoints configured while requesting the `CreateTrainingJob` request. To find the saved checkpoint files using the S3 console, use the following procedure.

**To find the checkpoint files from the S3 console**

1. Sign in to the AWS Management Console and open the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Training jobs**.

1. Choose the link to the training job with checkpointing enabled to open **Job settings**.

1. On the **Job settings** page of the training job, locate the **Checkpoint configuration** section.  
![\[Checkpoint configuration section in the Job settings page of a training job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/checkpoints_trainingjob.png)

1. Use the link to the S3 bucket to access the checkpoint files.

# Resume training from a checkpoint
<a name="model-checkpoints-resume"></a>

To resume a training job from a checkpoint, run a new estimator with the same `checkpoint_s3_uri` that you created in the [Enable checkpointing](model-checkpoints-enable.md) section. Once the training has resumed, the checkpoints from this S3 bucket are restored to `checkpoint_local_path` in each instance of the new training job. Ensure that the S3 bucket is in the same Region as that of the current SageMaker AI session.

![\[Architecture diagram of syncing checkpoints to resume training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/checkpoints_resume.png)


# Cluster repairs for GPU errors
<a name="model-checkpoints-cluster-repair"></a>

If you are running a training job that fails on a GPU, SageMaker AI will run a GPU health check to see whether the failure is related to a GPU issue. SageMaker AI takes the following actions based on the health check results:
+ If the error is recoverable, and can be fixed by rebooting the instance or resetting the GPU, SageMaker AI will reboot the instance.
+ If the error is not recoverable, and caused by a GPU that needs to be replaced, SageMaker AI will replace the instance.

The instance is either replaced or rebooted as part of a SageMaker AI cluster repair process. During this process, you will see the following message in your training job status:

`Repairing training cluster due to hardware failure`

SageMaker AI will attempt to repair the cluster up to `10` times. If the cluster repair is successful, SageMaker AI will automatically restart the training job from the previous checkpoint. If the cluster repair fails, the training job will also fail. You are not billed for the cluster repair process. Cluster repairs will not initiate unless your training job fails. If a GPU issue is detected for a warmpool cluster, the cluster will enter into repair mode to either reboot or replace the faulty instance. After repair, the cluster can still be used as a warmpool cluster.

The previously described cluster and instance repair process is depicted in the following diagram:

![\[The cluster and instance repair process.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/training-cluster-repair.png)