# Amazon SageMaker Debugger
<a name="train-debugger"></a>

Debug model output tensors from machine learning training jobs in real time and detect non-converging issues using Amazon SageMaker Debugger.

## Amazon SageMaker Debugger features
<a name="debugger-features"></a>

A machine learning (ML) training job can have problems such as overfitting, saturated activation functions, and vanishing gradients, which can compromise model performance.

SageMaker Debugger provides tools to debug training jobs and resolve such problems to improve the performance of your model. Debugger also offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors.

SageMaker Debugger supports the Apache MXNet, PyTorch, TensorFlow, and XGBoost frameworks. For more information about available frameworks and versions supported by SageMaker Debugger, see [Supported frameworks and algorithms](debugger-supported-frameworks.md).

![\[Overview of how Amazon SageMaker Debugger works.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-main.png)


The high-level Debugger workflow is as follows:

1. Modify your training script with the `sagemaker-debugger` Python SDK if needed.

1. Configure a SageMaker training job with SageMaker Debugger.
   + Configure using the SageMaker AI Estimator API (for Python SDK).
   + Configure using the SageMaker AI [`CreateTrainingJob` request (for Boto3 or CLI)](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-createtrainingjob-api.html).
   + Configure [custom training containers](debugger-bring-your-own-container.md) with SageMaker Debugger.

1. Start a training job and monitor training issues in real time.
   + [List of Debugger built-in rules](debugger-built-in-rules.md).

1. Get alerts and take prompt actions against the training issues.
   + Receive texts and emails and stop training jobs when training issues are found using [Use Debugger built-in actions for rules](debugger-built-in-actions.md).
   + Set up your own actions using [Amazon CloudWatch Events and AWS Lambda](debugger-cloudwatch-lambda.md).

1. Explore deep analysis of the training issues.
   + For debugging model output tensors, see [Visualize Debugger Output Tensors in TensorBoard](debugger-enable-tensorboard-summaries.md).

1. Fix the issues, consider the suggestions provided by Debugger, and repeat steps 1–5 until you optimize your model and achieve target accuracy.

The SageMaker Debugger developer guide walks you through the following topics.

**Topics**
+ [

## Amazon SageMaker Debugger features
](#debugger-features)
+ [

# Supported frameworks and algorithms
](debugger-supported-frameworks.md)
+ [

# Amazon SageMaker Debugger architecture
](debugger-how-it-works.md)
+ [

# Debugger tutorials
](debugger-tutorial.md)
+ [

# Debugging training jobs using Amazon SageMaker Debugger
](debugger-debug-training-jobs.md)
+ [

# List of Debugger built-in rules
](debugger-built-in-rules.md)
+ [

# Creating custom rules using the Debugger client library
](debugger-custom-rules.md)
+ [

# Use Debugger with custom training containers
](debugger-bring-your-own-container.md)
+ [

# Configure Debugger using SageMaker API
](debugger-createtrainingjob-api.md)
+ [

# Amazon SageMaker Debugger references
](debugger-reference.md)

# Supported frameworks and algorithms
<a name="debugger-supported-frameworks"></a>

The following table shows SageMaker AI machine learning frameworks and algorithms supported by Debugger. 


| 
| 
| **SageMaker AI-supported frameworks and algorithms** |  **Debugging output tensors**  | 
| --- |--- |
|  [TensorFlow](https://sagemaker.readthedocs.io/en/stable/using_tf.html)   |  [AWS TensorFlow deep learning containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) 1.15.4 or later  | 
|  [PyTorch](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html)  |  [AWS PyTorch deep learning containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) 1.5.0 or later  | 
|  [MXNet](https://sagemaker.readthedocs.io/en/stable/using_mxnet.html)   |  [AWS MXNet deep learning containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#general-framework-containers) 1.6.0 or later  | 
|  [XGBoost](https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html)  |  1.0-1, 1.2-1, 1.3-1  | 
|  [SageMaker AI generic estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)  |  [Custom training containers](debugger-bring-your-own-container.md) (available for TensorFlow, PyTorch, MXNet, and XGBoost with manual hook registration)  | 
+ **Debugging output tensors** – Track and debug model parameters, such as weights, gradients, biases, and scalar values of your training job. Available deep learning frameworks are Apache MXNet, TensorFlow, PyTorch, and XGBoost.
**Important**  
For the TensorFlow framework with Keras, SageMaker Debugger deprecates the zero code change support for debugging models built using the `tf.keras` modules of TensorFlow 2.6 and later. This is due to breaking changes announced in the [TensorFlow 2.6.0 release note](https://github.com/tensorflow/tensorflow/releases/tag/v2.6.0). For instructions on how to update your training script, see [Adapt your TensorFlow training script](debugger-modify-script-tensorflow.md).
**Important**  
From PyTorch v1.12.0 and later, SageMaker Debugger deprecates the zero code change support for debugging models.  
This is due to breaking changes that cause SageMaker Debugger to interfere with the `torch.jit` functionality. For instructions on how to update your training script, see [Adapt your PyTorch training script](debugger-modify-script-pytorch.md).

If the framework or algorithm that you want to train and debug is not listed in the table, go to the [AWS Discussion Forum](https://forums.aws.amazon.com/) and leave feedback on SageMaker Debugger.

## AWS Regions
<a name="debugger-support-aws-regions"></a>

Amazon SageMaker Debugger is available in all regions where Amazon SageMaker AI is in service except the following region.
+ Asia Pacific (Jakarta): `ap-southeast-3`

To find if Amazon SageMaker AI is in service in your AWS Region, see [AWS Regional Services](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/).

## Use Debugger with Custom Training Containers
<a name="debugger-byoc-intro"></a>

Bring your training containers to SageMaker AI and gain insights into your training jobs using Debugger. Maximize your work efficiency by optimizing your model on Amazon EC2 instances using the monitoring and debugging features.

For more information about how to build your training container with the `sagemaker-debugger` client library, push it to the Amazon Elastic Container Registry (Amazon ECR), and monitor and debug, see [Use Debugger with custom training containers](debugger-bring-your-own-container.md).

## Debugger Open-Source GitHub Repositories
<a name="debugger-opensource"></a>

Debugger APIs are provided through the SageMaker Python SDK and designed to construct Debugger hook and rule configurations for the SageMaker AI [ CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) and [ DescribeTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) API operations. The `sagemaker-debugger` client library provides tools to register *hooks* and access the training data through its *trial* feature, all through its flexible and powerful API operations. It supports the machine learning frameworks TensorFlow, PyTorch, MXNet, and XGBoost on Python 3.6 and later. 

For direct resources about the Debugger and `sagemaker-debugger` API operations, see the following links: 
+ [The Amazon SageMaker Python SDK documentation](https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html)
+ [The Amazon SageMaker Python SDK - Debugger APIs](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html)
+ [The `sagemaker-debugger` Python SDK documentation](https://sagemaker-debugger.readthedocs.io/en/website/index.html) for [the Amazon SageMaker Debugger open source client library](https://github.com/awslabs/sagemaker-debugger#amazon-sagemaker-debugger)
+ [The `sagemaker-debugger` PyPI](https://pypi.org/project/smdebug/)

If you use the SDK for Java to conduct SageMaker training jobs and want to configure Debugger APIs, see the following references:
+ [Amazon SageMaker Debugger APIs](debugger-reference.md#debugger-apis)
+ [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md)

# Amazon SageMaker Debugger architecture
<a name="debugger-how-it-works"></a>

This topic walks you through a high-level overview of the Amazon SageMaker Debugger workflow.

Debugger supports profiling functionality for *performance optimization* to identify computation issues, such as system bottlenecks and underutilization, and to help optimize hardware resource utilization at scale. 

Debugger's debugging functionality for *model optimization* is about analyzing non-converging training issues that can arise while minimizing the loss functions using optimization algorithms, such as gradient descent and its variations. 

The following diagram shows the architecture of SageMaker Debugger. The blocks with bold boundary lines are what Debugger manages to analyze your training job. 

![\[Overview of how Amazon SageMaker Debugger works.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger_new_diagram.png)


Debugger stores the following data from your training jobs in your secured Amazon S3 bucket:
+ **Output tensors** – Collections of scalars and model parameters that are continuously updated during the forward and backward passes while training ML models. The output tensors include scalar values (accuracy and loss) and matrices (weights, gradients, input layers, and output layers).
**Note**  
By default, Debugger monitors and debugs SageMaker training jobs without any Debugger-specific parameters configured in SageMaker AI estimators. Debugger collects system metrics every 500 milliseconds and basic output tensors (scalar outputs such as loss and accuracy) every 500 steps. It also runs the `ProfilerReport` rule to analyze the system metrics and aggregate the Studio Debugger insights dashboard and a profiling report. Debugger saves the output data in your secured Amazon S3 bucket.

The Debugger built-in rules run on processing containers, which are designed to evaluate machine learning models by processing the training data collected in your S3 bucket (see [Process Data and Evaluate Models](https://docs.aws.amazon.com//sagemaker/latest/dg/processing-job.html)). The built-in rules are fully managed by Debugger. You can also create your own rules customized to your model to watch for any issues you want to monitor. 

# Debugger tutorials
<a name="debugger-tutorial"></a>

The following topics walk you through tutorials from the basics to advanced use cases of monitoring, profiling, and debugging SageMaker training jobs using Debugger. Explore the Debugger features and learn how you can debug and improve your machine learning models efficiently by using Debugger.

**Topics**
+ [

# Debugger tutorial videos
](debugger-videos.md)
+ [

# Debugger example notebooks
](debugger-notebooks.md)
+ [

# Debugger advanced demos and visualization
](debugger-visualization.md)

# Debugger tutorial videos
<a name="debugger-videos"></a>

The following videos provide a tour of Amazon SageMaker Debugger capabilities using SageMaker Studio and SageMaker AI notebook instances. 

**Topics**
+ [

## Debugging models with Amazon SageMaker Debugger in Studio Classic
](#debugger-video-get-started)
+ [

## Deep dive on Amazon SageMaker Debugger and SageMaker AI model monitor
](#debugger-video-dive-deep)

## Debugging models with Amazon SageMaker Debugger in Studio Classic
<a name="debugger-video-get-started"></a>

*Julien Simon, AWS Technical Evangelist \$1 Length: 14 minutes 17 seconds*

This tutorial video demonstrates how to use Amazon SageMaker Debugger to capture and inspect debugging information from a training model. The example training model used in this video is a simple convolutional neural network (CNN) based on Keras with the TensorFlow backend. SageMaker AI in a TensorFlow framework and Debugger enable you to build an estimator directly using the training script and debug the training job.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/MqPdTj0Znwg/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/MqPdTj0Znwg)


You can find the example notebook in the video in [ this Studio Demo repository](https://gitlab.com/juliensimon/amazon-studio-demos/-/tree/master) provided by the author. You need to clone the `debugger.ipynb` notebook file and the `mnist_keras_tf.py` training script to your SageMaker Studio or a SageMaker notebook instance. After you clone the two files, specify the path `keras_script_path` to the `mnist_keras_tf.py` file inside the `debugger.ipynb` notebook. For example, if you cloned the two files in the same directory, set it as `keras_script_path = "mnist_keras_tf.py"`.

## Deep dive on Amazon SageMaker Debugger and SageMaker AI model monitor
<a name="debugger-video-dive-deep"></a>

*Julien Simon, AWS Technical Evangelist \$1 Length: 44 minutes 34 seconds*

This video session explores advanced features of Debugger and SageMaker Model Monitor that help boost productivity and the quality of your models. First, this video shows how to detect and fix training issues, visualize tensors, and improve models with Debugger. Next, at 22:41, the video shows how to monitor models in production and identify prediction issues such as missing features or data drift using SageMaker AI Model Monitor. Finally, it offers cost optimization tips to help you make the most of your machine learning budget.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/0zqoeZxakOI/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/0zqoeZxakOI)


You can find the example notebook in the video in [ this AWS Dev Days 2020 repository](https://gitlab.com/juliensimon/awsdevdays2020/-/tree/master/mls1) offered by the author.

# Debugger example notebooks
<a name="debugger-notebooks"></a>

[SageMaker Debugger example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/) are provided in the [aws/amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) repository. The Debugger example notebooks walk you through basic to advanced use cases of debugging and profiling training jobs. 

We recommend that you run the example notebooks on SageMaker Studio or a SageMaker Notebook instance because most of the examples are designed for training jobs in the SageMaker AI ecosystem, including Amazon EC2, Amazon S3, and Amazon SageMaker Python SDK. 

To clone the example repository to SageMaker Studio, follow the instructions at [Amazon SageMaker Studio Tour](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-end-to-end.html).

**Important**  
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the `SMDebug` client library. In your iPython kernel, Jupyter Notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Debugger example notebooks for profiling training jobs
<a name="debugger-notebooks-profiling"></a>

The following list shows Debugger example notebooks introducing Debugger's adaptability to monitor and profile training jobs for various machine learning models, datasets, and frameworks.


| Notebook Title | Framework | Model | Dataset | Description | 
| --- | --- | --- | --- | --- | 
|  [Amazon SageMaker Debugger Profiling Data Analysis](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/debugger_interactive_analysis_profiling/interactive_analysis_profiling_data.html)  |  TensorFlow  |  Keras ResNet50  | Cifar-10 |  This notebook provides an introduction to interactive analysis of profiled data captured by SageMaker Debugger. Explore the full functionality of the `SMDebug` interactive analysis tools.  | 
|  [Profile machine learning training with Amazon SageMaker Debugger ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_nlp_sentiment_analysis/sentiment-analysis-tf-distributed-training-bringyourownscript.html)  |  TensorFlow  |  1-D Convolutional Neural Network  |  IMDB dataset  |  Profile a TensorFlow 1-D CNN for sentiment analysis of IMDB data that consists of movie reviews labeled as having positive or negative sentiment. Explore the Studio Debugger insights and Debugger profiling report.  | 
|  [Profiling TensorFlow ResNet model training with various distributed training settings](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_profiling)  |  TensorFlow  | ResNet50 | Cifar-10 |  Run TensorFlow training jobs with various distributed training settings, monitor system resource utilization, and profile model performance using Debugger.  | 
|  [Profiling PyTorch ResNet model training with various distributed training settings](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_profiling)   | PyTorch |  ResNet50  | Cifar-10 |  Run PyTorch training jobs with various distributed training settings, monitor system resource utilization, and profile model performance using Debugger.  | 

## Debugger example notebooks for analyzing model parameters
<a name="debugger-notebooks-debugging"></a>

The following list shows Debugger example notebooks introducing Debugger's adaptability to debug training jobs for various machine learning models, datasets, and frameworks.


| Notebook Title | Framework | Model | Dataset | Description | 
| --- | --- | --- | --- | --- | 
|  [Amazon SageMaker Debugger - Use built-in rule](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_builtin_rule)  |  TensorFlow  |  Convolutional Neural Network  | MNIST |  Use the Amazon SageMaker Debugger built-in rules for debugging a TensorFlow model.  | 
|  [Amazon SageMaker Debugger - Tensorflow 2.1](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow2)  |  TensorFlow  |  ResNet50  | Cifar-10 |  Use the Amazon SageMaker Debugger hook configuration and built-in rules for debugging a model with the Tensorflow 2.1 framework.  | 
|  [Visualizing Debugging Tensors of MXNet training](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_plot)  |  MXNet  |  Gluon Convolutional Neural Network  | Fashion MNIST |  Run a training job and configure SageMaker Debugger to store all tensors from this job, then visualize those tensors ina notebook.  | 
|  [Enable Spot Training with Amazon SageMaker Debugger](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_spot_training)   | MXNet |  Gluon Convolutional Neural Network  | Fashion MNIST |  Learn how Debugger collects tensor data from a training job on a spot instance, and how to use the Debugger built-in rules with managed spot training.  | 
| [Explain an XGBoost model that predicts an individual’s income with Amazon SageMaker Debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/xgboost_census_explanations/xgboost-census-debugger-rules.html) | XGBoost |  XGBoost Regression  |  [Adult Census dataset](https://archive.ics.uci.edu/ml/datasets/adult)  | Learn how to use the Debugger hook and built-in rules for collecting and visualizing tensor data from an XGBoost regression model, such as loss values, features, and SHAP values. | 

To find advanced visualizations of model parameters and use cases, see the next topic at [Debugger advanced demos and visualization](debugger-visualization.md).

# Debugger advanced demos and visualization
<a name="debugger-visualization"></a>

The following demos walk you through advanced use cases and visualization scripts using Debugger.

**Topics**
+ [

## Training and pruning models with Amazon SageMaker Experiments and Debugger
](#debugger-visualization-video-model-pruning)
+ [

## [Using SageMaker Debugger to monitor a convolutional autoencoder model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/autoencoder_mnist/autoencoder_mnist.html)
](#debugger-visualization-autoencoder_mnist)
+ [

## [Using SageMaker Debugger to monitor attentions in BERT model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/bert_attention_head_view/bert_attention_head_view.html)
](#debugger-visualization-bert_attention_head_view)
+ [

## [Using SageMaker Debugger to visualize class activation maps in convolutional neural networks (CNNs)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/cnn_class_activation_maps/cnn_class_activation_maps.html)
](#debugger-visualization-cnn_class_activation_maps)

## Training and pruning models with Amazon SageMaker Experiments and Debugger
<a name="debugger-visualization-video-model-pruning"></a>

*Dr. Nathalie Rauschmayr, AWS Applied Scientist \$1 Length: 49 minutes 26 seconds*

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/Tnv6HsT1r4I/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/Tnv6HsT1r4I)


Find out how Amazon SageMaker Experiments and Debugger can simplify the management of your training jobs. Amazon SageMaker Debugger provides transparent visibility into training jobs and saves training metrics into your Amazon S3 bucket. SageMaker Experiments enables you to call the training information as *trials* through SageMaker Studio and supports visualization of the training job. This helps you keep model quality high while reducing less important parameters based on importance rank.

This video demonstrates a *model pruning* technique that makes pre-trained ResNet50 and AlexNet models lighter and affordable while keeping high standards for model accuracy.

SageMaker AI Estimator trains those algorithms supplied from the PyTorch model zoo in an AWS Deep Learning Containers with PyTorch framework, and Debugger extracts training metrics from the training process.

The video also demonstrates how to set up a Debugger custom rule to watch the accuracy of a pruned model, to trigger an Amazon CloudWatch event and an AWS Lambda function when the accuracy hits a threshold, and to automatically stop the pruning process to avoid redundant iterations. 

Learning objectives are as follows: 
+  Learn how to use SageMaker AI to accelerate ML model training and improve model quality. 
+  Understand how to manage training iterations with SageMaker Experiments by automatically capturing input parameters, configurations, and results. 
+  Discover how Debugger makes the training process transparent by automatically capturing real-time tensor data from metrics such as weights, gradients, and activation outputs of convolutional neural networks.
+ Use CloudWatch to trigger Lambda when Debugger catches issues.
+  Master the SageMaker training process using SageMaker Experiments and Debugger.

You can find the notebooks and training scripts used in this video from [SageMaker Debugger PyTorch Iterative Model Pruning](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_iterative_model_pruning).

The following image shows how the iterative model pruning process reduces the size of AlexNet by cutting out the 100 least significant filters based on importance rank evaluated by activation outputs and gradients.

The pruning process reduced the initial 50 million parameters to 18 million. It also reduced the estimated model size from 201 MB to 73 MB. 

![\[An image containing model pruning result output visualizations\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-model-pruning-results-alexnet.gif)


You also need to track model accuracy, and the following image shows how you can plot the model pruning process to visualize changes in model accuracy based on the number of parameters in SageMaker Studio.

![\[An image of tensor visualization using Debugger in SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-model-pruning-studio.png)


In SageMaker Studio, choose the **Experiments** tab, select a list of tensors saved by Debugger from the pruning process, and then compose a **Trial Component List** panel. Select all ten iterations and then choose **Add chart** to create a **Trial Component Chart**. After you decide on a model to deploy, choose the trial component and choose a menu to perform an action or choose **Deploy model**.

**Note**  
To deploy a model through SageMaker Studio using the following notebook example, add a line at the end of the `train` function in the `train.py` script.  

```
# In the train.py script, look for the train function in line 58.
def train(epochs, batch_size, learning_rate):
    ...
        print('acc:{:.4f}'.format(correct/total))
        hook.save_scalar("accuracy", correct/total, sm_metric=True)

    # Add the following code to line 128 of the train.py script to save the pruned models
    # under the current SageMaker Studio model directory
    torch.save(model.state_dict(), os.environ['SM_MODEL_DIR'] + '/model.pt')
```

## [Using SageMaker Debugger to monitor a convolutional autoencoder model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/autoencoder_mnist/autoencoder_mnist.html)
<a name="debugger-visualization-autoencoder_mnist"></a>

This notebook demonstrates how SageMaker Debugger visualizes tensors from an unsupervised (or self-supervised) learning process on a MNIST image dataset of handwritten numbers.

The training model in this notebook is a convolutional autoencoder with the MXNet framework. The convolutional autoencoder has a bottleneck-shaped convolutional neural network that consists of an encoder part and a decoder part. 

The encoder in this example has two convolution layers to produce compressed representation (latent variables) of the input images. In this case, the encoder produces a latent variable of size (1, 20) from an original input image of size (28, 28) and significantly reduces the size of data for training by 40 times.

The decoder has two *deconvolutional* layers and ensures that the latent variables preserve key information by reconstructing output images.

The convolutional encoder powers clustering algorithms with smaller input data size and the performance of clustering algorithms such as k-means, k-NN, and t-Distributed Stochastic Neighbor Embedding (t-SNE).

This notebook example demonstrates how to visualize the latent variables using Debugger, as shown in the following animation. It also demonstrates how the t-SNE algorithm classifies the latent variables into ten clusters and projects them into a two-dimensional space. The scatter plot color scheme on the right side of the image reflects the true values to show how well the BERT model and t-SNE algorithm organize the latent variables into the clusters.

![\[A conceptual image of convolutional autoencoder\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-cnn-autoencoder-plot.gif)


## [Using SageMaker Debugger to monitor attentions in BERT model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/bert_attention_head_view/bert_attention_head_view.html)
<a name="debugger-visualization-bert_attention_head_view"></a>

Bidirectional Encode Representations from Transformers (BERT) is a language representation model. As the name of model reflects, the BERT model builds on *transfer learning* and the *Transformer model* for natural language processing (NLP).

The BERT model is pre-trained on unsupervised tasks such as predicting missing words in a sentence or predicting the next sentence that naturally follows a previous sentence. The training data contains 3.3 billion words (tokens) of English text, from sources such as Wikipedia and electronic books. For a simple example, the BERT model can give a high *attention* to appropriate verb tokens or pronoun tokens from a subject token.

The pre-trained BERT model can be fine-tuned with an additional output layer to achieve state-of-the-art model training in NLP tasks, such as automated responses to questions, text classification, and many others. 

Debugger collects tensors from the fine-tuning process. In the context of NLP, the weight of neurons is called *attention*. 

This notebook demonstrates how to use the [ pre-trained BERT model from the GluonNLP model zoo](https://gluon-nlp.mxnet.io/model_zoo/bert/index.html) on the Stanford Question and Answering dataset and how to set up SageMaker Debugger to monitor the training job.

Plotting *attention scores* and individual neurons in the query and key vectors can help to identify causes of incorrect model predictions. With SageMaker AI Debugger, you can retrieve the tensors and plot the *attention-head view* in real time as training progresses and understand what the model is learning.

The following animation shows the attention scores of the first 20 input tokens for ten iterations in the training job provided in the notebook example.

![\[An animation of the attention scores\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-attention_scores.gif)


## [Using SageMaker Debugger to visualize class activation maps in convolutional neural networks (CNNs)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/cnn_class_activation_maps/cnn_class_activation_maps.html)
<a name="debugger-visualization-cnn_class_activation_maps"></a>

This notebook demonstrates how to use SageMaker Debugger to plot class activation maps for image detection and classification in convolutional neural networks (CNNs). In deep learning, a *convolutional neural network (CNN or ConvNet)* is a class of deep neural networks, most commonly applied to analyzing visual imagery. One of the applications that adopts the class activation maps is self-driving cars, which require instantaneous detection and classification of images such as traffic signs, roads, and obstacles.

In this notebook, the PyTorch ResNet model is trained on [the German Traffic Sign Dataset](http://benchmark.ini.rub.de/), which contains more than 40 classes of traffic-related objects and more than 50,000 images in total.

![\[An animation of CNN class activation maps\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-cnn-class-activation-maps.gif)


During the training process, SageMaker Debugger collects tensors to plot the class activation maps in real time. As shown in the animated image, the class activation map (also called as a *saliency map*) highlights regions with high activation in red color. 

Using tensors captured by Debugger, you can visualize how the activation map evolves during the model training. The model starts by detecting the edge on the lower-left corner at the beginning of the training job. As the training progresses, the focus shifts to the center and detects the speed limit sign, and the model successfully predicts the input image as Class 3, which is a class of speed limit 60km/h signs, with a 97% confidence level.

# Debugging training jobs using Amazon SageMaker Debugger
<a name="debugger-debug-training-jobs"></a>

To prepare your training script and run training jobs with SageMaker Debugger to debug model training progress, you follow the typical two-step process: modify your training script using the `sagemaker-debugger` Python SDK, and construct a SageMaker AI estimator using the SageMaker Python SDK. Go through the following topics to learn how to use SageMaker Debugger's debugging functionality.

**Topics**
+ [

# Adapting your training script to register a hook
](debugger-modify-script.md)
+ [

# Launch training jobs with Debugger using the SageMaker Python SDK
](debugger-configuration-for-debugging.md)
+ [

# SageMaker Debugger interactive report for XGBoost
](debugger-report-xgboost.md)
+ [

# Action on Amazon SageMaker Debugger rules
](debugger-action-on-rules.md)
+ [

# Visualize Amazon SageMaker Debugger output tensors in TensorBoard
](debugger-enable-tensorboard-summaries.md)

# Adapting your training script to register a hook
<a name="debugger-modify-script"></a>

Amazon SageMaker Debugger comes with a client library called the [`sagemaker-debugger` Python SDK](https://sagemaker-debugger.readthedocs.io/en/website). The `sagemaker-debugger` Python SDK provides tools for adapting your training script before training and analysis tools after training. In this page, you'll learn how to adapt your training script using the client library. 

The `sagemaker-debugger` Python SDK provides wrapper functions that help register a hook to extract model tensors, without altering your training script. To get started with collecting model output tensors and debug them to find training issues, make the following modifications in your training script.

**Tip**  
While you're following this page, use the [`sagemaker-debugger` open source SDK documentation](https://sagemaker-debugger.readthedocs.io/en/website/index.html) for API references.

**Topics**
+ [

# Adapt your PyTorch training script
](debugger-modify-script-pytorch.md)
+ [

# Adapt your TensorFlow training script
](debugger-modify-script-tensorflow.md)

# Adapt your PyTorch training script
<a name="debugger-modify-script-pytorch"></a>

To start collecting model output tensors and debug training issues, make the following modifications to your PyTorch training script.

**Note**  
SageMaker Debugger cannot collect model output tensors from the [https://pytorch.org/docs/stable/nn.functional.html](https://pytorch.org/docs/stable/nn.functional.html) API operations. When you write a PyTorch training script, it is recommended to use the [https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) modules instead.

## For PyTorch 1.12.0
<a name="debugger-modify-script-pytorch-1-12-0"></a>

If you bring a PyTorch training script, you can run the training job and extract model output tensors with a few additional code lines in your training script. You need to use the [hook APIs](https://sagemaker-debugger.readthedocs.io/en/website/hook-api.html) in the `sagemaker-debugger` client library. Walk through the following instructions that break down the steps with code examples.

1. Create a hook.

   **(Recommended) For training jobs within SageMaker AI**

   ```
   import smdebug.pytorch as smd
   hook=smd.get_hook(create_if_not_exists=True)
   ```

   When you launch a training job in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md) with any of the DebuggerHookConfig, TensorBoardConfig, or Rules in your estimator, SageMaker AI adds a JSON configuration file to your training instance that is picked up by the `get_hook` function. Note that if you do not include any of the configuration APIs in your estimator, there will be no configuration file for the hook to find, and the function returns `None`.

   **(Optional) For training jobs outside SageMaker AI**

   If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2 instances, or your own local devices, use `smd.Hook` class to create a hook. However, this approach can only store the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s built-in Rules don’t work with the local mode because the Rules require SageMaker AI ML training instances and S3 to store outputs from the remote instances in real time. The `smd.get_hook` API returns `None` in this case. 

   If you want to create a manual hook to save tensors in local mode, use the following code snippet with the logic to check if the `smd.get_hook` API returns `None` and create a manual hook using the `smd.Hook` class. Note that you can specify any output directory in your local machine.

   ```
   import smdebug.pytorch as smd
   hook=smd.get_hook(create_if_not_exists=True)
   
   if hook is None:
       hook=smd.Hook(
           out_dir='/path/to/your/local/output/',
           export_tensorboard=True
       )
   ```

1. Wrap your model with the hook’s class methods.

   The `hook.register_module()` method takes your model and iterates through each layer, looking for any tensors that match with regular expressions that you’ll provide through the configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md). The collectable tensors through this hook method are weights, biases, activations, gradients, inputs, and outputs.

   ```
   hook.register_module(model)
   ```
**Tip**  
If you collect the entire output tensors from a large deep learning model, the total size of those collections can exponentially grow and might cause bottlenecks. If you want to save specific tensors, you can also use the `hook.save_tensor()` method. This method helps you pick the variable for the specific tensor and save to a custom collection named as you want. For more information, see [step 7](#debugger-modify-script-pytorch-save-custom-tensor) of this instruction.

1. Warp the loss function with the hook’s class methods.

   The `hook.register_loss` method is to wrap the loss function. It extracts loss values every `save_interval` that you’ll set during configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md), and saves them to the `"losses"` collection.

   ```
   hook.register_loss(loss_function)
   ```

1. Add `hook.set_mode(ModeKeys.TRAIN)` in the train block. This indicates the tensor collection is extracted during the training phase.

   ```
   def train():
       ...
       hook.set_mode(ModeKeys.TRAIN)
   ```

1. Add `hook.set_mode(ModeKeys.EVAL)` in the validation block. This indicates the tensor collection is extracted during the validation phase.

   ```
   def validation():
       ...
       hook.set_mode(ModeKeys.EVAL)
   ```

1. Use [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar) to save custom scalars. You can save scalar values that aren’t in your model. For example, if you want to record the accuracy values computed during evaluation, add the following line of code below the line where you calculate accuracy.

   ```
   hook.save_scalar("accuracy", accuracy)
   ```

   Note that you need to provide a string as the first argument to name the custom scalar collection. This is the name that'll be used for visualizing the scalar values in TensorBoard, and can be any string you want.

1. <a name="debugger-modify-script-pytorch-save-custom-tensor"></a>Use [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_tensor](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_tensor) to save custom tensors. Similarly to [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar), you can save additional tensors, defining your own tensor collection. For example, you can extract input image data that are passed into the model and save as a custom tensor by adding the following code line, where `"images"` is an example name of the custom tensor, `image_inputs` is an example variable for the input image data.

   ```
   hook.save_tensor("images", image_inputs)
   ```

   Note that you must provide a string to the first argument to name the custom tensor. `hook.save_tensor()` has the third argument `collections_to_write` to specify the tensor collection to save the custom tensor. The default is `collections_to_write="default"`. If you don't explicitely specify the third argument, the custom tensor is saved to the `"default"` tensor collection.

After you have completed adapting your training script, proceed to [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md).

# Adapt your TensorFlow training script
<a name="debugger-modify-script-tensorflow"></a>

To start collecting model output tensors and debug training issues, make the following modifications to your TensorFlow training script.

**Create a hook for training jobs within SageMaker AI**

```
import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True)
```

This creates a hook when you start a SageMaker training job. When you launch a training job in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md) with any of the `DebuggerHookConfig`, `TensorBoardConfig`, or `Rules` in your estimator, SageMaker AI adds a JSON configuration file to your training instance that is picked up by the `smd.get_hook` method. Note that if you do not include any of the configuration APIs in your estimator, there will be no configuration file for the hook to find, and the function returns `None`.

**(Optional) Create a hook for training jobs outside SageMaker AI**

If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2 instances, or your own local devices, use `smd.Hook` class to create a hook. However, this approach can only store the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s built-in Rules don’t work with the local mode. The `smd.get_hook` method also returns `None` in this case. 

If you want to create a manual hook, use the following code snippet with the logic to check if the hook returns `None` and create a manual hook using the `smd.Hook` class.

```
import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True) 

if hook is None:
    hook=smd.KerasHook(
        out_dir='/path/to/your/local/output/',
        export_tensorboard=True
    )
```

After adding the hook creation code, proceed to the following topic for TensorFlow Keras.

**Note**  
SageMaker Debugger currently supports TensorFlow Keras only.

## Register the hook in your TensorFlow Keras training script
<a name="debugger-modify-script-tensorflow-keras"></a>

The following precedure walks you through how to use the hook and its methods to collect output scalars and tensors from your model and optimizer.

1. Wrap your Keras model and optimizer with the hook’s class methods.

   The `hook.register_model()` method takes your model and iterates through each layer, looking for any tensors that match with regular expressions that you’ll provide through the configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md). The collectable tensors through this hook method are weights, biases, and activations.

   ```
   model=tf.keras.Model(...)
   hook.register_model(model)
   ```

1. Wrap the optimizer by the `hook.wrap_optimizer()` method.

   ```
   optimizer=tf.keras.optimizers.Adam(...)
   optimizer=hook.wrap_optimizer(optimizer)
   ```

1. Compile the model in eager mode in TensorFlow.

   To collect tensors from the model, such as the input and output tensors of each layer, you must run the training in eager mode. Otherwise, SageMaker AI Debugger will not be able to collect the tensors. However, other tensors, such as model weights, biases, and the loss, can be collected without explicitly running in eager mode.

   ```
   model.compile(
       loss="categorical_crossentropy", 
       optimizer=optimizer, 
       metrics=["accuracy"],
       # Required for collecting tensors of each layer
       run_eagerly=True
   )
   ```

1. Register the hook to the [https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method.

   To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` class method. This will pass the `sagemaker-debugger` hook as a Keras callback.

   ```
   model.fit(
       X_train, Y_train,
       batch_size=batch_size,
       epochs=epoch,
       validation_data=(X_valid, Y_valid),
       shuffle=True, 
       callbacks=[hook]
   )
   ```

1. TensorFlow 2.x provides only symbolic gradient variables that do not provide access to their values. To collect gradients, wrap `tf.GradientTape` by the [https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html#tensorflow-specific-hook-api](https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html#tensorflow-specific-hook-api) method, which requires you to write your own training step as follows.

   ```
   def training_step(model, dataset):
       with hook.wrap_tape(tf.GradientTape()) as tape:
           pred=model(data)
           loss_value=loss_fn(labels, pred)
       grads=tape.gradient(loss_value, model.trainable_variables)
       optimizer.apply_gradients(zip(grads, model.trainable_variables))
   ```

   By wrapping the tape, the `sagemaker-debugger` hook can identify output tensors such as gradients, parameters, and losses. Wrapping the tape ensures that the `hook.wrap_tape()` method around functions of the tape object, such as `push_tape()`, `pop_tape()`, `gradient()`, will set up the writers of SageMaker Debugger and save tensors that are provided as input to `gradient()` (trainable variables and loss) and output of `gradient()` (gradients).
**Note**  
To collect with a custom training loop, make sure that you use eager mode. Otherwise, SageMaker Debugger is not able to collect any tensors.

For a full list of actions that the `sagemaker-debugger` hook APIs offer to construct hooks and save tensors, see [Hook Methods](https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html) in the *`sagemaker-debugger` Python SDK documentation*.

After you have completed adapting your training script, proceed to [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md).

# Launch training jobs with Debugger using the SageMaker Python SDK
<a name="debugger-configuration-for-debugging"></a>

To configure a SageMaker AI estimator with SageMaker Debugger, use [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and specify Debugger-specific parameters. To fully utilize the debugging functionality, there are three parameters you need to configure: `debugger_hook_config`, `tensorboard_output_config`, and `rules`.

**Important**  
Before constructing and running the estimator fit method to launch a training job, make sure that you adapt your training script following the instructions at [Adapting your training script to register a hook](debugger-modify-script.md).

## Constructing a SageMaker AI Estimator with Debugger-specific parameters
<a name="debugger-configuration-structure"></a>

The code examples in this section show how to construct a SageMaker AI estimator with the Debugger-specific parameters.

**Note**  
The following code examples are templates for constructing the SageMaker AI framework estimators and not directly executable. You need to proceed to the next sections and configure the Debugger-specific parameters.

------
#### [ PyTorch ]

```
# An example of constructing a SageMaker AI PyTorch estimator
import boto3
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=PyTorch(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.12.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ TensorFlow ]

```
# An example of constructing a SageMaker AI TensorFlow estimator
import boto3
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

session=boto3.session.Session()
region=session.region_name

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=TensorFlow(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ MXNet ]

```
# An example of constructing a SageMaker AI MXNet estimator
import sagemaker
from sagemaker.mxnet import MXNet
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=MXNet(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.7.0",
    py_version="py37",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ XGBoost ]

```
# An example of constructing a SageMaker AI XGBoost estimator
import sagemaker
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

estimator=XGBoost(
    entry_point="directory/to/your_training_script.py",
    role=sagemaker.get_execution_role(),
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="1.5-1",

    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------
#### [ Generic estimator ]

```
# An example of constructing a SageMaker AI generic estimator using the XGBoost algorithm base image
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import CollectionConfig, DebuggerHookConfig, Rule, rule_configs

debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule())
]

region=boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Debugger-specific parameters
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

estimator.fit(wait=False)
```

------

Configure the following parameters to activate SageMaker Debugger:
+ `debugger_hook_config` (an object of [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig)) – Required to activate the hook in the adapted training script during [Adapting your training script to register a hook](debugger-modify-script.md), configure the SageMaker training launcher (estimator) to collect output tensors from your training job, and save the tensors into your secured S3 bucket or local machine. To learn how to configure the `debugger_hook_config` parameter, see [Configuring SageMaker Debugger to save tensors](debugger-configure-hook.md).
+ `rules` (a list of [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule) objects) – Configure this parameter to activate SageMaker Debugger built-in rules that you want to run in real time. The built-in rules are logics that automatically debug the training progress of your model and find training issues by analyzing the output tensors saved in your secured S3 bucket. To learn how to configure the `rules` parameter, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md). To find a complete list of built-in rules for debugging output tensors, see [Debugger rule](debugger-built-in-rules.md#debugger-built-in-rules-Rule). If you want to create your own logic to detect any training issues, see [Creating custom rules using the Debugger client library](debugger-custom-rules.md).
**Note**  
The built-in rules are available only through SageMaker training instances. You cannot use them in local mode.
+ `tensorboard_output_config` (an object of [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig)) – Configure SageMaker Debugger to collect output tensors in the TensorBoard-compatible format and save to your S3 output path specified in the `TensorBoardOutputConfig` object. To learn more, see [Visualize Amazon SageMaker Debugger output tensors in TensorBoard](debugger-enable-tensorboard-summaries.md).
**Note**  
The `tensorboard_output_config` must be configured with the `debugger_hook_config` parameter, which also requires you to adapt your training script by adding the `sagemaker-debugger` hook.

**Note**  
SageMaker Debugger securely saves output tensors in subfolders of your S3 bucket. For example, the format of the default S3 bucket URI in your account is `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<base-job-name>/<debugger-subfolders>/`. There are two subfolders created by SageMaker Debugger: `debug-output`, and `rule-output`. If you add the `tensorboard_output_config` parameter, you'll also find `tensorboard-output` folder.

See the following topics to find more examples of how to configure the Debugger-specific parameters in detail.

**Topics**
+ [

## Constructing a SageMaker AI Estimator with Debugger-specific parameters
](#debugger-configuration-structure)
+ [

# Configuring SageMaker Debugger to save tensors
](debugger-configure-hook.md)
+ [

# How to configure Debugger built-in rules
](use-debugger-built-in-rules.md)
+ [

# Turn off Debugger
](debugger-turn-off.md)
+ [

# Useful SageMaker AI estimator class methods for Debugger
](debugger-estimator-classmethods.md)

# Configuring SageMaker Debugger to save tensors
<a name="debugger-configure-hook"></a>

*Tensors* are data collections of updated parameters from the backward and forward pass of each training iteration. SageMaker Debugger collects the output tensors to analyze the state of a training job. SageMaker Debugger's [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) and [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig) API operations provide methods for grouping tensors into *collections* and saving them to a target S3 bucket. The following topics show how to use the `CollectionConfig` and `DebuggerHookConfig` API operations, followed by examples of how to use Debugger hook to save, access, and visualize output tensors.

While constructing a SageMaker AI estimator, activate SageMaker Debugger by specifying the `debugger_hook_config` parameter. The following topics include examples of how to set up the `debugger_hook_config` using the `CollectionConfig` and `DebuggerHookConfig` API operations to pull tensors out of your training jobs and save them.

**Note**  
After properly configured and activated, SageMaker Debugger saves the output tensors in a default S3 bucket, unless otherwise specified. The format of the default S3 bucket URI is `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/`.

**Topics**
+ [

# Configure tensor collections using the `CollectionConfig` API
](debugger-configure-tensor-collections.md)
+ [

# Configure the `DebuggerHookConfig` API to save tensors
](debugger-configure-tensor-hook.md)
+ [

# Example notebooks and code samples to configure Debugger hook
](debugger-save-tensors.md)

# Configure tensor collections using the `CollectionConfig` API
<a name="debugger-configure-tensor-collections"></a>

Use the `CollectionConfig` API operation to configure tensor collections. Debugger provides pre-built tensor collections that cover a variety of regular expressions (regex) of parameters if using Debugger-supported deep learning frameworks and machine learning algorithms. As shown in the following example code, add the built-in tensor collections you want to debug.

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(name="weights"),
    CollectionConfig(name="gradients")
]
```

The preceding collections set up the Debugger hook to save the tensors every 500 steps based on the default `"save_interval"` value.

For a full list of available Debugger built-in collections, see [Debugger Built-in Collections](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#collection).

If you want to customize the built-in collections, such as changing the save intervals and tensor regex, use the following `CollectionConfig` template to adjust parameters.

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(
        name="tensor_collection",
        parameters={
            "key_1": "value_1",
            "key_2": "value_2",
            ...
            "key_n": "value_n"
        }
    )
]
```

For more information about available parameter keys, see [CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). For example, the following code example shows how you can adjust the save intervals of the "losses" tensor collection at different phases of training: save loss every 100 steps in training phase and validation loss every 10 steps in validation phase. 

```
from sagemaker.debugger import CollectionConfig

collection_configs=[
    CollectionConfig(
        name="losses",
        parameters={
            "train.save_interval": "100",
            "eval.save_interval": "10"
        }
    )
]
```

**Tip**  
This tensor collection configuration object can be used for both [DebuggerHookConfig](https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-configure-hook.html#debugger-configure-tensor-hook) and [Rule](https://docs.aws.amazon.com/sagemaker/latest/dg/use-debugger-built-in-rules.html#debugger-built-in-rules-configuration-param-change) API operations.

# Configure the `DebuggerHookConfig` API to save tensors
<a name="debugger-configure-tensor-hook"></a>

Use the [DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html                 #sagemaker.debugger.DebuggerHookConfig) API to create a `debugger_hook_config` object using the `collection_configs` object you created in the previous step.

```
from sagemaker.debugger import DebuggerHookConfig

debugger_hook_config=DebuggerHookConfig(
    collection_configs=collection_configs
)
```

Debugger saves the model training output tensors into the default S3 bucket. The format of the default S3 bucket URI is `s3://amzn-s3-demo-bucket-sagemaker-<region>-<12digit_account_id>/<training-job-name>/debug-output/.`

If you want to specify an exact S3 bucket URI, use the following code example:

```
from sagemaker.debugger import DebuggerHookConfig

debugger_hook_config=DebuggerHookConfig(
    s3_output_path="specify-uri"
    collection_configs=collection_configs
)
```

For more information, see [DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# Example notebooks and code samples to configure Debugger hook
<a name="debugger-save-tensors"></a>

The following sections provide notebooks and code examples of how to use Debugger hook to save, access, and visualize output tensors.

**Topics**
+ [

## Tensor visualization example notebooks
](#debugger-tensor-visualization-notebooks)
+ [

## Save tensors using Debugger built-in collections
](#debugger-save-built-in-collections)
+ [

## Save tensors by modifying Debugger built-in collections
](#debugger-save-modified-built-in-collections)
+ [

## Save tensors using Debugger custom collections
](#debugger-save-custom-collections)

## Tensor visualization example notebooks
<a name="debugger-tensor-visualization-notebooks"></a>

The following two notebook examples show advanced use of Amazon SageMaker Debugger for visualizing tensors. Debugger provides a transparent view into training deep learning models.
+ [Interactive Tensor Analysis in SageMaker Studio Notebook with MXNet](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_analysis)

  This notebook example shows how to visualize saved tensors using Amazon SageMaker Debugger. By visualizing the tensors, you can see how the tensor values change while training deep learning algorithms. This notebook includes a training job with a poorly configured neural network and uses Amazon SageMaker Debugger to aggregate and analyze tensors, including gradients, activation outputs, and weights. For example, the following plot shows the distribution of gradients of a convolutional layer that is suffering from a vanishing gradient problem.  
![\[A graph plotting the distribution of gradients.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-vanishing-gradient.gif)

  This notebook also illustrates how a good initial hyperparameter setting improves the training process by generating the same tensor distribution plots. 
+ [ Visualizing and Debugging Tensors from MXNet Model Training](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_plot)

   This notebook example shows how to save and visualize tensors from an MXNet Gluon model training job using Amazon SageMaker Debugger. It illustrates that Debugger is set to save all tensors to an Amazon S3 bucket and retrieves ReLu activation outputs for the visualization. The following figure shows a three-dimensional visualization of the ReLu activation outputs. The color scheme is set to blue to indicate values close to 0 and yellow to indicate values close to 1.   
![\[A visualization of the ReLU activation outputs\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/tensorplot.gif)

  In this notebook, the `TensorPlot` class imported from `tensor_plot.py` is designed to plot convolutional neural networks (CNNs) that take two-dimensional images for inputs. The `tensor_plot.py` script provided with the notebook retrieves tensors using Debugger and visualizes the CNN. You can run this notebook on SageMaker Studio to reproduce the tensor visualization and implement your own convolutional neural network model. 
+ [Real-time Tensor Analysis in a SageMaker Notebook with MXNet](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_realtime_analysis)

  This example guides you through installing required components for emitting tensors in an Amazon SageMaker training job and using the Debugger API operations to access those tensors while training is running. A gluon CNN model is trained on the Fashion MNIST dataset. While the job is running, you will see how Debugger retrieves activation outputs of the first convolutional layer from each of 100 batches and visualizes them. Also, this will show you how to visualize weights after the job is done.

## Save tensors using Debugger built-in collections
<a name="debugger-save-built-in-collections"></a>

You can use built-in collections of tensors using the `CollectionConfig` API and save them using the `DebuggerHookConfig` API. The following example shows how to use the default settings of Debugger hook configurations to construct a SageMaker AI TensorFlow estimator. You can also utilize this for MXNet, PyTorch, and XGBoost estimators.

**Note**  
In the following example code, the `s3_output_path` parameter for `DebuggerHookConfig` is optional. If you do not specify it, Debugger saves the tensors at `s3://<output_path>/debug-output/`, where the `<output_path>` is the default output path of SageMaker training jobs. For example:  

```
"s3://sagemaker-us-east-1-111122223333/sagemaker-debugger-training-YYYY-MM-DD-HH-MM-SS-123/debug-output"
```

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to call built-in collections
collection_configs=[
        CollectionConfig(name="weights"),
        CollectionConfig(name="gradients"),
        CollectionConfig(name="losses"),
        CollectionConfig(name="biases")
    ]

# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-built-in-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

To see a list of Debugger built-in collections, see [Debugger Built-in Collections](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#collection).

## Save tensors by modifying Debugger built-in collections
<a name="debugger-save-modified-built-in-collections"></a>

You can modify the Debugger built-in collections using the `CollectionConfig` API operation. The following example shows how to tweak the built-in `losses` collection and construct a SageMaker AI TensorFlow estimator. You can also use this for MXNet, PyTorch, and XGBoost estimators.

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to call and modify built-in collections
collection_configs=[
    CollectionConfig(
                name="losses", 
                parameters={"save_interval": "50"})]

# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-modified-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

For a full list of `CollectionConfig` parameters, see [ Debugger CollectionConfig API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk).

## Save tensors using Debugger custom collections
<a name="debugger-save-custom-collections"></a>

You can also save a reduced number of tensors instead of the full set of tensors (for example, if you want to reduce the amount of data saved in your Amazon S3 bucket). The following example shows how to customize the Debugger hook configuration to specify target tensors that you want to save. You can use this for TensorFlow, MXNet, PyTorch, and XGBoost estimators.

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig

# use Debugger CollectionConfig to create a custom collection
collection_configs=[
        CollectionConfig(
            name="custom_activations_collection",
            parameters={
                "include_regex": "relu|tanh", # Required
                "reductions": "mean,variance,max,abs_mean,abs_variance,abs_max"
            })
    ]
    
# configure Debugger hook
# set a target S3 bucket as you want
sagemaker_session=sagemaker.Session()
BUCKET_NAME=sagemaker_session.default_bucket()
LOCATION_IN_BUCKET='debugger-custom-collections-hook'

hook_config=DebuggerHookConfig(
    s3_output_path='s3://{BUCKET_NAME}/{LOCATION_IN_BUCKET}'.
                    format(BUCKET_NAME=BUCKET_NAME, 
                           LOCATION_IN_BUCKET=LOCATION_IN_BUCKET),
    collection_configs=collection_configs
)

# construct a SageMaker TensorFlow estimator
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-demo-job',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",
    
    # debugger-specific hook argument below
    debugger_hook_config=hook_config
)

sagemaker_estimator.fit()
```

For a full list of `CollectionConfig` parameters, see [ Debugger CollectionConfig](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk).

# How to configure Debugger built-in rules
<a name="use-debugger-built-in-rules"></a>

In the following topics, you'll learn how to use the SageMaker Debugger built-in rules. Amazon SageMaker Debugger's built-in rules analyze tensors emitted during the training of a model. SageMaker AI Debugger offers the `Rule` API operation that monitors training job progress and errors for the success of training your model. For example, the rules can detect whether gradients are getting too large or too small, whether a model is overfitting or overtraining, and whether a training job does not decrease loss function and improve. To see a full list of available built-in rules, see [List of Debugger built-in rules](debugger-built-in-rules.md).

**Topics**
+ [

# Use Debugger built-in rules with the default parameter settings
](debugger-built-in-rules-configuration.md)
+ [

# Use Debugger built-in rules with custom parameter values
](debugger-built-in-rules-configuration-param-change.md)
+ [

# Example notebooks and code samples to configure Debugger rules
](debugger-built-in-rules-example.md)

# Use Debugger built-in rules with the default parameter settings
<a name="debugger-built-in-rules-configuration"></a>

To specify Debugger built-in rules in an estimator, you need to configure a list object. The following example code shows the basic structure of listing the Debugger built-in rules:

```
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.built_in_rule_name_1()),
    Rule.sagemaker(rule_configs.built_in_rule_name_2()),
    ...
    Rule.sagemaker(rule_configs.built_in_rule_name_n()),
    ... # You can also append more profiler rules in the ProfilerRule.sagemaker(rule_configs.*()) format.
]
```

For more information about default parameter values and descriptions of the built-in rule, see [List of Debugger built-in rules](debugger-built-in-rules.md).

To find the SageMaker Debugger API reference, see [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.sagemaker.debugger.rule_configs](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.sagemaker.debugger.rule_configs) and [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule).

For example, to inspect the overall training performance and progress of your model, construct a SageMaker AI estimator with the following built-in rule configuration. 

```
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.loss_not_decreasing()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.stalled_training_rule())
]
```

When you start the training job, Debugger collects system resource utilization data every 500 milliseconds and the loss and accuracy values every 500 steps by default. Debugger analyzes the resource utilization to identify if your model is having bottleneck problems. The `loss_not_decreasing`, `overfit`, `overtraining`, and `stalled_training_rule` monitors if your model is optimizing the loss function without those training issues. If the rules detect training anomalies, the rule evaluation status changes to `IssueFound`. You can set up automated actions, such as notifying training issues and stopping training jobs using Amazon CloudWatch Events and AWS Lambda. For more information, see [Action on Amazon SageMaker Debugger rules](debugger-action-on-rules.md).


# Use Debugger built-in rules with custom parameter values
<a name="debugger-built-in-rules-configuration-param-change"></a>

If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure the `base_config` and `rule_parameters` parameters for the `ProfilerRule.sagemaker` and `Rule.sagemaker` classmethods. In case of the `Rule.sagemaker` class methods, you can also customize tensor collections through the `collections_to_save` parameter. The instruction of how to use the `CollectionConfig` class is provided at [Configure tensor collections using the `CollectionConfig` API](debugger-configure-tensor-collections.md). 

Use the following configuration template for built-in rules to customize parameter values. By changing the rule parameters as you want, you can adjust the sensitivity of the rules to be triggered. 
+ The `base_config` argument is where you call the built-in rule methods.
+ The `rule_parameters` argument is to adjust the default key values of the built-in rules listed in [List of Debugger built-in rules](debugger-built-in-rules.md).
+ The `collections_to_save` argument takes in a tensor configuration through the `CollectionConfig` API, which requires `name` and `parameters` arguments. 
  + To find available tensor collections for `name`, see [ Debugger Built-in Tensor Collections ](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections). 
  + For a full list of adjustable `parameters`, see [ Debugger CollectionConfig API](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#configuring-collection-using-sagemaker-python-sdk).

For more information about the Debugger rule class, methods, and parameters, see [SageMaker AI Debugger Rule class](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs, CollectionConfig

rules=[
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule_name(),
        rule_parameters={
                "key": "value"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="tensor_collection_name", 
                parameters={
                    "key": "value"
                } 
            )
        ]
    )
]
```

The parameter descriptions and value customization examples are provided for each rule at [List of Debugger built-in rules](debugger-built-in-rules.md).

# Example notebooks and code samples to configure Debugger rules
<a name="debugger-built-in-rules-example"></a>

In the following sections, notebooks and code samples of how to use Debugger rules to monitor SageMaker training jobs are provided.

**Topics**
+ [

## Debugger built-in rules example notebooks
](#debugger-built-in-rules-notebook-example)
+ [

## Debugger built-in rules example code
](#debugger-deploy-built-in-rules)
+ [

## Use Debugger built-in rules with parameter modifications
](#debugger-deploy-modified-built-in-rules)

## Debugger built-in rules example notebooks
<a name="debugger-built-in-rules-notebook-example"></a>

The following example notebooks show how to use Debugger built-in rules when running training jobs with Amazon SageMaker AI: 
+ [Using a SageMaker Debugger built-in rule with TensorFlow](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_builtin_rule)
+ [Using a SageMaker Debugger built-in rule with Managed Spot Training and MXNet](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_spot_training)
+ [Using a SageMaker Debugger built-in rule with parameter modifications for a real-time training job analysis with XGBoost](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/xgboost_realtime_analysis)

While running the example notebooks in SageMaker Studio, you can find the training job trial created on the **Studio Experiment List** tab. For example, as shown in the following screenshot, you can find and open a **Describe Trial Component** window of your current training job. On the Debugger tab, you can check if the Debugger rules, `vanishing_gradient()` and `loss_not_decreasing()`, are monitoring the training session in parallel. For a full instruction of how to find your training job trial components in the Studio UI, see [SageMaker Studio - View Experiments, Trials, and Trial Components](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-tasks.html#studio-tasks-experiments).

![\[An image of running a training job with Debugger built-in rules activated in SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-rule-studio.png)


There are two ways of using the Debugger built-in rules in the SageMaker AI environment: deploy the built-in rules as it is prepared or adjust their parameters as you want. The following topics show you how to use the built-in rules with example codes.

## Debugger built-in rules example code
<a name="debugger-deploy-built-in-rules"></a>

The following code sample shows how to set the Debugger built-in rules using the `Rule.sagemaker` method. To specify built-in rules that you want to run, use the `rules_configs` API operation to call the built-in rules. To find a full list of Debugger built-in rules and default parameter values, see [List of Debugger built-in rules](debugger-built-in-rules.md).

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

# call built-in rules that you want to use.
built_in_rules=[ 
            Rule.sagemaker(rule_configs.vanishing_gradient())
            Rule.sagemaker(rule_configs.loss_not_decreasing())
]

# construct a SageMaker AI estimator with the Debugger built-in rules
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name='debugger-built-in-rules-demo',
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",

    # debugger-specific arguments below
    rules=built_in_rules
)
sagemaker_estimator.fit()
```

**Note**  
The Debugger built-in rules run in parallel with your training job. The maximum number of built-in rule containers for a training job is 20. 

For more information about the Debugger rule class, methods, and parameters, see the [SageMaker Debugger Rule class](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). 

To find an example of how to adjust the Debugger rule parameters, see the following [Use Debugger built-in rules with parameter modifications](#debugger-deploy-modified-built-in-rules) section.

## Use Debugger built-in rules with parameter modifications
<a name="debugger-deploy-modified-built-in-rules"></a>

The following code example shows the structure of built-in rules to adjust parameters. In this example, the `stalled_training_rule` collects the `losses` tensor collection from a training job at every 50 steps and an evaluation stage at every 10 steps. If the training process starts stalling and not collecting tensor outputs for 120 seconds, the `stalled_training_rule` stops the training job. 

```
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import Rule, CollectionConfig, rule_configs

# call the built-in rules and modify the CollectionConfig parameters

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

built_in_rules_modified=[
    Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                'threshold': '120',
                'training_job_name_prefix': base_job_name_prefix,
                'stop_training_on_fire' : 'True'
        }
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                      "train.save_interval": "50"
                      "eval.save_interval": "10"
                } 
            )
        ]
    )
]

# construct a SageMaker AI estimator with the modified Debugger built-in rule
sagemaker_estimator=TensorFlow(
    entry_point='directory/to/your_training_script.py',
    role=sm.get_execution_role(),
    base_job_name=base_job_name_prefix,
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    framework_version="2.9.0",
    py_version="py39",

    # debugger-specific arguments below
    rules=built_in_rules_modified
)
sagemaker_estimator.fit()
```

For an advanced configuration of the Debugger built-in rules using the `CreateTrainingJob` API, see [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md).

# Turn off Debugger
<a name="debugger-turn-off"></a>

If you want to completely turn off Debugger, do one of the following:
+ Before starting a training job, do the following:

  To stop both monitoring and profiling, include the `disable_profiler` parameter to your estimator and set it to `True`.
**Warning**  
If you disable it, you won't be able to view the comprehensive Studio Debugger insights dashboard and the autogenerated profiling report.

  To stop debugging, set the `debugger_hook_config` parameter to `False`.
**Warning**  
If you disable it, you won't be able to collect output tensors and cannot debug your model parameters.

  ```
  estimator=Estimator(
      ...
      disable_profiler=True
      debugger_hook_config=False
  )
  ```

  For more information about the Debugger-specific parameters, see [SageMaker AI Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).
+ While a training job is running, do the following:

  To disable both monitoring and profiling while your training job is running, use the following estimator classmethod:

  ```
  estimator.disable_profiling()
  ```

  To disable framework profiling only and keep system monitoring, use the `update_profiler` method:

  ```
  estimator.update_profiler(disable_framework_metrics=true)
  ```

  For more information about the estimator extension methods, see the [estimator.disable\$1profiling](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.disable_profiling) and [estimator.update\$1profiler](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator.update_profiler) classmethods in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) documentation.

# Useful SageMaker AI estimator class methods for Debugger
<a name="debugger-estimator-classmethods"></a>

The following estimator class methods are useful for accessing your SageMaker training job information and retrieving output paths of training data collected by Debugger. The following methods are executable after you initiate a training job with the `estimator.fit()` method.
+ To check the base S3 bucket URI of a SageMaker training job:

  ```
  estimator.output_path
  ```
+ To check the base job name of a SageMaker training job:

  ```
  estimator.latest_training_job.job_name
  ```
+ To see a full `CreateTrainingJob` API operation configuration of a SageMaker training job:

  ```
  estimator.latest_training_job.describe()
  ```
+ To check a full list of the Debugger rules while a SageMaker training job is running:

  ```
  estimator.latest_training_job.rule_job_summary()
  ```
+ To check the S3 bucket URI where the model parameter data (output tensors) are saved:

  ```
  estimator.latest_job_debugger_artifacts_path()
  ```
+ To check the S3 bucket URI at where the model performance data (system and framework metrics) are saved:

  ```
  estimator.latest_job_profiler_artifacts_path()
  ```
+ To check the Debugger rule configuration for debugging output tensors:

  ```
  estimator.debugger_rule_configs
  ```
+ To check the list of the Debugger rules for debugging while a SageMaker training job is running:

  ```
  estimator.debugger_rules
  ```
+ To check the Debugger rule configuration for monitoring and profiling system and framework metrics:

  ```
  estimator.profiler_rule_configs
  ```
+ To check the list of the Debugger rules for monitoring and profiling while a SageMaker training job is running:

  ```
  estimator.profiler_rules
  ```

For more information about the SageMaker AI estimator class and its methods, see [Estimator API](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html#sagemaker.estimator.Estimator) in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

# SageMaker Debugger interactive report for XGBoost
<a name="debugger-report-xgboost"></a>

Receive training reports autogenerated by Debugger. The Debugger reports provide insights into your training jobs and suggest recommendations to improve your model performance. For SageMaker AI XGBoost training jobs, use the Debugger [CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) rule to receive a comprehensive training report of the training progress and results. Following this guide, specify the [CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) rule while constructing an XGBoost estimator, download the report using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) or the Amazon S3 console, and gain insights into the training results.

**Note**  
You can download a Debugger reports while your training job is running or after the job has finished. During training, Debugger concurrently updates the report reflecting the current rules' evaluation status. You can download a complete Debugger report only after the training job has completed.

**Important**  
In the report, plots and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

**Topics**
+ [

# Construct a SageMaker AI XGBoost estimator with the Debugger XGBoost Report rule
](debugger-training-xgboost-report-estimator.md)
+ [

# Download the Debugger XGBoost training report
](debugger-training-xgboost-report-download.md)
+ [

# Debugger XGBoost training report walkthrough
](debugger-training-xgboost-report-walkthrough.md)

# Construct a SageMaker AI XGBoost estimator with the Debugger XGBoost Report rule
<a name="debugger-training-xgboost-report-estimator"></a>

The [CreateXgboostReport](debugger-built-in-rules.md#create-xgboost-report) rule collects the following output tensors from your training job: 
+ `hyperparameters` – Saves at the first step.
+ `metrics` – Saves loss and accuracy every 5 steps.
+ `feature_importance` – Saves every 5 steps.
+ `predictions` – Saves every 5 steps.
+ `labels` – Saves every 5 steps.

The output tensors are saved at a default S3 bucket. For example, `s3://sagemaker-<region>-<12digit_account_id>/<base-job-name>/debug-output/`.

When you construct a SageMaker AI estimator for an XGBoost training job, specify the rule as shown in the following example code.

------
#### [ Using the SageMaker AI generic estimator ]

```
import boto3
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker import image_uris
from sagemaker.debugger import Rule, rule_configs

rules=[
    Rule.sagemaker(rule_configs.create_xgboost_report())
]

region = boto3.Session().region_name
xgboost_container=sagemaker.image_uris.retrieve("xgboost", region, "1.2-1")

estimator=Estimator(
    role=sagemaker.get_execution_role()
    image_uri=xgboost_container,
    base_job_name="debugger-xgboost-report-demo",
    instance_count=1,
    instance_type="ml.m5.2xlarge",
    
    # Add the Debugger XGBoost report rule
    rules=rules
)

estimator.fit(wait=False)
```

------

# Download the Debugger XGBoost training report
<a name="debugger-training-xgboost-report-download"></a>

Download the Debugger XGBoost training report while your training job is running or after the job has finished using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) and AWS Command Line Interface (CLI).

------
#### [ Download using the SageMaker Python SDK and AWS CLI ]

1. Check the current job's default S3 output base URI.

   ```
   estimator.output_path
   ```

1. Check the current job name.

   ```
   estimator.latest_training_job.job_name
   ```

1. The Debugger XGBoost report is stored under `<default-s3-output-base-uri>/<training-job-name>/rule-output`. Configure the rule output path as follows:

   ```
   rule_output_path = estimator.output_path + "/" + estimator.latest_training_job.job_name + "/rule-output"
   ```

1. To check if the report is generated, list directories and files recursively under the `rule_output_path` using `aws s3 ls` with the `--recursive` option.

   ```
   ! aws s3 ls {rule_output_path} --recursive
   ```

   This should return a complete list of files under autogenerated folders that are named `CreateXgboostReport` and `ProfilerReport-1234567890`. The XGBoost training report is stored in the `CreateXgboostReport`, and the profiling report is stored in the `ProfilerReport-1234567890` folder. To learn more about the profiling report generated by default with the XGBoost training job, see [SageMaker Debugger interactive report](debugger-profiling-report.md).  
![\[An example of rule output.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-ls.png)

   The `xgboost_report.html` is an autogenerated XGBoost training report by Debugger. The `xgboost_report.ipynb` is a Jupyter notebook that's used to aggregate training results into the report. You can download all of the files, browse the HTML report file, and modify the report using the notebook.

1. Download the files recursively using `aws s3 cp`. The following command saves all of the rule output files to the `ProfilerReport-1234567890` folder under the current working directory.

   ```
   ! aws s3 cp {rule_output_path} ./ --recursive
   ```
**Tip**  
If you are using a Jupyter notebook server, run `!pwd` to verify the current working directory.

1. Under the `/CreateXgboostReport` directory, open `xgboost_report.html`. If you are using JupyterLab, choose **Trust HTML** to see the autogenerated Debugger training report.  
![\[An example of rule output.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-open-trust.png)

1. Open the `xgboost_report.ipynb` file to explore how the report is generated. You can customize and extend the training report using the Jupyter notebook file.

------
#### [ Download using the Amazon S3 console ]

1. Sign in to the AWS Management Console and open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Search for the base S3 bucket. For example, if you haven't specified any base job name, the base S3 bucket name should be in the following format: `sagemaker-<region>-111122223333`. Look up the base S3 bucket through the **Find bucket by name** field.  
![\[The Find bucket by name field in the Amazon S3 console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-0.png)

1. In the base S3 bucket, look up the training job name by entering your job name prefix in **Find objects by prefix** and then choosing the training job name.  
![\[The Find objects by prefix field in the Amazon S3 console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-1.png)

1. In the training job's S3 bucket, choose **rule-output/** subfolder. There must be three subfolders for training data collected by Debugger: **debug-output/**, **profiler-output/**, and **rule-output/**.   
![\[An example to the rule output S3 bucket URI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-report-download-s3console-2.png)

1. In the **rule-output/** folder, choose the **CreateXgboostReport/** folder. The folder contains **xbgoost\$1report.html** (the autogenerated report in html) and **xbgoost\$1report.ipynb** (a Jupyter notebook with scripts that are used for generating the report).

1. Choose the **xbgoost\$1report.html** file, choose **Download actions**, and then choose **Download**.  
![\[An example to the rule output S3 bucket URI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-xgboost-report-s3-download.png)

1. Open the downloaded **xbgoost\$1report.html** file in a web browser.

------

# Debugger XGBoost training report walkthrough
<a name="debugger-training-xgboost-report-walkthrough"></a>

This section walks you through the Debugger XGBoost training report. The report is automatically aggregated depending on the output tensor regex, recognizing what type of your training job is among binary classification, multiclass classification, and regression.

**Important**  
In the report, plots and and recommendations are provided for informational purposes and are not definitive. You are responsible for making your own independent assessment of the information.

**Topics**
+ [

## Distribution of true labels of the dataset
](#debugger-training-xgboost-report-walkthrough-dist-label)
+ [

## Loss versus step graph
](#debugger-training-xgboost-report-walkthrough-loss-vs-step)
+ [

## Feature importance
](#debugger-training-xgboost-report-walkthrough-feature-importance)
+ [

## Confusion matrix
](#debugger-training-xgboost-report-walkthrough-confusion-matrix)
+ [

## Evaluation of the confusion matrix
](#debugger-training-xgboost-report-walkthrough-eval-conf-matrix)
+ [

## Accuracy rate of each diagonal element over iteration
](#debugger-training-xgboost-report-walkthrough-accuracy-rate)
+ [

## Receiver operating characteristic curve
](#debugger-training-xgboost-report-walkthrough-rec-op-char)
+ [

## Distribution of residuals at the last saved step
](#debugger-training-xgboost-report-walkthrough-dist-residual)
+ [

## Absolute validation error per label bin over iteration
](#debugger-training-xgboost-report-walkthrough-val-error-per-label-bin)

## Distribution of true labels of the dataset
<a name="debugger-training-xgboost-report-walkthrough-dist-label"></a>

This histogram shows the distribution of labeled classes (for classification) or values (for regression) in your original dataset. Skewness in your dataset could contribute to inaccuracies. This visualization is available for the following model types: binary classification, multiclassification, and regression.

![\[An example of a distribution of true labels of the dataset graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-dist-label.png)


## Loss versus step graph
<a name="debugger-training-xgboost-report-walkthrough-loss-vs-step"></a>

This is a line chart that shows the progression of loss on training data and validation data throughout training steps. The loss is what you defined in your objective function, such as mean squared error. You can gauge whether the model is overfit or underfit from this plot. This section also provides insights that you can use to determine how to resolve the overfit and underfit problems. This visualization is available for the following model types: binary classification, multiclassification, and regression. 

![\[An example of a loss versus step graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-loss-vs-step.png)


## Feature importance
<a name="debugger-training-xgboost-report-walkthrough-feature-importance"></a>

There are three different types of feature importance visualizations provided: Weight, Gain and Coverage. We provide detailed definitions for each of the three in the report. Feature importance visualizations help you learn what features in your training dataset contributed to the predictions. Feature importance visualizations are available for the following model types: binary classification, multiclassification, and regression. 

![\[An example of a feature importance graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-feature-importance.png)


## Confusion matrix
<a name="debugger-training-xgboost-report-walkthrough-confusion-matrix"></a>

This visualization is only applicable to binary and multiclass classification models. Accuracy alone might not be sufficient for evaluating the model performance. For some use cases, such as healthcare and fraud detection, it’s also important to know the false positive rate and false negative rate. A confusion matrix gives you the additional dimensions for evaluating your model performance.

![\[An example of confusion matrix.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-confusion-matrix.png)


## Evaluation of the confusion matrix
<a name="debugger-training-xgboost-report-walkthrough-eval-conf-matrix"></a>

This section provides you with more insights on the micro, macro, and weighted metrics on precision, recall, and F1-score for your model.

![\[Evaluation of the confusion matrix.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-eval-conf-matrix.png)


## Accuracy rate of each diagonal element over iteration
<a name="debugger-training-xgboost-report-walkthrough-accuracy-rate"></a>

This visualization is only applicable to binary classification and multiclass classification models. This is a line chart that plots the diagonal values in the confusion matrix throughout the training steps for each class. This plot shows you how the accuracy of each class progresses throughout the training steps. You can identify the under-performing classes from this plot. 

![\[An example of an accuracy rate of each diagonal element over iteration graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-accuracy-rate.gif)


## Receiver operating characteristic curve
<a name="debugger-training-xgboost-report-walkthrough-rec-op-char"></a>

This visualization is only applicable to binary classification models. The Receiver Operating Characteristic curve is commonly used to evaluate binary classification model performance. The y-axis of the curve is True Positive Rate (TPF) and x-axis is false positive rate (FPR). The plot also displays the value for the area under the curve (AUC). The higher the AUC value, the more predictive your classifier. You can also use the ROC curve to understand the trade-off between TPR and FPR and identify the optimum classification threshold for your use case. The classification threshold can be adjusted to tune the behavior of the model to reduce more of one or another type of error (FP/FN).

![\[An example a receiver operating characteristic curve graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-rec-op-char.png)


## Distribution of residuals at the last saved step
<a name="debugger-training-xgboost-report-walkthrough-dist-residual"></a>

This visualization is a column chart that shows the residual distributions in the last step Debugger captures. In this visualization, you can check whether the residual distribution is close to normal distribution that’s centered at zero. If the residuals are skewed, your features may not be sufficient for predicting the labels. 

![\[An example of a distribution of residuals at the last saved step graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-dist-residual.png)


## Absolute validation error per label bin over iteration
<a name="debugger-training-xgboost-report-walkthrough-val-error-per-label-bin"></a>

This visualization is only applicable to regression models. The actual target values are split into 10 intervals. This visualization shows how validation errors progress for each interval throughout the training steps in line plots. Absolute validation error is the absolute value of difference between prediction and actual during validation. You can identify the underperforming intervals from this visualization.

![\[An example an absolute validation error per label bin over iteration graph.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-training-xgboost-report-walkthrough-val-error-per-label-bin.png)


# Action on Amazon SageMaker Debugger rules
<a name="debugger-action-on-rules"></a>

Based on the Debugger rule evaluation status, you can set up automated actions such as stopping a training job and sending notifications using Amazon Simple Notification Service (Amazon SNS). You can also create your own actions using Amazon CloudWatch Events and AWS Lambda. To learn how to set up automated actions based on the Debugger rule evaluation status, see the following topics.

**Topics**
+ [

# Use Debugger built-in actions for rules
](debugger-built-in-actions.md)
+ [

# Actions on rules using Amazon CloudWatch and AWS Lambda
](debugger-cloudwatch-lambda.md)

# Use Debugger built-in actions for rules
<a name="debugger-built-in-actions"></a>

Use Debugger built-in actions to respond to issues found by [Debugger rule](debugger-built-in-rules.md#debugger-built-in-rules-Rule). The Debugger `rule_configs` class provides tools to configure a list of actions, including automatically stopping training jobs and sending notifications using Amazon Simple Notification Service (Amazon SNS) when the Debugger rules find training issues. The following topics takes you through the steps to accomplish these tasks.

**Topics**
+ [

## Set up Amazon SNS, create an `SMDebugRules` topic, and subscribe to the topic
](#debugger-built-in-actions-sns)
+ [

## Set up your IAM role to attach required policies
](#debugger-built-in-actions-iam)
+ [

## Configure Debugger rules with the built-in actions
](#debugger-built-in-actions-on-rule)
+ [

## Considerations for using the Debugger built-in actions
](#debugger-built-in-actions-considerations)

## Set up Amazon SNS, create an `SMDebugRules` topic, and subscribe to the topic
<a name="debugger-built-in-actions-sns"></a>

This section walks you through how to set up an Amazon SNS **SMDebugRules** topic, subscribe to it, and confirm the subscription to receive notifications from the Debugger rules.

**Note**  
For more information about billing for Amazon SNS, see [Amazon SNS pricing](https://aws.amazon.com/sns/pricing/) and [Amazon SNS FAQs](https://aws.amazon.com/sns/faqs/).

**To create a SMDebugRules topic**

1. Sign in to the AWS Management Console and open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the left navigation pane, choose **Topics**. 

1. On the **Topics** page, choose **Create topic**.

1. On the **Create topic** page, in the **Details** section, do the following:

   1. For **Type**, choose **Standard** for topic type.

   1. In **Name**, enter **SMDebugRules**.

1. Skip all other optional settings and choose **Create topic**. If you want to learn more about the optional settings, see [Creating an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html).

**To subscribe to the SMDebugRules topic**

1. Open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the left navigation pane, choose **Subscriptions**. 

1. On the **Subscriptions** page, choose **Create subscription**.

1. On the **Create subscription** page, in the **Details** section, do the following: 

   1. For **Topic ARN**, choose the **SMDebugRules** topic ARN. The ARN should be in format of `arn:aws:sns:<region-id>:111122223333:SMDebugRules`.

   1. For **Protocol**, choose **Email** or **SMS**. 

   1. For **Endpoint**, enter the endpoint value, such as an email address or a phone number that you want to receive notifications.
**Note**  
Make sure you type the correct email address and phone number. Phone numbers must include `+`, a country code, and phone number, with no special characters or spaces. For example, the phone number \$11 (222) 333-4444 is formatted as **\$112223334444**.

1. Skip all other optional settings and choose **Create subscription**. If you want to learn more about the optional settings, see [Subscribing to an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html).

After you subscribe to the **SMDebugRules** topic, you receive the following confirmation message in email or by phone:

![\[A subscription confirmation email message for the Amazon SNS SMDebugRules topic.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-subscription-confirmation.png)


For more information about Amazon SNS, see [Mobile text messaging (SMS)](https://docs.aws.amazon.com/sns/latest/dg/sns-mobile-phone-number-as-subscriber.html) and [Email notifications](https://docs.aws.amazon.com/sns/latest/dg/sns-email-notifications.html) in the *Amazon SNS Developer Guide*.

## Set up your IAM role to attach required policies
<a name="debugger-built-in-actions-iam"></a>

In this step, you add the required policies to your IAM role.

**To add the required policies to your IAM role**

1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the left navigation pane, choose **Policies**, and choose **Create policy**.

1. On the **Create policy** page, do the following to create a new sns-access policy:

   1. Choose the **JSON** tab.

   1. Paste the JSON strings formatted in bold in the following code into the `"Statement"`, replacing the 12-digit AWS account ID with your AWS account ID.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": [
                      "sns:Publish",
                      "sns:CreateTopic",
                      "sns:Subscribe"
                  ],
                  "Resource": "arn:aws:sns:*:111122223333:SMDebugRules"
              }
          ]
      }
      ```

------

   1. At the bottom of the page, choose **Review policy**.

   1. On the **Review policy** page, for **Name**, enter **sns-access**.

   1. At the bottom of the page, choose **Create policy**.

1. Go back to the IAM console, and choose **Roles** in the left navigation pane.

1. Look up the IAM role that you use for SageMaker AI model training and choose that IAM role.

1. On the **Permissions** tab of the **Summary** page, choose **Attach policies**.

1. Search for the **sns-access** policy, select the check box next to the policy, and then choose **Attach policy**.

For more examples of setting up IAM policies for Amazon SNS, see [Example cases for Amazon SNS access control](https://docs.aws.amazon.com/sns/latest/dg/sns-access-policy-use-cases.html).

## Configure Debugger rules with the built-in actions
<a name="debugger-built-in-actions-on-rule"></a>

After successfully finishing the required settings in the preceding steps, you can configure the Debugger built-in actions for debugging rules as shown in the following example script. You can choose which built-in actions to use while building the `actions` list object. The `rule_configs` is a helper module that provides high-level tools to configure Debugger built-in rules and actions. The following built-in actions are available for Debugger:
+ `rule_configs.StopTraining()` – Stops a training job when the Debugger rule finds an issue.
+ `rule_configs.Email("abc@abc.com")` – Sends a notification via email when the Debugger rule finds an issue. Use the email address that you used when you set up your SNS topic subscription.
+ `rule_configs.SMS("+1234567890")` – Sends a notification via text message when the Debugger rule finds an issue. Use the phone number that you used when you set up your SNS topic subscription.
**Note**  
Make sure you type the correct email address and phone number. Phone numbers must include `+`, a country code, and a phone number, with no special characters or spaces. For example, the phone number \$11 (222) 333-4444 is formatted as **\$112223334444**.

You can use all of the built-in actions or a subset of actions by wrapping up using the `rule_configs.ActionList()` method, which takes the built-in actions and configures a list of actions.

**To add all of the three built-in actions to a single rule**

If you want to assign all of the three built-in actions to a single rule, configure a Debugger built-in action list while constructing an estimator. Use the following template to construct the estimator, and Debugger will stop training jobs and send notifications through email and text for any rules that you use to monitor your training job progress.

```
from sagemaker.debugger import Rule, rule_configs

# Configure an action list object for Debugger rules
actions = rule_configs.ActionList(
    rule_configs.StopTraining(), 
    rule_configs.Email("abc@abc.com"), 
    rule_configs.SMS("+1234567890")
)

# Configure rules for debugging with the actions parameter
rules = [
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule(),         # Required
        rule_parameters={"paramter_key": value },        # Optional
        actions=actions
    )
]

estimator = Estimator(
    ...
    rules = rules
)

estimator.fit(wait=False)
```

**To create multiple built-in action objects to assign different actions to a single rule**

If you want to assign the built-in actions to be triggered at different threshold values of a single rule, you can create multiple built-in action objects as shown in the following script. To avoid a conflict error by running the same rule, you must submit different rule job names (specify different strings for the rules' `name` attribute) as shown in the following example script template. This example shows how to set up [StalledTrainingRule](debugger-built-in-rules.md#stalled-training) to take two different actions: send an email to `abc@abc.com` when a training job stalls for 60 seconds, and stop the training job if stalling for 120 seconds.

```
from sagemaker.debugger import Rule, rule_configs
import time

base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))

# Configure an action object for StopTraining
action_stop_training = rule_configs.ActionList(
    rule_configs.StopTraining()
)

# Configure an action object for Email
action_email = rule_configs.ActionList(
    rule_configs.Email("abc@abc.com")
)

# Configure a rule with the Email built-in action to trigger if a training job stalls for 60 seconds
stalled_training_job_rule_email = Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "60", 
                "training_job_name_prefix": base_job_name_prefix
        },
        actions=action_email
)
stalled_training_job_rule_text.name="StalledTrainingJobRuleEmail"

# Configure a rule with the StopTraining built-in action to trigger if a training job stalls for 120 seconds
stalled_training_job_rule = Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "120", 
                "training_job_name_prefix": base_job_name_prefix
        },
        actions=action_stop_training
)
stalled_training_job_rule.name="StalledTrainingJobRuleStopTraining"

estimator = Estimator(
    ...
    rules = [stalled_training_job_rule_email, stalled_training_job_rule]
)

estimator.fit(wait=False)
```

While the training job is running, the Debugger built-in action sends notification emails and text messages whenever the rule finds issues with your training job. The following screenshot shows an example of email notification for a training job that has a stalled training job issue. 

![\[An example email notification sent by Debugger when it detects a StalledTraining issue.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-email.png)


The following screenshot shows an example text notification that Debugger sends when the rule finds a StalledTraining issue.

![\[An example text notification sent by Debugger when it detects a StalledTraining issue.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-built-in-action-text.png)


## Considerations for using the Debugger built-in actions
<a name="debugger-built-in-actions-considerations"></a>
+ To use the Debugger built-in actions, an internet connection is required. This feature is not supported in the network isolation mode provided by Amazon SageMaker AI or Amazon VPC.
+ The built-in actions cannot be used for [Profiler rules](debugger-built-in-profiler-rules.md#debugger-built-in-profiler-rules-ProfilerRule).
+ The built-in actions cannot be used on training jobs with spot training interruptions.
+ In email or text notifications, `None` appears at the end of messages. This does not have any meaning, so you can disregard the text `None`.

# Actions on rules using Amazon CloudWatch and AWS Lambda
<a name="debugger-cloudwatch-lambda"></a>

Amazon CloudWatch collects Amazon SageMaker AI model training job logs and Amazon SageMaker Debugger rule processing job logs. Configure Debugger with Amazon CloudWatch Events and AWS Lambda to take action based on Debugger rule evaluation status. 

## Example notebooks
<a name="debugger-test-stop-training"></a>

You can run the following example notebooks, which are prepared for experimenting with stopping a training job using actions on Debugger's built-in rules using Amazon CloudWatch and AWS Lambda.
+ [Amazon SageMaker Debugger - Reacting to CloudWatch Events from Rules](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_action_on_rule/tf-mnist-stop-training-job.html)

  This example notebook runs a training job that has a vanishing gradient issue. The Debugger [VanishingGradient](debugger-built-in-rules.md#vanishing-gradient) built-in rule is used while constructing the SageMaker AI TensorFlow estimator. When the Debugger rule detects the issue, the training job is terminated.
+ [Detect Stalled Training and Invoke Actions Using SageMaker Debugger Rule](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_action_on_rule/detect_stalled_training_job_and_actions.html)

  This example notebook runs a training script with a code line that forces it to sleep for 10 minutes. The Debugger [StalledTrainingRule](debugger-built-in-rules.md#stalled-training) built-in rule invokes issues and stops the training job.

**Topics**
+ [

## Example notebooks
](#debugger-test-stop-training)
+ [

# Access CloudWatch logs for Debugger rules and training jobs
](debugger-cloudwatch-metric.md)
+ [

# Set up Debugger for automated training job termination using CloudWatch and Lambda
](debugger-stop-training.md)
+ [

# Disable the CloudWatch Events rule to stop using the automated training job termination
](debugger-disable-cw.md)

# Access CloudWatch logs for Debugger rules and training jobs
<a name="debugger-cloudwatch-metric"></a>

You can use the training and Debugger rule job status in the CloudWatch logs to take further actions when there are training issues. The following procedure shows how to access the related CloudWatch logs. For more information about monitoring training jobs using CloudWatch, see [Monitor Amazon SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-overview.html).

**To access training job logs and Debugger rule job logs**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the left navigation pane under the **Log** node, choose **Log Groups**.

1. In the log groups list, do the following:
   + Choose **/aws/sagemaker/TrainingJobs** for training job logs.
   + Choose **/aws/sagemaker/ProcessingJobs** for Debugger rule job logs.

# Set up Debugger for automated training job termination using CloudWatch and Lambda
<a name="debugger-stop-training"></a>

The Debugger rules monitor training job status, and a CloudWatch Events rule watches the Debugger rule training job evaluation status. The following sections outline the process needed to automate training job termination using using CloudWatch and Lambda.

**Topics**
+ [

## Step 1: Create a Lambda function
](#debugger-lambda-function-create)
+ [

## Step 2: Configure the Lambda function
](#debugger-lambda-function-configure)
+ [

## Step 3: Create a CloudWatch events rule and link to the Lambda function for Debugger
](#debugger-cloudwatch-events)

## Step 1: Create a Lambda function
<a name="debugger-lambda-function-create"></a>

**To create a Lambda function**

1. Open the AWS Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/).

1. In the left navigation pane, choose **Functions** and then choose **Create function**.

1. On the **Create function** page, choose **Author from scratch** option.

1. In the **Basic information** section, enter a **Function name** (for example, **debugger-rule-stop-training-job**).

1. For **Runtime**, choose **Python 3.7**.

1. For **Permissions**, expand the drop down option, and choose **Change default execution role**.

1. For **Execution role**, choose **Use an existing role** and choose the IAM role that you use for training jobs on SageMaker AI.
**Note**  
Make sure you use the execution role with `AmazonSageMakerFullAccess` and `AWSLambdaBasicExecutionRole` attached. Otherwise, the Lambda function won't properly react to the Debugger rule status changes of the training job. If you are unsure which execution role is being used, run the following code in a Jupyter notebook cell to retrieve the execution role output:  

   ```
   import sagemaker
   sagemaker.get_execution_role()
   ```

1. At the bottom of the page, choose **Create function**.

The following figure shows an example of the **Create function** page with the input fields and selections completed.

![\[Create Function page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-lambda-create.png)


## Step 2: Configure the Lambda function
<a name="debugger-lambda-function-configure"></a>

**To configure the Lambda function**

1. In the **Function code** section of the configuration page, paste the following Python script in the Lambda code editor pane. The `lambda_handler` function monitors the Debugger rule evaluation status collected by CloudWatch and triggers the `StopTrainingJob` API operation. The AWS SDK for Python (Boto3) `client` for SageMaker AI provides a high-level method, `stop_training_job`, which triggers the `StopTrainingJob` API operation.

   ```
   import json
   import boto3
   import logging
   
   logger = logging.getLogger()
   logger.setLevel(logging.INFO)
   
   def lambda_handler(event, context):
       training_job_name = event.get("detail").get("TrainingJobName")
       logging.info(f'Evaluating Debugger rules for training job: {training_job_name}')
       eval_statuses = event.get("detail").get("DebugRuleEvaluationStatuses", None)
   
       if eval_statuses is None or len(eval_statuses) == 0:
           logging.info("Couldn't find any debug rule statuses, skipping...")
           return {
               'statusCode': 200,
               'body': json.dumps('Nothing to do')
           }
   
       # should only attempt stopping jobs with InProgress status
       training_job_status = event.get("detail").get("TrainingJobStatus", None)
       if training_job_status != 'InProgress':
           logging.debug(f"Current Training job status({training_job_status}) is not 'InProgress'. Exiting")
           return {
               'statusCode': 200,
               'body': json.dumps('Nothing to do')
           }
   
       client = boto3.client('sagemaker')
   
       for status in eval_statuses:
           logging.info(status.get("RuleEvaluationStatus") + ', RuleEvaluationStatus=' + str(status))
           if status.get("RuleEvaluationStatus") == "IssuesFound":
               secondary_status = event.get("detail").get("SecondaryStatus", None)
               logging.info(
                   f'About to stop training job, since evaluation of rule configuration {status.get("RuleConfigurationName")} resulted in "IssuesFound". ' +
                   f'\ntraining job "{training_job_name}" status is "{training_job_status}", secondary status is "{secondary_status}"' +
                   f'\nAttempting to stop training job "{training_job_name}"'
               )
               try:
                   client.stop_training_job(
                       TrainingJobName=training_job_name
                   )
               except Exception as e:
                   logging.error(
                       "Encountered error while trying to "
                       "stop training job {}: {}".format(
                           training_job_name, str(e)
                       )
                   )
                   raise e
       return None
   ```

   For more information about the Lambda code editor interface, see [Creating functions using the AWS Lambda console editor](https://docs.aws.amazon.com/lambda/latest/dg/code-editor.html).

1. Skip all other settings and choose **Save** at the top of the configuration page.

## Step 3: Create a CloudWatch events rule and link to the Lambda function for Debugger
<a name="debugger-cloudwatch-events"></a>

**To create a CloudWatch Events rule and link to the Lambda function for Debugger**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the left navigation pane, choose **Rules** under the **Events** node.

1. Choose **Create rule**.

1. In the **Event Source** section of the **Step 1: Create rule** page, choose **SageMaker AI** for **Service Name**, and choose **SageMaker AI Training Job State Change** for **Event Type**. The Event Pattern Preview should look like the following example JSON strings: 

   ```
   {
       "source": [
           "aws.sagemaker"
       ],
       "detail-type": [
           "SageMaker Training Job State Change"
       ]
   }
   ```

1. In the **Targets** section, choose **Add target\$1**, and choose the **debugger-rule-stop-training-job** Lambda function that you created. This step links the CloudWatch Events rule with the Lambda function.

1. Choose **Configure details** and go to the **Step 2: Configure rule details** page.

1. Specify the CloudWatch rule definition name. For example, **debugger-cw-event-rule**.

1. Choose **Create rule** to finish.

1. Go back to the Lambda function configuration page and refresh the page. Confirm that it's configured correctly in the **Designer** panel. The CloudWatch Events rule should be registered as a trigger for the Lambda function. The configuration design should look like the following example:  
<a name="lambda-designer-example"></a>![\[Designer panel for the CloudWatch configuration.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-lambda-designer.png)

# Disable the CloudWatch Events rule to stop using the automated training job termination
<a name="debugger-disable-cw"></a>

If you want to disable the automated training job termination, you need to disable the CloudWatch Events rule. In the Lambda **Designer** panel, choose the **EventBridge (CloudWatch Events)** block linked to the Lambda function. This shows an **EventBridge** panel below the **Designer** panel (for example, see the previous screen shot). Select the check box next to **EventBridge (CloudWatch Events): debugger-cw-event-rule**, and then choose **Disable**. If you want to use the automated termination functionality later, you can enable the CloudWatch Events rule again.

# Visualize Amazon SageMaker Debugger output tensors in TensorBoard
<a name="debugger-enable-tensorboard-summaries"></a>

**Important**  
This page is deprecated in favor of Amazon SageMaker AI with TensoBoard, which provides a comprehensive TensorBoard experience integrated with SageMaker Training and the access control functionalities of SageMaker AI domain. To learn more, see [TensorBoard in Amazon SageMaker AI](tensorboard-on-sagemaker.md).

Use SageMaker Debugger to create output tensor files that are compatible with TensorBoard. Load the files to visualize in TensorBoard and analyze your SageMaker training jobs. Debugger automatically generates output tensor files that are compatible with TensorBoard. For any hook configuration you customize for saving output tensors, Debugger has the flexibility to create scalar summaries, distributions, and histograms that you can import to TensorBoard. 

![\[An architecture diagram of the Debugger output tensor saving mechanism.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-tensorboard-concept.png)


You can enable this by passing `DebuggerHookConfig` and `TensorBoardOutputConfig` objects to an `estimator`.

The following procedure explains how to save scalars, weights, and biases as full tensors, histograms, and distributions that can be visualized with TensorBoard. Debugger saves them to the training container's local path (the default path is `/opt/ml/output/tensors`) and syncs to the Amazon S3 locations passed through the Debugger output configuration objects.

**To save TensorBoard compatible output tensor files using Debugger**

1. Set up a `tensorboard_output_config` configuration object to save TensorBoard output using the Debugger `TensorBoardOutputConfig` class. For the `s3_output_path` parameter, specify the default S3 bucket of the current SageMaker AI session or a preferred S3 bucket. This example does not add the `container_local_output_path` parameter; instead, it is set to the default local path `/opt/ml/output/tensors`.

   ```
   import sagemaker
   from sagemaker.debugger import TensorBoardOutputConfig
   
   bucket = sagemaker.Session().default_bucket()
   tensorboard_output_config = TensorBoardOutputConfig(
       s3_output_path='s3://{}'.format(bucket)
   )
   ```

   For additional information, see the Debugger `[TensorBoardOutputConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.TensorBoardOutputConfig)` API in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

1. Configure the Debugger hook and customize the hook parameter values. For example, the following code configures a Debugger hook to save all scalar outputs every 100 steps in training phases and 10 steps in validation phases, the `weights` parameters every 500 steps (the default `save_interval` value for saving tensor collections is 500), and the `bias` parameters every 10 global steps until the global step reaches 500.

   ```
   from sagemaker.debugger import CollectionConfig, DebuggerHookConfig
   
   hook_config = DebuggerHookConfig(
       hook_parameters={
           "train.save_interval": "100",
           "eval.save_interval": "10"
       },
       collection_configs=[
           CollectionConfig("weights"),
           CollectionConfig(
               name="biases",
               parameters={
                   "save_interval": "10",
                   "end_step": "500",
                   "save_histogram": "True"
               }
           ),
       ]
   )
   ```

   For more information about the Debugger configuration APIs, see the Debugger `[CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig)` and `[DebuggerHookConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.DebuggerHookConfig)` APIs in the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable).

1. Construct a SageMaker AI estimator with the Debugger parameters passing the configuration objects. The following example template shows how to create a generic SageMaker AI estimator. You can replace `estimator` and `Estimator` with other SageMaker AI frameworks' estimator parent classes and estimator classes. Available SageMaker AI framework estimators for this functionality are `[TensorFlow](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#create-an-estimator)`, `[PyTorch](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#create-an-estimator)`, and `[MXNet](https://sagemaker.readthedocs.io/en/stable/frameworks/mxnet/using_mxnet.html#create-an-estimator)`.

   ```
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(
       ...
       # Debugger parameters
       debugger_hook_config=hook_config,
       tensorboard_output_config=tensorboard_output_config
   )
   estimator.fit()
   ```

   The `estimator.fit()` method starts a training job, and Debugger writes the output tensor files in real time to the Debugger S3 output path and to the TensorBoard S3 output path. To retrieve the output paths, use the following estimator methods:
   + For the Debugger S3 output path, use `estimator.latest_job_debugger_artifacts_path()`.
   + For the TensorBoard S3 output path, use `estimator.latest_job_tensorboard_artifacts_path()`.

1. After the training has completed, check the names of saved output tensors:

   ```
   from smdebug.trials import create_trial
   trial = create_trial(estimator.latest_job_debugger_artifacts_path())
   trial.tensor_names()
   ```

1. Check the TensorBoard output data in Amazon S3:

   ```
   tensorboard_output_path=estimator.latest_job_tensorboard_artifacts_path()
   print(tensorboard_output_path)
   !aws s3 ls {tensorboard_output_path}/
   ```

1. Download the TensorBoard output data to your notebook instance. For example, the following AWS CLI command downloads the TensorBoard files to `/logs/fit` under the current working directory of your notebook instance.

   ```
   !aws s3 cp --recursive {tensorboard_output_path} ./logs/fit
   ```

1. Compress the file directory to a TAR file to download to your local machine.

   ```
   !tar -cf logs.tar logs
   ```

1. Download and extract the Tensorboard TAR file to a directory on your device, launch a Jupyter notebook server, open a new notebook, and run the TensorBoard app.

   ```
   !tar -xf logs.tar
   %load_ext tensorboard
   %tensorboard --logdir logs/fit
   ```

The following animated screenshot illustrates steps 5 through 8. It demonstrates how to download the Debugger TensorBoard TAR file and load the file in a Jupyter notebook on your local device.

![\[Animation on how to download and load the Debugger TensorBoard file on your local machine.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-tensorboard.gif)


# List of Debugger built-in rules
<a name="debugger-built-in-rules"></a>

You can use the Debugger built-in rules, provided by Amazon SageMaker Debugger, to analyze metrics and tensors collected while training your models. The following lists the debugger rules, including information and an example on how to configure and deploy each built-in rule.

The Debugger built-in rules monitor various common conditions that are critical for the success of a training job. You can call the built-in rules using [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable) or the low-level SageMaker API operations. 

There's no additional cost for using the built-in rules. For more information about billing, see the [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/) page.

**Note**  
The maximum numbers of built-in rules that you can attach to a training job is 20. SageMaker Debugger fully manages the built-in rules and analyzes your training job synchronously.

**Important**  
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the SMDebug client library. In your iPython kernel, Jupyter notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Debugger rule
<a name="debugger-built-in-rules-Rule"></a>

The following rules are the Debugger built-in rules that are callable using the `Rule.sagemaker` classmethod.

Debugger built-in rules for generating training reports


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Training Report for SageMaker AI XGboost training job |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 

Debugger built-in rules for debugging model training data (output tensors)


| Scope of Validity | Built-in Rules | 
| --- | --- | 
| Deep learning frameworks (TensorFlow, MXNet, and PyTorch) |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 
| Deep learning frameworks (TensorFlow, MXNet, and PyTorch) and the XGBoost algorithm  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 
| Deep learning applications |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 
| XGBoost algorithm |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html)  | 

**To use the built-in rules with default parameter values** – use the following configuration format:

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    Rule.sagemaker(rule_configs.built_in_rule_name_1()),
    Rule.sagemaker(rule_configs.built_in_rule_name_2()),
    ...
    Rule.sagemaker(rule_configs.built_in_rule_name_n())
]
```

**To use the built-in rules with customizing the parameter values** – use the following configuration format:

```
from sagemaker.debugger import Rule, ProfilerRule, rule_configs

rules = [
    Rule.sagemaker(
        base_config=rule_configs.built_in_rule_name(),
        rule_parameters={
                "key": "value"
        }
        collections_to_save=[ 
            CollectionConfig(
                name="tensor_collection_name", 
                parameters={
                    "key": "value"
                } 
            )
        ]
    )
]
```

To find available keys for the `rule_parameters` parameter, see the parameter description tables.

Sample rule configuration codes are provided for each built-in rule below the parameter description tables.
+ For a full instruction and examples of using the Debugger built-in rules, see [Debugger built-in rules example code](debugger-built-in-rules-example.md#debugger-deploy-built-in-rules).
+ For a full instruction on using the built-in rules with the low-level SageMaker API operations, see [Configure Debugger using SageMaker API](debugger-createtrainingjob-api.md).

## CreateXgboostReport
<a name="create-xgboost-report"></a>

The CreateXgboostReport rule collects output tensors from an XGBoost training job and autogenerates a comprehensive training report. You can download a comprehensive profiling report while a training job is running or after the training job is complete, and check progress of training or the final result of the training job. The CreateXgboostReport rule collects the following output tensors by default: 
+ `hyperparameters` – Saves at the first step
+ `metrics` – Saves loss and accuracy every 5 steps
+ `feature_importance` – Saves every 5 steps
+ `predictions` – Saves every 5 steps
+ `labels` – Saves every 5 steps

Parameter Descriptions for the CreateXgboostReport Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial | The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger. **Required** Valid values: String  | 

```
rules=[
    Rule.sagemaker(
        rule_configs.create_xgboost_report()
    )  
]
```

## DeadRelu
<a name="dead-relu"></a>

This rule detects when the percentage of rectified linear unit (ReLU) activation functions in a trial are considered dead because their activation activity has dropped below a threshold. If the percent of inactive ReLUs in a layer is greater than the `threshold_layer` value of inactive ReLUs, the rule returns `True`.

Parameter Descriptions for the DeadRelu Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: `".*relu_output"`  | 
| threshold\$1inactivity |  Defines a level of activity below which a ReLU is considered to be dead. A ReLU might be active in the beginning of a trial and then slowly die during the training process. If the ReLU is active less than the `threshold_inactivity`, it is considered to be dead. **Optional** Valid values: Float Default values: `1.0` (in percentage)  | 
| threshold\$1layer |  Returns `True` if the percentage of inactive ReLUs in a layer is greater than `threshold_layer`. Returns `False` if the percentage of inactive ReLUs in a layer is less than `threshold_layer`. **Optional** Valid values: Float Default values: `50.0` (in percentage)  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.dead_relu(),
        rule_parameters={
                "tensor_regex": ".*relu_output|.*ReLU_output",
                "threshold_inactivity": "1.0",
                "threshold_layer": "50.0"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_relu_collection", 
                parameters={
                    "include_regex: ".*relu_output|.*ReLU_output",
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## ExplodingTensor
<a name="exploding-tensor"></a>

This rule detects whether the tensors emitted during training have non-finite values, either infinite or NaN (not a number). If a non-finite value is detected, the rule returns `True`.

Parameter Descriptions for the ExplodingTensor Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: String Default value: `None`  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: String  Default value: `None`  | 
| only\$1nan |   `True` to monitor the `base_trial` tensors only for `NaN` values and not for infinity.  `False` to treat both `NaN` and infinity as exploding values and to monitor for both. **Optional** Default value: `False`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.exploding_tensor(),
        rule_parameters={
                "tensor_regex": ".*gradient",
                "only_nan": "False"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="gradients", 
                parameters={
                    "save_interval": "500"
                }
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## PoorWeightInitialization
<a name="poor-weight-initialization"></a>

 This rule detects if your model parameters have been poorly initialized. 

Good initialization breaks the symmetry of the weights and gradients in a neural network and maintains commensurate activation variances across layers. Otherwise, the neural network doesn't learn effectively. Initializers like Xavier aim to keep variance constant across activations, which is especially relevant for training very deep neural nets. Too small an initialization can lead to vanishing gradients. Too large an initialization can lead to exploding gradients. This rule checks the variance of activation inputs across layers, the distribution of gradients, and the loss convergence for the initial steps to determine if a neural network has been poorly initialized.

Parameter Descriptions for the PoorWeightInitialization Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| activation\$1inputs\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: String Default value: `".*relu_input"`  | 
| threshold |  If the ratio between minimum and maximum variance of weights per layer exceeds the `threshold` at a step, the rule returns `True`. **Optional** Valid values: Float Default value: `10.0`  | 
| distribution\$1range |  If the minimum difference between 5th and 95th percentiles of the gradient distribution is less than the `distribution_range`, the rule returns `True`. **Optional** Valid values: Float Default value: `0.001`  | 
| patience |  The number of steps to wait until the loss is considered to be no longer decreasing. **Optional** Valid values: Integer Default value: `5`  | 
| steps |  The number of steps this rule analyzes. You typically need to check only the first few iterations. **Optional** Valid values: Float Default value: `10`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.poor_weight_initialization(),
        rule_parameters={
                "activation_inputs_regex": ".*relu_input|.*ReLU_input",
                "threshold": "10.0",
                "distribution_range": "0.001",
                "patience": "5",
                "steps": "10"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_relu_collection", 
                parameters={
                    "include_regex": ".*relu_input|.*ReLU_input",
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## SaturatedActivation
<a name="saturated-activation"></a>

This rule detects if the tanh and sigmoid activation layers are becoming saturated. An activation layer is saturated when the input of the layer is close to the maximum or minimum of the activation function. The minimum and maximum of the tanh and sigmoid activation functions are defined by their respective `min_threshold` and `max_thresholds` values. If the activity of a node drops below the `threshold_inactivity` percentage, it is considered saturated. If more than a `threshold_layer` percent of the nodes are saturated, the rule returns `True`.

Parameter Descriptions for the SaturatedActivation Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: None  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: String  Default value: `".*tanh_input\|.*sigmoid_input".`  | 
| threshold\$1tanh\$1min |  The minimum and maximum thresholds that define the extremes of the input for a tanh activation function, defined as: `(min_threshold, max_threshold)`. The default values are determined based on a vanishing gradient threshold of 0.0000001. **Optional** Valid values: Float Default values: `-9.4999`  | 
| threshold\$1tanh\$1max |  The minimum and maximum thresholds that define the extremes of the input for a tanh activation function, defined as: `(min_threshold, max_threshold)`. The default values are determined based on a vanishing gradient threshold of 0.0000001. **Optional** Valid values: Float Default values: `9.4999`  | 
| threshold\$1sigmoid\$1min |  The minimum and maximum thresholds that define the extremes of the input for a sigmoid activation function, defined as: `(min_threshold, max_threshold)`. The default values are determined based on a vanishing gradient threshold of 0.0000001. **Optional** Valid values: Float Default values: `-23`  | 
| threshold\$1sigmoid\$1max |  The minimum and maximum thresholds that define the extremes of the input for a sigmoid activation function, defined as: `(min_threshold, max_threshold)`. The default values are determined based on a vanishing gradient threshold of 0.0000001. **Optional** Valid values: Float Default values: `16.99999`  | 
| threshold\$1inactivity |  The percentage of inactivity below which the activation layer is considered to be saturated. The activation might be active in the beginning of a trial and then slowly become less active during the training process. **Optional** Valid values: Float Default values: `1.0`  | 
| threshold\$1layer |  Returns `True` if the number of saturated activations in a layer is greater than the `threshold_layer` percentage. Returns `False` if the number of saturated activations in a layer is less than the `threshold_layer` percentage. **Optional** Valid values: Float Default values: `50.0`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.saturated_activation(),
        rule_parameters={
                "tensor_regex": ".*tanh_input|.*sigmoid_input",
                "threshold_tanh_min": "-9.4999",
                "threshold_tanh_max": "9.4999",
                "threshold_sigmoid_min": "-23",
                "threshold_sigmoid_max": "16.99999",
                "threshold_inactivity": "1.0",
                "threshold_layer": "50.0"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_activations_collection",
                parameters={
                    "include_regex": ".*tanh_input|.*sigmoid_input"
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## VanishingGradient
<a name="vanishing-gradient"></a>

This rule detects if the gradients in a trial become extremely small or drop to a zero magnitude. If the mean of the absolute values of the gradients drops below a specified `threshold`, the rule returns `True`.

Parameters Descriptions for the VanishingGradient Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold | The value at which the gradient is determined to be vanishing.**Optional**Valid values: FloatDefault value: `0.0000001`. | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.vanishing_gradient(),
        rule_parameters={
                "threshold": "0.0000001"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="gradients", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## WeightUpdateRatio
<a name="weight-update-ratio"></a>

This rule keeps track of the ratio of updates to weights during training and detects if that ratio gets too large or too small. If the ratio of updates to weights is larger than the `large_threshold value` or if this ratio is smaller than `small_threshold`, the rule returns `True`.

Conditions for training are best when the updates are commensurate to gradients. Excessively large updates can push the weights away from optimal values, and very small updates result in very slow convergence. This rule requires weights to be available for two training steps, and `train.save_interval` needs to be set equal to `num_steps`.

Parameter Descriptions for the WeightUpdateRatio Rule


| Parameter Name, | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| num\$1steps |  The number of steps across which the rule checks to determine if the tensor has changed.  The number of steps across which you want to compare the weight ratios. If you pass no value, the rule runs by default against the current step and the immediately previous saved step. If you override the default by passing a value for this parameter, the comparison is done between weights at step `s` and at a step `>= s - num_steps`. **Optional** Valid values: Integer Default value: `None`  | 
| large\$1threshold |  The maximum value that the ratio of updates to weight can take before the rule returns `True`.  **Optional** Valid values: Float Default value: `10.0`  | 
| small\$1threshold |  The minimum value that the ratio of updates to weight can take, below which the rule returns `True`. **Optional** Valid values: Float Default value: `0.00000001`  | 
| epsilon |  A small constant used to ensure that Debugger does not divide by zero when computing the ratio updates to weigh. **Optional** Valid values: Float Default value: `0.000000001`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.weight_update_ratio(),
        rule_parameters={
                "num_steps": "100",
                "large_threshold": "10.0",
                "small_threshold": "0.00000001",
                "epsilon": "0.000000001"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="weights", 
                parameters={
                    "train.save_interval": "100"
                } 
            )
        ]
    )
]
```

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
This rule is not available for the XGBoost algorithm.

## AllZero
<a name="all-zero"></a>

This rule detects if all or a specified percentage of the tensor values are zero.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified, the rule inspects the union of tensors from both sets.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameters Descriptions for the AllZero Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: `None`  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string Default value: `None`  | 
| threshold |  Specifies the percentage of values in the tensor that needs to be zero for this rule to be invoked.  **Optional** Valid values: Float Default value: 100 (in percentage)  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.all_zero(),
        rule_parameters={
                "tensor_regex": ".*",
                "threshold": "100"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="all", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## ClassImbalance
<a name="class-imbalance"></a>

This rule measures sampling imbalances between classes and throws errors if the imbalance exceeds a threshold or if too many mispredictions for underrepresented classes occur as a result of the imbalance.

Classification models require well-balanced classes in the training dataset or a proper weighting/sampling of classes during training. The rule performs the following checks:
+  It counts the occurrences per class. If the ratio of number of samples between smallest and largest class is larger than the `threshold_imbalance`, an error is thrown.
+  It checks the prediction accuracy per class. If resampling or weighting has not been correctly applied, then the model can reach high accuracy for the class with many training samples, but low accuracy for the classes with few training samples. If a fraction of mispredictions for a certain class is above `threshold_misprediction`, an error is thrown.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the ClassImbalance Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold\$1imbalance |  The acceptable imbalance between the number of samples in the smallest class and in the largest class. Exceeding this threshold value throws an error. **Optional** Valid values: Float Default value: `10`  | 
| threshold\$1misprediction |  A limit on the fraction of mispredictions allowed for each class. Exceeding this threshold throws an error. The underrepresented classes are most at risk of crossing this threshold.  **Optional** Valid values: Float Default value: `0.7`  | 
| samples |  The number of labels that have to be processed before an imbalance is evaluated. The rule might not be triggered until it has seen sufficient samples across several steps. The more classes that your dataset contains, the larger this `sample` number should be.  **Optional** Valid values: Integer Default value: `500` (assuming a dataset like MNIST with 10 classes)  | 
| argmax |  If `True`, [np.argmax](https://docs.scipy.org/doc/numpy-1.9.3/reference/generated/numpy.argmax.html) is applied to the prediction tensor. Required when you have a vector of probabilities for each class. It is used to determine which class has the highest probability. **Conditional** Valid values: Boolean Default value: `False`  | 
| labels\$1regex |  The name of the tensor that contains the labels. **Optional** Valid values: String Default value: `".*labels"`  | 
| predictions\$1regex |  The name of the tensor that contains the predictions. **Optional** Valid values: String Default value: `".*predictions"`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.class_imbalance(),
        rule_parameters={
                "threshold_imbalance": "10",
                "threshold_misprediction": "0.7",
                "samples": "500",
                "argmax": "False",
                "labels_regex": ".*labels",
                "predictions_regex": ".*predictions"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_output_collection",
                parameters={
                    "include_regex": ".*labels|.*predictions",
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## LossNotDecreasing
<a name="loss-not-decreasing"></a>

This rule detects when the loss is not decreasing in value at an adequate rate. These losses must be scalars. 

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified, the rule inspects the union of tensors from both sets.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the LossNotDecreasing Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: `None`  | 
| tensor\$1regex |  A list of regex patterns that is used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: `None`  | 
| use\$1losses\$1collection |  If set to `True`, looks for losses in the collection named "losses" when the collection is present. **Optional** Valid values: Boolean Default value: `True`  | 
| num\$1steps |  The minimum number of steps after which the rule checks if the loss has decreased. Rule evaluation happens every `num_steps`. The rule compares the loss for this step with the loss at a step which is at least `num_steps` behind the current step. For example, suppose that the loss is being saved every three steps, but `num_steps` is set to 10. At step 21, loss for step 21 is compared with loss for step 9. The next step at which loss is checked is step 33, because ten steps after step 21 is step 31, and at step 31 and step 32 loss is not saved.  **Optional** Valid values: Integer Default value: `10`  | 
| diff\$1percent |  The minimum percentage difference by which the loss should decrease between `num_steps`. **Optional** Valid values: `0.0` < float < `100` Default value: `0.1` (in percentage)  | 
| increase\$1threshold\$1percent |  The maximum threshold percent that loss is allowed to increase in case loss has been increasing **Optional** Valid values: `0` < float < `100` Default value: `5` (in percentage)  | 
| mode |  The name of the Debugger mode to query tensor values for rule checking. If this is not passed, the rule checks in order by default for the `mode.EVAL`, then `mode.TRAIN`, and then `mode.GLOBAL`.  **Optional** Valid values: String (`EVAL`, `TRAIN`, or `GLOBAL`) Default value: `GLOBAL`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.loss_not_decreasing(),
        rule_parameters={
                "tensor_regex": ".*",
                "use_losses_collection": "True",
                "num_steps": "10",
                "diff_percent": "0.1",
                "increase_threshold_percent": "5",
                "mode": "GLOBAL"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## Overfit
<a name="overfit"></a>

This rule detects if your model is being overfit to the training data by comparing the validation and training losses.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
A standard way to prevent overfitting is to regularize your model.

Parameter Descriptions for the Overfit Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: None  | 
| start\$1step |  The step from which to start comparing the validation and training loss. **Optional** Valid values: Integer Default value: `0`  | 
| patience |  The number of steps for which the `ratio_threshold` is allowed to exceed the value set before the model is considered to be overfit. **Optional** Valid values: Integer Default value: `1`  | 
| ratio\$1threshold |  The maximum ratio of the difference between the mean validation loss and mean training loss to the mean training loss. If this threshold is exceeded for a `patience` number of steps, the model is being overfit and the rule returns `True`. **Optional** Valid values: Float Default value: `0.1`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.overfit(),
        rule_parameters={
                "tensor_regex": ".*",
                "start_step": "0",
                "patience": "1",
                "ratio_threshold": "0.1"
        },
        collections_to_save=[
            CollectionConfig(
                name="losses", 
                parameters={
                    "train.save_interval": "100",
                    "eval.save_interval": "10"
                } 
            )
        ]
    )
]
```

## Overtraining
<a name="overtraining"></a>

This rule detects if a model is being overtrained. After a number of training iterations on a well-behaved model (both training and validation loss decrease), the model approaches to a minimum of the loss function and does not improve anymore. If the model continues training it can happen that validation loss starts increasing, because the model starts overfitting. This rule sets up thresholds and conditions to determine if the model is not improving, and prevents overfitting problems due to overtraining.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

**Note**  
Overtraining can be avoided by early stopping. For information on early stopping, see [Stop Training Jobs Early](automatic-model-tuning-early-stopping.md). For an example that shows how to use spot training with Debugger, see [Enable Spot Training with Amazon SageMaker Debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/mxnet_spot_training/mxnet-spot-training-with-sagemakerdebugger.html). 

Parameter Descriptions for the Overtraining Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| patience\$1train |  The number of steps to wait before the training loss is considered to not to be improving anymore. **Optional** Valid values: Integer Default value: `5`  | 
| patience\$1validation | The number of steps to wait before the validation loss is considered to not to be improving anymore.**Optional**Valid values: IntegerDefault value: `10` | 
| delta |  The minimum threshold by how much the error should improve before it is considered as a new optimum. **Optional** Valid values: Float Default value: `0.01`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.overtraining(),
        rule_parameters={
                "patience_train": "5",
                "patience_validation": "10",
                "delta": "0.01"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## SimilarAcrossRuns
<a name="similar-across-runs"></a>

This rule compares tensors gathered from a base trial with tensors from another trial. 

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the SimilarAcrossRuns Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| other\$1trials |  A completed training job name whose tensors you want to compare to those tensors gathered from the current `base_trial`. **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: None  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: None  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.similar_across_runs(),
        rule_parameters={
                "other_trials": "<specify-another-job-name>",
                "collection_names": "losses",
                "tensor_regex": ".*"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## StalledTrainingRule
<a name="stalled-training"></a>

StalledTrainingRule detects if there is no progress made on training job, and stops the training job if the rule fires. This rule requires tensors to be periodically saved in a time interval defined by its `threshold` parameter. This rule keeps on monitoring for new tensors, and if no new tensor has been emitted for threshold interval rule gets fired. 

Parameter Descriptions for the StalledTrainingRule Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold |  A threshold that defines by how much time in seconds the rule waits for a tensor output until it fires a stalled training issue. Default value is 1800 seconds. **Optional** Valid values: Integer Default value: `1800`  | 
| stop\$1training\$1on\$1fire |  If set to `True`, watches if the base training job outputs tensors in "`threshold`" seconds. **Optional** Valid values: Boolean Default value: `False`  | 
| training\$1job\$1name\$1prefix |  The prefix of base training job name. If `stop_training_on_fire` is true, the rule searches for SageMaker training jobs with this prefix in the same account. If there is an inactivity found, the rule takes a `StopTrainingJob` action. Note if there are multiple jobs found with same prefix, the rule skips termination. It is important that the prefix is set unique per each training job. **Optional** Valid values: String  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.stalled_training_rule(),
        rule_parameters={
                "threshold": "1800",
                "stop_training_on_fire": "True",
                "training_job_name_prefix": "<specify-training-base-job-name>"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## TensorVariance
<a name="tensor-variance"></a>

This rule detects if you have tensors with very high or low variances. Very high or low variances in a tensor could lead to neuron saturation, which reduces the learning ability of the neural network. Very high variance in tensors can also eventually lead to exploding tensors. Use this rule to detect such issues early.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified, the rule inspects the union of tensors from both sets.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the TensorVariance Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: None  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: None  | 
| max\$1threshold |  The threshold for the upper bound of tensor variance. **Optional** Valid values: Float Default value: None  | 
| min\$1threshold |  The threshold for the lower bound of tensor variance. **Optional** Valid values: Float Default value: None  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.tensor_variance(),
        rule_parameters={
                "collection_names": "weights",
                "max_threshold": "10",
                "min_threshold": "0.00001",
        },
        collections_to_save=[ 
            CollectionConfig(
                name="weights", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## UnchangedTensor
<a name="unchanged-tensor"></a>

This rule detects whether a tensor is no longer changing across steps. 

This rule runs the [numpy.allclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html) method to check if the tensor isn't changing.

This rule can be applied either to one of the supported deep learning frameworks (TensorFlow, MXNet, and PyTorch) or to the XGBoost algorithm. You must specify either the `collection_names` or `tensor_regex` parameter. If both the parameters are specified, the rule inspects the union of tensors from both sets.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the UnchangedTensor Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| collection\$1names |  The list of collection names whose tensors the rule inspects. **Optional** Valid values: List of strings or a comma-separated string Default value: None  | 
| tensor\$1regex |  A list of regex patternsused to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: None  | 
| num\$1steps |  The number of steps across which the rule checks to determine if the tensor has changed.  This checks the last `num_steps` that are available. They don't need to be consecutive. If `num_steps` is 2, at step s it doesn't necessarily check for s-1 and s. If s-1 isn't available, it checks the last available step along with s. In that case, it checks the last available step with the current step. **Optional** Valid values: Integer Default value: `3`  | 
| rtol |  The relative tolerance parameter to be passed to the `[numpy.allclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html)` method.  **Optional** Valid values: Float Default value: `1e-05`  | 
| atol |  The absolute tolerance parameter to be passed to the `[numpy.allclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html)` method. **Optional** Valid values: Float Default value: `1e-08`  | 
| equal\$1nan |  Whether to compare NaNs as equal. If `True`, NaNs in input array a are considered equal to NaNs in input array b in the output array. This parameter is passed to the `[numpy.allclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.allclose.html)` method. **Optional** Valid values: Boolean Default value: `False`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.unchanged_tensor(),
        rule_parameters={
                "collection_names": "losses",
                "tensor_regex": "",
                "num_steps": "3",
                "rtol": "1e-05",
                "atol": "1e-08",
                "equal_nan": "False"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="losses", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## CheckInputImages
<a name="checkinput-mages"></a>

This rule checks if input images have been correctly normalized. Specifically, it detects if the mean of the sample data differs by more than a threshold value from zero. Many computer vision models require that input data has a zero mean and unit variance.

This rule is applicable to deep learning applications.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the CheckInputImages Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold\$1mean |  A threshold that defines by how much mean of the input data can differ from 0. **Optional** Valid values: Float Default value: `0.2`  | 
| threshold\$1samples |  The number of images that have to be sampled before an error can be thrown. If the value is too low, the estimation of the dataset mean will be inaccurate. **Optional** Valid values: Integer Default value: `500`  | 
| regex |  The name of the input data tensor. **Optional** Valid values: String Default value: `".*hybridsequential0_input_0"` (the name of the input tensor for Apache MXNet models using HybridSequential)  | 
| channel |  The position of the color channel in the input tensor shape array.  **Optional** Valid values: Integer Default value: `1` (for example, MXNet expects input data in the form of (batch\$1size, channel, height, width))  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.check_input_images(),
        rule_parameters={
                "threshold_mean": "0.2",
                "threshold_samples": "500",
                "regex": ".*hybridsequential0_input_0",
                "channel": "1"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_inputs_collection", 
                parameters={
                    "include_regex": ".*hybridsequential0_input_0",
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## NLPSequenceRatio
<a name="nlp-sequence-ratio"></a>

This rule calculates the ratio of specific tokens given the rest of the input sequence that is useful for optimizing performance. For example, you can calculate the percentage of padding end-of-sentence (EOS) tokens in your input sequence. If the number of EOS tokens is too high, an alternate bucketing strategy should be performed. You also can calculate the percentage of unknown tokens in your input sequence. If the number of unknown words is too high, an alternate vocabulary could be used.

This rule is applicable to deep learning applications.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the NLPSequenceRatio Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| tensor\$1regex |  A list of regex patterns used to restrict this comparison to specific scalar-valued tensors. The rule inspects only the tensors that match the regex patterns specified in the list. If no patterns are passed, the rule compares all tensors gathered in the trials by default. Only scalar-valued tensors can be matched. **Optional** Valid values: List of strings or a comma-separated string  Default value: `".*embedding0_input_0"` (assuming an embedding as the initial layer of the network)  | 
| token\$1values |  A string of a list of the numerical values of the tokens. For example, "3, 0". **Optional** Valid values: Comma-separated string of numerical values Default value: `0`  | 
| token\$1thresholds\$1percent |  A string of a list of thresholds (in percentages) that correspond to each of the `token_values`. For example,"50.0, 50.0". **Optional** Valid values: Comma-separated string of floats Default value: `"50"`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.nlp_sequence_ratio(),
        rule_parameters={
                "tensor_regex": ".*embedding0_input_0",
                "token_values": "0",
                "token_thresholds_percent": "50"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="custom_inputs_collection", 
                parameters={
                    "include_regex": ".*embedding0_input_0"
                } 
            )
        ]
    )
]
```

## Confusion
<a name="confusion"></a>

This rule evaluates the goodness of a confusion matrix for a classification problem.

It creates a matrix of size `category_no*category_no` and populates it with data coming from (`labels`, `predictions`) pairs. For each (`labels`, `predictions`) pair, the count in `confusion[labels][predictions]` is incremented by 1. When the matrix is fully populated, the ratio of data on-diagonal values and off-diagonal values are evaluated as follows:
+ For elements on the diagonal: `confusion[i][i]/sum_j(confusion[j][j])>=min_diag`
+ For elements off the diagonal: `confusion[j][i])/sum_j(confusion[j][i])<=max_off_diag`

This rule can be applied to the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the Confusion Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| category\$1no |  The number of categories. **Optional** Valid values: Integer ≥2 Default value: `"None"`  | 
| labels |  The `labels` tensor collection or an 1-d vector of true labels.  **Optional** Valid values: String Default value: `"labels"`  | 
| predictions |  The `predictions` tensor collection or an 1-d vector of estimated labels.  **Optional** Valid values: String Default value: `"predictions"`  | 
| labels\$1collection |  The rule inspects the tensors in this collection for `labels`. **Optional** Valid values: String Default value: `"labels"`  | 
| predictions\$1collection |  The rule inspects the tensors in this collection for `predictions`. **Optional** Valid values: String Default value: `"predictions"`  | 
| min\$1diag |  The minimum threshold for the ratio of data on the diagonal. **Optional** Valid values: `0`≤float≤`1` Default value: `0.9`  | 
| max\$1off\$1diag |  The maximum threshold for the ratio of data off the diagonal. **Optional** Valid values: `0`≤float≤`1` Default value: `0.1`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.confusion(),
        rule_parameters={
                "category_no": "10",
                "labels": "labels",
                "predictions": "predictions",
                "labels_collection": "labels",
                "predictions_collection": "predictions",
                "min_diag": "0.9",
                "max_off_diag": "0.1"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="labels",
                parameters={
                    "save_interval": "500"
                } 
            ),
            CollectionConfig(
                name="predictions",
                parameters={
                    "include_regex": "500"
                } 
            )
        ]
    )
]
```

**Note**  
This rule infers default values for the optional parameters if their values aren't specified.

## FeatureImportanceOverweight
<a name="feature_importance_overweight"></a>

This rule accumulates the weights of the n largest feature importance values per step and ensures that they do not exceed the threshold. For example, you can set the threshold for the top 3 features to not hold more than 80 percent of the total weights of the model.

This rule is valid only for the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the FeatureImportanceOverweight Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| threshold |  Defines the threshold for the proportion of the cumulative sum of the `n` largest features. The number `n` is defined by the `nfeatures` parameter. **Optional** Valid values: Float Default value: `0.8`  | 
| nfeatures |  The number of largest features. **Optional** Valid values: Integer Default value: `3`  | 
| tensor\$1regex |  Regular expression (regex) of tensor names the rule to analyze. **Optional** Valid values: String Default value: `".*feature_importance/weight"`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.feature_importance_overweight(),
        rule_parameters={
                "threshold": "0.8",
                "nfeatures": "3",
                "tensor_regex": ".*feature_importance/weight"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="feature_importance", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

## TreeDepth
<a name="tree-depth"></a>

This rule measures the depth of trees in an XGBoost model. XGBoost rejects splits if they do not improve loss. This regularizes the training. As a result, the tree might not grow as deep as defined by the `depth` parameter.

This rule is valid only for the XGBoost algorithm.

For an example of how to configure and deploy a built-in rule, see [How to configure Debugger built-in rules](use-debugger-built-in-rules.md).

Parameter Descriptions for the TreeDepth Rule


| Parameter Name | Description | 
| --- | --- | 
| base\$1trial |  The base trial training job name. This parameter is automatically set to the current training job by Amazon SageMaker Debugger.  **Required** Valid values: String  | 
| depth |  The depth of the tree. The depth of the tree is obtained by computing the base 2 logarithm of the largest node ID. **Optional** Valid values: Float Default value: `4`  | 

```
built_in_rules = [
    Rule.sagemaker(
        base_config=rule_configs.tree_depth(),
        rule_parameters={
                "depth": "4"
        },
        collections_to_save=[ 
            CollectionConfig(
                name="tree", 
                parameters={
                    "save_interval": "500"
                } 
            )
        ]
    )
]
```

# Creating custom rules using the Debugger client library
<a name="debugger-custom-rules"></a>

You can create custom rules to monitor your training job using the Debugger rule APIs and the open source [`smdebug` Python library](https://github.com/awslabs/sagemaker-debugger/) that provide tools to build your own rule containers.

## Prerequisites for creating a custom rule
<a name="debugger-custom-rules-prerequisite"></a>

To create Debugger custom rules, you need the following prerequisites.
+ [SageMaker Debugger Rule.custom API](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.Rule.custom)
+ [The open source smdebug Python library](https://github.com/awslabs/sagemaker-debugger/)
+ Your own custom rule python script
+ [Amazon SageMaker Debugger image URIs for custom rule evaluators](debugger-reference.md#debuger-custom-rule-registry-ids)

**Topics**
+ [

## Prerequisites for creating a custom rule
](#debugger-custom-rules-prerequisite)
+ [

# Use the `smdebug` client library to create a custom rule as a Python script
](debugger-custom-rules-python-script.md)
+ [

# Use the Debugger APIs to run your own custom rules
](debugger-custom-rules-python-sdk.md)

# Use the `smdebug` client library to create a custom rule as a Python script
<a name="debugger-custom-rules-python-script"></a>

The `smdebug` Rule API provides an interface to set up your own custom rules. The following python script is a sample of how to construct a custom rule, `CustomGradientRule`. This tutorial custom rule watches if the gradients are getting too large and set the default threshold as 10. The custom rule takes a base trial created by a SageMaker AI estimator when it initiates training job. 

```
from smdebug.rules.rule import Rule

class CustomGradientRule(Rule):
    def __init__(self, base_trial, threshold=10.0):
        super().__init__(base_trial)
        self.threshold = float(threshold)

    def invoke_at_step(self, step):
        for tname in self.base_trial.tensor_names(collection="gradients"):
            t = self.base_trial.tensor(tname)
            abs_mean = t.reduction_value(step, "mean", abs=True)
            if abs_mean > self.threshold:
                return True
        return False
```

You can add multiple custom rule classes as many as you want in the same python script and deploy them to any training job trials by constructing custom rule objects in the following section.

# Use the Debugger APIs to run your own custom rules
<a name="debugger-custom-rules-python-sdk"></a>

The following code sample shows how to configure a custom rule with the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). This example assumes that the custom rule script you created in the previous step is located at '*path/to/my\$1custom\$1rule.py*'.

```
from sagemaker.debugger import Rule, CollectionConfig

custom_rule = Rule.custom(
    name='MyCustomRule',
    image_uri='759209512951.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rule-evaluator:latest', 
    instance_type='ml.t3.medium',     
    source='path/to/my_custom_rule.py', 
    rule_to_invoke='CustomGradientRule',     
    collections_to_save=[CollectionConfig("gradients")], 
    rule_parameters={"threshold": "20.0"}
)
```

The following list explains the Debugger `Rule.custom` API arguments.
+ `name` (str): Specify a custom rule name as you want.
+ `image_uri` (str): This is the image of the container that has the logic of understanding your custom rule. It sources and evaluates the specified tensor collections you save in the training job. You can find the list of open source SageMaker AI rule evaluator images from [Amazon SageMaker Debugger image URIs for custom rule evaluators](debugger-reference.md#debuger-custom-rule-registry-ids).
+ `instance_type` (str): You need to specify an instance to build a rule docker container. This spins up the instance in parallel with a training container.
+ `source` (str): This is the local path or the Amazon S3 URI to your custom rule script.
+ `rule_to_invoke` (str): This specifies the particular Rule class implementation in your custom rule script. SageMaker AI supports only one rule to be evaluated at a time in a rule job.
+ `collections_to_save` (str): This specifies which tensor collections you will save for the rule to run.
+ `rule_parameters` (dictionary): This accepts parameter inputs in a dictionary format. You can adjust the parameters that you configured in the custom rule script.

After you set up the `custom_rule` object, you can use it for building a SageMaker AI estimator for any training jobs. Specify the `entry_point` to your training script. You do not need to make any change of your training script.

```
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
                role=sagemaker.get_execution_role(),
                base_job_name='smdebug-custom-rule-demo-tf-keras',
                entry_point='path/to/your_training_script.py'
                train_instance_type='ml.p2.xlarge'
                ...
                
                # debugger-specific arguments below
                rules = [custom_rule]
)

estimator.fit()
```

For more variations and advanced examples of using Debugger custom rules, see the following example notebooks.
+ [Monitor your training job with Amazon SageMaker Debugger custom rules](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_keras_custom_rule/tf-keras-custom-rule.html)
+ [PyTorch iterative model pruning of ResNet and AlexNet](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_iterative_model_pruning)
+ [Trigger Amazon CloudWatch Events using Debugger Rules to Take an Action Based on Training Status with TensorFlow](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_action_on_rule)

# Use Debugger with custom training containers
<a name="debugger-bring-your-own-container"></a>

Amazon SageMaker Debugger is available for any deep learning models that you bring to Amazon SageMaker AI. The AWS CLI, SageMaker AI `Estimator` API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train your models. To use Debugger with customized containers, you need to make a minimal change to your training script to implement the Debugger hook callback and retrieve tensors from training jobs. The following sections will walk you through how to use Debugger with Custom Training Containers.

You need the following resources to build a customized container with Debugger.
+ [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable)
+ [The SMDebug open source client library](https://github.com/awslabs/sagemaker-debugger)
+ A Docker base image of your choice
+ Your training script with a Debugger hook registered – For more information about registering a Debugger hook to your training script, see [Register Debugger hook to your training script](#debugger-script-mode).

For an end-to-end example of using Debugger with a custom training container, see the following example notebook.
+ [Build a Custom Training Container and Debug Training Jobs with Debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/build_your_own_container_with_debugger/debugger_byoc.html)

**Tip**  
This custom container with Debugger guide is an extension of the [Adapting your own training container](adapt-training-container.md) guide which walks you thorough how to build and push your custom training container to Amazon ECR.

## Prepare to build a custom training container
<a name="debugger-bring-your-own-container-1"></a>

To build a docker container, the basic structure of files should look like the following:

```
├── debugger_custom_container_test_notebook.ipynb      # a notebook to run python snippet codes
└── debugger_custom_container_test_folder              # this is a docker folder
    ├──  your-training-script.py                       # your training script with Debugger hook
    └──  Dockerfile                                    # a Dockerfile to build your own container
```

## Register Debugger hook to your training script
<a name="debugger-script-mode"></a>

To debug your model training, you need to add a Debugger hook to your training script.

**Note**  
This step is required to collect model parameters (output tensors) for debugging your model training. If you only want to monitor and profile, you can skip this hook registration step and exclude the `debugger_hook_config` parameter when constructing an estimater.

The following example code shows the structure of a training script using the Keras ResNet50 model and how to pass the Debugger hook as a Keras callback for debugging. To find a complete training script, see [TensorFlow training script with SageMaker Debugger hook](https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-debugger/build_your_own_container_with_debugger/docker/tf_keras_resnet_byoc.py).

```
# An example of training script (your-training-script.py)
import tensorflow.compat.v2 as tf
from tensorflow.keras.applications.resnet50 import ResNet50
import smdebug.tensorflow as smd

def train(batch_size, epoch, model, hook):

    ...
    model.fit(X_train, Y_train,
              batch_size=batch_size,
              epochs=epoch,
              validation_data=(X_valid, Y_valid),
              shuffle=True,

              # smdebug modification: Pass the Debugger hook in the main() as a Keras callback
              callbacks=[hook])

def main():
    parser=argparse.ArgumentParser(description="Train resnet50 cifar10")

    # hyperparameter settings
    parser.add_argument(...)
    
    args = parser.parse_args()

    model=ResNet50(weights=None, input_shape=(32,32,3), classes=10)

    # Add the following line to register the Debugger hook for Keras.
    hook=smd.KerasHook.create_from_json_file()

    # Start the training.
    train(args.batch_size, args.epoch, model, hook)

if __name__ == "__main__":
    main()
```

For more information about registering the Debugger hook for the supported frameworks and algorithm, see the following links in the SMDebug client library:
+ [SMDebug TensorFlow hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/tensorflow.md)
+ [SMDebug PyTorch hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/pytorch.md)
+ [SMDebug MXNet hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/mxnet.md)
+ [SMDebug XGBoost hook](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/xgboost.md)

In the following example notebooks' training scripts, you can find more examples about how to add the Debugger hooks to training scripts and collect output tensors in detail:
+ [ Debugger in script mode with the TensorFlow 2.1 framework](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow2/tensorflow2_keras_custom_container/tf2-keras-custom-container.html)

  To see the difference between using Debugger in a Deep Learning Container and in script mode, open this notebook and put it and [ the previous Debugger in a Deep Learning Container TensorFlow v2.1 notebook example](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow2/tensorflow2_zero_code_change/tf2-keras-default-container.html) side by side. 

   In script mode, the hook configuration part is removed from the script in which you set the estimator. Instead, the Debugger hook feature is merged into the training script, [ TensorFlow Keras ResNet training script in script mode](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-debugger/tensorflow2/tensorflow2_keras_custom_container/src/tf_keras_resnet_byoc.py). The training script imports the `smdebug` library in the required TensorFlow Keras environment to communicate with the TensorFlow ResNet50 algorithm. It also manually implements the `smdebug` hook functionality by adding the `callbacks=[hook]` argument inside the `train` function (in line 49), and by adding the manual hook configuration (in line 89) provided through SageMaker Python SDK.

  This script mode example runs the training job in the TF 2.1 framework for direct comparison with the zero script change in the TF 2.1 example. The benefit of setting up Debugger in script mode is the flexibility to choose framework versions not covered by AWS Deep Learning Containers. 
+ [ Using Amazon SageMaker Debugger in a PyTorch Container in Script Mode ](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_custom_container)

  This notebook enables Debugger in script mode in PyTorch v1.3.1 framework. PyTorch v1.3.1 is supported by SageMaker AI containers, and this example shows details of how to modify a training script. 

  The SageMaker AI PyTorch estimator is already in script mode by default. In the notebook, the line to activate `script_mode` is not included in the estimator configuration.

  This notebook shows detailed steps to change [the original PyTorch training script](https://github.com/pytorch/examples/blob/master/mnist/main.py) to a modified version to enable Debugger. Additionally, this example shows how you can use Debugger built-in rules to detect training issues such as the vanishing gradients problem, and the Debugger trial features to call and analyze the saved tensors. 

## Create and configure a Dockerfile
<a name="debugger-bring-your-own-container-2"></a>

Open your SageMaker AI JupyterLab and create a new folder, `debugger_custom_container_test_folder` in this example, to save your training script and `Dockerfile`. The following code example is a `Dockerfile` that includes essential docker build commends. Paste the following code into the `Dockerfile` text file and save it. Upload your training script to the same folder.

```
# Specify a docker base image
FROM tensorflow/tensorflow:2.2.0rc2-gpu-py3
RUN /usr/bin/python3 -m pip install --upgrade pip
RUN pip install --upgrade protobuf

# Install required packages to enable the SageMaker Python SDK and the smdebug library
RUN pip install sagemaker-training
RUN pip install smdebug
CMD ["bin/bash"]
```

If you want to use a pre-built AWS Deep Learning Container image, see [Available AWS Deep Learning Containers Images](https://aws.amazon.com/releasenotes/available-deep-learning-containers-images/).

## Build and push the custom training image to Amazon ECR
<a name="debugger-bring-your-own-container-3"></a>

Create a test notebook, `debugger_custom_container_test_notebook.ipynb`, and run the following code in the notebook cell. This will access the `debugger_byoc_test_docker` directory, build the docker with the specified `algorithm_name`, and push the docker container to your Amazon ECR.

```
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-debugger-mnist-byoc-tf2'
tag = ':latest'

region = boto3.session.Session().region_name

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)

!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $byoc_image_uri
!docker push $byoc_image_uri
```

**Tip**  
If you use one of the AWS Deep Learning Container base images, run the following code to log in to Amazon ECR and access to the Deep Learning Container image repository.  

```
! aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
```

## Run and debug training jobs using the custom training container
<a name="debugger-bring-your-own-container-4"></a>

After you build and push your docker container to Amazon ECR, configure a SageMaker AI estimator with your training script and the Debugger-specific parameters. After you execute the `estimator.fit()`, Debugger will collect output tensors, monitor them, and detect training issues. Using the saved tensors, you can further analyze the training job by using the `smdebug` core features and tools. Configuring a workflow of Debugger rule monitoring process with Amazon CloudWatch Events and AWS Lambda, you can automate a stopping training job process whenever the Debugger rules spots training issues.

```
import sagemaker
from sagemaker.estimator import Estimator
from sagemaker.debugger import Rule, DebuggerHookConfig, CollectionConfig, rule_configs

profiler_config=ProfilerConfig(...)
debugger_hook_config=DebuggerHookConfig(...)
rules=[
    Rule.sagemaker(rule_configs.built_in_rule()),
    ProfilerRule.sagemaker(rule_configs.BuiltInRule())
]

estimator=Estimator(
    image_uri=byoc_image_uri,
    entry_point="./debugger_custom_container_test_folder/your-training-script.py"
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-custom-container-test',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    
    # Debugger-specific parameters
    profiler_config=profiler_config,
    debugger_hook_config=debugger_hook_config,
    rules=rules
)

# start training
estimator.fit()
```

# Configure Debugger using SageMaker API
<a name="debugger-createtrainingjob-api"></a>

 The preceding topics focus on using Debugger through Amazon SageMaker Python SDK, which is a wrapper around AWS SDK for Python (Boto3) and SageMaker API operations. This offers a high-level experience of accessing the Amazon SageMaker API operations. In case you need to manually configure the SageMaker API operations using AWS Boto3 or AWS Command Line Interface (CLI) for other SDKs, such as Java, Go, and C\$1\$1, this section covers how to configure the following low-level API operations.

**Topics**
+ [

# JSON (AWS CLI)
](debugger-built-in-rules-api.CLI.md)
+ [

# SDK for Python (Boto3)
](debugger-built-in-rules-api.Boto3.md)

# JSON (AWS CLI)
<a name="debugger-built-in-rules-api.CLI"></a>

Amazon SageMaker Debugger built-in rules can be configured for a training job using the [DebugHookConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html), [DebugRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html), [ProfilerConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html), and [ProfilerRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html) objects through the SageMaker AI [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API operation. You need to specify the right image URI in the `RuleEvaluatorImage` parameter, and the following examples walk you through how to set up the JSON strings to request [CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).

The following code shows a complete JSON template to run a training job with required settings and Debugger configurations. Save the template as a JSON file in your working directory and run the training job using AWS CLI. For example, save the following code as `debugger-training-job-cli.json`.

**Note**  
Ensure that you use the correct Docker container images. To find AWS Deep Learning Container images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules).

```
{
   "TrainingJobName": "debugger-aws-cli-test",
   "RoleArn": "arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-YYYYMMDDT123456",
   "AlgorithmSpecification": {
      // Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
      "TrainingImage": "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04",
      "TrainingInputMode": "File",
      "EnableSageMakerMetricsTimeSeries": false
   },
   "HyperParameters": {
      "sagemaker_program": "entry_point/tf-hvd-train.py",
      "sagemaker_submit_directory": "s3://sagemaker-us-west-2-111122223333/debugger-boto3-profiling-test/source.tar.gz"
   },
   "OutputDataConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/output"
   },
   "DebugHookConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/debug-output",
      "CollectionConfigurations": [
         {
            "CollectionName": "losses",
            "CollectionParameters" : {
                "train.save_interval": "50"
            }
         }
      ]
   },
   "DebugRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "LossNotDecreasing",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "LossNotDecreasing"}
      }
   ],
   "ProfilerConfig": { 
      "S3OutputPath": "s3://sagemaker-us-west-2-111122223333/debugger-aws-cli-test/profiler-output",
      "ProfilingIntervalInMilliseconds": 500,
      "ProfilingParameters": {
          "DataloaderProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\", }",
          "DetailedProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, }",
          "PythonProfilingConfig": "{\"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cprofile\", \"cProfileTimer\": \"total_time\"}",
          "LocalPath": "/opt/ml/output/profiler/" 
      }
   },
   "ProfilerRuleConfigurations": [ 
      { 
         "RuleConfigurationName": "ProfilerReport",
         "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
         "RuleParameters": {"rule_to_invoke": "ProfilerReport"}
      }
   ],
   "ResourceConfig": { 
      "InstanceType": "ml.p3.8xlarge",
      "InstanceCount": 1,
      "VolumeSizeInGB": 30
   },
   
   "StoppingCondition": { 
      "MaxRuntimeInSeconds": 86400
   }
}
```

After saving the JSON file, run the following command in your terminal. (Use `!` at the beginning of the line if you use a Jupyter notebook.)

```
aws sagemaker create-training-job --cli-input-json file://debugger-training-job-cli.json
```

## To configure a Debugger rule for debugging model parameters
<a name="debugger-built-in-rules-api-debug.CLI"></a>

The following code samples show how to configure a built-in `VanishingGradient` rule using this SageMaker API. 

**To enable Debugger to collect output tensors**

Specify the Debugger hook configuration as follows:

```
"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "gradients",
            "CollectionParameters" : {
                "save_interval": "500"
            }
        }
    ]
}
```

This will make the training job save the tensor collection, `gradients`, every `save_interval` of 500 steps. To find available `CollectionName` values, see [Debugger Built-in Collections](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections) in the *SMDebug client library documentation*. To find available `CollectionParameters` parameter keys and values, see the [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) class in the *SageMaker Python SDK documentation*.

**To enable Debugger rules for debugging the output tensors**

The following `DebugRuleConfigurations` API example shows how to run the built-in `VanishingGradient` rule on the saved `gradients` collection.

```
"DebugRuleConfigurations": [
    {
        "RuleConfigurationName": "VanishingGradient",
        "RuleEvaluatorImage": "503895931360.dkr.ecr.us-east-1.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "VanishingGradient",
            "threshold": "20.0"
        }
    }
]
```

With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training job using the `VanishingGradient` rule on the collection of `gradients` tensor. To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

## To configure a Debugger built-in rule for profiling system and framework metrics
<a name="debugger-built-in-rules-api-profile.CLI"></a>

The following example code shows how to specify the ProfilerConfig API operation to enable collecting system and framework metrics.

**To enable Debugger profiling to collect system and framework metrics**

------
#### [ Target Step ]

```
"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500, 
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartStep\": 5, \"NumSteps\": 3, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/" 
    }
}
```

------
#### [ Target Time Duration ]

```
"ProfilerConfig": { 
    // Optional. Path to an S3 bucket to save profiling outputs
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/profiler-output", 
    // Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    "ProfilingIntervalInMilliseconds": 500,
    "ProfilingParameters": {
        "DataloaderProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"MetricsRegex\": \".*\" }",
        "DetailedProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10 }",
        // For PythonProfilingConfig,
        // available ProfilerName options: cProfile, Pyinstrument
        // available cProfileTimer options only when using cProfile: cpu, off_cpu, total_time
        "PythonProfilingConfig": "{ \"StartTimeInSecSinceEpoch\": 12345567789, \"DurationInSeconds\": 10, \"ProfilerName\": \"cProfile\", \"cProfileTimer\": \"total_time\" }",
        // Optional. Local path for profiling outputs
        "LocalPath": "/opt/ml/output/profiler/"  
    }
}
```

------

**To enable Debugger rules for profiling the metrics**

The following example code shows how to configure the `ProfilerReport` rule.

```
"ProfilerRuleConfigurations": [ 
    {
        "RuleConfigurationName": "ProfilerReport",
        "RuleEvaluatorImage": "895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest",
        "RuleParameters": {
            "rule_to_invoke": "ProfilerReport",
            "CPUBottleneck_cpu_threshold": "90",
            "IOBottleneck_threshold": "90"
        }
    }
]
```

To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

## Update Debugger profiling configuration using the `UpdateTrainingJob` API
<a name="debugger-updatetrainingjob-api.CLI"></a>

Debugger profiling configuration can be updated while your training job is running by using the [UpdateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateTrainingJob.html) API operation. Configure new [ProfilerConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html) and [ProfilerRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html) objects, and specify the training job name to the `TrainingJobName` parameter.

```
{
    "ProfilerConfig": { 
        "DisableProfiler": boolean,
        "ProfilingIntervalInMilliseconds": number,
        "ProfilingParameters": { 
            "string" : "string" 
        }
    },
    "ProfilerRuleConfigurations": [ 
        { 
            "RuleConfigurationName": "string",
            "RuleEvaluatorImage": "string",
            "RuleParameters": { 
                "string" : "string" 
            }
        }
    ],
    "TrainingJobName": "your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS"
}
```

## Add Debugger custom rule configuration to the `CreateTrainingJob` API
<a name="debugger-custom-rules-api.CLI"></a>

A custom rule can be configured for a training job using the [ DebugHookConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html) and [ DebugRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html) objects in the [ CreateTrainingJob](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API operation. The following code sample shows how to configure a custom `ImproperActivation` rule written with the *smdebug* library using this SageMaker API operation. This example assumes that you’ve written the custom rule in *custom\$1rules.py* file and uploaded it to an Amazon S3 bucket. The example provides pre-built Docker images that you can use to run your custom rules. These are listed at [Amazon SageMaker Debugger image URIs for custom rule evaluators](debugger-reference.md#debuger-custom-rule-registry-ids). You specify the URL registry address for the pre-built Docker image in the `RuleEvaluatorImage` parameter.

```
"DebugHookConfig": {
    "S3OutputPath": "s3://<default-bucket>/<training-job-name>/debug-output",
    "CollectionConfigurations": [
        {
            "CollectionName": "relu_activations",
            "CollectionParameters": {
                "include_regex": "relu",
                "save_interval": "500",
                "end_step": "5000"
            }
        }
    ]
},
"DebugRulesConfigurations": [
    {
        "RuleConfigurationName": "improper_activation_job",
        "RuleEvaluatorImage": "552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest",
        "InstanceType": "ml.c4.xlarge",
        "VolumeSizeInGB": 400,
        "RuleParameters": {
           "source_s3_uri": "s3://bucket/custom_rules.py",
           "rule_to_invoke": "ImproperActivation",
           "collection_names": "relu_activations"
        }
    }
]
```

To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

# SDK for Python (Boto3)
<a name="debugger-built-in-rules-api.Boto3"></a>

Amazon SageMaker Debugger built-in rules can be configured for a training job using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) function of the AWS Boto3 SageMaker AI client. You need to specify the right image URI in the `RuleEvaluatorImage` parameter, and the following examples walk you through how to set up the request body for the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) function.

The following code shows a complete example of how to configure Debugger for the `create_training_job()` request body and start a training job in `us-west-2`, assuming that a training script `entry_point/train.py` is prepared using TensorFlow. To find an end-to-end example notebook, see [Profiling TensorFlow Multi GPU Multi Node Training Job with Amazon SageMaker Debugger (Boto3)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_profiling/tf-resnet-profiling-multi-gpu-multi-node-boto3.html).

**Note**  
Ensure that you use the correct Docker container images. To find available AWS Deep Learning Container images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules).

```
import sagemaker, boto3
import datetime, tarfile

# Start setting up a SageMaker session and a Boto3 SageMaker client
session = sagemaker.Session()
region = session.boto_region_name
bucket = session.default_bucket()

# Upload a training script to a default Amazon S3 bucket of the current SageMaker session
source = 'source.tar.gz'
project = 'debugger-boto3-test'

tar = tarfile.open(source, 'w:gz')
tar.add ('entry_point/train.py') # Specify the directory and name of your training script
tar.close()

s3 = boto3.client('s3')
s3.upload_file(source, bucket, project+'/'+source)

# Set up a Boto3 session client for SageMaker
sm = boto3.Session(region_name=region).client("sagemaker")

# Start a training job
sm.create_training_job(
    TrainingJobName='debugger-boto3-'+datetime.datetime.now().strftime('%Y-%m-%d-%H-%M-%S'),
    HyperParameters={
        'sagemaker_submit_directory': 's3://'+bucket+'/'+project+'/'+source,
        'sagemaker_program': '/entry_point/train.py' # training scrip file location and name under the sagemaker_submit_directory
    },
    AlgorithmSpecification={
        # Specify a training Docker container image URI (Deep Learning Container or your own training container) to TrainingImage.
        'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04',
        'TrainingInputMode': 'File',
        'EnableSageMakerMetricsTimeSeries': False
    },
    RoleArn='arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20201014T161125',
    OutputDataConfig={'S3OutputPath': 's3://'+bucket+'/'+project+'/output'},
    ResourceConfig={
        'InstanceType': 'ml.p3.8xlarge',
        'InstanceCount': 1,
        'VolumeSizeInGB': 30
    },
    StoppingCondition={
        'MaxRuntimeInSeconds': 86400
    },
    DebugHookConfig={
        'S3OutputPath': 's3://'+bucket+'/'+project+'/debug-output',
        'CollectionConfigurations': [
            {
                'CollectionName': 'losses',
                'CollectionParameters' : {
                    'train.save_interval': '500',
                    'eval.save_interval': '50'
                }
            }
        ]
    },
    DebugRuleConfigurations=[
        {
            'RuleConfigurationName': 'LossNotDecreasing',
            'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
            'RuleParameters': {'rule_to_invoke': 'LossNotDecreasing'}
        }
    ],
    ProfilerConfig={
        'S3OutputPath': 's3://'+bucket+'/'+project+'/profiler-output',
        'ProfilingIntervalInMilliseconds': 500,
        'ProfilingParameters': {
            'DataloaderProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "MetricsRegex": ".*", }',
            'DetailedProfilingConfig': '{"StartStep": 5, "NumSteps": 3, }',
            'PythonProfilingConfig': '{"StartStep": 5, "NumSteps": 3, "ProfilerName": "cprofile", "cProfileTimer": "total_time"}',
            'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
        }
    },
    ProfilerRuleConfigurations=[
        {
            'RuleConfigurationName': 'ProfilerReport',
            'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
            'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}
        }
    ]
)
```

## To configure a Debugger rule for debugging model parameters
<a name="debugger-built-in-rules-api-debug.Boto3"></a>

The following code samples show how to configure a built-in `VanishingGradient` rule using this SageMaker API. 

**To enable Debugger to collect output tensors**

Specify the Debugger hook configuration as follows:

```
DebugHookConfig={
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
    'CollectionConfigurations': [
        {
            'CollectionName': 'gradients',
            'CollectionParameters' : {
                'train.save_interval': '500',
                'eval.save_interval': '50'
            }
        }
    ]
}
```

This will make the training job save a tensor collection, `gradients`, every `save_interval` of 500 steps. To find available `CollectionName` values, see [Debugger Built-in Collections](https://github.com/awslabs/sagemaker-debugger/blob/master/docs/api.md#built-in-collections) in the *SMDebug client library documentation*. To find available `CollectionParameters` parameter keys and values, see the [https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig](https://sagemaker.readthedocs.io/en/stable/api/training/debugger.html#sagemaker.debugger.CollectionConfig) class in the *SageMaker Python SDK documentation*.

**To enable Debugger rules for debugging the output tensors**

The following `DebugRuleConfigurations` API example shows how to run the built-in `VanishingGradient` rule on the saved `gradients` collection.

```
DebugRuleConfigurations=[
    {
        'RuleConfigurationName': 'VanishingGradient',
        'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
        'RuleParameters': {
            'rule_to_invoke': 'VanishingGradient',
            'threshold': '20.0'
        }
    }
]
```

With a configuration like the one in this sample, Debugger starts a rule evaluation job for your training job using the `VanishingGradient` rule on the collection of `gradients` tensor. To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

## To configure a Debugger built-in rule for profiling system and framework metrics
<a name="debugger-built-in-rules-api-profile.Boto3"></a>

The following example code shows how to specify the ProfilerConfig API operation to enable collecting system and framework metrics.

**To enable Debugger profiling to collect system and framework metrics**

------
#### [ Target Step ]

```
ProfilerConfig={ 
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', # Optional. Path to an S3 bucket to save profiling outputs
    # Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    'ProfilingIntervalInMilliseconds': 500, 
    'ProfilingParameters': {
        'DataloaderProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3, 
            "MetricsRegex": ".*"
        }',
        'DetailedProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3 
        }',
        'PythonProfilingConfig': '{
            "StartStep": 5, 
            "NumSteps": 3, 
            "ProfilerName": "cprofile",  # Available options: cprofile, pyinstrument
            "cProfileTimer": "total_time"  # Include only when using cprofile. Available options: cpu, off_cpu, total_time
        }',
        'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
    }
}
```

------
#### [ Target Time Duration ]

```
ProfilerConfig={ 
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/profiler-output', # Optional. Path to an S3 bucket to save profiling outputs
    # Available values for ProfilingIntervalInMilliseconds: 100, 200, 500, 1000 (1 second), 5000 (5 seconds), and 60000 (1 minute) milliseconds.
    'ProfilingIntervalInMilliseconds': 500,
    'ProfilingParameters': {
        'DataloaderProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10, 
            "MetricsRegex": ".*"
        }',
        'DetailedProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10
        }',
        'PythonProfilingConfig': '{
            "StartTimeInSecSinceEpoch": 12345567789, 
            "DurationInSeconds": 10, 
            "ProfilerName": "cprofile",  # Available options: cprofile, pyinstrument
            "cProfileTimer": "total_time"  # Include only when using cprofile. Available options: cpu, off_cpu, total_time
        }',
        'LocalPath': '/opt/ml/output/profiler/' # Optional. Local path for profiling outputs
    }
}
```

------

**To enable Debugger rules for profiling the metrics**

The following example code shows how to configure the `ProfilerReport` rule.

```
ProfilerRuleConfigurations=[ 
    {
        'RuleConfigurationName': 'ProfilerReport',
        'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest',
        'RuleParameters': {
            'rule_to_invoke': 'ProfilerReport',
            'CPUBottleneck_cpu_threshold': '90',
            'IOBottleneck_threshold': '90'
        }
    }
]
```

To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

## Update Debugger Profiling Configuration Using the `UpdateTrainingJob` API Operation
<a name="debugger-updatetrainingjob-api.Boto3"></a>

Debugger profiling configuration can be updated while your training job is running by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_training_job) function of the AWS Boto3 SageMaker AI client. Configure new [ProfilerConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html) and [ProfilerRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html) objects, and specify the training job name to the `TrainingJobName` parameter.

```
ProfilerConfig={ 
    'DisableProfiler': boolean,
    'ProfilingIntervalInMilliseconds': number,
    'ProfilingParameters': { 
        'string' : 'string' 
    }
},
ProfilerRuleConfigurations=[ 
    { 
        'RuleConfigurationName': 'string',
        'RuleEvaluatorImage': 'string',
        'RuleParameters': { 
            'string' : 'string' 
        }
    }
],
TrainingJobName='your-training-job-name-YYYY-MM-DD-HH-MM-SS-SSS'
```

## Add Debugger Custom Rule Configuration to the CreateTrainingJob API Operation
<a name="debugger-custom-rules-api.Boto3"></a>

A custom rule can be configured for a training job using the [ DebugHookConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html) and [ DebugRuleConfiguration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html) objects using the AWS Boto3 SageMaker AI client's [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) function. The following code sample shows how to configure a custom `ImproperActivation` rule written with the *smdebug* library using this SageMaker API operation. This example assumes that you’ve written the custom rule in *custom\$1rules.py* file and uploaded it to an Amazon S3 bucket. The example provides pre-built Docker images that you can use to run your custom rules. These are listed at [Amazon SageMaker Debugger image URIs for custom rule evaluators](debugger-reference.md#debuger-custom-rule-registry-ids). You specify the URL registry address for the pre-built Docker image in the `RuleEvaluatorImage` parameter.

```
DebugHookConfig={
    'S3OutputPath': 's3://<default-bucket>/<training-job-name>/debug-output',
    'CollectionConfigurations': [
        {
            'CollectionName': 'relu_activations',
            'CollectionParameters': {
                'include_regex': 'relu',
                'save_interval': '500',
                'end_step': '5000'
            }
        }
    ]
},
DebugRulesConfigurations=[
    {
        'RuleConfigurationName': 'improper_activation_job',
        'RuleEvaluatorImage': '552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest',
        'InstanceType': 'ml.c4.xlarge',
        'VolumeSizeInGB': 400,
        'RuleParameters': {
           'source_s3_uri': 's3://bucket/custom_rules.py',
           'rule_to_invoke': 'ImproperActivation',
           'collection_names': 'relu_activations'
        }
    }
]
```

To find a complete list of available Docker images for using the Debugger rules, see [Docker images for Debugger rules](debugger-reference.md#debugger-docker-images-rules). To find the key-value pairs for `RuleParameters`, see [List of Debugger built-in rules](debugger-built-in-rules.md).

# Amazon SageMaker Debugger references
<a name="debugger-reference"></a>

Find more information and references about using Amazon SageMaker Debugger in the following topics.

**Topics**
+ [

## Amazon SageMaker Debugger APIs
](#debugger-apis)
+ [

## Docker images for Debugger rules
](#debugger-docker-images-rules)
+ [

## Amazon SageMaker Debugger exceptions
](#debugger-exceptions)
+ [

## Distributed training supported by Amazon SageMaker Debugger
](#debugger-considerations)

## Amazon SageMaker Debugger APIs
<a name="debugger-apis"></a>

Amazon SageMaker Debugger has API operations in several locations that are used to implement its monitoring and analysis of model training.

Amazon SageMaker Debugger also provides the open source [`sagemaker-debugger` Python SDK](https://github.com/awslabs/sagemaker-debugger/tree/master/smdebug) that is used to configure built-in rules, define custom rules, and register hooks to collect output tensor data from training jobs.

The [Amazon SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) is a high-level SDK focused on machine learning experimentation. The SDK can be used to deploy built-in or custom rules defined with the `SMDebug` Python library to monitor and analyze these tensors using SageMaker AI estimators.

Debugger has added operations and types to the Amazon SageMaker API that enable the platform to use Debugger when training a model and to manage the configuration of inputs and outputs. 
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateTrainingJob.html) use the following Debugger APIs to configure tensor collections, rules, rule images, and profiling options:
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CollectionConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CollectionConfiguration.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TensorBoardOutputConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_TensorBoardOutputConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html)
+ [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html) provides a full description of a training job, including the following Debugger configurations and rule evaluation statuses:
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugHookConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleConfiguration.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleEvaluationStatus.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DebugRuleEvaluationStatus.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerConfig.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleConfiguration.html)
  + [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleEvaluationStatus.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProfilerRuleEvaluationStatus.html)

The rule configuration API operations use the SageMaker Processing functionality when analyzing a model training. For more information about SageMaker Processing, see [Data transformation workloads with SageMaker Processing](processing-job.md).

## Docker images for Debugger rules
<a name="debugger-docker-images-rules"></a>

Amazon SageMaker AI provides two sets of Docker images for rules: one set for evaluating rules provided by SageMaker AI (built-in rules) and one set for evaluating custom rules provided in Python source files. 

If you use the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable), you can simply use SageMaker AI high-level Debugger API operations with SageMaker AI Estimator API operations, without having to manually retrieve the Debugger Docker images and configure the `ConfigureTrainingJob`API. 

If you are not using the SageMaker Python SDK, you have to retrieve a relevant pre-built container base image for the Debugger rules. Amazon SageMaker Debugger provides pre-built Docker images for built-in and custom rules, and the images are stored in Amazon Elastic Container Registry (Amazon ECR). To pull an image from an Amazon ECR repository (or to push an image to one), use the full name registry URL of the image using the `CreateTrainingJob` API. SageMaker AI uses the following URL patterns for the Debugger rule container image registry address. 

```
<account_id>.dkr.ecr.<Region>.amazonaws.com/<ECR repository name>:<tag>
```

For the account ID in each AWS Region, the Amazon ECR repository name, and the tag value, see the following topics.

**Topics**
+ [

### Amazon SageMaker Debugger image URIs for built-in rule evaluators
](#debuger-built-in-registry-ids)
+ [

### Amazon SageMaker Debugger image URIs for custom rule evaluators
](#debuger-custom-rule-registry-ids)

### Amazon SageMaker Debugger image URIs for built-in rule evaluators
<a name="debuger-built-in-registry-ids"></a>

Use the following values for the components of the registry URLs for the images that provide built-in rules for Amazon SageMaker Debugger. For account IDs, see the following table.

**ECR Repository Name**: sagemaker-debugger-rules 

**Tag**: latest 

**Example of a full registry URL**: 

`904829902805.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rules:latest`

Account IDs for Built-in Rules Container Images by AWS Region


| Region | account\$1id | 
| --- | --- | 
| af-south-1 |  314341159256  | 
| ap-east-1 |  199566480951  | 
| ap-northeast-1 |  430734990657   | 
| ap-northeast-2 |  578805364391  | 
| ap-south-1 |  904829902805  | 
| ap-southeast-1 |  972752614525  | 
| ap-southeast-2 |  184798709955  | 
| ca-central-1 |  519511493484  | 
| cn-north-1 |  618459771430  | 
| cn-northwest-1 |  658757709296  | 
| eu-central-1 |  482524230118  | 
| eu-north-1 |  314864569078  | 
| eu-south-1 |  563282790590  | 
| eu-west-1 |  929884845733  | 
| eu-west-2 |  250201462417  | 
| eu-west-3 |  447278800020  | 
| me-south-1 |  986000313247  | 
| sa-east-1 |  818342061345  | 
| us-east-1 |  503895931360  | 
| us-east-2 |  915447279597  | 
| us-west-1 |  685455198987  | 
| us-west-2 |  895741380848  | 
| us-gov-west-1 |  515509971035  | 

### Amazon SageMaker Debugger image URIs for custom rule evaluators
<a name="debuger-custom-rule-registry-ids"></a>

Use the following values for the components of the registry URL for the images that provide custom rule evaluators for Amazon SageMaker Debugger. For account IDs, see the following table.

**ECR Repository Name**: sagemaker-debugger-rule-evaluator 

**Tag**: latest 

**Example of a full registry URL**: 

`552407032007.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-debugger-rule-evaluator:latest`

Account IDs for Custom Rules Container Images by AWS Region


| Region | account\$1id | 
| --- | --- | 
| af-south-1 |  515950693465  | 
| ap-east-1 |  645844755771  | 
| ap-northeast-1 |  670969264625   | 
| ap-northeast-2 |  326368420253  | 
| ap-south-1 |  552407032007  | 
| ap-southeast-1 |  631532610101  | 
| ap-southeast-2 |  445670767460  | 
| ca-central-1 |  105842248657  | 
| cn-north-1 |  617202126805  | 
| cn-northwest-1 |  658559488188  | 
| eu-central-1 |  691764027602  | 
| eu-north-1 |  091235270104  | 
| eu-south-1 |  335033873580  | 
| eu-west-1 |  606966180310  | 
| eu-west-2 |  074613877050  | 
| eu-west-3 |  224335253976  | 
| me-south-1 |  050406412588  | 
| sa-east-1 |  466516958431  | 
| us-east-1 |  864354269164  | 
| us-east-2 |  840043622174  | 
| us-west-1 |  952348334681  | 
| us-west-2 |  759209512951  | 
| us-gov-west-1 |  515361955729  | 

## Amazon SageMaker Debugger exceptions
<a name="debugger-exceptions"></a>

Amazon SageMaker Debugger is designed to be aware of that tensors required to execute a rule might not be available at every step. As a result, it raises a few exceptions, which enable you to control what happens when a tensor is missing. These exceptions are available in the [smdebug.exceptions module](https://github.com/awslabs/sagemaker-debugger/blob/master/smdebug/exceptions.py). You can import them as follows:

```
from smdebug.exceptions import *
```

The following exceptions are available:
+ `TensorUnavailableForStep` – The tensor requested is not available for the step. This might mean that this step might not be saved at all by the hook, or that this step might have saved some tensors but the requested tensor is not part of them. Note that when you see this exception, it means that this tensor can never become available for this step in the future. If the tensor has reductions saved for the step, it notifies you they can be queried.
+ `TensorUnavailable` – This tensor is not being saved or has not been saved by the `smdebug` API. This means that this tensor is never seen for any step in `smdebug`.
+ `StepUnavailable` – The step was not saved and Debugger has no data from the step.
+ `StepNotYetAvailable` – The step has not yet been seen by `smdebug`. It might be available in the future if the training is still going on. Debugger automatically loads new data as it becomes available.
+ `NoMoreData` – Raised when the training ends. Once you see this, you know that there are no more steps and no more tensors to be saved.
+ `IndexReaderException` – The index reader is not valid.
+ `InvalidWorker` – A worker was invoked that was not valid.
+ `RuleEvaluationConditionMet` – Evaluation of the rule at the step resulted in the condition being met.
+ `InsufficientInformationForRuleInvocation` – Insufficient information was provided to invoke the rule.

## Distributed training supported by Amazon SageMaker Debugger
<a name="debugger-considerations"></a>

The following list shows the scope of validity and considerations for using Debugger on training jobs with deep learning frameworks and various distributed training options.
+ **Horovod**

  Scope of validity of using Debugger for training jobs with Horovod    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-reference.html)
+ **SageMaker AI distributed data parallel**

  Scope of validity of using Debugger for training jobs with SageMaker AI distributed data parallel    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/debugger-reference.html)

  \$1 Debugger does not support framework profiling for TensorFlow 2.x.

  \$1\$1 SageMaker AI distributed data parallel does not support TensorFlow 2.x with Keras implementation.
+ **SageMaker AI distributed model parallel** – Debugger does not support SageMaker AI distributed model parallel training.
+ **Distributed training with SageMaker AI checkpoints** – Debugger is not available for training jobs when both the distributed training option and SageMaker AI checkpoints are enabled. You might see an error that looks like the following: 

  ```
  SMDebug Does Not Currently Support Distributed Training Jobs With Checkpointing Enabled
  ```

  To use Debugger for training jobs with distributed training options, you need to disable SageMaker AI checkpointing and add manual checkpointing functions to your training script. For more information about using Debugger with distributed training options and checkpoints, see [Using SageMaker AI distributed data parallel with Amazon SageMaker Debugger and checkpoints](distributed-troubleshooting-data-parallel.md#distributed-ts-data-parallel-debugger) and [Saving Checkpoints](distributed-troubleshooting-model-parallel.md#distributed-ts-model-parallel-checkpoints).
+ **Parameter Server** – Debugger does not support parameter server-based distributed training.
+ Profiling distributed training framework operations, such as the `AllReduced` operation of SageMaker AI distributed data parallel and [Horovod operations](https://horovod.readthedocs.io/en/stable/timeline_include.html), is not available.