

# Debugger tutorials
<a name="debugger-tutorial"></a>

The following topics walk you through tutorials from the basics to advanced use cases of monitoring, profiling, and debugging SageMaker training jobs using Debugger. Explore the Debugger features and learn how you can debug and improve your machine learning models efficiently by using Debugger.

**Topics**
+ [

# Debugger tutorial videos
](debugger-videos.md)
+ [

# Debugger example notebooks
](debugger-notebooks.md)
+ [

# Debugger advanced demos and visualization
](debugger-visualization.md)

# Debugger tutorial videos
<a name="debugger-videos"></a>

The following videos provide a tour of Amazon SageMaker Debugger capabilities using SageMaker Studio and SageMaker AI notebook instances. 

**Topics**
+ [

## Debugging models with Amazon SageMaker Debugger in Studio Classic
](#debugger-video-get-started)
+ [

## Deep dive on Amazon SageMaker Debugger and SageMaker AI model monitor
](#debugger-video-dive-deep)

## Debugging models with Amazon SageMaker Debugger in Studio Classic
<a name="debugger-video-get-started"></a>

*Julien Simon, AWS Technical Evangelist \$1 Length: 14 minutes 17 seconds*

This tutorial video demonstrates how to use Amazon SageMaker Debugger to capture and inspect debugging information from a training model. The example training model used in this video is a simple convolutional neural network (CNN) based on Keras with the TensorFlow backend. SageMaker AI in a TensorFlow framework and Debugger enable you to build an estimator directly using the training script and debug the training job.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/MqPdTj0Znwg/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/MqPdTj0Znwg)


You can find the example notebook in the video in [ this Studio Demo repository](https://gitlab.com/juliensimon/amazon-studio-demos/-/tree/master) provided by the author. You need to clone the `debugger.ipynb` notebook file and the `mnist_keras_tf.py` training script to your SageMaker Studio or a SageMaker notebook instance. After you clone the two files, specify the path `keras_script_path` to the `mnist_keras_tf.py` file inside the `debugger.ipynb` notebook. For example, if you cloned the two files in the same directory, set it as `keras_script_path = "mnist_keras_tf.py"`.

## Deep dive on Amazon SageMaker Debugger and SageMaker AI model monitor
<a name="debugger-video-dive-deep"></a>

*Julien Simon, AWS Technical Evangelist \$1 Length: 44 minutes 34 seconds*

This video session explores advanced features of Debugger and SageMaker Model Monitor that help boost productivity and the quality of your models. First, this video shows how to detect and fix training issues, visualize tensors, and improve models with Debugger. Next, at 22:41, the video shows how to monitor models in production and identify prediction issues such as missing features or data drift using SageMaker AI Model Monitor. Finally, it offers cost optimization tips to help you make the most of your machine learning budget.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/0zqoeZxakOI/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/0zqoeZxakOI)


You can find the example notebook in the video in [ this AWS Dev Days 2020 repository](https://gitlab.com/juliensimon/awsdevdays2020/-/tree/master/mls1) offered by the author.

# Debugger example notebooks
<a name="debugger-notebooks"></a>

[SageMaker Debugger example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/) are provided in the [aws/amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) repository. The Debugger example notebooks walk you through basic to advanced use cases of debugging and profiling training jobs. 

We recommend that you run the example notebooks on SageMaker Studio or a SageMaker Notebook instance because most of the examples are designed for training jobs in the SageMaker AI ecosystem, including Amazon EC2, Amazon S3, and Amazon SageMaker Python SDK. 

To clone the example repository to SageMaker Studio, follow the instructions at [Amazon SageMaker Studio Tour](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-studio-end-to-end.html).

**Important**  
To use the new Debugger features, you need to upgrade the SageMaker Python SDK and the `SMDebug` client library. In your iPython kernel, Jupyter Notebook, or JupyterLab environment, run the following code to install the latest versions of the libraries and restart the kernel.  

```
import sys
import IPython
!{sys.executable} -m pip install -U sagemaker smdebug
IPython.Application.instance().kernel.do_shutdown(True)
```

## Debugger example notebooks for profiling training jobs
<a name="debugger-notebooks-profiling"></a>

The following list shows Debugger example notebooks introducing Debugger's adaptability to monitor and profile training jobs for various machine learning models, datasets, and frameworks.


| Notebook Title | Framework | Model | Dataset | Description | 
| --- | --- | --- | --- | --- | 
|  [Amazon SageMaker Debugger Profiling Data Analysis](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/debugger_interactive_analysis_profiling/interactive_analysis_profiling_data.html)  |  TensorFlow  |  Keras ResNet50  | Cifar-10 |  This notebook provides an introduction to interactive analysis of profiled data captured by SageMaker Debugger. Explore the full functionality of the `SMDebug` interactive analysis tools.  | 
|  [Profile machine learning training with Amazon SageMaker Debugger ](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/tensorflow_nlp_sentiment_analysis/sentiment-analysis-tf-distributed-training-bringyourownscript.html)  |  TensorFlow  |  1-D Convolutional Neural Network  |  IMDB dataset  |  Profile a TensorFlow 1-D CNN for sentiment analysis of IMDB data that consists of movie reviews labeled as having positive or negative sentiment. Explore the Studio Debugger insights and Debugger profiling report.  | 
|  [Profiling TensorFlow ResNet model training with various distributed training settings](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_profiling)  |  TensorFlow  | ResNet50 | Cifar-10 |  Run TensorFlow training jobs with various distributed training settings, monitor system resource utilization, and profile model performance using Debugger.  | 
|  [Profiling PyTorch ResNet model training with various distributed training settings](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_profiling)   | PyTorch |  ResNet50  | Cifar-10 |  Run PyTorch training jobs with various distributed training settings, monitor system resource utilization, and profile model performance using Debugger.  | 

## Debugger example notebooks for analyzing model parameters
<a name="debugger-notebooks-debugging"></a>

The following list shows Debugger example notebooks introducing Debugger's adaptability to debug training jobs for various machine learning models, datasets, and frameworks.


| Notebook Title | Framework | Model | Dataset | Description | 
| --- | --- | --- | --- | --- | 
|  [Amazon SageMaker Debugger - Use built-in rule](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow_builtin_rule)  |  TensorFlow  |  Convolutional Neural Network  | MNIST |  Use the Amazon SageMaker Debugger built-in rules for debugging a TensorFlow model.  | 
|  [Amazon SageMaker Debugger - Tensorflow 2.1](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/tensorflow2)  |  TensorFlow  |  ResNet50  | Cifar-10 |  Use the Amazon SageMaker Debugger hook configuration and built-in rules for debugging a model with the Tensorflow 2.1 framework.  | 
|  [Visualizing Debugging Tensors of MXNet training](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mnist_tensor_plot)  |  MXNet  |  Gluon Convolutional Neural Network  | Fashion MNIST |  Run a training job and configure SageMaker Debugger to store all tensors from this job, then visualize those tensors ina notebook.  | 
|  [Enable Spot Training with Amazon SageMaker Debugger](https://github.com/aws/amazon-sagemaker-examples/tree/master/sagemaker-debugger/mxnet_spot_training)   | MXNet |  Gluon Convolutional Neural Network  | Fashion MNIST |  Learn how Debugger collects tensor data from a training job on a spot instance, and how to use the Debugger built-in rules with managed spot training.  | 
| [Explain an XGBoost model that predicts an individual’s income with Amazon SageMaker Debugger](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/xgboost_census_explanations/xgboost-census-debugger-rules.html) | XGBoost |  XGBoost Regression  |  [Adult Census dataset](https://archive.ics.uci.edu/ml/datasets/adult)  | Learn how to use the Debugger hook and built-in rules for collecting and visualizing tensor data from an XGBoost regression model, such as loss values, features, and SHAP values. | 

To find advanced visualizations of model parameters and use cases, see the next topic at [Debugger advanced demos and visualization](debugger-visualization.md).

# Debugger advanced demos and visualization
<a name="debugger-visualization"></a>

The following demos walk you through advanced use cases and visualization scripts using Debugger.

**Topics**
+ [

## Training and pruning models with Amazon SageMaker Experiments and Debugger
](#debugger-visualization-video-model-pruning)
+ [

## [Using SageMaker Debugger to monitor a convolutional autoencoder model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/autoencoder_mnist/autoencoder_mnist.html)
](#debugger-visualization-autoencoder_mnist)
+ [

## [Using SageMaker Debugger to monitor attentions in BERT model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/bert_attention_head_view/bert_attention_head_view.html)
](#debugger-visualization-bert_attention_head_view)
+ [

## [Using SageMaker Debugger to visualize class activation maps in convolutional neural networks (CNNs)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/cnn_class_activation_maps/cnn_class_activation_maps.html)
](#debugger-visualization-cnn_class_activation_maps)

## Training and pruning models with Amazon SageMaker Experiments and Debugger
<a name="debugger-visualization-video-model-pruning"></a>

*Dr. Nathalie Rauschmayr, AWS Applied Scientist \$1 Length: 49 minutes 26 seconds*

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/Tnv6HsT1r4I/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/Tnv6HsT1r4I)


Find out how Amazon SageMaker Experiments and Debugger can simplify the management of your training jobs. Amazon SageMaker Debugger provides transparent visibility into training jobs and saves training metrics into your Amazon S3 bucket. SageMaker Experiments enables you to call the training information as *trials* through SageMaker Studio and supports visualization of the training job. This helps you keep model quality high while reducing less important parameters based on importance rank.

This video demonstrates a *model pruning* technique that makes pre-trained ResNet50 and AlexNet models lighter and affordable while keeping high standards for model accuracy.

SageMaker AI Estimator trains those algorithms supplied from the PyTorch model zoo in an AWS Deep Learning Containers with PyTorch framework, and Debugger extracts training metrics from the training process.

The video also demonstrates how to set up a Debugger custom rule to watch the accuracy of a pruned model, to trigger an Amazon CloudWatch event and an AWS Lambda function when the accuracy hits a threshold, and to automatically stop the pruning process to avoid redundant iterations. 

Learning objectives are as follows: 
+  Learn how to use SageMaker AI to accelerate ML model training and improve model quality. 
+  Understand how to manage training iterations with SageMaker Experiments by automatically capturing input parameters, configurations, and results. 
+  Discover how Debugger makes the training process transparent by automatically capturing real-time tensor data from metrics such as weights, gradients, and activation outputs of convolutional neural networks.
+ Use CloudWatch to trigger Lambda when Debugger catches issues.
+  Master the SageMaker training process using SageMaker Experiments and Debugger.

You can find the notebooks and training scripts used in this video from [SageMaker Debugger PyTorch Iterative Model Pruning](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-debugger/pytorch_iterative_model_pruning).

The following image shows how the iterative model pruning process reduces the size of AlexNet by cutting out the 100 least significant filters based on importance rank evaluated by activation outputs and gradients.

The pruning process reduced the initial 50 million parameters to 18 million. It also reduced the estimated model size from 201 MB to 73 MB. 

![\[An image containing model pruning result output visualizations\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-model-pruning-results-alexnet.gif)


You also need to track model accuracy, and the following image shows how you can plot the model pruning process to visualize changes in model accuracy based on the number of parameters in SageMaker Studio.

![\[An image of tensor visualization using Debugger in SageMaker Studio\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-model-pruning-studio.png)


In SageMaker Studio, choose the **Experiments** tab, select a list of tensors saved by Debugger from the pruning process, and then compose a **Trial Component List** panel. Select all ten iterations and then choose **Add chart** to create a **Trial Component Chart**. After you decide on a model to deploy, choose the trial component and choose a menu to perform an action or choose **Deploy model**.

**Note**  
To deploy a model through SageMaker Studio using the following notebook example, add a line at the end of the `train` function in the `train.py` script.  

```
# In the train.py script, look for the train function in line 58.
def train(epochs, batch_size, learning_rate):
    ...
        print('acc:{:.4f}'.format(correct/total))
        hook.save_scalar("accuracy", correct/total, sm_metric=True)

    # Add the following code to line 128 of the train.py script to save the pruned models
    # under the current SageMaker Studio model directory
    torch.save(model.state_dict(), os.environ['SM_MODEL_DIR'] + '/model.pt')
```

## [Using SageMaker Debugger to monitor a convolutional autoencoder model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/autoencoder_mnist/autoencoder_mnist.html)
<a name="debugger-visualization-autoencoder_mnist"></a>

This notebook demonstrates how SageMaker Debugger visualizes tensors from an unsupervised (or self-supervised) learning process on a MNIST image dataset of handwritten numbers.

The training model in this notebook is a convolutional autoencoder with the MXNet framework. The convolutional autoencoder has a bottleneck-shaped convolutional neural network that consists of an encoder part and a decoder part. 

The encoder in this example has two convolution layers to produce compressed representation (latent variables) of the input images. In this case, the encoder produces a latent variable of size (1, 20) from an original input image of size (28, 28) and significantly reduces the size of data for training by 40 times.

The decoder has two *deconvolutional* layers and ensures that the latent variables preserve key information by reconstructing output images.

The convolutional encoder powers clustering algorithms with smaller input data size and the performance of clustering algorithms such as k-means, k-NN, and t-Distributed Stochastic Neighbor Embedding (t-SNE).

This notebook example demonstrates how to visualize the latent variables using Debugger, as shown in the following animation. It also demonstrates how the t-SNE algorithm classifies the latent variables into ten clusters and projects them into a two-dimensional space. The scatter plot color scheme on the right side of the image reflects the true values to show how well the BERT model and t-SNE algorithm organize the latent variables into the clusters.

![\[A conceptual image of convolutional autoencoder\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-cnn-autoencoder-plot.gif)


## [Using SageMaker Debugger to monitor attentions in BERT model training](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/bert_attention_head_view/bert_attention_head_view.html)
<a name="debugger-visualization-bert_attention_head_view"></a>

Bidirectional Encode Representations from Transformers (BERT) is a language representation model. As the name of model reflects, the BERT model builds on *transfer learning* and the *Transformer model* for natural language processing (NLP).

The BERT model is pre-trained on unsupervised tasks such as predicting missing words in a sentence or predicting the next sentence that naturally follows a previous sentence. The training data contains 3.3 billion words (tokens) of English text, from sources such as Wikipedia and electronic books. For a simple example, the BERT model can give a high *attention* to appropriate verb tokens or pronoun tokens from a subject token.

The pre-trained BERT model can be fine-tuned with an additional output layer to achieve state-of-the-art model training in NLP tasks, such as automated responses to questions, text classification, and many others. 

Debugger collects tensors from the fine-tuning process. In the context of NLP, the weight of neurons is called *attention*. 

This notebook demonstrates how to use the [ pre-trained BERT model from the GluonNLP model zoo](https://gluon-nlp.mxnet.io/model_zoo/bert/index.html) on the Stanford Question and Answering dataset and how to set up SageMaker Debugger to monitor the training job.

Plotting *attention scores* and individual neurons in the query and key vectors can help to identify causes of incorrect model predictions. With SageMaker AI Debugger, you can retrieve the tensors and plot the *attention-head view* in real time as training progresses and understand what the model is learning.

The following animation shows the attention scores of the first 20 input tokens for ten iterations in the training job provided in the notebook example.

![\[An animation of the attention scores\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-attention_scores.gif)


## [Using SageMaker Debugger to visualize class activation maps in convolutional neural networks (CNNs)](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-debugger/model_specific_realtime_analysis/cnn_class_activation_maps/cnn_class_activation_maps.html)
<a name="debugger-visualization-cnn_class_activation_maps"></a>

This notebook demonstrates how to use SageMaker Debugger to plot class activation maps for image detection and classification in convolutional neural networks (CNNs). In deep learning, a *convolutional neural network (CNN or ConvNet)* is a class of deep neural networks, most commonly applied to analyzing visual imagery. One of the applications that adopts the class activation maps is self-driving cars, which require instantaneous detection and classification of images such as traffic signs, roads, and obstacles.

In this notebook, the PyTorch ResNet model is trained on [the German Traffic Sign Dataset](http://benchmark.ini.rub.de/), which contains more than 40 classes of traffic-related objects and more than 50,000 images in total.

![\[An animation of CNN class activation maps\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/debugger/debugger-cnn-class-activation-maps.gif)


During the training process, SageMaker Debugger collects tensors to plot the class activation maps in real time. As shown in the animated image, the class activation map (also called as a *saliency map*) highlights regions with high activation in red color. 

Using tensors captured by Debugger, you can visualize how the activation map evolves during the model training. The model starts by detecting the edge on the lower-left corner at the beginning of the training job. As the training progresses, the focus shifts to the center and detects the speed limit sign, and the model successfully predicts the input image as Class 3, which is a class of speed limit 60km/h signs, with a 97% confidence level.