

# Adapting your training script to register a hook
<a name="debugger-modify-script"></a>

Amazon SageMaker Debugger comes with a client library called the [`sagemaker-debugger` Python SDK](https://sagemaker-debugger.readthedocs.io/en/website). The `sagemaker-debugger` Python SDK provides tools for adapting your training script before training and analysis tools after training. In this page, you'll learn how to adapt your training script using the client library. 

The `sagemaker-debugger` Python SDK provides wrapper functions that help register a hook to extract model tensors, without altering your training script. To get started with collecting model output tensors and debug them to find training issues, make the following modifications in your training script.

**Tip**  
While you're following this page, use the [`sagemaker-debugger` open source SDK documentation](https://sagemaker-debugger.readthedocs.io/en/website/index.html) for API references.

**Topics**
+ [Adapt your PyTorch training script](debugger-modify-script-pytorch.md)
+ [Adapt your TensorFlow training script](debugger-modify-script-tensorflow.md)

# Adapt your PyTorch training script
<a name="debugger-modify-script-pytorch"></a>

To start collecting model output tensors and debug training issues, make the following modifications to your PyTorch training script.

**Note**  
SageMaker Debugger cannot collect model output tensors from the [https://pytorch.org/docs/stable/nn.functional.html](https://pytorch.org/docs/stable/nn.functional.html) API operations. When you write a PyTorch training script, it is recommended to use the [https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) modules instead.

## For PyTorch 1.12.0
<a name="debugger-modify-script-pytorch-1-12-0"></a>

If you bring a PyTorch training script, you can run the training job and extract model output tensors with a few additional code lines in your training script. You need to use the [hook APIs](https://sagemaker-debugger.readthedocs.io/en/website/hook-api.html) in the `sagemaker-debugger` client library. Walk through the following instructions that break down the steps with code examples.

1. Create a hook.

   **(Recommended) For training jobs within SageMaker AI**

   ```
   import smdebug.pytorch as smd
   hook=smd.get_hook(create_if_not_exists=True)
   ```

   When you launch a training job in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md) with any of the DebuggerHookConfig, TensorBoardConfig, or Rules in your estimator, SageMaker AI adds a JSON configuration file to your training instance that is picked up by the `get_hook` function. Note that if you do not include any of the configuration APIs in your estimator, there will be no configuration file for the hook to find, and the function returns `None`.

   **(Optional) For training jobs outside SageMaker AI**

   If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2 instances, or your own local devices, use `smd.Hook` class to create a hook. However, this approach can only store the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s built-in Rules don’t work with the local mode because the Rules require SageMaker AI ML training instances and S3 to store outputs from the remote instances in real time. The `smd.get_hook` API returns `None` in this case. 

   If you want to create a manual hook to save tensors in local mode, use the following code snippet with the logic to check if the `smd.get_hook` API returns `None` and create a manual hook using the `smd.Hook` class. Note that you can specify any output directory in your local machine.

   ```
   import smdebug.pytorch as smd
   hook=smd.get_hook(create_if_not_exists=True)
   
   if hook is None:
       hook=smd.Hook(
           out_dir='/path/to/your/local/output/',
           export_tensorboard=True
       )
   ```

1. Wrap your model with the hook’s class methods.

   The `hook.register_module()` method takes your model and iterates through each layer, looking for any tensors that match with regular expressions that you’ll provide through the configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md). The collectable tensors through this hook method are weights, biases, activations, gradients, inputs, and outputs.

   ```
   hook.register_module(model)
   ```
**Tip**  
If you collect the entire output tensors from a large deep learning model, the total size of those collections can exponentially grow and might cause bottlenecks. If you want to save specific tensors, you can also use the `hook.save_tensor()` method. This method helps you pick the variable for the specific tensor and save to a custom collection named as you want. For more information, see [step 7](#debugger-modify-script-pytorch-save-custom-tensor) of this instruction.

1. Warp the loss function with the hook’s class methods.

   The `hook.register_loss` method is to wrap the loss function. It extracts loss values every `save_interval` that you’ll set during configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md), and saves them to the `"losses"` collection.

   ```
   hook.register_loss(loss_function)
   ```

1. Add `hook.set_mode(ModeKeys.TRAIN)` in the train block. This indicates the tensor collection is extracted during the training phase.

   ```
   def train():
       ...
       hook.set_mode(ModeKeys.TRAIN)
   ```

1. Add `hook.set_mode(ModeKeys.EVAL)` in the validation block. This indicates the tensor collection is extracted during the validation phase.

   ```
   def validation():
       ...
       hook.set_mode(ModeKeys.EVAL)
   ```

1. Use [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar) to save custom scalars. You can save scalar values that aren’t in your model. For example, if you want to record the accuracy values computed during evaluation, add the following line of code below the line where you calculate accuracy.

   ```
   hook.save_scalar("accuracy", accuracy)
   ```

   Note that you need to provide a string as the first argument to name the custom scalar collection. This is the name that'll be used for visualizing the scalar values in TensorBoard, and can be any string you want.

1. <a name="debugger-modify-script-pytorch-save-custom-tensor"></a>Use [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_tensor](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_tensor) to save custom tensors. Similarly to [https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar](https://sagemaker-debugger.readthedocs.io/en/website/hook-constructor.html#smdebug.core.hook.BaseHook.save_scalar), you can save additional tensors, defining your own tensor collection. For example, you can extract input image data that are passed into the model and save as a custom tensor by adding the following code line, where `"images"` is an example name of the custom tensor, `image_inputs` is an example variable for the input image data.

   ```
   hook.save_tensor("images", image_inputs)
   ```

   Note that you must provide a string to the first argument to name the custom tensor. `hook.save_tensor()` has the third argument `collections_to_write` to specify the tensor collection to save the custom tensor. The default is `collections_to_write="default"`. If you don't explicitely specify the third argument, the custom tensor is saved to the `"default"` tensor collection.

After you have completed adapting your training script, proceed to [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md).

# Adapt your TensorFlow training script
<a name="debugger-modify-script-tensorflow"></a>

To start collecting model output tensors and debug training issues, make the following modifications to your TensorFlow training script.

**Create a hook for training jobs within SageMaker AI**

```
import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True)
```

This creates a hook when you start a SageMaker training job. When you launch a training job in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md) with any of the `DebuggerHookConfig`, `TensorBoardConfig`, or `Rules` in your estimator, SageMaker AI adds a JSON configuration file to your training instance that is picked up by the `smd.get_hook` method. Note that if you do not include any of the configuration APIs in your estimator, there will be no configuration file for the hook to find, and the function returns `None`.

**(Optional) Create a hook for training jobs outside SageMaker AI**

If you run training jobs in local mode, directly on SageMaker Notebook instances, Amazon EC2 instances, or your own local devices, use `smd.Hook` class to create a hook. However, this approach can only store the tensor collections and usable for TensorBoard visualization. SageMaker Debugger’s built-in Rules don’t work with the local mode. The `smd.get_hook` method also returns `None` in this case. 

If you want to create a manual hook, use the following code snippet with the logic to check if the hook returns `None` and create a manual hook using the `smd.Hook` class.

```
import smdebug.tensorflow as smd

hook=smd.get_hook(hook_type="keras", create_if_not_exists=True) 

if hook is None:
    hook=smd.KerasHook(
        out_dir='/path/to/your/local/output/',
        export_tensorboard=True
    )
```

After adding the hook creation code, proceed to the following topic for TensorFlow Keras.

**Note**  
SageMaker Debugger currently supports TensorFlow Keras only.

## Register the hook in your TensorFlow Keras training script
<a name="debugger-modify-script-tensorflow-keras"></a>

The following precedure walks you through how to use the hook and its methods to collect output scalars and tensors from your model and optimizer.

1. Wrap your Keras model and optimizer with the hook’s class methods.

   The `hook.register_model()` method takes your model and iterates through each layer, looking for any tensors that match with regular expressions that you’ll provide through the configuration in [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md). The collectable tensors through this hook method are weights, biases, and activations.

   ```
   model=tf.keras.Model(...)
   hook.register_model(model)
   ```

1. Wrap the optimizer by the `hook.wrap_optimizer()` method.

   ```
   optimizer=tf.keras.optimizers.Adam(...)
   optimizer=hook.wrap_optimizer(optimizer)
   ```

1. Compile the model in eager mode in TensorFlow.

   To collect tensors from the model, such as the input and output tensors of each layer, you must run the training in eager mode. Otherwise, SageMaker AI Debugger will not be able to collect the tensors. However, other tensors, such as model weights, biases, and the loss, can be collected without explicitly running in eager mode.

   ```
   model.compile(
       loss="categorical_crossentropy", 
       optimizer=optimizer, 
       metrics=["accuracy"],
       # Required for collecting tensors of each layer
       run_eagerly=True
   )
   ```

1. Register the hook to the [https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) method.

   To collect the tensors from the hooks that you registered, add `callbacks=[hook]` to the Keras `model.fit()` class method. This will pass the `sagemaker-debugger` hook as a Keras callback.

   ```
   model.fit(
       X_train, Y_train,
       batch_size=batch_size,
       epochs=epoch,
       validation_data=(X_valid, Y_valid),
       shuffle=True, 
       callbacks=[hook]
   )
   ```

1. TensorFlow 2.x provides only symbolic gradient variables that do not provide access to their values. To collect gradients, wrap `tf.GradientTape` by the [https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html#tensorflow-specific-hook-api](https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html#tensorflow-specific-hook-api) method, which requires you to write your own training step as follows.

   ```
   def training_step(model, dataset):
       with hook.wrap_tape(tf.GradientTape()) as tape:
           pred=model(data)
           loss_value=loss_fn(labels, pred)
       grads=tape.gradient(loss_value, model.trainable_variables)
       optimizer.apply_gradients(zip(grads, model.trainable_variables))
   ```

   By wrapping the tape, the `sagemaker-debugger` hook can identify output tensors such as gradients, parameters, and losses. Wrapping the tape ensures that the `hook.wrap_tape()` method around functions of the tape object, such as `push_tape()`, `pop_tape()`, `gradient()`, will set up the writers of SageMaker Debugger and save tensors that are provided as input to `gradient()` (trainable variables and loss) and output of `gradient()` (gradients).
**Note**  
To collect with a custom training loop, make sure that you use eager mode. Otherwise, SageMaker Debugger is not able to collect any tensors.

For a full list of actions that the `sagemaker-debugger` hook APIs offer to construct hooks and save tensors, see [Hook Methods](https://sagemaker-debugger.readthedocs.io/en/website/hook-methods.html) in the *`sagemaker-debugger` Python SDK documentation*.

After you have completed adapting your training script, proceed to [Launch training jobs with Debugger using the SageMaker Python SDK](debugger-configuration-for-debugging.md).