

# GPU Monitoring and Optimization
<a name="tutorial-gpu"></a>

The following section will guide you through GPU optimization and monitoring options. This section is organized like a typical workflow with monitoring overseeing preprocessing and training. 
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](tutorial-gpu-opt-preprocessing.md)
  + [Training](tutorial-gpu-opt-training.md)

# Monitoring
<a name="tutorial-gpu-monitoring"></a>

Your DLAMI comes preinstalled with several GPU monitoring tools. This guide also mentions tools that are available to download and install.
+ [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md) - a preinstalled utility that reports GPU usage statistics to Amazon CloudWatch.
+ [nvidia-smi CLI](https://developer.nvidia.com/nvidia-system-management-interface) - a utility to monitor overall GPU compute and memory utilization. This is preinstalled on your AWS Deep Learning AMIs (DLAMI).
+ [NVML C library](https://developer.nvidia.com/nvidia-management-library-nvml) - a C-based API to directly access GPU monitoring and management functions. This used by the nvidia-smi CLI under the hood and is preinstalled on your DLAMI. It also has Python and Perl bindings to facilitate development in those languages. The gpumon.py utility preinstalled on your DLAMI uses the pynvml package from [nvidia-ml-py](https://pypi.org/project/nvidia-ml-py/).
+ [NVIDIA DCGM](https://developer.nvidia.com/data-center-gpu-manager-dcgm) - A cluster management tool. Visit the developer page to learn how to install and configure this tool.

**Tip**  
Check out NVIDIA's developer blog for the latest info on using the CUDA tools installed your DLAMI:  
[Monitoring TensorCore utilization using Nsight IDE and nvprof](https://devblogs.nvidia.com/using-nsight-compute-nvprof-mixed-precision-deep-learning-models/).

# Monitor GPUs with CloudWatch
<a name="tutorial-gpu-monitoring-gpumon"></a>

When you use your DLAMI with a GPU you might find that you are looking for ways to track its usage during training or inference. This can be useful for optimizing your data pipeline, and tuning your deep learning network. 

There are two ways to configure GPU metrics with CloudWatch:
+ [Configure metrics with the AWS CloudWatch agent (Recommended)](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-guide)
+ [Configure metrics with the preinstalled `gpumon.py` script](#tutorial-gpu-monitoring-gpumon-script)

## Configure metrics with the AWS CloudWatch agent (Recommended)
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-guide"></a>

Integrate your DLAMI with the [ unified CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) to configure GPU metrics and monitor the utilization of GPU coprocesses in Amazon EC2 accelerated instances.

There are four ways to configure [GPU metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-NVIDIA-GPU.html) with your DLAMI:
+ [Configure minimal GPU metrics](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-minimal)
+ [Configure partial GPU metrics](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-partial)
+ [Configure all available GPU metrics](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-all)
+ [Configure custom GPU metrics](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-custom)

For information on updates and security patches, see [Security patching for the AWS CloudWatch agent](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-security)

### Prerequisites
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-prerequisites"></a>

To get started, you must configure Amazon EC2 instance IAM permissions that allow your instance to push metrics to CloudWatch. For detailed steps, see [ Create IAM roles and users for use with the CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-iam-roles-for-cloudwatch-agent.html).

### Configure minimal GPU metrics
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-minimal"></a>

Configure minimal GPU metrics using the `dlami-cloudwatch-agent@minimal` `systemd` service. This service configures the following metrics:
+ `utilization_gpu`
+ `utilization_memory`

You can find the `systemd` service for minimal preconfigured GPU metrics in the following location:

```
/opt/aws/amazon-cloudwatch-agent/etc/dlami-amazon-cloudwatch-agent-minimal.json
```

Enable and start the `systemd` service with the following commands:

```
sudo systemctl enable dlami-cloudwatch-agent@minimal
sudo systemctl start dlami-cloudwatch-agent@minimal
```

### Configure partial GPU metrics
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-partial"></a>

Configure partial GPU metrics using the `dlami-cloudwatch-agent@partial` `systemd` service. This service configures the following metrics:
+ `utilization_gpu`
+ `utilization_memory`
+ `memory_total`
+ `memory_used`
+ `memory_free`

You can find the `systemd` service for partial preconfigured GPU metrics in the following location:

```
/opt/aws/amazon-cloudwatch-agent/etc/dlami-amazon-cloudwatch-agent-partial.json
```

Enable and start the `systemd` service with the following commands:

```
sudo systemctl enable dlami-cloudwatch-agent@partial
sudo systemctl start dlami-cloudwatch-agent@partial
```

### Configure all available GPU metrics
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-all"></a>

Configure all available GPU metrics using the `dlami-cloudwatch-agent@all` `systemd` service. This service configures the following metrics:
+ `utilization_gpu`
+ `utilization_memory`
+ `memory_total`
+ `memory_used`
+ `memory_free`
+ `temperature_gpu`
+ `power_draw`
+ `fan_speed`
+ `pcie_link_gen_current`
+ `pcie_link_width_current`
+ `encoder_stats_session_count`
+ `encoder_stats_average_fps`
+ `encoder_stats_average_latency`
+ `clocks_current_graphics`
+ `clocks_current_sm`
+ `clocks_current_memory`
+ `clocks_current_video`

You can find the `systemd` service for all available preconfigured GPU metrics in the following location:

```
/opt/aws/amazon-cloudwatch-agent/etc/dlami-amazon-cloudwatch-agent-all.json
```

Enable and start the `systemd` service with the following commands:

```
sudo systemctl enable dlami-cloudwatch-agent@all
sudo systemctl start dlami-cloudwatch-agent@all
```

### Configure custom GPU metrics
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-custom"></a>

If the preconfigured metrics do not meet your requirements, you can create a custom CloudWatch agent configuration file.

#### Create a custom configuration file
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-custom-create"></a>

To create a custom configuration file, refer to the detailed steps in [ Manually create or edit the CloudWatch agent configuration file](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html).

For this example, assume that the schema definition is located at `/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json`.

#### Configure metrics with your custom file
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-custom-configure"></a>

Run the following command to configure the CloudWatch agent according to your custom file:

```
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 -s -c \
file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
```

### Security patching for the AWS CloudWatch agent
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-security"></a>

Newly released DLAMIs are configured with the latest available AWS CloudWatch agent security patches. Refer to the following sections to update your current DLAMI with the latest security patches depending on your operating system of choice.

#### Amazon Linux 2
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-security-al2"></a>

Use `yum` to get the latest AWS CloudWatch agent security patches for an Amazon Linux 2 DLAMI.

```
 sudo yum update
```

#### Ubuntu
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-security-ubuntu"></a>

To get the latest AWS CloudWatch security patches for a DLAMI with Ubuntu, it is necessary to reinstall the AWS CloudWatch agent using an Amazon S3 download link.

```
wget https://s3.region.amazonaws.com/amazoncloudwatch-agent-region/ubuntu/arm64/latest/amazon-cloudwatch-agent.deb
```

For more information on installing the AWS CloudWatch agent using Amazon S3 download links, see [ Installing and running the CloudWatch agent on your servers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-commandline-fleet.html).

## Configure metrics with the preinstalled `gpumon.py` script
<a name="tutorial-gpu-monitoring-gpumon-script"></a>

A utility called gpumon.py is preinstalled on your DLAMI. It integrates with CloudWatch and supports monitoring of per-GPU usage: GPU memory, GPU temperature, and GPU Power. The script periodically sends the monitored data to CloudWatch. You can configure the level of granularity for data being sent to CloudWatch by changing a few settings in the script. Before starting the script, however, you will need to setup CloudWatch to receive the metrics. 

**How to setup and run GPU monitoring with CloudWatch**

1. Create an IAM user, or modify an existing one to have a policy for publishing the metric to CloudWatch. If you create a new user please take note of the credentials as you will need these in the next step. 

   The IAM policy to search for is “cloudwatch:PutMetricData”. The policy that is added is as follows:

------
#### [ JSON ]

****  

   ```
   {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
           {
               "Action": [
                   "cloudwatch:PutMetricData"
                ],
                "Effect": "Allow",
                "Resource": "*"
           }
      ]
   }
   ```

------
**Tip**  
For more information on creating an IAM user and adding policies for CloudWatch, refer to the [ CloudWatch documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-iam-roles-for-cloudwatch-agent.html).

1. On your DLAMI, run [AWS configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration) and specify the IAM user credentials. 

   ```
   $ aws configure
   ```

1. You might need to make some modifications to the gpumon utility before you run it. You can find the gpumon utility and README in the location defined in the following code block. For more information on the `gpumon.py` script, see [the Amazon S3 location of the script.](https://s3.amazonaws.com/aws-bigdata-blog/artifacts/GPUMonitoring/gpumon.py)

   ```
   Folder: ~/tools/GPUCloudWatchMonitor
   Files: 	~/tools/GPUCloudWatchMonitor/gpumon.py
         	~/tools/GPUCloudWatchMonitor/README
   ```

   Options:
   + Change the region in gpumon.py if your instance is NOT in us-east-1.
   + Change other parameters such as the CloudWatch `namespace` or the reporting period with `store_reso`.

1. Currently the script only supports Python 3. Activate your preferred framework’s Python 3 environment or activate the DLAMI general Python 3 environment. 

   ```
   $ source activate python3
   ```

1. Run the gpumon utility in background.

   ```
   (python3)$ python gpumon.py &
   ```

1. Open your browser to the [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/) then select metric. It will have a namespace 'DeepLearningTrain'. 
**Tip**  
You can change the namespace by modifying gpumon.py. You can also modify the reporting interval by adjusting `store_reso`. 

The following is an example CloudWatch chart reporting on a run of gpumon.py monitoring a training job on p2.8xlarge instance. 

![\[GPU monitoring on CloudWatch\]](http://docs.aws.amazon.com/dlami/latest/devguide/images/gpumon.png)


You might be interested in these other topics on GPU monitoring and optimization:
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](#tutorial-gpu-monitoring-gpumon)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](tutorial-gpu-opt-preprocessing.md)
  + [Training](tutorial-gpu-opt-training.md)

# Optimization
<a name="tutorial-gpu-opt"></a>

To make the most of your GPUs, you can optimize your data pipeline and tune your deep learning network. As the following chart describes, a naive or basic implementation of a neural network might use the GPU inconsistently and not to its fullest potential. When you optimize your preprocessing and data loading, you can reduce the bottleneck from your CPU to your GPU. You can adjust the neural network itself, by using hybridization (when supported by the framework), adjusting batch size, and synchronizing calls. You can also use multiple-precision (float16 or int8) training in most frameworks, which can have a dramatic effect on improving throughput. 

The following chart shows the cumulative performance gains when applying different optimizations. Your results will depend on the data you are processing and the network you are optimizing.

![\[Performance enhancements for GPUs\]](http://docs.aws.amazon.com/dlami/latest/devguide/images/performance-enhancements.png)


The following guides introduce options that will work with your DLAMI and help you boost GPU performance.

**Topics**
+ [Preprocessing](tutorial-gpu-opt-preprocessing.md)
+ [Training](tutorial-gpu-opt-training.md)

# Preprocessing
<a name="tutorial-gpu-opt-preprocessing"></a>

Data preprocessing through transformations or augmentations can often be a CPU-bound process, and this can be the bottleneck in your overall pipeline. Frameworks have built-in operators for image processing, but DALI (Data Augmentation Library) demonstrates improved performance over frameworks’ built-in options.
+ NVIDIA Data Augmentation Library (DALI): DALI offloads data augmentation to the GPU. It is not preinstalled on the DLAMI, but you can access it by installing it or loading a supported framework container on your DLAMI or other Amazon Elastic Compute Cloud instance. Refer to the [DALI project page](https://docs.nvidia.com/deeplearning/sdk/dali-install-guide/index.html) on the NVIDIA website for details. For an example use-case and to download code samples, see the [ SageMaker Preprocessing Training Performance](https://github.com/aws-samples/sagemaker-cv-preprocessing-training-performance) sample.
+ nvJPEG: a GPU-accelerated JPEG decoder library for C programmers. It supports decoding single images or batches as well as subsequent transformation operations that are common in deep learning. nvJPEG comes built-in with DALI, or you can download from the [NVIDIA website's nvjpeg page](https://developer.nvidia.com/nvjpeg) and use it separately.

You might be interested in these other topics on GPU monitoring and optimization:
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](#tutorial-gpu-opt-preprocessing)
  + [Training](tutorial-gpu-opt-training.md)

# Training
<a name="tutorial-gpu-opt-training"></a>

With mixed-precision training you can deploy larger networks with the same amount of memory, or reduce memory usage compared to your single or double precision network, and you will see compute performance increases. You also get the benefit of smaller and faster data transfers, an important factor in multiple node distributed training. To take advantage of mixed-precision training you need to adjust data casting and loss scaling. The following are guides describing how to do this for the frameworks that support mixed-precision.
+ [NVIDIA Deep Learning SDK](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/) - docs on the NVIDIA website describing mixed-precision implementation for MXNet, PyTorch, and TensorFlow.

**Tip**  
Be sure to check the website for your framework of choice, and search for "mixed precision" or "fp16" for the latest optimization techniques. Here are some mixed-precision guides you might find helpful:  
[Mixed-precision training with TensorFlow (video)](https://devblogs.nvidia.com/mixed-precision-resnet-50-tensor-cores/) - on the NVIDIA blog site.
[Mixed-precision training using float16 with MXNet](https://mxnet.apache.org/api/faq/float16) - an FAQ article on the MXNet website.
[NVIDIA Apex: a tool for easy mixed-precision training with PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/) - a blog article on the NVIDIA website.

You might be interested in these other topics on GPU monitoring and optimization:
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](tutorial-gpu-opt-preprocessing.md)
  + [Training](#tutorial-gpu-opt-training)