

# Optimization
<a name="tutorial-gpu-opt"></a>

To make the most of your GPUs, you can optimize your data pipeline and tune your deep learning network. As the following chart describes, a naive or basic implementation of a neural network might use the GPU inconsistently and not to its fullest potential. When you optimize your preprocessing and data loading, you can reduce the bottleneck from your CPU to your GPU. You can adjust the neural network itself, by using hybridization (when supported by the framework), adjusting batch size, and synchronizing calls. You can also use multiple-precision (float16 or int8) training in most frameworks, which can have a dramatic effect on improving throughput. 

The following chart shows the cumulative performance gains when applying different optimizations. Your results will depend on the data you are processing and the network you are optimizing.

![\[Performance enhancements for GPUs\]](http://docs.aws.amazon.com/dlami/latest/devguide/images/performance-enhancements.png)


The following guides introduce options that will work with your DLAMI and help you boost GPU performance.

**Topics**
+ [Preprocessing](tutorial-gpu-opt-preprocessing.md)
+ [Training](tutorial-gpu-opt-training.md)

# Preprocessing
<a name="tutorial-gpu-opt-preprocessing"></a>

Data preprocessing through transformations or augmentations can often be a CPU-bound process, and this can be the bottleneck in your overall pipeline. Frameworks have built-in operators for image processing, but DALI (Data Augmentation Library) demonstrates improved performance over frameworks’ built-in options.
+ NVIDIA Data Augmentation Library (DALI): DALI offloads data augmentation to the GPU. It is not preinstalled on the DLAMI, but you can access it by installing it or loading a supported framework container on your DLAMI or other Amazon Elastic Compute Cloud instance. Refer to the [DALI project page](https://docs.nvidia.com/deeplearning/sdk/dali-install-guide/index.html) on the NVIDIA website for details. For an example use-case and to download code samples, see the [ SageMaker Preprocessing Training Performance](https://github.com/aws-samples/sagemaker-cv-preprocessing-training-performance) sample.
+ nvJPEG: a GPU-accelerated JPEG decoder library for C programmers. It supports decoding single images or batches as well as subsequent transformation operations that are common in deep learning. nvJPEG comes built-in with DALI, or you can download from the [NVIDIA website's nvjpeg page](https://developer.nvidia.com/nvjpeg) and use it separately.

You might be interested in these other topics on GPU monitoring and optimization:
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](#tutorial-gpu-opt-preprocessing)
  + [Training](tutorial-gpu-opt-training.md)

# Training
<a name="tutorial-gpu-opt-training"></a>

With mixed-precision training you can deploy larger networks with the same amount of memory, or reduce memory usage compared to your single or double precision network, and you will see compute performance increases. You also get the benefit of smaller and faster data transfers, an important factor in multiple node distributed training. To take advantage of mixed-precision training you need to adjust data casting and loss scaling. The following are guides describing how to do this for the frameworks that support mixed-precision.
+ [NVIDIA Deep Learning SDK](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/) - docs on the NVIDIA website describing mixed-precision implementation for MXNet, PyTorch, and TensorFlow.

**Tip**  
Be sure to check the website for your framework of choice, and search for "mixed precision" or "fp16" for the latest optimization techniques. Here are some mixed-precision guides you might find helpful:  
[Mixed-precision training with TensorFlow (video)](https://devblogs.nvidia.com/mixed-precision-resnet-50-tensor-cores/) - on the NVIDIA blog site.
[Mixed-precision training using float16 with MXNet](https://mxnet.apache.org/api/faq/float16) - an FAQ article on the MXNet website.
[NVIDIA Apex: a tool for easy mixed-precision training with PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/) - a blog article on the NVIDIA website.

You might be interested in these other topics on GPU monitoring and optimization:
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](tutorial-gpu-opt-preprocessing.md)
  + [Training](#tutorial-gpu-opt-training)