

# Tutorials
<a name="tutorials"></a>

The following are tutorials on how to use the Deep Learning AMI with Conda's software.

**Topics**
+ [

# Activating Frameworks
](activating.md)
+ [

# Distributed training using Elastic Fabric Adapter
](tutorial-efa.md)
+ [

# GPU Monitoring and Optimization
](tutorial-gpu.md)
+ [

# The AWS Inferentia Chip With DLAMI
](tutorial-inferentia.md)
+ [

# The ARM64 DLAMI
](tutorial-arm64.md)
+ [

# Inference
](tutorial-inference.md)
+ [

# Model Serving
](model-serving.md)

# Activating Frameworks
<a name="activating"></a>

The following are the deep learning frameworks installed on the Deep Learning AMI with Conda. Click on a framework to learn how to activate it.

**Topics**
+ [

# PyTorch
](tutorial-pytorch.md)
+ [

# TensorFlow 2
](tutorial-tensorflow-2.md)

# PyTorch
<a name="tutorial-pytorch"></a>

## Activating PyTorch
<a name="tutorial-pytorch-overview"></a>

When a stable Conda package of a framework is released, it's tested and pre-installed on the DLAMI. If you want to run the latest, untested nightly build, you can [Install PyTorch's Nightly Build (experimental)](#tutorial-pytorch-install) manually. 

To activate the currently installed framework, follow these instructions on your Deep Learning AMI with Conda.

For PyTorch on Python 3 with CUDA and MKL-DNN, run this command:

```
$ source activate pytorch_p310
```

Start the iPython terminal.

```
(pytorch_p310)$ ipython
```

Run a quick PyTorch program.

```
import torch
x = torch.rand(5, 3)
print(x)
print(x.size())
y = torch.rand(5, 3)
print(torch.add(x, y))
```

You should see the initial random array printed, then its size, and then the addition of another random array.

## Install PyTorch's Nightly Build (experimental)
<a name="tutorial-pytorch-install"></a>

**How to install PyTorch from a nightly build**

You can install the latest PyTorch build into either or both of the PyTorch Conda environments on your Deep Learning AMI with Conda.

1. 
   + (Option for Python 3) - Activate the Python 3 PyTorch environment:

     ```
     $ source activate pytorch_p310
     ```

1. The remaining steps assume you are using the `pytorch_p310` environment. Remove the currently installed PyTorch:

   ```
   (pytorch_p310)$ pip uninstall torch
   ```

1. 
   + (Option for GPU instances) - Install the latest nightly build of PyTorch with CUDA.0:

     ```
     (pytorch_p310)$ pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu100/torch_nightly.html
     ```
   + (Option for CPU instances) - Install the latest nightly build of PyTorch for instances with no GPUs:

     ```
     (pytorch_p310)$ pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
     ```

1. To verify you have successfully installed latest nightly build, start the IPython terminal and check the version of PyTorch.

   ```
   (pytorch_p310)$ ipython
   ```

   ```
   import torch
   print (torch.__version__)
   ```

   The output should print something similar to `1.0.0.dev20180922`

1. To verify that the PyTorch nightly build works well with the MNIST example, you can run a test script from PyTorch's examples repository:

   ```
   (pytorch_p310)$ cd ~
   (pytorch_p310)$ git clone https://github.com/pytorch/examples.git pytorch_examples
   (pytorch_p310)$ cd pytorch_examples/mnist
   (pytorch_p310)$ python main.py || exit 1
   ```

## More Tutorials
<a name="tutorial-pytorch-more"></a>

For further tutorials and examples refer to the framework's official docs, [PyTorch documentation](http://pytorch.org/docs/master/), and the [PyTorch](http://pytorch.org) website.

# TensorFlow 2
<a name="tutorial-tensorflow-2"></a>

This tutorial shows how to activate TensorFlow 2 on an instance running the Deep Learning AMI with Conda (DLAMI on Conda) and run a TensorFlow 2 program.

When a stable Conda package of a framework is released, it's tested and pre-installed on the DLAMI. 

## Activating TensorFlow 2
<a name="tutorial-tensorflow-2-overview"></a>

**To run TensorFlow on the DLAMI with Conda**

1. To activate TensorFlow 2, open an Amazon Elastic Compute Cloud (Amazon EC2) instance of the DLAMI with Conda.

1. For TensorFlow 2 and Keras 2 on Python 3 with CUDA 10.1 and MKL-DNN, run this command:

   ```
   $ source activate tensorflow2_p310
   ```

1. Start the iPython terminal:

   ```
   (tensorflow2_p310)$ ipython
   ```

1. Run a TensorFlow 2 program to verify that it is working properly:

   ```
   import tensorflow as tf
   hello = tf.constant('Hello, TensorFlow!')
   tf.print(hello)
   ```

   `Hello, TensorFlow!` should appear on your screen.

## More Tutorials
<a name="tutorial-tensorflow-2-more"></a>

For more tutorials and examples, see the TensorFlow documentation for the [TensorFlow Python API](https://www.tensorflow.org/api_docs/python/) or see the [TensorFlow](https://www.tensorflow.org) website.

# Distributed training using Elastic Fabric Adapter
<a name="tutorial-efa"></a>

An [Elastic Fabric Adapter](https://aws.amazon.com/hpc/efa/) (EFA) is a network device that you can attach to your DLAMI instance to accelerate High Performance Computing (HPC) applications. EFA enables you to achieve the application performance of an on-premises HPC cluster, with the scalability, flexibility, and elasticity provided by the AWS Cloud.

The following topics show you how to get started using EFA with the DLAMI.

**Note**  
Choose your DLAMI from this [Base GPU DLAMI list](appendix-ami-release-notes.md#appendix-ami-release-notes-base)

**Topics**
+ [

# Launching a AWS Deep Learning AMIs Instance With EFA
](tutorial-efa-launching.md)
+ [

# Using EFA on the DLAMI
](tutorial-efa-using.md)

# Launching a AWS Deep Learning AMIs Instance With EFA
<a name="tutorial-efa-launching"></a>

The latest Base DLAMI is ready to use with EFA and comes with the required drivers, kernel modules, libfabric, openmpi and the [NCCL OFI plugin](https://github.com/aws/aws-ofi-nccl/tree/aws) for GPU instances.

You can find the supported CUDA versions of a Base DLAMI in the [release notes](appendix-ami-release-notes.md#appendix-ami-release-notes-base).

Note:
+ When running a NCCL Application using `mpirun` on EFA, you will have to specify the full path to the EFA supported installation as: 

  ```
  /opt/amazon/openmpi/bin/mpirun <command>  
  ```
+ To enable your application to use EFA, add `FI_PROVIDER="efa"` to the `mpirun` command as shown in [Using EFA on the DLAMI](tutorial-efa-using.md).

**Topics**
+ [

## Prepare an EFA Enabled Security Group
](#tutorial-efa-security-group)
+ [

## Launch Your Instance
](#tutorial-efa-launch)
+ [

## Verify EFA Attachment
](#tutorial-efa-verify-attachment)

## Prepare an EFA Enabled Security Group
<a name="tutorial-efa-security-group"></a>

EFA requires a security group that allows all inbound and outbound traffic to and from the security group itself. For more information, see the [EFA Documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security).

1. Open the Amazon EC2 console at [https://console.aws.amazon.com/ec2/](https://console.aws.amazon.com/ec2/). 

1. In the navigation pane, choose **Security Groups** and then choose **Create Security Group**. 

1. In the **Create Security Group** window, do the following: 
   + For **Security group name**, enter a descriptive name for the security group, such as `EFA-enabled security group`. 
   + (Optional) For **Description**, enter a brief description of the security group. 
   + For **VPC**, select the VPC into which you intend to launch your EFA-enabled instances. 
   + Choose **Create**. 

1. Select the security group that you created, and on the **Description** tab, copy the **Group ID**. 

1. On the **Inbound** and **Outbound** tabs, do the following: 
   + Choose **Edit**. 
   + For **Type**, choose **All traffic**. 
   + For **Source**, choose **Custom**. 
   + Paste the security group ID that you copied into the field. 
   + Choose **Save**. 

1. Enable inbound traffic referring to [Authorizing Inbound Traffic for Your Linux Instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html). If you skip this step, you won't be able to communicate with your DLAMI instance.

## Launch Your Instance
<a name="tutorial-efa-launch"></a>

EFA on the AWS Deep Learning AMIs is currently supported with the following instance types and operating systems:
+  P3dn: Amazon Linux 2, Ubuntu 20.04
+  P4d, P4de: Amazon Linux 2, Amazon Linux 2023, Ubuntu 20.04, Ubuntu 22.04
+  P5, P5e, P5en: Amazon Linux 2, Amazon Linux 2023, Ubuntu 20.04, Ubuntu 22.04

The following section shows how to launch an EFA enabled DLAMI instance. For more information on launching an EFA enabled instance, see [Launch EFA-Enabled Instances into a Cluster Placement Group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-instances).

1. Open the Amazon EC2 console at [https://console.aws.amazon.com/ec2/](https://console.aws.amazon.com/ec2/). 

1. Choose **Launch Instance**. 

1. On the **Choose an AMI** page, select a supported DLAMI found on the [DLAMI Release Notes Page](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes) 

1. On the **Choose an Instance Type** page, select one of the following supported instance types and then choose **Next: Configure Instance Details.** Refer to this link for the list of supported instances: [Get started with EFA and MPI](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html) 

1. On the **Configure Instance Details** page, do the following: 
   + For **Number of instances**, enter the number of EFA-enabled instances that you want to launch. 
   + For **Network** and **Subnet**, select the VPC and subnet into which to launch the instances. 
   + [Optional] For **Placement group**, select **Add instance to placement group**. For best performance, launch the instances within a placement group. 
   + [Optional] For **Placement group name**, select **Add to a new placement group**, enter a descriptive name for the placement group, and then for **Placement group strategy**, select **cluster**. 
   + Make sure to enable the **“Elastic Fabric Adapter”** on this page. If this option is disabled, change the subnet to one that supports your selected instance type. 
   + In the **Network Interfaces** section, for device **eth0**, choose **New network interface**. You can optionally specify a primary IPv4 address and one or more secondary IPv4 addresses. If you're launching the instance into a subnet that has an associated IPv6 CIDR block, you can optionally specify a primary IPv6 address and one or more secondary IPv6 addresses. 
   + Choose **Next: Add Storage**. 

1. On the **Add Storage** page, specify the volumes to attach to the instances in addition to the volumes specified by the AMI (such as the root device volume), and then choose **Next: Add Tags**. 

1. On the **Add Tags** page, specify tags for the instances, such as a user-friendly name, and then choose **Next: Configure Security Group**. 

1. On the **Configure Security Group** page, for **Assign a security group**, select **Select an existing security group**, and then select the security group that you created previously**.** 

1. Choose **Review and Launch**. 

1. On the **Review Instance Launch** page, review the settings, and then choose **Launch** to choose a key pair and to launch your instances. 

## Verify EFA Attachment
<a name="tutorial-efa-verify-attachment"></a>

### From the Console
<a name="tutorial-efa-verify-attachment-console"></a>

After launching the instance, check the instance details in the AWS Console. To do this, select the instance in the EC2 console and look at the Description Tab in the lower pane on the page. Find the parameter ‘Network Interfaces: eth0’ and click on eth0 which opens a pop-up. Make sure that ‘Elastic Fabric Adapter’ is enabled. 

If EFA is not enabled, you can fix this by either:
+ Terminating the EC2 instance and launching a new one with the same steps. Make sure the EFA is attached. 
+ Attach EFA to an existing instance.

  1. In the EC2 Console, go to Network Interfaces.

  1. Click on Create a Network Interface.

  1. Select the same subnet that your instance is in.

  1. Make sure to enable the ‘Elastic Fabric Adapter’ and click on Create.

  1. Go back to the EC2 Instances Tab and select your instance.

  1. Go to Actions: Instance State and stop the instance before you attach EFA.

  1. From Actions, select Networking: Attach Network Interface.

  1. Select the interface you just created and click on attach.

  1. Restart your instance.

### From the Instance
<a name="tutorial-efa-verify-attachment-instance"></a>

The following test script is already present on the DLAMI. Run it to ensure that the kernel modules are loaded correctly.

```
$ fi_info -p efa
```

Your output should look similar to the following.

```
provider: efa
    fabric: EFA-fe80::e5:56ff:fe34:56a8
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::e5:56ff:fe34:56a8
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::e5:56ff:fe34:56a8
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD
```

### Verify Security Group Configuration
<a name="tutorial-efa-verify-attachment-security"></a>

The following test script is already present on the DLAMI. Run it to ensure that the security group you created is configured correctly.

```
$ cd /opt/amazon/efa/test/ 
$ ./efa_test.sh
```

Your output should look similar to the following.

```
Starting server...
Starting client...
bytes   #sent   #ack     total       time     MB/sec    usec/xfer   Mxfers/sec
64      10      =10      1.2k        0.02s      0.06    1123.55       0.00
256     10      =10      5k          0.00s     17.66      14.50       0.07
1k      10      =10      20k         0.00s     67.81      15.10       0.07
4k      10      =10      80k         0.00s    237.45      17.25       0.06
64k     10      =10      1.2m        0.00s    921.10      71.15       0.01
1m      10      =10      20m         0.01s   2122.41     494.05       0.00
```

If it stops responding or does not complete, ensure that your security group has the correct inbound/outbound rules. 

# Using EFA on the DLAMI
<a name="tutorial-efa-using"></a>

The following section describes how to use EFA to run multi-node applications on the AWS Deep Learning AMIs.

## Running Multi-Node Applications with EFA
<a name="tutorial-efa-using-multi-node"></a>

To run an application across a cluster of nodes the following configuration is required

**Topics**
+ [

### Enable Passwordless SSH
](#tutorial-efa-using-multi-node-ssh)
+ [

### Create Hosts File
](#tutorial-efa-using-multi-node-hosts)
+ [

### NCCL Tests
](#tutorial-efa-using-2node)

### Enable Passwordless SSH
<a name="tutorial-efa-using-multi-node-ssh"></a>

Select one node in your cluster as the leader node. The remaining nodes are referred to as the member nodes. 

1. On the leader node, generate the RSA keypair.

   ```
   ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
   ```

1. Change the permissions of the private key on the leader node.

   ```
   chmod 600 ~/.ssh/id_rsa
   ```

1. Copy the public key `~/.ssh/id_rsa.pub` to and append it to `~/.ssh/authorized_keys` of the member nodes in the cluster. 

1. You should now be able to directly login to the member nodes from the leader node using the private ip.

   ```
   ssh <member private ip>
   ```

1. Disable strictHostKeyChecking and enable agent forwarding on the leader node by adding the following to the \$1/.ssh/config file on the leader node: 

   ```
   Host *
       ForwardAgent yes
   Host *
       StrictHostKeyChecking no
   ```

1. On Amazon Linux 2 instances, run the following command on the leader node to provide correct permissions to the config file:

   ```
   chmod 600 ~/.ssh/config
   ```

### Create Hosts File
<a name="tutorial-efa-using-multi-node-hosts"></a>

On the leader node, create a hosts file to identify the nodes in the cluster. The hosts file must have an entry for each node in the cluster. Create a file \$1/hosts and add each node using the private ip as follows: 

```
localhost slots=8
<private ip of node 1> slots=8
<private ip of node 2> slots=8
```

### NCCL Tests
<a name="tutorial-efa-using-2node"></a>

**Note**  
These tests have been run using EFA version 1.38.0 and OFI NCCL Plugin 1.13.2.

 Listed below are a subset of NCCL Tests provided by Nvidia to test both functionality and performance over multiple compute nodes 

 **Supported Instances: P3dn, P4, P5, P5e, P5en** 

#### Performance Tests
<a name="tutorial-efa-using-multinode"></a>

##### Multi-node NCCL Performance Test on P4d.24xlarge
<a name="tutorial-efa-using-multi-node-performance"></a>

To check NCCL Performance with EFA, run the standard NCCL Performance test that is available on the official [NCCL-Tests Repo](https://github.com/NVIDIA/nccl-tests.git). The DLAMI comes with this test already built for CUDA XX.X. You can similarly run your own script with EFA.

When constructing your own script, refer to the following guidance:
+ Use the complete path to mpirun as shown in the example while running NCCL applications with EFA.
+ Change the params np and N based on the number of instances and GPUs in your cluster.
+ Add the NCCL\$1DEBUG=INFO flag and make sure that the logs indicate EFA usage as "Selected Provider is EFA".
+  Set the Training Log Location to parse for validation 

  ```
  TRAINING_LOG="testEFA_$(date +"%N").log"
  ```

Use the command `watch nvidia-smi` on any of the member nodes to monitor GPU usage. The following `watch nvidia-smi` commands are for a generic CUDA xx.x version and depend on the Operating System of your instance. You can run the commands for any available CUDA version in your Amazon EC2 instance by replacing the CUDA version in the script.
+ Amazon Linux 2, Amazon Linux 2023:

  ```
   $ /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 \
  -x NCCL_DEBUG=INFO --mca pml ^cm \
  -x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:$LD_LIBRARY_PATH \
  --hostfile hosts --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
  /usr/local/cuda-xx.x/efa/test-cuda-xx.x/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100 | tee ${TRAINING_LOG}
  ```
+ Ubuntu 20.04, Ubuntu 20.04:

  ```
  $ /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 \
  -x NCCL_DEBUG=INFO --mca pml ^cm \
  -x LD_LIBRARY_PATH=/usr/local/cuda-xx.x/efa/lib:/usr/local/cuda-xx.x/lib:/usr/local/cuda-xx.x/lib64:/usr/local/cuda-xx.x:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:$LD_LIBRARY_PATH \
  --hostfile hosts --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
  /usr/local/cuda-xx.x/efa/test-cuda-xx.x/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100 | tee ${TRAINING_LOG}
  ```

Your output should look like the following:

```
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  33378 on ip-172-31-42-25 device  0 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  1 Group  0 Pid  33379 on ip-172-31-42-25 device  1 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  2 Group  0 Pid  33380 on ip-172-31-42-25 device  2 [0x20] NVIDIA A100-SXM4-40GB
#  Rank  3 Group  0 Pid  33381 on ip-172-31-42-25 device  3 [0x20] NVIDIA A100-SXM4-40GB
#  Rank  4 Group  0 Pid  33382 on ip-172-31-42-25 device  4 [0x90] NVIDIA A100-SXM4-40GB
#  Rank  5 Group  0 Pid  33383 on ip-172-31-42-25 device  5 [0x90] NVIDIA A100-SXM4-40GB
#  Rank  6 Group  0 Pid  33384 on ip-172-31-42-25 device  6 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank  7 Group  0 Pid  33385 on ip-172-31-42-25 device  7 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank  8 Group  0 Pid  30378 on ip-172-31-43-8 device  0 [0x10] NVIDIA A100-SXM4-40GB
#  Rank  9 Group  0 Pid  30379 on ip-172-31-43-8 device  1 [0x10] NVIDIA A100-SXM4-40GB
#  Rank 10 Group  0 Pid  30380 on ip-172-31-43-8 device  2 [0x20] NVIDIA A100-SXM4-40GB
#  Rank 11 Group  0 Pid  30381 on ip-172-31-43-8 device  3 [0x20] NVIDIA A100-SXM4-40GB
#  Rank 12 Group  0 Pid  30382 on ip-172-31-43-8 device  4 [0x90] NVIDIA A100-SXM4-40GB
#  Rank 13 Group  0 Pid  30383 on ip-172-31-43-8 device  5 [0x90] NVIDIA A100-SXM4-40GB
#  Rank 14 Group  0 Pid  30384 on ip-172-31-43-8 device  6 [0xa0] NVIDIA A100-SXM4-40GB
#  Rank 15 Group  0 Pid  30385 on ip-172-31-43-8 device  7 [0xa0] NVIDIA A100-SXM4-40GB
ip-172-31-42-25:33385:33385 [7] NCCL INFO cudaDriverVersion 12060
ip-172-31-43-8:30383:30383 [5] NCCL INFO Bootstrap : Using ens32:172.31.43.8
ip-172-31-43-8:30383:30383 [5] NCCL INFO NCCL version 2.23.4+cuda12.5
...
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.13.2-aws
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Using Libfabric version 1.22
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Using CUDA driver version 12060 with runtime 12050
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Configuring AWS-specific options
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Setting provider_filter to efa
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
ip-172-31-42-25:33384:33451 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOPO_FILE environment variable to /opt/amazon/ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
...
-----------------------------some output truncated-----------------------------------
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    180.3    0.00    0.00      0    179.3    0.00    0.00      0
          16             4     float     sum      -1    178.1    0.00    0.00      0    177.6    0.00    0.00      0
          32             8     float     sum      -1    178.5    0.00    0.00      0    177.9    0.00    0.00      0
          64            16     float     sum      -1    178.8    0.00    0.00      0    178.7    0.00    0.00      0
         128            32     float     sum      -1    178.2    0.00    0.00      0    177.8    0.00    0.00      0
         256            64     float     sum      -1    178.6    0.00    0.00      0    178.8    0.00    0.00      0
         512           128     float     sum      -1    177.2    0.00    0.01      0    177.1    0.00    0.01      0
        1024           256     float     sum      -1    179.2    0.01    0.01      0    179.3    0.01    0.01      0
        2048           512     float     sum      -1    181.3    0.01    0.02      0    181.2    0.01    0.02      0
        4096          1024     float     sum      -1    184.2    0.02    0.04      0    183.9    0.02    0.04      0
        8192          2048     float     sum      -1    191.2    0.04    0.08      0    190.6    0.04    0.08      0
       16384          4096     float     sum      -1    202.5    0.08    0.15      0    202.3    0.08    0.15      0
       32768          8192     float     sum      -1    233.0    0.14    0.26      0    232.1    0.14    0.26      0
       65536         16384     float     sum      -1    238.6    0.27    0.51      0    235.1    0.28    0.52      0
      131072         32768     float     sum      -1    237.2    0.55    1.04      0    236.8    0.55    1.04      0
      262144         65536     float     sum      -1    248.3    1.06    1.98      0    247.0    1.06    1.99      0
      524288        131072     float     sum      -1    309.2    1.70    3.18      0    307.7    1.70    3.20      0
     1048576        262144     float     sum      -1    408.7    2.57    4.81      0    404.3    2.59    4.86      0
     2097152        524288     float     sum      -1    613.5    3.42    6.41      0    607.9    3.45    6.47      0
     4194304       1048576     float     sum      -1    924.5    4.54    8.51      0    914.8    4.58    8.60      0
     8388608       2097152     float     sum      -1   1059.5    7.92   14.85      0   1054.3    7.96   14.92      0
    16777216       4194304     float     sum      -1   1269.9   13.21   24.77      0   1272.0   13.19   24.73      0
    33554432       8388608     float     sum      -1   1642.7   20.43   38.30      0   1636.7   20.50   38.44      0
    67108864      16777216     float     sum      -1   2446.7   27.43   51.43      0   2445.8   27.44   51.45      0
   134217728      33554432     float     sum      -1   4143.6   32.39   60.73      0   4142.4   32.40   60.75      0
   268435456      67108864     float     sum      -1   7351.9   36.51   68.46      0   7346.7   36.54   68.51      0
   536870912     134217728     float     sum      -1    13717   39.14   73.39      0    13703   39.18   73.46      0
  1073741824     268435456     float     sum      -1    26416   40.65   76.21      0    26420   40.64   76.20      0
...
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 15.5514
```

#### Validation Tests
<a name="tutorial-efa-validation"></a>

To Validate that the EFA tests returned a valid result, please use the following tests to confirm: 
+ Get the instance type using EC2 Instance Metadata:

  ```
  TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
  INSTANCE_TYPE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -v http://169.254.169.254/latest/meta-data/instance-type)
  ```
+ Run the [Performance Tests](#tutorial-efa-using-multinode) 
+  Set the Following Parameters 

  ```
  CUDA_VERSION
  CUDA_RUNTIME_VERSION
  NCCL_VERSION
  ```
+  Validate the Results as shown: 

  ```
  RETURN_VAL=`echo $?`
  if [ ${RETURN_VAL} -eq 0 ]; then
  
      # [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.13.2-aws
      # [0] NCCL INFO NET/OFI Using CUDA driver version 12060 with runtime 12010
  
      # cudaDriverVersion 12060  --> This is max supported cuda version by nvidia driver
      # NCCL version 2.23.4+cuda12.5 --> This is NCCL version compiled with cuda version
  
      # Validation of logs
      grep "NET/OFI Configuring AWS-specific options" ${TRAINING_LOG} || { echo "AWS-specific options text not found"; exit 1; } 
      grep "busbw" ${TRAINING_LOG} || { echo "busbw text not found"; exit 1; } 
      grep "Avg bus bandwidth " ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } 
      grep "NCCL version $NCCL_VERSION" ${TRAINING_LOG} || { echo "Text not found: NCCL version $NCCL_VERSION"; exit 1; }
      if [[ ${INSTANCE_TYPE} == "p4d.24xlarge" ]]; then
          grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Text not found: NET/Libfabric/0/GDRDMA"; exit 1; }  
          grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Selected Provider is efa text not found"; exit 1; }   
      elif [[ ${INSTANCE_TYPE} == "p4de.24xlarge" ]]; then
          grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
          grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
      elif [[ ${INSTANCE_TYPE} == "p5.48xlarge" ]]; then
          grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
          grep "NET/OFI Selected Provider is efa (found 32 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; } 
      elif [[ ${INSTANCE_TYPE} == "p5e.48xlarge" ]]; then
          grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
          grep "NET/OFI Selected Provider is efa (found 32 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
      elif [[ ${INSTANCE_TYPE} == "p5en.48xlarge" ]]; then
          grep "NET/Libfabric/0/GDRDMA" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
          grep "NET/OFI Selected Provider is efa (found 16 nics)" ${TRAINING_LOG} || { echo "Avg bus bandwidth text not found"; exit 1; }
      elif [[ ${INSTANCE_TYPE} == "p3dn.24xlarge" ]]; then
          grep "NET/OFI Selected Provider is efa (found 4 nics)" ${TRAINING_LOG} || { echo "Selected Provider is efa text not found"; exit 1; }  
      fi
      echo "***************************** check_efa_nccl_all_reduce passed for cuda version ${CUDA_VERSION} *****************************"
  else
      echo "***************************** check_efa_nccl_all_reduce failed for cuda version ${CUDA_VERSION} *****************************"
  fi
  ```
+ To access the benchmark data, we can parse the final row of table output from the Multi Node all\$1reduce test: 

  ```
  benchmark=$(sudo cat ${TRAINING_LOG} | grep '1073741824' | tail -n1 | awk -F " " '{{print $12}}' | sed 's/ //' | sed  's/  5e-07//')
  if [[ -z "${benchmark}" ]]; then
    echo "benchmark variable is empty"
    exit 1
  fi
  
  echo "Benchmark throughput: ${benchmark}"
  ```

# GPU Monitoring and Optimization
<a name="tutorial-gpu"></a>

The following section will guide you through GPU optimization and monitoring options. This section is organized like a typical workflow with monitoring overseeing preprocessing and training. 
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](tutorial-gpu-opt-preprocessing.md)
  + [Training](tutorial-gpu-opt-training.md)

# Monitoring
<a name="tutorial-gpu-monitoring"></a>

Your DLAMI comes preinstalled with several GPU monitoring tools. This guide also mentions tools that are available to download and install.
+ [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md) - a preinstalled utility that reports GPU usage statistics to Amazon CloudWatch.
+ [nvidia-smi CLI](https://developer.nvidia.com/nvidia-system-management-interface) - a utility to monitor overall GPU compute and memory utilization. This is preinstalled on your AWS Deep Learning AMIs (DLAMI).
+ [NVML C library](https://developer.nvidia.com/nvidia-management-library-nvml) - a C-based API to directly access GPU monitoring and management functions. This used by the nvidia-smi CLI under the hood and is preinstalled on your DLAMI. It also has Python and Perl bindings to facilitate development in those languages. The gpumon.py utility preinstalled on your DLAMI uses the pynvml package from [nvidia-ml-py](https://pypi.org/project/nvidia-ml-py/).
+ [NVIDIA DCGM](https://developer.nvidia.com/data-center-gpu-manager-dcgm) - A cluster management tool. Visit the developer page to learn how to install and configure this tool.

**Tip**  
Check out NVIDIA's developer blog for the latest info on using the CUDA tools installed your DLAMI:  
[Monitoring TensorCore utilization using Nsight IDE and nvprof](https://devblogs.nvidia.com/using-nsight-compute-nvprof-mixed-precision-deep-learning-models/).

# Monitor GPUs with CloudWatch
<a name="tutorial-gpu-monitoring-gpumon"></a>

When you use your DLAMI with a GPU you might find that you are looking for ways to track its usage during training or inference. This can be useful for optimizing your data pipeline, and tuning your deep learning network. 

There are two ways to configure GPU metrics with CloudWatch:
+ [Configure metrics with the AWS CloudWatch agent (Recommended)](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-guide)
+ [Configure metrics with the preinstalled `gpumon.py` script](#tutorial-gpu-monitoring-gpumon-script)

## Configure metrics with the AWS CloudWatch agent (Recommended)
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-guide"></a>

Integrate your DLAMI with the [ unified CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html) to configure GPU metrics and monitor the utilization of GPU coprocesses in Amazon EC2 accelerated instances.

There are four ways to configure [GPU metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-NVIDIA-GPU.html) with your DLAMI:
+ [Configure minimal GPU metrics](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-minimal)
+ [Configure partial GPU metrics](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-partial)
+ [Configure all available GPU metrics](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-all)
+ [Configure custom GPU metrics](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-custom)

For information on updates and security patches, see [Security patching for the AWS CloudWatch agent](#tutorial-gpu-monitoring-gpumon-cloudwatch-agent-security)

### Prerequisites
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-prerequisites"></a>

To get started, you must configure Amazon EC2 instance IAM permissions that allow your instance to push metrics to CloudWatch. For detailed steps, see [ Create IAM roles and users for use with the CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-iam-roles-for-cloudwatch-agent.html).

### Configure minimal GPU metrics
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-minimal"></a>

Configure minimal GPU metrics using the `dlami-cloudwatch-agent@minimal` `systemd` service. This service configures the following metrics:
+ `utilization_gpu`
+ `utilization_memory`

You can find the `systemd` service for minimal preconfigured GPU metrics in the following location:

```
/opt/aws/amazon-cloudwatch-agent/etc/dlami-amazon-cloudwatch-agent-minimal.json
```

Enable and start the `systemd` service with the following commands:

```
sudo systemctl enable dlami-cloudwatch-agent@minimal
sudo systemctl start dlami-cloudwatch-agent@minimal
```

### Configure partial GPU metrics
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-partial"></a>

Configure partial GPU metrics using the `dlami-cloudwatch-agent@partial` `systemd` service. This service configures the following metrics:
+ `utilization_gpu`
+ `utilization_memory`
+ `memory_total`
+ `memory_used`
+ `memory_free`

You can find the `systemd` service for partial preconfigured GPU metrics in the following location:

```
/opt/aws/amazon-cloudwatch-agent/etc/dlami-amazon-cloudwatch-agent-partial.json
```

Enable and start the `systemd` service with the following commands:

```
sudo systemctl enable dlami-cloudwatch-agent@partial
sudo systemctl start dlami-cloudwatch-agent@partial
```

### Configure all available GPU metrics
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-all"></a>

Configure all available GPU metrics using the `dlami-cloudwatch-agent@all` `systemd` service. This service configures the following metrics:
+ `utilization_gpu`
+ `utilization_memory`
+ `memory_total`
+ `memory_used`
+ `memory_free`
+ `temperature_gpu`
+ `power_draw`
+ `fan_speed`
+ `pcie_link_gen_current`
+ `pcie_link_width_current`
+ `encoder_stats_session_count`
+ `encoder_stats_average_fps`
+ `encoder_stats_average_latency`
+ `clocks_current_graphics`
+ `clocks_current_sm`
+ `clocks_current_memory`
+ `clocks_current_video`

You can find the `systemd` service for all available preconfigured GPU metrics in the following location:

```
/opt/aws/amazon-cloudwatch-agent/etc/dlami-amazon-cloudwatch-agent-all.json
```

Enable and start the `systemd` service with the following commands:

```
sudo systemctl enable dlami-cloudwatch-agent@all
sudo systemctl start dlami-cloudwatch-agent@all
```

### Configure custom GPU metrics
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-custom"></a>

If the preconfigured metrics do not meet your requirements, you can create a custom CloudWatch agent configuration file.

#### Create a custom configuration file
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-custom-create"></a>

To create a custom configuration file, refer to the detailed steps in [ Manually create or edit the CloudWatch agent configuration file](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-Agent-Configuration-File-Details.html).

For this example, assume that the schema definition is located at `/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json`.

#### Configure metrics with your custom file
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-custom-configure"></a>

Run the following command to configure the CloudWatch agent according to your custom file:

```
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config -m ec2 -s -c \
file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
```

### Security patching for the AWS CloudWatch agent
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-security"></a>

Newly released DLAMIs are configured with the latest available AWS CloudWatch agent security patches. Refer to the following sections to update your current DLAMI with the latest security patches depending on your operating system of choice.

#### Amazon Linux 2
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-security-al2"></a>

Use `yum` to get the latest AWS CloudWatch agent security patches for an Amazon Linux 2 DLAMI.

```
 sudo yum update
```

#### Ubuntu
<a name="tutorial-gpu-monitoring-gpumon-cloudwatch-agent-security-ubuntu"></a>

To get the latest AWS CloudWatch security patches for a DLAMI with Ubuntu, it is necessary to reinstall the AWS CloudWatch agent using an Amazon S3 download link.

```
wget https://s3.region.amazonaws.com/amazoncloudwatch-agent-region/ubuntu/arm64/latest/amazon-cloudwatch-agent.deb
```

For more information on installing the AWS CloudWatch agent using Amazon S3 download links, see [ Installing and running the CloudWatch agent on your servers](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Agent-commandline-fleet.html).

## Configure metrics with the preinstalled `gpumon.py` script
<a name="tutorial-gpu-monitoring-gpumon-script"></a>

A utility called gpumon.py is preinstalled on your DLAMI. It integrates with CloudWatch and supports monitoring of per-GPU usage: GPU memory, GPU temperature, and GPU Power. The script periodically sends the monitored data to CloudWatch. You can configure the level of granularity for data being sent to CloudWatch by changing a few settings in the script. Before starting the script, however, you will need to setup CloudWatch to receive the metrics. 

**How to setup and run GPU monitoring with CloudWatch**

1. Create an IAM user, or modify an existing one to have a policy for publishing the metric to CloudWatch. If you create a new user please take note of the credentials as you will need these in the next step. 

   The IAM policy to search for is “cloudwatch:PutMetricData”. The policy that is added is as follows:

------
#### [ JSON ]

****  

   ```
   {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
           {
               "Action": [
                   "cloudwatch:PutMetricData"
                ],
                "Effect": "Allow",
                "Resource": "*"
           }
      ]
   }
   ```

------
**Tip**  
For more information on creating an IAM user and adding policies for CloudWatch, refer to the [ CloudWatch documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/create-iam-roles-for-cloudwatch-agent.html).

1. On your DLAMI, run [AWS configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html#cli-quick-configuration) and specify the IAM user credentials. 

   ```
   $ aws configure
   ```

1. You might need to make some modifications to the gpumon utility before you run it. You can find the gpumon utility and README in the location defined in the following code block. For more information on the `gpumon.py` script, see [the Amazon S3 location of the script.](https://s3.amazonaws.com/aws-bigdata-blog/artifacts/GPUMonitoring/gpumon.py)

   ```
   Folder: ~/tools/GPUCloudWatchMonitor
   Files: 	~/tools/GPUCloudWatchMonitor/gpumon.py
         	~/tools/GPUCloudWatchMonitor/README
   ```

   Options:
   + Change the region in gpumon.py if your instance is NOT in us-east-1.
   + Change other parameters such as the CloudWatch `namespace` or the reporting period with `store_reso`.

1. Currently the script only supports Python 3. Activate your preferred framework’s Python 3 environment or activate the DLAMI general Python 3 environment. 

   ```
   $ source activate python3
   ```

1. Run the gpumon utility in background.

   ```
   (python3)$ python gpumon.py &
   ```

1. Open your browser to the [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/) then select metric. It will have a namespace 'DeepLearningTrain'. 
**Tip**  
You can change the namespace by modifying gpumon.py. You can also modify the reporting interval by adjusting `store_reso`. 

The following is an example CloudWatch chart reporting on a run of gpumon.py monitoring a training job on p2.8xlarge instance. 

![\[GPU monitoring on CloudWatch\]](http://docs.aws.amazon.com/dlami/latest/devguide/images/gpumon.png)


You might be interested in these other topics on GPU monitoring and optimization:
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](#tutorial-gpu-monitoring-gpumon)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](tutorial-gpu-opt-preprocessing.md)
  + [Training](tutorial-gpu-opt-training.md)

# Optimization
<a name="tutorial-gpu-opt"></a>

To make the most of your GPUs, you can optimize your data pipeline and tune your deep learning network. As the following chart describes, a naive or basic implementation of a neural network might use the GPU inconsistently and not to its fullest potential. When you optimize your preprocessing and data loading, you can reduce the bottleneck from your CPU to your GPU. You can adjust the neural network itself, by using hybridization (when supported by the framework), adjusting batch size, and synchronizing calls. You can also use multiple-precision (float16 or int8) training in most frameworks, which can have a dramatic effect on improving throughput. 

The following chart shows the cumulative performance gains when applying different optimizations. Your results will depend on the data you are processing and the network you are optimizing.

![\[Performance enhancements for GPUs\]](http://docs.aws.amazon.com/dlami/latest/devguide/images/performance-enhancements.png)


The following guides introduce options that will work with your DLAMI and help you boost GPU performance.

**Topics**
+ [

# Preprocessing
](tutorial-gpu-opt-preprocessing.md)
+ [

# Training
](tutorial-gpu-opt-training.md)

# Preprocessing
<a name="tutorial-gpu-opt-preprocessing"></a>

Data preprocessing through transformations or augmentations can often be a CPU-bound process, and this can be the bottleneck in your overall pipeline. Frameworks have built-in operators for image processing, but DALI (Data Augmentation Library) demonstrates improved performance over frameworks’ built-in options.
+ NVIDIA Data Augmentation Library (DALI): DALI offloads data augmentation to the GPU. It is not preinstalled on the DLAMI, but you can access it by installing it or loading a supported framework container on your DLAMI or other Amazon Elastic Compute Cloud instance. Refer to the [DALI project page](https://docs.nvidia.com/deeplearning/sdk/dali-install-guide/index.html) on the NVIDIA website for details. For an example use-case and to download code samples, see the [ SageMaker Preprocessing Training Performance](https://github.com/aws-samples/sagemaker-cv-preprocessing-training-performance) sample.
+ nvJPEG: a GPU-accelerated JPEG decoder library for C programmers. It supports decoding single images or batches as well as subsequent transformation operations that are common in deep learning. nvJPEG comes built-in with DALI, or you can download from the [NVIDIA website's nvjpeg page](https://developer.nvidia.com/nvjpeg) and use it separately.

You might be interested in these other topics on GPU monitoring and optimization:
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](#tutorial-gpu-opt-preprocessing)
  + [Training](tutorial-gpu-opt-training.md)

# Training
<a name="tutorial-gpu-opt-training"></a>

With mixed-precision training you can deploy larger networks with the same amount of memory, or reduce memory usage compared to your single or double precision network, and you will see compute performance increases. You also get the benefit of smaller and faster data transfers, an important factor in multiple node distributed training. To take advantage of mixed-precision training you need to adjust data casting and loss scaling. The following are guides describing how to do this for the frameworks that support mixed-precision.
+ [NVIDIA Deep Learning SDK](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/) - docs on the NVIDIA website describing mixed-precision implementation for MXNet, PyTorch, and TensorFlow.

**Tip**  
Be sure to check the website for your framework of choice, and search for "mixed precision" or "fp16" for the latest optimization techniques. Here are some mixed-precision guides you might find helpful:  
[Mixed-precision training with TensorFlow (video)](https://devblogs.nvidia.com/mixed-precision-resnet-50-tensor-cores/) - on the NVIDIA blog site.
[Mixed-precision training using float16 with MXNet](https://mxnet.apache.org/api/faq/float16) - an FAQ article on the MXNet website.
[NVIDIA Apex: a tool for easy mixed-precision training with PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/) - a blog article on the NVIDIA website.

You might be interested in these other topics on GPU monitoring and optimization:
+ [Monitoring](tutorial-gpu-monitoring.md)
  + [Monitor GPUs with CloudWatch](tutorial-gpu-monitoring-gpumon.md)
+ [Optimization](tutorial-gpu-opt.md)
  + [Preprocessing](tutorial-gpu-opt-preprocessing.md)
  + [Training](#tutorial-gpu-opt-training)

# The AWS Inferentia Chip With DLAMI
<a name="tutorial-inferentia"></a>

AWS Inferentia is a custom machine learning chip designed by AWS that you can use for high-performance inference predictions. In order to use the chip, set up an Amazon Elastic Compute Cloud instance and use the AWS Neuron software development kit (SDK) to invoke the Inferentia chip. To provide customers with the best Inferentia experience, Neuron has been built into the AWS Deep Learning AMIs (DLAMI). 

The following topics show you how to get started using Inferentia with the DLAMI. 

**Topics**
+ [

# Launching a DLAMI Instance with AWS Neuron
](tutorial-inferentia-launching.md)
+ [

# Using the DLAMI with AWS Neuron
](tutorial-inferentia-using.md)

# Launching a DLAMI Instance with AWS Neuron
<a name="tutorial-inferentia-launching"></a>

 The latest DLAMI is ready to use with AWS Inferentia and comes with the AWS Neuron API package. To launch a DLAMI instance, see [Launching and Configuring a DLAMI.](https://docs.aws.amazon.com/dlami/latest/devguide/launch-config.html) After you have a DLAMI, use the steps here to ensure that your AWS Inferentia chip and AWS Neuron resources are active.

**Topics**
+ [

## Verify Your Instance
](#tutorial-inferentia-launching-verify)
+ [

## Identifying AWS Inferentia Devices
](#tutorial-inferentia-launching-identify)
+ [

## View Resource Usage
](#tutorial-inferentia-launching-resource-usage)
+ [

## Using Neuron Monitor (neuron-monitor)
](#tutorial-inferentia-launching-neuron-monitor)
+ [

## Upgrading Neuron Software
](#tutorial-inferentia-launching-upgrade)

## Verify Your Instance
<a name="tutorial-inferentia-launching-verify"></a>

 Before using your instance, verify that it's properly setup and configured with Neuron. 

## Identifying AWS Inferentia Devices
<a name="tutorial-inferentia-launching-identify"></a>

 To identify the number of Inferentia devices on your instance, use the following command: 

```
neuron-ls
```

 If your instance has Inferentia devices attached to it, your output will look similar to the following: 

```
+--------+--------+--------+-----------+--------------+
| NEURON | NEURON | NEURON | CONNECTED |     PCI      |
| DEVICE | CORES  | MEMORY |  DEVICES  |     BDF      |
+--------+--------+--------+-----------+--------------+
| 0      | 4      | 8 GB   | 1         | 0000:00:1c.0 |
| 1      | 4      | 8 GB   | 2, 0      | 0000:00:1d.0 |
| 2      | 4      | 8 GB   | 3, 1      | 0000:00:1e.0 |
| 3      | 4      | 8 GB   | 2         | 0000:00:1f.0 |
+--------+--------+--------+-----------+--------------+
```

 The supplied output is taken from an Inf1.6xlarge instance and includes the following columns:
+ NEURON DEVICE: The logical ID assigned to the NeuronDevice. This ID is used when configuring multiple runtimes to use different NeuronDevices.
+ NEURON CORES: The number of NeuronCores present in the NeuronDevice. 
+ NEURON MEMORY: The amount of DRAM memory in the NeuronDevice.
+ CONNECTED DEVICES: Other NeuronDevices connected to the NeuronDevice. 
+ PCI BDF: The PCI Bus Device Function (BDF) ID of the NeuronDevice.

## View Resource Usage
<a name="tutorial-inferentia-launching-resource-usage"></a>

 View useful information about NeuronCore and vCPU utilization, memory usage, loaded models, and Neuron applications with the `neuron-top` command. Launching `neuron-top` with no arguments will show data for all machine learning applications that utilize NeuronCores. 

```
neuron-top
```

 When an application is using four NeuronCores, the output should look similar to the following image: 

![\[The output of the neuron-top command, with information for one of four NeuronCores highlighted.\]](http://docs.aws.amazon.com/dlami/latest/devguide/images/neuron-top-output.png)


For more information on resources to monitor and optimize Neuron-based inference applications, see [Neuron Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-tools/index.html).

## Using Neuron Monitor (neuron-monitor)
<a name="tutorial-inferentia-launching-neuron-monitor"></a>

Neuron Monitor collects metrics from the Neuron runtimes running on the system and streams the collected data to stdout in JSON format. These metrics are organized into metric groups that you configure by providing a configuration file. For more information on Neuron Monitor, see the [User Guide for Neuron Monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-tools/neuron-monitor-user-guide.html).

## Upgrading Neuron Software
<a name="tutorial-inferentia-launching-upgrade"></a>

For information on how to update Neuron SDK software within DLAMI, see the AWS Neuron [Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-intro/neuron-install-guide.html).

**Next Step**  
[Using the DLAMI with AWS Neuron](tutorial-inferentia-using.md)

# Using the DLAMI with AWS Neuron
<a name="tutorial-inferentia-using"></a>

 A typical workflow with the AWS Neuron SDK is to compile a previously trained machine learning model on a compilation server. After this, distribute the artifacts to the Inf1 instances for execution. AWS Deep Learning AMIs (DLAMI) comes pre-installed with everything you need to compile and run inference in an Inf1 instance that uses Inferentia. 

 The following sections describe how to use the DLAMI with Inferentia. 

**Topics**
+ [

# Using TensorFlow-Neuron and the AWS Neuron Compiler
](tutorial-inferentia-tf-neuron.md)
+ [

# Using AWS Neuron TensorFlow Serving
](tutorial-inferentia-tf-neuron-serving.md)
+ [

# Using MXNet-Neuron and the AWS Neuron Compiler
](tutorial-inferentia-mxnet-neuron.md)
+ [

# Using MXNet-Neuron Model Serving
](tutorial-inferentia-mxnet-neuron-serving.md)
+ [

# Using PyTorch-Neuron and the AWS Neuron Compiler
](tutorial-inferentia-pytorch-neuron.md)

# Using TensorFlow-Neuron and the AWS Neuron Compiler
<a name="tutorial-inferentia-tf-neuron"></a>

 This tutorial shows how to use the AWS Neuron compiler to compile the Keras ResNet-50 model and export it as a saved model in SavedModel format. This format is a typical TensorFlow model interchangeable format. You also learn how to run inference on an Inf1 instance with example input.  

 For more information about the Neuron SDK, see the [AWS Neuron SDK documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/tensorflow-neuron/index.html). 

**Topics**
+ [

## Prerequisites
](#tutorial-inferentia-tf-neuron-prerequisites)
+ [

## Activate the Conda environment
](#tutorial-inferentia-tf-neuron-activate)
+ [

## Resnet50 Compilation
](#tutorial-inferentia-tf-neuron-compilation)
+ [

## ResNet50 Inference
](#tutorial-inferentia-tf-neuron-inference)

## Prerequisites
<a name="tutorial-inferentia-tf-neuron-prerequisites"></a>

 Before using this tutorial, you should have completed the set up steps in [Launching a DLAMI Instance with AWS Neuron](tutorial-inferentia-launching.md). You should also have a familiarity with deep learning and using the DLAMI. 

## Activate the Conda environment
<a name="tutorial-inferentia-tf-neuron-activate"></a>

 Activate the TensorFlow-Neuron conda environment using the following command: 

```
source activate aws_neuron_tensorflow_p36
```

 To exit the current conda environment, run the following command: 

```
source deactivate
```

## Resnet50 Compilation
<a name="tutorial-inferentia-tf-neuron-compilation"></a>

Create a Python script called **tensorflow\$1compile\$1resnet50.py** that has the following content. This Python script compiles the Keras ResNet50 model and exports it as a saved model. 

```
import os
import time
import shutil
import tensorflow as tf
import tensorflow.neuron as tfn
import tensorflow.compat.v1.keras as keras
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input

# Create a workspace
WORKSPACE = './ws_resnet50'
os.makedirs(WORKSPACE, exist_ok=True)

# Prepare export directory (old one removed)
model_dir = os.path.join(WORKSPACE, 'resnet50')
compiled_model_dir = os.path.join(WORKSPACE, 'resnet50_neuron')
shutil.rmtree(model_dir, ignore_errors=True)
shutil.rmtree(compiled_model_dir, ignore_errors=True)

# Instantiate Keras ResNet50 model
keras.backend.set_learning_phase(0)
model = ResNet50(weights='imagenet')

# Export SavedModel
tf.saved_model.simple_save(
 session            = keras.backend.get_session(),
 export_dir         = model_dir,
 inputs             = {'input': model.inputs[0]},
 outputs            = {'output': model.outputs[0]})

# Compile using Neuron
tfn.saved_model.compile(model_dir, compiled_model_dir)

# Prepare SavedModel for uploading to Inf1 instance
shutil.make_archive(compiled_model_dir, 'zip', WORKSPACE, 'resnet50_neuron')
```

 Compile the model using the following command: 

```
python tensorflow_compile_resnet50.py
```

The compilation process will take a few minutes. When it completes, your output should look like the following: 

```
...
INFO:tensorflow:fusing subgraph neuron_op_d6f098c01c780733 with neuron-cc
INFO:tensorflow:Number of operations in TensorFlow session: 4638
INFO:tensorflow:Number of operations after tf.neuron optimizations: 556
INFO:tensorflow:Number of operations placed on Neuron runtime: 554
INFO:tensorflow:Successfully converted ./ws_resnet50/resnet50 to ./ws_resnet50/resnet50_neuron
...
```

 ​ 

 After compilation, the saved model is zipped at **ws\$1resnet50/resnet50\$1neuron.zip**. Unzip the model and download the sample image for inference using the following commands: 

```
unzip ws_resnet50/resnet50_neuron.zip -d .
curl -O https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg
```

## ResNet50 Inference
<a name="tutorial-inferentia-tf-neuron-inference"></a>

Create a Python script called **tensorflow\$1infer\$1resnet50.py**  that has the following content. This script runs inference on the downloaded model using a previously compiled inference model. 

```
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications import resnet50

# Create input from image
img_sgl = image.load_img('kitten_small.jpg', target_size=(224, 224))
img_arr = image.img_to_array(img_sgl)
img_arr2 = np.expand_dims(img_arr, axis=0)
img_arr3 = resnet50.preprocess_input(img_arr2)
# Load model
COMPILED_MODEL_DIR = './ws_resnet50/resnet50_neuron/'
predictor_inferentia = tf.contrib.predictor.from_saved_model(COMPILED_MODEL_DIR)
# Run inference
model_feed_dict={'input': img_arr3}
infa_rslts = predictor_inferentia(model_feed_dict);
# Display results
print(resnet50.decode_predictions(infa_rslts["output"], top=5)[0])
```

 Run inference on the model using the following command: 

```
python tensorflow_infer_resnet50.py
```

 Your output should look like the following: 

```
...
[('n02123045', 'tabby', 0.6918919), ('n02127052', 'lynx', 0.12770271), ('n02123159', 'tiger_cat', 0.08277027), ('n02124075', 'Egyptian_cat', 0.06418919), ('n02128757', 'snow_leopard', 0.009290541)]
```

**Next Step**  
[Using AWS Neuron TensorFlow Serving](tutorial-inferentia-tf-neuron-serving.md)

# Using AWS Neuron TensorFlow Serving
<a name="tutorial-inferentia-tf-neuron-serving"></a>

This tutorial shows how to construct a graph and add an AWS Neuron compilation step before exporting the saved model to use with TensorFlow Serving. TensorFlow Serving is a serving system that allows you to scale-up inference across a network. Neuron TensorFlow Serving uses the same API as normal TensorFlow Serving. The only difference is that a saved model must be compiled for AWS Inferentia and the entry point is a different binary named `tensorflow_model_server_neuron`. The binary is found at `/usr/local/bin/tensorflow_model_server_neuron` and is pre-installed in the DLAMI. 

 For more information about the Neuron SDK, see the [AWS Neuron SDK documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/tensorflow-neuron/index.html). 

**Topics**
+ [

## Prerequisites
](#tutorial-inferentia-tf-neuron--serving-prerequisites)
+ [

## Activate the Conda environment
](#tutorial-inferentia-tf-neuron-serving-activate)
+ [

## Compile and Export the Saved Model
](#tutorial-inferentia-tf-neuron-serving-compile)
+ [

## Serving the Saved Model
](#tutorial-inferentia-tf-neuron-serving-serving)
+ [

## Generate inference requests to the model server
](#tutorial-inferentia-tf-neuron-serving-inference)

## Prerequisites
<a name="tutorial-inferentia-tf-neuron--serving-prerequisites"></a>

Before using this tutorial, you should have completed the set up steps in [Launching a DLAMI Instance with AWS Neuron](tutorial-inferentia-launching.md). You should also have a familiarity with deep learning and using the DLAMI. 

## Activate the Conda environment
<a name="tutorial-inferentia-tf-neuron-serving-activate"></a>

 Activate the TensorFlow-Neuron conda environment using the following command: 

```
source activate aws_neuron_tensorflow_p36
```

 If you need to exit the current conda environment, run: 

```
source deactivate
```

## Compile and Export the Saved Model
<a name="tutorial-inferentia-tf-neuron-serving-compile"></a>

Create a Python script called `tensorflow-model-server-compile.py` with the following content. This script constructs a graph and compiles it using Neuron. It then exports the compiled graph as a saved model.  

```
import tensorflow as tf
import tensorflow.neuron
import os

tf.keras.backend.set_learning_phase(0)
model = tf.keras.applications.ResNet50(weights='imagenet')
sess = tf.keras.backend.get_session()
inputs = {'input': model.inputs[0]}
outputs = {'output': model.outputs[0]}

# save the model using tf.saved_model.simple_save
modeldir = "./resnet50/1"
tf.saved_model.simple_save(sess, modeldir, inputs, outputs)

# compile the model for Inferentia
neuron_modeldir = os.path.join(os.path.expanduser('~'), 'resnet50_inf1', '1')
tf.neuron.saved_model.compile(modeldir, neuron_modeldir, batch_size=1)
```

 Compile the model using the following command: 

```
python tensorflow-model-server-compile.py
```

 Your output should look like the following: 

```
...
INFO:tensorflow:fusing subgraph neuron_op_d6f098c01c780733 with neuron-cc
INFO:tensorflow:Number of operations in TensorFlow session: 4638
INFO:tensorflow:Number of operations after tf.neuron optimizations: 556
INFO:tensorflow:Number of operations placed on Neuron runtime: 554
INFO:tensorflow:Successfully converted ./resnet50/1 to /home/ubuntu/resnet50_inf1/1
```

## Serving the Saved Model
<a name="tutorial-inferentia-tf-neuron-serving-serving"></a>

Once the model has been compiled, you can use the following command to serve the saved model with the tensorflow\$1model\$1server\$1neuron binary: 

```
tensorflow_model_server_neuron --model_name=resnet50_inf1 \
    --model_base_path=$HOME/resnet50_inf1/ --port=8500 &
```

 Your output should look like the following. The compiled model is staged in the Inferentia device’s DRAM by the server to prepare for inference. 

```
...
2019-11-22 01:20:32.075856: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 40764 microseconds.
2019-11-22 01:20:32.075888: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:105] No warmup data file found at /home/ubuntu/resnet50_inf1/1/assets.extra/tf_serving_warmup_requests
2019-11-22 01:20:32.075950: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: resnet50_inf1 version: 1}
2019-11-22 01:20:32.077859: I tensorflow_serving/model_servers/server.cc:353] Running gRPC ModelServer at 0.0.0.0:8500 ...
```

## Generate inference requests to the model server
<a name="tutorial-inferentia-tf-neuron-serving-inference"></a>

Create a Python script called `tensorflow-model-server-infer.py` with the following content. This script runs inference via gRPC, which is service framework. 

```
import numpy as np
import grpc
import tensorflow as tf
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
from tensorflow.keras.applications.resnet50 import decode_predictions

if __name__ == '__main__':
    channel = grpc.insecure_channel('localhost:8500')
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    img_file = tf.keras.utils.get_file(
        "./kitten_small.jpg",
        "https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg")
    img = image.load_img(img_file, target_size=(224, 224))
    img_array = preprocess_input(image.img_to_array(img)[None, ...])
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'resnet50_inf1'
    request.inputs['input'].CopyFrom(
        tf.contrib.util.make_tensor_proto(img_array, shape=img_array.shape))
    result = stub.Predict(request)
    prediction = tf.make_ndarray(result.outputs['output'])
    print(decode_predictions(prediction))
```

 Run inference on the model by using gRPC with the following command: 

```
python tensorflow-model-server-infer.py
```

 Your output should look like the following: 

```
[[('n02123045', 'tabby', 0.6918919), ('n02127052', 'lynx', 0.12770271), ('n02123159', 'tiger_cat', 0.08277027), ('n02124075', 'Egyptian_cat', 0.06418919), ('n02128757', 'snow_leopard', 0.009290541)]]
```

# Using MXNet-Neuron and the AWS Neuron Compiler
<a name="tutorial-inferentia-mxnet-neuron"></a>

The MXNet-Neuron compilation API provides a method to compile a model graph that you can run on an AWS Inferentia device. 

 In this example, you use the API to compile a ResNet-50 model and use it to run inference. 

 For more information about the Neuron SDK, see the [AWS Neuron SDK documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/mxnet-neuron/index.html). 

**Topics**
+ [

## Prerequisites
](#tutorial-inferentia-mxnet-neuron-prerequisites)
+ [

## Activate the Conda Environment
](#tutorial-inferentia-mxnet-neuron-activate)
+ [

## Resnet50 Compilation
](#tutorial-inferentia-mxnet-neuron-compilation)
+ [

## ResNet50 Inference
](#tutorial-inferentia-mxnet-neuron-inference)

## Prerequisites
<a name="tutorial-inferentia-mxnet-neuron-prerequisites"></a>

 Before using this tutorial, you should have completed the set up steps in [Launching a DLAMI Instance with AWS Neuron](tutorial-inferentia-launching.md). You should also have a familiarity with deep learning and using the DLAMI. 

## Activate the Conda Environment
<a name="tutorial-inferentia-mxnet-neuron-activate"></a>

 Activate the MXNet-Neuron conda environment using the following command: 

```
source activate aws_neuron_mxnet_p36
```

To exit the current conda environment, run: 

```
source deactivate
```

## Resnet50 Compilation
<a name="tutorial-inferentia-mxnet-neuron-compilation"></a>

Create a Python script called **mxnet\$1compile\$1resnet50.py** with the following content. This script uses the MXNet-Neuron compilation Python API to compile a ResNet-50 model. 

```
import mxnet as mx
import numpy as np

print("downloading...")
path='http://data.mxnet.io/models/imagenet/'
mx.test_utils.download(path+'resnet/50-layers/resnet-50-0000.params')
mx.test_utils.download(path+'resnet/50-layers/resnet-50-symbol.json')
print("download finished.")

sym, args, aux = mx.model.load_checkpoint('resnet-50', 0)

print("compile for inferentia using neuron... this will take a few minutes...")
inputs = { "data" : mx.nd.ones([1,3,224,224], name='data', dtype='float32') }

sym, args, aux = mx.contrib.neuron.compile(sym, args, aux, inputs)

print("save compiled model...")
mx.model.save_checkpoint("compiled_resnet50", 0, sym, args, aux)
```

 Compile the model using the following command: 

```
python mxnet_compile_resnet50.py
```

 Compilation will take a few minutes. When compilation has finished, the following files will be in your current directory: 

```
resnet-50-0000.params
resnet-50-symbol.json
compiled_resnet50-0000.params
compiled_resnet50-symbol.json
```

## ResNet50 Inference
<a name="tutorial-inferentia-mxnet-neuron-inference"></a>

Create a Python script called **mxnet\$1infer\$1resnet50.py** with the following content. This script downloads a sample image and uses it to run inference with the compiled model. 

```
import mxnet as mx
import numpy as np

path='http://data.mxnet.io/models/imagenet/'
mx.test_utils.download(path+'synset.txt')

fname = mx.test_utils.download('https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg')
img = mx.image.imread(fname)

# convert into format (batch, RGB, width, height)
img = mx.image.imresize(img, 224, 224) 
# resize
img = img.transpose((2, 0, 1)) 
# Channel first
img = img.expand_dims(axis=0) 
# batchify
img = img.astype(dtype='float32')

sym, args, aux = mx.model.load_checkpoint('compiled_resnet50', 0)
softmax = mx.nd.random_normal(shape=(1,))
args['softmax_label'] = softmax
args['data'] = img
# Inferentia context
ctx = mx.neuron()

exe = sym.bind(ctx=ctx, args=args, aux_states=aux, grad_req='null')
with open('synset.txt', 'r') as f:
    labels = [l.rstrip() for l in f]

exe.forward(data=img)
prob = exe.outputs[0].asnumpy()
# print the top-5
prob = np.squeeze(prob)
a = np.argsort(prob)[::-1] 
for i in a[0:5]:
    print('probability=%f, class=%s' %(prob[i], labels[i]))
```

 Run inference with the compiled model using the following command: 

```
python mxnet_infer_resnet50.py
```

 Your output should look like the following: 

```
probability=0.642454, class=n02123045 tabby, tabby cat
probability=0.189407, class=n02123159 tiger cat
probability=0.100798, class=n02124075 Egyptian cat
probability=0.030649, class=n02127052 lynx, catamount
probability=0.016278, class=n02129604 tiger, Panthera tigris
```

**Next Step**  
[Using MXNet-Neuron Model Serving](tutorial-inferentia-mxnet-neuron-serving.md)

# Using MXNet-Neuron Model Serving
<a name="tutorial-inferentia-mxnet-neuron-serving"></a>

In this tutorial, you learn to use a pre-trained MXNet model to perform real-time image classification with Multi Model Server (MMS). MMS is a flexible and easy-to-use tool for serving deep learning models that are trained using any machine learning or deep learning framework. This tutorial includes a compilation step using AWS Neuron and an implementation of MMS using MXNet.

 For more information about the Neuron SDK, see the [AWS Neuron SDK documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/mxnet-neuron/index.html). 

**Topics**
+ [

## Prerequisites
](#tutorial-inferentia-mxnet-neuron-serving-prerequisites)
+ [

## Activate the Conda Environment
](#tutorial-inferentia-mxnet-neuron-serving-activate)
+ [

## Download the Example Code
](#tutorial-inferentia-mxnet-neuron-serving-download)
+ [

## Compile the Model
](#tutorial-inferentia-mxnet-neuron-serving-compile)
+ [

## Run Inference
](#tutorial-inferentia-mxnet-neuron-serving-inference)

## Prerequisites
<a name="tutorial-inferentia-mxnet-neuron-serving-prerequisites"></a>

 Before using this tutorial, you should have completed the set up steps in [Launching a DLAMI Instance with AWS Neuron](tutorial-inferentia-launching.md). You should also have a familiarity with deep learning and using the DLAMI. 

## Activate the Conda Environment
<a name="tutorial-inferentia-mxnet-neuron-serving-activate"></a>

 Activate the MXNet-Neuron conda environment by using the following command: 

```
source activate aws_neuron_mxnet_p36
```

 To exit the current conda environment, run: 

```
source deactivate
```

## Download the Example Code
<a name="tutorial-inferentia-mxnet-neuron-serving-download"></a>

 To run this example, download the example code using the following commands: 

```
git clone https://github.com/awslabs/multi-model-server
cd multi-model-server/examples/mxnet_vision
```

## Compile the Model
<a name="tutorial-inferentia-mxnet-neuron-serving-compile"></a>

Create a Python script called `multi-model-server-compile.py` with the following content. This script compiles the ResNet50 model to the Inferentia device target. 

```
import mxnet as mx
from mxnet.contrib import neuron
import numpy as np

path='http://data.mxnet.io/models/imagenet/'
mx.test_utils.download(path+'resnet/50-layers/resnet-50-0000.params')
mx.test_utils.download(path+'resnet/50-layers/resnet-50-symbol.json')
mx.test_utils.download(path+'synset.txt')

nn_name = "resnet-50"

#Load a model
sym, args, auxs = mx.model.load_checkpoint(nn_name, 0)

#Define compilation parameters#  - input shape and dtype
inputs = {'data' : mx.nd.zeros([1,3,224,224], dtype='float32') }

# compile graph to inferentia target
csym, cargs, cauxs = neuron.compile(sym, args, auxs, inputs)

# save compiled model
mx.model.save_checkpoint(nn_name + "_compiled", 0, csym, cargs, cauxs)
```

 To compile the model, use the following command: 

```
python multi-model-server-compile.py
```

 Your output should look like the following: 

```
...
[21:18:40] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[21:18:40] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[21:19:00] src/operator/subgraph/build_subgraph.cc:698: start to execute partition graph.
[21:19:00] src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[21:19:00] src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
```

 Create a file named `signature.json` with the following content to configure the input name and shape: 

```
{
  "inputs": [
    {
      "data_name": "data",
      "data_shape": [
        1,
        3,
        224,
        224
      ]
    }
  ]
}
```

Download the `synset.txt` file by using the following command. This file is a list of names for ImageNet prediction classes. 

```
curl -O https://s3.amazonaws.com/model-server/model_archive_1.0/examples/squeezenet_v1.1/synset.txt
```

Create a custom service class following the template in the `model_server_template` folder. Copy the template into your current working directory by using the following command: 

```
cp -r ../model_service_template/* .
```

 Edit the `mxnet_model_service.py` module to replace the `mx.cpu()` context with the `mx.neuron()` context as follows. You also need to comment out the unnecessary data copy for `model_input` because MXNet-Neuron does not support the NDArray and Gluon APIs. 

```
...
self.mxnet_ctx = mx.neuron() if gpu_id is None else mx.gpu(gpu_id)
...
#model_input = [item.as_in_context(self.mxnet_ctx) for item in model_input]
```

 Package the model with model-archiver using the following commands: 

```
cd ~/multi-model-server/examples
model-archiver --force --model-name resnet-50_compiled --model-path mxnet_vision --handler mxnet_vision_service:handle
```

## Run Inference
<a name="tutorial-inferentia-mxnet-neuron-serving-inference"></a>

Start the Multi Model Server and load the model that uses the RESTful API by using the following commands. Ensure that **neuron-rtd** is running with the default settings. 

```
cd ~/multi-model-server/
multi-model-server --start --model-store examples > /dev/null # Pipe to log file if you want to keep a log of MMS
curl -v -X POST "http://localhost:8081/models?initial_workers=1&max_workers=4&synchronous=true&url=resnet-50_compiled.mar"
sleep 10 # allow sufficient time to load model
```

 Run inference using an example image with the following commands: 

```
curl -O https://raw.githubusercontent.com/awslabs/multi-model-server/master/docs/images/kitten_small.jpg
curl -X POST http://127.0.0.1:8080/predictions/resnet-50_compiled -T kitten_small.jpg
```

 Your output should look like the following: 

```
[
  {
    "probability": 0.6388034820556641,
    "class": "n02123045 tabby, tabby cat"
  },
  {
    "probability": 0.16900072991847992,
    "class": "n02123159 tiger cat"
  },
  {
    "probability": 0.12221276015043259,
    "class": "n02124075 Egyptian cat"
  },
  {
    "probability": 0.028706775978207588,
    "class": "n02127052 lynx, catamount"
  },
  {
    "probability": 0.01915954425930977,
    "class": "n02129604 tiger, Panthera tigris"
  }
]
```

 To cleanup after the test, issue a delete command via the RESTful API and stop the model server using the following commands: 

```
curl -X DELETE http://127.0.0.1:8081/models/resnet-50_compiled

multi-model-server --stop
```

 You should see the following output: 

```
{
  "status": "Model \"resnet-50_compiled\" unregistered"
}
Model server stopped.
Found 1 models and 1 NCGs.
Unloading 10001 (MODEL_STATUS_STARTED) :: success
Destroying NCG 1 :: success
```

# Using PyTorch-Neuron and the AWS Neuron Compiler
<a name="tutorial-inferentia-pytorch-neuron"></a>

The PyTorch-Neuron compilation API provides a method to compile a model graph that you can run on an AWS Inferentia device. 

A trained model must be compiled to an Inferentia target before it can be deployed on Inf1 instances. The following tutorial compiles the torchvision ResNet50 model and exports it as a saved TorchScript module. This model is then used to run inference.

For convenience, this tutorial uses an Inf1 instance for both compilation and inference. In practice, you may compile your model using another instance type, such as the c5 instance family. You must then deploy your compiled model to the Inf1 inference server. For more information, see the [AWS Neuron PyTorch SDK Documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-guide/neuron-frameworks/pytorch-neuron/index.html).

**Topics**
+ [

## Prerequisites
](#tutorial-inferentia-pytorch-neuron-prerequisites)
+ [

## Activate the Conda Environment
](#tutorial-inferentia-pytorch-neuron-activate)
+ [

## Resnet50 Compilation
](#tutorial-inferentia-pytorch-neuron-compilation)
+ [

## ResNet50 Inference
](#tutorial-inferentia-pytorch-neuron-inference)

## Prerequisites
<a name="tutorial-inferentia-pytorch-neuron-prerequisites"></a>

Before using this tutorial, you should have completed the set up steps in [Launching a DLAMI Instance with AWS Neuron](tutorial-inferentia-launching.md). You should also have a familiarity with deep learning and using the DLAMI. 

## Activate the Conda Environment
<a name="tutorial-inferentia-pytorch-neuron-activate"></a>

Activate the PyTorch-Neuron conda environment using the following command: 

```
source activate aws_neuron_pytorch_p36
```

To exit the current conda environment, run: 

```
source deactivate
```

## Resnet50 Compilation
<a name="tutorial-inferentia-pytorch-neuron-compilation"></a>

Create a Python script called **pytorch\$1trace\$1resnet50.py** with the following content. This script uses the PyTorch-Neuron compilation Python API to compile a ResNet-50 model. 

**Note**  
There is a dependency between versions of torchvision and the torch package that you should be aware of when compiling torchvision models. These dependency rules can be managed through pip. Torchvision==0.6.1 matches the torch==1.5.1 release, while torchvision==0.8.2 matches the torch==1.7.1 release.

```
import torch
import numpy as np
import os
import torch_neuron
from torchvision import models

image = torch.zeros([1, 3, 224, 224], dtype=torch.float32)

## Load a pretrained ResNet50 model
model = models.resnet50(pretrained=True)

## Tell the model we are using it for evaluation (not training)
model.eval()
model_neuron = torch.neuron.trace(model, example_inputs=[image])

## Export to saved model
model_neuron.save("resnet50_neuron.pt")
```

Run the compilation script.

```
python pytorch_trace_resnet50.py
```

Compilation will take a few minutes. When compilation has finished, the compiled model is saved as `resnet50_neuron.pt` in the local directory.

## ResNet50 Inference
<a name="tutorial-inferentia-pytorch-neuron-inference"></a>

Create a Python script called **pytorch\$1infer\$1resnet50.py** with the following content. This script downloads a sample image and uses it to run inference with the compiled model. 

```
import os
import time
import torch
import torch_neuron
import json
import numpy as np

from urllib import request

from torchvision import models, transforms, datasets

## Create an image directory containing a small kitten
os.makedirs("./torch_neuron_test/images", exist_ok=True)
request.urlretrieve("https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg",
                    "./torch_neuron_test/images/kitten_small.jpg")


## Fetch labels to output the top classifications
request.urlretrieve("https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json","imagenet_class_index.json")
idx2label = []

with open("imagenet_class_index.json", "r") as read_file:
    class_idx = json.load(read_file)
    idx2label = [class_idx[str(k)][1] for k in range(len(class_idx))]

## Import a sample image and normalize it into a tensor
normalize = transforms.Normalize(
    mean=[0.485, 0.456, 0.406],
    std=[0.229, 0.224, 0.225])

eval_dataset = datasets.ImageFolder(
    os.path.dirname("./torch_neuron_test/"),
    transforms.Compose([
    transforms.Resize([224, 224]),
    transforms.ToTensor(),
    normalize,
    ])
)

image, _ = eval_dataset[0]
image = torch.tensor(image.numpy()[np.newaxis, ...])

## Load model
model_neuron = torch.jit.load( 'resnet50_neuron.pt' )

## Predict
results = model_neuron( image )

# Get the top 5 results
top5_idx = results[0].sort()[1][-5:]

# Lookup and print the top 5 labels
top5_labels = [idx2label[idx] for idx in top5_idx]

print("Top 5 labels:\n {}".format(top5_labels) )
```

Run inference with the compiled model using the following command: 

```
python pytorch_infer_resnet50.py
```

Your output should look like the following: 

```
Top 5 labels:
 ['tiger', 'lynx', 'tiger_cat', 'Egyptian_cat', 'tabby']
```

# The ARM64 DLAMI
<a name="tutorial-arm64"></a>

AWS ARM64 GPU DLAMIs are designed to provide high performance and cost efficiency for deep learning workloads. Specifically, the G5g instance type features the Arm64-based [AWS Graviton2 processor](https://aws.amazon.com/ec2/graviton/), which was built from the ground up by AWS and optimized for how customers run their workloads in the cloud. AWS ARM64 GPU DLAMIs are pre-configured with Docker, NVIDIA Docker, NVIDIA Driver, CUDA, CuDNN, NCCL, as well as popular machine learning frameworks such as TensorFlow and PyTorch.

With the G5g instance type, you can take advantage of the price and performance benefits of Graviton2 to deploy GPU-accelerated deep learning models at a significantly lower cost when compared with x86-based instances with GPU acceleration.

## Select a ARM64 DLAMI
<a name="tutorial-arm64-select-dlami"></a>

Launch a [G5g instance](https://aws.amazon.com/ec2/instance-types/g5g/) with the ARM64 DLAMI of your choice. 

For step-by-step instructions on launching a DLAMI, see [Launching and Configuring a DLAMI.](https://docs.aws.amazon.com/dlami/latest/devguide/launch-config.html) 

For a list of the most recent ARM64 DLAMIs, see the [Release Notes for DLAMI](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html).

## Get Started
<a name="tutorial-arm64-get-started"></a>

The following topics show you how to get started using the ARM64 DLAMI. 

**Topics**
+ [

## Select a ARM64 DLAMI
](#tutorial-arm64-select-dlami)
+ [

## Get Started
](#tutorial-arm64-get-started)
+ [

# Using the ARM64 GPU PyTorch DLAMI
](tutorial-arm64-pytorch.md)

# Using the ARM64 GPU PyTorch DLAMI
<a name="tutorial-arm64-pytorch"></a>

The AWS Deep Learning AMIs is ready to use with Arm64 processor-based GPUs, and comes optimized for PyTorch. The ARM64 GPU PyTorch DLAMI includes a Python environment pre-configured with [PyTorch](https://aws.amazon.com/pytorch), [TorchVision](https://pytorch.org/vision/stable/index.html), and [TorchServe](https://pytorch.org/serve/) for deep learning training and inference use cases.

**Topics**
+ [

## Verify PyTorch Python Environment
](#tutorial-arm64-pytorch-environment)
+ [

## Run Training Sample with PyTorch
](#tutorial-arm64-pytorch-training)
+ [

## Run Inference Sample with PyTorch
](#tutorial-arm64-pytorch-inference)

## Verify PyTorch Python Environment
<a name="tutorial-arm64-pytorch-environment"></a>

Connect to your G5g instance and activate the base Conda environment with the following command:

```
source activate base
```

Your command prompt should indicate that you are working in the base Conda environment, which contains PyTorch, TorchVision, and other libraries.

```
(base) $
```

Verify the default tool paths of the PyTorch environment:

```
(base) $ which python
(base) $ which pip
(base) $ which conda
(base) $ which mamba
>>> import torch, torchvision
>>> torch.__version__
>>> torchvision.__version__
>>> v = torch.autograd.Variable(torch.randn(10, 3, 224, 224))
>>> v = torch.autograd.Variable(torch.randn(10, 3, 224, 224)).cuda()
>>> assert isinstance(v, torch.Tensor)
```

## Run Training Sample with PyTorch
<a name="tutorial-arm64-pytorch-training"></a>

Run a sample MNIST training job:

```
git clone https://github.com/pytorch/examples.git
cd examples/mnist
python main.py
```

Your output should look similar to the following:

```
...
Train Epoch: 14 [56320/60000 (94%)]    Loss: 0.021424
Train Epoch: 14 [56960/60000 (95%)]    Loss: 0.023695
Train Epoch: 14 [57600/60000 (96%)]    Loss: 0.001973
Train Epoch: 14 [58240/60000 (97%)]    Loss: 0.007121
Train Epoch: 14 [58880/60000 (98%)]    Loss: 0.003717
Train Epoch: 14 [59520/60000 (99%)]    Loss: 0.001729
Test set: Average loss: 0.0275, Accuracy: 9916/10000 (99%)
```

## Run Inference Sample with PyTorch
<a name="tutorial-arm64-pytorch-inference"></a>

Use the following commands to download a pre-trained densenet161 model and run inference using TorchServe:

```
# Set up TorchServe
cd $HOME
git clone https://github.com/pytorch/serve.git
mkdir -p serve/model_store
cd serve

# Download a pre-trained densenet161 model
wget https://download.pytorch.org/models/densenet161-8d451a50.pth >/dev/null

# Save the model using torch-model-archiver
torch-model-archiver --model-name densenet161 \
    --version 1.0 \
    --model-file examples/image_classifier/densenet_161/model.py \
    --serialized-file densenet161-8d451a50.pth \
    --handler image_classifier \
    --extra-files examples/image_classifier/index_to_name.json  \
    --export-path model_store 

# Start the model server
torchserve --start --no-config-snapshots \
    --model-store model_store \
    --models densenet161=densenet161.mar &> torchserve.log

# Wait for the model server to start
sleep 30

# Run a prediction request
curl http://127.0.0.1:8080/predictions/densenet161 -T examples/image_classifier/kitten.jpg
```

Your output should look similar to the following:

```
{
  "tiger_cat": 0.4693363308906555,
  "tabby": 0.4633873701095581,
  "Egyptian_cat": 0.06456123292446136,
  "lynx": 0.0012828150065615773,
  "plastic_bag": 0.00023322898778133094
}
```

Use the following commands to unregister the densenet161 model and stop the server:

```
curl -X DELETE http://localhost:8081/models/densenet161/1.0
torchserve --stop
```

Your output should look similar to the following:

```
{
  "status": "Model \"densenet161\" unregistered"
}
TorchServe has stopped.
```

# Inference
<a name="tutorial-inference"></a>

This section provides tutorials on how to run inference using the DLAMI's frameworks and tools.

## Inference Tools
<a name="tutorial-inference-tools"></a>
+ [TensorFlow Serving](tutorial-tfserving.md)

# Model Serving
<a name="model-serving"></a>

The following are model serving options installed on the Deep Learning AMI with Conda. Click on one of the options to learn how to use it.

**Topics**
+ [

# TensorFlow Serving
](tutorial-tfserving.md)
+ [

# TorchServe
](tutorial-torchserve.md)

# TensorFlow Serving
<a name="tutorial-tfserving"></a>

[TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) is a flexible, high-performance serving system for machine learning models.

The `tensorflow-serving-api` is pre-installed with single framwork DLAMI. To use tensorflow serving, first activate the TensorFlow environment.

```
$ source /opt/tensorflow/bin/activate
```

Then use your preferred text editor to create a script that has the following content. Name it `test_train_mnist.py`. This script is referenced from [TensorFlow Tutorial](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/quickstart/beginner.ipynb) which will train and evaluate a neural network machine learning model that classifies images.

```
import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)
```

Now run the script passing the server location and port and the husky photo's filename as the parameters.

```
$ /opt/tensorflow/bin/python3 test_train_mnist.py
```

 Be patient, as this script may take a while before providing any output. When the training is complete you should see the following: 

```
I0000 00:00:1739482012.389276    4284 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
1875/1875 [==============================] - 24s 2ms/step - loss: 0.2973 - accuracy: 0.9134 
Epoch 2/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.1422 - accuracy: 0.9582
Epoch 3/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1076 - accuracy: 0.9687
Epoch 4/5
1875/1875 [==============================] - 3s 2ms/step - loss: 0.0872 - accuracy: 0.9731
Epoch 5/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0731 - accuracy: 0.9771
313/313 [==============================] - 0s 1ms/step - loss: 0.0749 - accuracy: 0.9780
```

## More Features and Examples
<a name="tutorial-tfserving-project"></a>

If you are interested in learning more about TensorFlow Serving, check out the [TensorFlow website](https://www.tensorflow.org/serving/).

# TorchServe
<a name="tutorial-torchserve"></a>

TorchServe is a flexible tool for serving deep learning models that have been exported from PyTorch. TorchServe comes preinstalled with the Deep Learning AMI with Conda. 

For more information on using TorchServe, see [Model Server for PyTorch Documentation](https://github.com/pytorch/serve/blob/master/docs/README.md). 

 **Topics** 

## Serve an Image Classification Model on TorchServe
<a name="tutorial-torchserve-serving"></a>

This tutorial shows how to serve an image classification model with TorchServe. It uses a DenseNet-161 model provided by PyTorch. Once the server is running, it listens for prediction requests. When you upload an image, in this case, an image of a kitten, the server returns a prediction of the top 5 matching classes out of the classes that the model was trained on. 

**To serve an example image classification model on TorchServe**

1. Connect to an Amazon Elastic Compute Cloud (Amazon EC2) instance with Deep Learning AMI with Conda v34 or later. 

1. Activate the `pytorch_p310` environment. 

   ```
   source activate pytorch_p310
   ```

1. Clone the TorchServe repository, then create a directory to store your models.  

   ```
   git clone https://github.com/pytorch/serve.git
   mkdir model_store
   ```

1. Archive the model using the model archiver. The `extra-files` param uses a file from the `TorchServe` repo, so update the path if necessary. For more information about the model archiver, see [Torch Model archiver for TorchServe.](https://github.com/pytorch/serve/blob/master/model-archiver/README.md) 

   ```
   wget https://download.pytorch.org/models/densenet161-8d451a50.pth
   torch-model-archiver --model-name densenet161 --version 1.0 --model-file ./serve/examples/image_classifier/densenet_161/model.py --serialized-file densenet161-8d451a50.pth --export-path model_store --extra-files ./serve/examples/image_classifier/index_to_name.json --handler image_classifier
   ```

1. Run TorchServe to start an endpoint. Adding `> /dev/null` quiets the log output. 

   ```
   torchserve --start --ncs --model-store model_store --models densenet161.mar > /dev/null
   ```

1. Download an image of a kitten and send it to the TorchServe predict endpoint: 

   ```
   curl -O https://s3.amazonaws.com/model-server/inputs/kitten.jpg
   curl http://127.0.0.1:8080/predictions/densenet161 -T kitten.jpg
   ```

   The predict endpoint returns a prediction in JSON similar to the following top five predictions, where the image has a 47% probability of containing an Egyptian cat, followed by a 46% chance it has a tabby cat. 

   ```
   {
    "tiger_cat": 0.46933576464653015,
    "tabby": 0.463387668132782,
    "Egyptian_cat": 0.0645613968372345,
    "lynx": 0.0012828196631744504,
    "plastic_bag": 0.00023323058849200606
   }
   ```

1. When you finish testing, stop the server: 

   ```
   torchserve --stop
   ```

 **Other Examples** 

TorchServe has a variety of examples that you can run on your DLAMI instance. You can view them on [the TorchServe project repository examples page](https://github.com/pytorch/serve/tree/master/examples). 

 **More Info** 

 For more TorchServe documentation, including how to set up TorchServe with Docker and the latest TorchServe features, see [the TorchServe project page](https://github.com/pytorch/serve)on GitHub. 