Training
This section shows how to run training on AWS Deep Learning Containers for Amazon EC2 using PyTorch and TensorFlow.
PyTorch training
To begin training with PyTorch from your Amazon EC2 instance, use the following commands to
run the container. You must use nvidia-docker
for GPU images.
-
For CPU
$
docker run -it<CPU training container>
-
For GPU
$
nvidia-docker run -it<GPU training container>
-
If you have docker-ce version 19.03 or later, you can use the --gpus flag with docker:
$
docker run -it --gpus<GPU training container>
Run the following to begin training.
-
For CPU
$
git clone https://github.com/pytorch/examples.git$
python examples/mnist/main.py --no-cuda -
For GPU
$
git clone https://github.com/pytorch/examples.git$
python examples/mnist/main.py
PyTorch distributed GPU training with NVIDIA Apex
NVIDIA Apex is a PyTorch extension with utilities for mixed precision and
distributed training. For more information on the utilities offered with Apex, see
the NVIDIA Apex website
To begin distributed training using NVIDIA Apex, run the following in the terminal of the GPU training container. This example requires at least two GPUs on your Amazon EC2 instance to run parallel distributed training.
$
git clone https://github.com/NVIDIA/apex.git && cd apex$
python -m torch.distributed.launch --nproc_per_node=2 examples/simple/distributed/distributed_data_parallel.py
TensorFlow training
After you log into your Amazon EC2 instance, you can run TensorFlow and TensorFlow 2
containers with the following commands. You must use nvidia-docker
for GPU
images.
-
For CPU-based training, run the following.
$
docker run -it<CPU training container>
-
For GPU-based training, run the following.
$
nvidia-docker run -it<GPU training container>
The previous command runs the container in interactive mode and provides a shell prompt inside the container. You can then run the following to import TensorFlow.
$
python
>> import tensorflow
Press Ctrl+D to return to the bash prompt. Run the following to begin training:
git clone https://github.com/fchollet/keras.git
$
cd keras
$
python examples/mnist_cnn.py
Next steps
To learn inference on Amazon EC2 using PyTorch with Deep Learning Containers, see PyTorch Inference .