Distributed GPU Training
This section is for running distributed training on multi-node GPU clusters.
Set up your cluster for distributed training
To run distributed training on EKS, you need the following components installed on your cluster.
-
The default installation of Kubeflow
with required components, such as PyTorch operators, and the NVIDIA plugin. -
MPI operators.
Download and run the script to install the required components in the cluster.
$ wget -O install_kubeflow.sh https://raw.githubusercontent.com/aws/deep-learning-containers/master/test/dlc_tests/eks/eks_manifest_templates/kubeflow/install_kubeflow.sh $ chmod +x install_kubeflow.sh $ ./install_kubeflow.sh
<EKS_CLUSTER_NAME>
<AWS_REGION>
PyTorch distributed GPU training
This tutorial will guide you on distributed training with PyTorch on your multi-node GPU cluster. It uses Gloo as the backend.
-
Verify that the PyTorch custom resource is installed.
$
kubectl get crdThe output should include
pytorchjobs.kubeflow.org
. -
Ensure that the NVIDIA plugin
daemonset
is running.$
kubectl get daemonset -n kubeflowThe output should should look similar to the following.
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin-daemonset 3 3 3 3 3 <none> 35h
-
Use the following text to create a gloo-based distributed data parallel job. Save it in a file named
distributed.yaml
.apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name:
"kubeflow-pytorch-gpu-dist-job"
spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: spec: containers: - name:"pytorch"
image:"763104351884.dkr.ecr.us-east-1.amazonaws.com/aws-samples-pytorch-training:1.7.1-gpu-py36-cu110-ubuntu18.04-example"
args: - "--backend" - "gloo" - "--epochs" - "5" Worker: replicas: 2 restartPolicy: OnFailure template: spec: containers: - name:"pytorch"
image:"763104351884.dkr.ecr.us-east-1.amazonaws.com/aws-samples-pytorch-training:1.7.1-gpu-py36-cu110-ubuntu18.04-example"
args: - "--backend" - "gloo" - "--epochs" - "5" resources: limits: nvidia.com/gpu: 1 -
Run a distributed training job with the pod file you just created.
$
kubectl create -f distributed.yaml -
You can check the status of the job using the following:
$
kubectl logs kubeflow-pytorch-gpu-dist-jobTo view logs continuously, use:
$
kubectl logs -f<pod>