Run Training with EFA
SageMaker AI provides integration with EFA devices to accelerate High Performance Computing (HPC) and machine learning applications. This integration allows you to leverage an EFA device when running your distributed training jobs. You can add EFA integration to an existing Docker container that you bring to SageMaker AI. The following information outlines how to configure your own container to use an EFA device for your distributed training jobs.
Prerequisites
Your container must satisfy the SageMaker Training container specification.
Install EFA and required packages
Your container must download and install the EFA software. This allows your container to recognize the EFA device, and provides compatible versions of Libfabric and Open MPI.
Any tools like MPI and NCCL must be installed and managed inside the container to be used as part of your EFA-enabled training job. For a list of all available EFA versions, see Verify the EFA installer using a checksum. The following example shows how to modify the Dockerfile of your EFA-enabled container to install EFA, MPI, OFI, NCCL, and NCCL-TEST.
Note
When using PyTorch with EFA on your container, the NCCL version of your container should match the NCCL version of your PyTorch installation. To verify the PyTorch NCCL version, use the following command:
torch.cuda.nccl.version()
ARG OPEN_MPI_PATH=/opt/amazon/openmpi/ ENV NCCL_VERSION=2.7.8 ENV EFA_VERSION=1.30.0 ENV BRANCH_OFI=1.1.1 ################################################# ## EFA and MPI SETUP RUN cd $HOME \ && curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ ENV PATH="$OPEN_MPI_PATH/bin:$PATH" ENV LD_LIBRARY_PATH="$OPEN_MPI_PATH/lib/:$LD_LIBRARY_PATH" ################################################# ## NCCL, OFI, NCCL-TEST SETUP RUN cd $HOME \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build BUILDDIR=/usr/local RUN apt-get update && apt-get install -y autoconf RUN cd $HOME \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=/usr/local --prefix=/usr/local \ && make && make install RUN cd $HOME \ && git clone https://github.com/NVIDIA/nccl-tests \ && cd nccl-tests \ && make MPI=1 MPI_HOME=/opt/amazon/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/local
Considerations when creating your container
The EFA device is mounted to the container as /dev/infiniband/uverbs0
under
the list of devices accessible to the container. On P4d instances, the container has access to
4 EFA devices. The EFA devices can be found in the list of devices accessible to the container
as:
-
/dev/infiniband/uverbs0
-
/dev/infiniband/uverbs1
-
/dev/infiniband/uverbs2
-
/dev/infiniband/uverbs3
To get information about hostname, peer hostnames, and network interface (for MPI) from
the resourceconfig.json
file provided to each container instances, see Distributed Training Configuration. Your container handles regular TCP traffic
among peers through the default Elastic Network Interfaces (ENI), while handling OFI (kernel
bypass) traffic through the EFA device.
Verify that your EFA device is recognized
To verify that the EFA device is recognized, run the following command from within your container.
/opt/amazon/efa/bin/fi_info -p efa
Your output should look similar to the following.
provider: efa fabric: EFA-fe80::e5:56ff:fe34:56a8 domain: efa_0-rdm version: 2.0 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::e5:56ff:fe34:56a8 domain: efa_0-dgrm version: 2.0 type: FI_EP_DGRAM protocol: FI_PROTO_EFA provider: efa;ofi_rxd fabric: EFA-fe80::e5:56ff:fe34:56a8 domain: efa_0-dgrm version: 1.0 type: FI_EP_RDM protocol: FI_PROTO_RXD
Running a training job with EFA
Once you’ve created an EFA-enabled container, you can run a training job with EFA using a SageMaker AI Estimator the same way as you would with any other Docker image. For more information on registering your container and using it for training, see Adapting Your Own Training Container.