Create your own Docker container with the SageMaker distributed data parallel library
To build your own Docker container for training and use the SageMaker data parallel library, you must include the correct dependencies and the binary files of the SageMaker distributed parallel libraries in your Dockerfile. This section provides instructions on how to create a complete Dockerfile with the minimum set of dependencies for distributed training in SageMaker using the data parallel library.
Note
This custom Docker option with the SageMaker data parallel library as a binary is available only for PyTorch.
To create a Dockerfile with the SageMaker training toolkit and the data parallel library
-
Start with a Docker image from NVIDIA CUDA
. Use the cuDNN developer versions that contain CUDA runtime and development tools (headers and libraries) to build from the PyTorch source code . FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
Tip
The official AWS Deep Learning Container (DLC) images are built from the NVIDIA CUDA base images
. If you want to use the prebuilt DLC images as references while following the rest of the instructions, see the AWS Deep Learning Containers for PyTorch Dockerfiles . -
Add the following arguments to specify versions of PyTorch and other packages. Also, indicate the Amazon S3 bucket paths to the SageMaker data parallel library and other software to use AWS resources, such as the Amazon S3 plug-in.
To use versions of the third party libraries other than the ones provided in the following code example, we recommend you look into the official Dockerfiles of AWS Deep Learning Container for PyTorch
to find versions that are tested, compatible, and suitable for your application. To find URLs for the
SMDATAPARALLEL_BINARY
argument, see the lookup tables at Supported frameworks.ARG PYTORCH_VERSION=
1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws
-
Set the following environment variables to properly build SageMaker training components and run the data parallel library. You use these variables for the components in the subsequent steps.
# Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
-
Install or update
curl
,wget
, andgit
to download and build packages in the subsequent steps.RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/*
-
Install Elastic Fabric Adapter (EFA) software for Amazon EC2 network communication.
RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa
-
Install Conda
to handle package management. RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya
-
Get, build, and install PyTorch and its dependencies. We build PyTorch from the source code
because we need to have control of the NCCL version to guarantee compatibility with the AWS OFI NCCL plug-in . -
Following the steps in the PyTorch official dockerfile
, install build dependencies and set up ccache to speed up recompilation. RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev \ && rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
-
Install PyTorch’s common and Linux dependencies
. # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113
-
Clone the PyTorch GitHub repository
. RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
-
Install and build a specific NCCL
version. To do this, replace the content in the PyTorch’s default NCCL folder ( /pytorch/third_party/nccl
) with the specific NCCL version from the NVIDIA repository. The NCCL version was set in the step 3 of this guide.RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
-
Build and install PyTorch. This process usually takes slightly more than 1 hour to complete. It is built using the NCCL version downloaded in a previous step.
RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch
-
-
Build and install AWS OFI NCCL plugin
. This enables libfabric support for the SageMaker data parallel library. RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl
-
Build and install TorchVision
. RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision
-
Install and configure OpenSSH. OpenSSH is required for MPI to communicate between containers. Allow OpenSSH to talk to containers without asking for confirmation.
RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
-
Install the PT S3 plug-in to efficiently access datasets in Amazon S3.
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
-
Install the libboost
library. This package is needed for networking the asynchronous IO functionality of the SageMaker data parallel library. WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost
-
Install the following SageMaker tools for PyTorch training.
WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training
-
Finally, install the SageMaker data parallel binary and the remaining dependencies.
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
-
After you finish creating the Dockerfile, see Adapting Your Own Training Container to learn how to build the Docker container, host it in Amazon ECR, and run a training job using the SageMaker Python SDK.
The following example code shows a complete Dockerfile after combining all the previous code blocks.
# This file creates a docker image with minimum dependencies to run SageMaker data parallel training FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 # Set appropiate versions and location for components ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH # Install basic dependencies to download and build other dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/* # Install EFA. # This is required for SMDDP backend communication RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa # Install Conda RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya # Install PyTorch. # Start with dependencies listed in official PyTorch dockerfile # https://github.com/pytorch/pytorch/blob/master/Dockerfile RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev && \ rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113 # Clone PyTorch RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION} # Note that we need to use the same NCCL version for PyTorch and OFI plugin. # To enforce that, install NCCL from source before building PT and OFI plugin. # Install NCCL. # Required for building OFI plugin (OFI requires NCCL's header files and library) RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1 # Build and install PyTorch. RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch RUN ccache -C # Build and install OFI plugin. \ # It is required to use libfabric. RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl # Build and install Torchvision RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision # Install OpenSSH. # Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config # Install PT S3 plugin. # Required to efficiently access datasets in Amazon S3 RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt # Install libboost from source. # This package is needed for smdataparallel functionality (for networking asynchronous IO). WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost # Install SageMaker PyTorch training. WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training # Install SageMaker data parallel binary (SMDDP) # Start with dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* # Install SMDDP RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
Tip
For more general information about creating a custom Dockerfile for training in SageMaker, see Use Your Own Training Algorithms.
Tip
If you want to extend the custom Dockerfile to incorporate the SageMaker model parallel library, see Create Your Own Docker Container with the SageMaker Distributed Model Parallel Library.