使用 SageMaker AI 分散式資料平行程式庫建立您自己的 Docker 容器 - Amazon SageMaker AI

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

使用 SageMaker AI 分散式資料平行程式庫建立您自己的 Docker 容器

若要建置自己的 Docker 容器進行訓練並使用 SageMaker AI 資料平行程式庫,您必須在 Dockerfile 中包含正確的相依性和 SageMaker AI 分散式平行程式庫的二進位檔案。本節說明如何使用資料平行程式庫,在 SageMaker AI 中建立具有最小組分散式訓練相依性的完整 Dockerfile。

注意

此自訂 Docker 選項搭配 SageMaker AI 資料平行程式庫做為二進位檔,僅適用於 PyTorch。

若要使用 SageMaker 訓練工具組和資料平行程式庫建立 Dockerfile
  1. NVIDIA CUDA 的 Docker 映像開始。使用包含 CUDA 執行期和開發工具 (標題和程式庫) 的 cuDNN 開發人員版本,從 PyTorch 原始程式碼進行建置。

    FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
    提示

    官方 AWS 的深度學習容器 (DLC) 映像是從 NVIDIA CUDA 基礎映像建置而成。如想使用預先建置的 DLC 映像做為參考,同時遵循其餘指示,請參閱適用於 PyTorch Dockerfiles 的AWS 深度學習容器

  2. 新增下列引數以指定 PyTorch 和其他套件的版本。此外,指出 SageMaker AI 資料平行程式庫和其他軟體的 Amazon S3 儲存貯體路徑,以使用 AWS 資源,例如 Amazon S3 外掛程式。

    若要使用下列程式碼範例所提供版本以外的第三方程式庫版本,建議您查看適用於 PyTorch 的 AWS 深度學習容器官方 Dockerfiles,以尋找經過測試、相容且適合您的應用程式的版本。

    若要尋找 SMDATAPARALLEL_BINARY 引數URLs,請參閱位於 的查詢資料表支援的架構

    ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws
  3. 設定下列環境變數,以妥善建置 SageMaker 訓練元件,並執行資料平行程式庫。您可以在後續步驟中將這些變數用於元件。

    # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
  4. 在後續步驟中安裝或更新 curlwgetgit,以下載並建置套件。

    RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/*
  5. 為 Amazon EC2 網路通訊安裝 Elastic Fabric Adapter (EFA) 軟體。

    RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa
  6. 安裝 Conda 以處理套件管理。

    RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya
  7. 獲取、建置和安裝 PyTorch 及其相依項。我們從原始程式碼建置 PyTorch,因為我們需要控制 NCCL 版本,以確保與 AWS OFI NCCL 外掛程式的相容性。

    1. 按照 PyTorch 官方 Dockerfile 中的步驟,安裝建置相依項並設定 ccache 以加速重新編譯。

      RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev \ && rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
    2. 安裝 PyTorch 常見的和 Linux 的相依項

      # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113
    3. 複製 PyTorch GitHub 儲存器

      RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
    4. 安裝並建置特定的 NCCL 版本。若要這麼做,請將 PyTorch 的預設 NCCL 資料夾 (/pytorch/third_party/nccl) 中的內容取代為 NVIDIA 儲存庫中的特定 NCCL 版本。NCCL 版本已在本指南的步驟 3 中設定。

      RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
    5. 建置和安裝 PyTorch。此程序通常需要稍微超過 1 小時才能完成。它是使用上一個步驟中所下載的 NCCL 版本。

      RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch
  8. 建置並安裝 AWS OFI NCCL 外掛程式。這可啟用 SageMaker AI 資料平行程式庫的 libfabric 支援。

    RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl
  9. 建置並安裝 TorchVision

    RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision
  10. 安裝及設定 OpenSSH。MPI 需要 OpenSSH 才能在容器之間進行通訊。允許 OpenSSH 不必要求確認就會與容器通話。

    RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
  11. 安裝 PT S3 外掛程式,以有效率地存取 Amazon S3 中的資料集。

    RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
  12. 安裝 libboost 程式庫。聯網 SageMaker AI 資料平行程式庫的非同步 IO 功能需要此套件。

    WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost
  13. 安裝下列適用於 PyTorch 訓練的 SageMaker AI 工具。

    WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training
  14. 最後,安裝 SageMaker AI 資料平行二進位和剩餘的相依性。

    RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
  15. 建立完 Dockerfile 後,請參閱調整您自己的訓練容器,以了解如何建置 Docker 容器、在 Amazon ECR 中託管,以及使用 SageMaker Python SDK 執行訓練任務。

以下範例程式碼顯示了一個完整的 Dockerfile 於合併所有先前的程式碼區塊的情況。

# This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 # Set appropiate versions and location for components ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH # Install basic dependencies to download and build other dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/* # Install EFA. # This is required for SMDDP backend communication RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa # Install Conda RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya # Install PyTorch. # Start with dependencies listed in official PyTorch dockerfile # https://github.com/pytorch/pytorch/blob/master/Dockerfile RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev && \ rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113 # Clone PyTorch RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION} # Note that we need to use the same NCCL version for PyTorch and OFI plugin. # To enforce that, install NCCL from source before building PT and OFI plugin. # Install NCCL. # Required for building OFI plugin (OFI requires NCCL's header files and library) RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1 # Build and install PyTorch. RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch RUN ccache -C # Build and install OFI plugin. \ # It is required to use libfabric. RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl # Build and install Torchvision RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision # Install OpenSSH. # Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config # Install PT S3 plugin. # Required to efficiently access datasets in Amazon S3 RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt # Install libboost from source. # This package is needed for smdataparallel functionality (for networking asynchronous IO). WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost # Install SageMaker AI PyTorch training. WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training # Install SageMaker AI data parallel binary (SMDDP) # Start with dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* # Install SMDDP RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
提示

如需在 SageMaker AI 中建立自訂 Dockerfile 以進行訓練的一般資訊,請參閱使用您自己的訓練演算法

提示

如果您想要擴展自訂 Dockerfile 以整合 SageMaker AI 模型平行程式庫,請參閱 使用 SageMaker 分散式模型平行程式庫建立您自己的 Docker 容器