本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
要构建自己的 Docker 容器进行训练并使用 SageMaker AI 数据并行库,您必须在 Dockerfile 中包含正确的依赖项和 SageMaker AI 分布式并行库的二进制文件。本节提供有关如何使用数据 parallel 库创建具有最小依赖关系的完整 Dockerfile,以便在 SageMaker AI 中进行分布式训练。
注意
此自定义 Docker 选项将 SageMaker AI 数据并行库作为二进制文件仅适用于。 PyTorch
使用 SageMaker 训练工具包和数据并行库创建 Dockerfile
-
使用 NVIDIA CUDA
提供的 Docker 映像开始。使用包含 CUDA 运行时和开发工具(头文件和库)的 cuDNN 开发者版本从源代码进行构建。PyTorch FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
提示
官方的 AWS 深度学习容器 (DLC) 镜像是基于 NVIDIA CUDA 基础
镜像构建的。如果你想在按照其余说明进行操作的同时使用预构建的 DLC 镜像作为参考,请参阅适用 PyTorch 于 Dockerfiles 的 Dee AWS p Learning Container s。 -
添加以下参数以指定软件包 PyTorch 和其他软件包的版本。此外,请指明指向 A SageMaker I 数据并行库和其他使用 AWS 资源的软件(例如 Amazon S3 插件)的 Amazon S3 存储桶路径。
要使用除以下代码示例中提供的版本之外的第三方库版本,我们建议您查看AWS 深度学习容器的官方 Dockerfiles
, PyTorch以查找经过测试、兼容且适合您的应用程序的版本。 要 URLs 查找
SMDATAPARALLEL_BINARY
参数,请参阅中的查找表支持的框架。ARG PYTORCH_VERSION=
1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws
-
设置以下环境变量以正确构建 SageMaker 训练组件并运行数据 parallel 库。在后续步骤中,您将为组件使用这些变量。
# Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
-
在后续步骤中安装或更新
curl
、wget
和git
,以下载和构建软件包。RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/*
-
安装用于 EC2 亚马逊网络通信的 Ela stic Fabric Adapter (EFA) 软件。
RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa
-
安装 Conda
来处理软件包管理。 RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya
-
获取、构建、安装 PyTorch 及其依赖关系。我们使用源代码进行
构建PyTorch ,因为我们需要控制 NCCL 版本以保证与 AWS OFI NCCL 插件的兼容性。 -
按照PyTorch 官方 dockerfil
e 中的步骤,安装构建依赖项并设置 ccache 以加快重新编译速度。 RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev \ && rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
-
# Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113
-
RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
-
安装并构建特定的 NCCL
版本。为此,请将默认 NCCL 文件夹 ( /pytorch/third_party/nccl
) 中的内容替换为 NVIDIA 存储库中的特定 NCCL 版本。 PyTorchNCCL 版本在本指南的第 3 步中设置。RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
-
构建并安装 PyTorch。完成此过程通常需要 1 个小时多一点。它使用上一步中下载的 NCCL 版本构建。
RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch
-
-
构建并安装 AWS OFI NCCL 插件
。这使得 libfabric 支持 SageMaker 人工智能数据并行库。 RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl
-
构建并安装TorchVision
。 RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision
-
安装和配置 OpenSSH。MPI 需要使用 OpenSSH 在容器之间进行通信。允许 OpenSSH 与容器通信而无需请求确认。
RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
-
安装 PT S3 插件以高效访问 Amazon S3 中的数据集。
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
-
安装 libboost
库。此软件包是将 SageMaker AI 数据并行库的异步 IO 功能联网所必需的。 WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost
-
安装以下 SageMaker AI 工具进行 PyTorch 训练。
WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training
-
最后,安装 SageMaker AI 数据 parallel 二进制文件和其余依赖项。
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
-
创建 Dockerfile 后,请参阅调整自己的训练容器,了解如何构建 Docker 容器,将其托管在 Amazon ECR 中,以及如何使用 Python SDK 运行训练作业。 SageMaker
以下示例代码显示了将之前所有代码块组合在一起的完整 Dockerfile。
# This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training
FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04
# Set appropiate versions and location for components
ARG PYTORCH_VERSION=1.10.2
ARG PYTHON_SHORT_VERSION=3.8
ARG EFA_VERSION=1.14.1
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl
ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl
ARG CONDA_PREFIX="/opt/conda"
ARG BRANCH_OFI=1.1.3-aws
# Set ENV variables required to build PyTorch
ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV NCCL_VERSION=2.10.3
# Add OpenMPI to the path.
ENV PATH /opt/amazon/openmpi/bin:$PATH
# Add Conda to path
ENV PATH $CONDA_PREFIX/bin:$PATH
# Set this enviroment variable for SageMaker AI to launch SMDDP correctly.
ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main
# Add enviroment variable for processes to be able to call fork()
ENV RDMAV_FORK_SAFE=1
# Indicate the container type
ENV DLC_CONTAINER_TYPE=training
# Add EFA and SMDDP to LD library path
ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH"
ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH
# Install basic dependencies to download and build other dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
curl \
wget \
git \
&& rm -rf /var/lib/apt/lists/*
# Install EFA.
# This is required for SMDDP backend communication
RUN DEBIAN_FRONTEND=noninteractive apt-get update
RUN mkdir /tmp/efa \
&& cd /tmp/efa \
&& curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
&& cd aws-efa-installer \
&& ./efa_installer.sh -y --skip-kmod -g \
&& rm -rf /tmp/efa
# Install Conda
RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p $CONDA_PREFIX && \
rm ~/miniconda.sh && \
$CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \
$CONDA_PREFIX/bin/conda clean -ya
# Install PyTorch.
# Start with dependencies listed in official PyTorch dockerfile
# https://github.com/pytorch/pytorch/blob/master/Dockerfile
RUN DEBIAN_FRONTEND=noninteractive \
apt-get install -y --no-install-recommends \
build-essential \
ca-certificates \
ccache \
cmake \
git \
libjpeg-dev \
libpng-dev && \
rm -rf /var/lib/apt/lists/*
# Setup ccache
RUN /usr/sbin/update-ccache-symlinks
RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache
# Common dependencies for PyTorch
RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses
# Linux specific dependency for PyTorch
RUN conda install -c pytorch magma-cuda113
# Clone PyTorch
RUN --mount=type=cache,target=/opt/ccache \
cd / \
&& git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION}
# Note that we need to use the same NCCL version for PyTorch and OFI plugin.
# To enforce that, install NCCL from source before building PT and OFI plugin.
# Install NCCL.
# Required for building OFI plugin (OFI requires NCCL's header files and library)
RUN cd /pytorch/third_party/nccl \
&& rm -rf nccl \
&& git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \
&& cd nccl \
&& make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \
&& make pkg.txz.build \
&& tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1
# Build and install PyTorch.
RUN cd /pytorch \
&& CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
python setup.py install \
&& rm -rf /pytorch
RUN ccache -C
# Build and install OFI plugin. \
# It is required to use libfabric.
RUN DEBIAN_FRONTEND=noninteractive apt-get update \
&& apt-get install -y --no-install-recommends \
autoconf \
automake \
libtool
RUN mkdir /tmp/efa-ofi-nccl \
&& cd /tmp/efa-ofi-nccl \
&& git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \
&& cd aws-ofi-nccl \
&& ./autogen.sh \
&& ./configure --with-libfabric=/opt/amazon/efa \
--with-mpi=/opt/amazon/openmpi \
--with-cuda=/usr/local/cuda \
--with-nccl=$CONDA_PREFIX \
&& make \
&& make install \
&& rm -rf /tmp/efa-ofi-nccl
# Build and install Torchvision
RUN pip install --no-cache-dir -U \
packaging \
mpi4py==3.0.3
RUN cd /tmp \
&& git clone https://github.com/pytorch/vision.git -b v0.9.1 \
&& cd vision \
&& BUILD_VERSION="0.9.1+cu111" python setup.py install \
&& cd /tmp \
&& rm -rf vision
# Install OpenSSH.
# Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation
RUN apt-get update \
&& apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
&& apt-get install -y --no-install-recommends openssh-client openssh-server \
&& mkdir -p /var/run/sshd \
&& cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \
&& echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \
&& mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \
&& rm -rf /var/lib/apt/lists/*
# Configure OpenSSH so that nodes can communicate with each other
RUN mkdir -p /var/run/sshd && \
sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd
RUN rm -rf /root/.ssh/ && \
mkdir -p /root/.ssh/ && \
ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \
&& printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config
# Install PT S3 plugin.
# Required to efficiently access datasets in Amazon S3
RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU}
RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt
# Install libboost from source.
# This package is needed for smdataparallel functionality (for networking asynchronous IO).
WORKDIR /
RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \
&& tar -xzf boost_1_73_0.tar.gz \
&& cd boost_1_73_0 \
&& ./bootstrap.sh \
&& ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \
&& cd .. \
&& rm -rf boost_1_73_0.tar.gz \
&& rm -rf boost_1_73_0 \
&& cd ${CONDA_PREFIX}/include/boost
# Install SageMaker AI PyTorch training.
WORKDIR /root
RUN pip install --no-cache-dir -U \
smclarify \
"sagemaker>=2,<3" \
sagemaker-experiments==0.* \
sagemaker-pytorch-training
# Install SageMaker AI data parallel binary (SMDDP)
# Start with dependencies
RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \
apt-get update && apt-get install -y --no-install-recommends \
jq \
libhwloc-dev \
libnuma1 \
libnuma-dev \
libssl1.1 \
libtool \
hwloc \
&& rm -rf /var/lib/apt/lists/*
# Install SMDDP
RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY}
提示
有关创建用于 SageMaker 人工智能训练的自定义 Dockerfile 的更多一般信息,请参阅使用自己的训练算法。
提示
如果要扩展自定义 Dockerfile 以纳入 SageMaker AI 模型并行库,请参阅。使用 SageMaker 分布式模型并行库创建自己的 Docker 容器