Run distributed training workloads with Slurm on HyperPod
SageMaker HyperPod is specialized for workloads of training large language models (LLMs) and foundation models (FMs). These workloads often require the use of multiple parallelism techniques and optimized operations for ML infrastructure and resources. Using SageMaker HyperPod, you can use the following SageMaker distributed training frameworks:
-
The SageMaker distributed data parallelism (SMDDP) library that offers collective communication operations optimized for AWS.
-
The SageMaker model parallelism (SMP) library that implements various model parallelism techniques.
Using SMDDP on a SageMaker HyperPod
The SMDDP library is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library works with the following open source distributed training frameworks:
The SMDDP library addresses communications overhead of the key collective communication operations by offering the following for SageMaker HyperPod.
-
The library offers
AllGather
optimized for AWS.AllGather
is a key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries. These include the SageMaker model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP). -
The library performs optimized node-to-node communication by fully utilizing the AWS network infrastructure and the SageMaker ML instance topology.
To run sample data-parallel training jobs
Explore the following distributed training samples implementing data parallelism techniques using the SMDDP library.
To set up an environment for using the SMDDP library on SageMaker HyperPod
The following are training environment requirements for using the SMDDP library on SageMaker HyperPod.
-
PyTorch v2.0.1 and later
-
CUDA v11.8 and later
-
libstdc++
runtime version greater than 3 -
Python v3.10.x and later
-
ml.p4d.24xlarge
andml.p4de.24xlarge
, which are supported instance types by the SMDDP library -
imdsv2
enabled on training host
Depending on how you want to run the distributed training job, there are two options to install the SMDDP library:
-
A direct installation using the SMDDP binary file.
-
Using the SageMaker Deep Learning Containers (DLCs) pre-installed with the SMDDP library.
Docker images pre-installed with the SMDDP library or the URLs to the SMDDP binary files are listed at Supported Frameworks in the SMDDP library documentation.
To install the SMDDP library on the SageMaker HyperPod DLAMI
-
pip install --no-cache-dir https://smdataparallel.s3.amazonaws.com/binary/pytorch/
<pytorch-version>/cuXYZ/YYYY-MM-DD/smdistributed_dataparallel-X.Y.Z-cp310-cp310-linux_x86_64
.whlNote
If you work in a Conda environment, ensure that you install PyTorch using
conda install
instead ofpip
.conda install pytorch==
X.Y.Z
torchvision==X.Y.Z
torchaudio==X.Y.Z
pytorch-cuda=X.Y.Z
-c pytorch -c nvidia
To use the SMDDP library on a Docker container
-
The SMDDP library is pre-installed on the SageMaker Deep Learning Containers (DLCs). To find the list of SageMaker framework DLCs for PyTorch with the SMDDP library, see Supported Frameworks in the SMDDP library documentation. You can also bring your own Docker container with required dependencies installed to use the SMDDP library. To learn more about setting up a custom Docker container to use the SMDDP library, see also Create your own Docker container with the SageMaker distributed data parallel library.
Important
To use the SMDDP library in a Docker container, mount the
/var/log
directory from the host machine onto/var/log
in the container. This can be done by adding the following option when running your container.docker run
<OTHER_OPTIONS>
-v /var/log:/var/log ...
To learn how to run data-parallel training jobs with SMDDP in general, see Distributed training with the SageMaker distributed data parallelism library.
Using SMP on a SageMaker HyperPod cluster
The SageMaker model parallelism (SMP) library offers various state-of-the-art model parallelism techniques, including:
-
fully sharded data parallelism
-
expert parallelism
-
mixed precision training with FP16/BF16 and FP8 data types
-
tensor parallelism
The SMP library is also compatible with open source frameworks such as PyTorch FSDP, NVIDIA Megatron, and NVIDIA Transformer Engine.
To run a sample model-parallel training workload
The SageMaker service teams provide sample training jobs implementing model
parallelism with the SMP library at awsome-distributed-training/3.test_cases/17.SM-modelparallelv2