Run Docker containers on a Slurm compute node on HyperPod
To run Docker containers with Slurm on SageMaker HyperPod, you need to use Enrootsrun
command, srun
--container-image=
. docker/image:tag
Tip
The Docker, Enroot, and Pyxis packages should be installed during cluster creation
as part of running the lifecycle scripts as guided in Start
with base lifecycle scripts provided by HyperPod. Use
the base lifecycle scriptsconfig.py
Config
class with the boolean type parameter for installing the packages set to
True
(enable_docker_enroot_pyxis=True
). This is called
by and parsed in the lifecycle_script.py
install_docker.sh
and install_enroot_pyxis.sh
scripts
from the utils
/opt/dlami/nvme
.
The default root volume of any fresh instance is mounted to /tmp
only
with a 100GB EBS volume, which runs out if the workload you plan to run involves
training of LLMs and thus large size Docker containers. If you use instance families
such as P and G with local NVMe storage, you need to make sure that you use the NVMe
storage attached at /opt/dlami/nvme
, and the installation scripts take
care of the configuration processes.
To check if the root paths are set up properly
On a compute node of your Slurm cluster on SageMaker HyperPod, run the following commands to
make sure that the lifecycle script worked properly and the root volume of each node is
set to /opt/dlami/nvme/*
. The following commands shows examples of checking
the Enroot runtime path and the data root path for 8 compute nodes of a Slurm
cluster.
$
srun -N
8
cat /etc/enroot/enroot.conf | grep "ENROOT_RUNTIME_PATH"ENROOT_RUNTIME_PATH /opt/dlami/nvme/tmp/enroot/user-$(id -u) ... // The same or similar lines repeat 7 times
$
srun -N
8
cat /etc/docker/daemon.json{ "data-root": "/opt/dlami/nvme/docker/data-root" } ... // The same or similar lines repeat 7 times
After you confirm that the runtime paths are properly set to
/opt/dlami/nvme/*
, you're ready to build and run Docker containers with
Enroot and Pyxis.
To test Docker with Slurm
-
On your compute node, try the following commands to check if Docker and Enroot are properly installed.
$
docker --help
$
enroot --help
-
Test if Pyxis and Enroot installed correctly by running one of the NVIDIA CUDA Ubuntu
images. $
srun --container-image=nvidia/cuda:
XX.Y.Z
-base-ubuntuXX.YY
nvidia-smipyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY DAY MMM DD HH:MM:SS YYYY +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: XX.YY | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 27W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
You can also test it by creating a script and running an
sbatch
command as follows.$
cat <<EOF >> container-test.sh #!/bin/bash #SBATCH --container-image=nvidia/cuda:
XX.Y.Z
-base-ubuntuXX.YY
nvidia-smi EOF$
sbatch container-test.sh
pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY DAY MMM DD HH:MM:SS YYYY +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: XX.YY | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 27W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
To run a test Slurm job with Docker
After you have completed setting up Slurm with Docker, you can bring any pre-built Docker images and run using Slurm on SageMaker HyperPod. The following is a sample use case that walks you through how to run a training job using Docker and Slurm on SageMaker HyperPod. It shows an example job of model-parallel training of the Llama 2 model with the SageMaker model parallelism (SMP) library.
-
If you want to use one of the pre-built ECR images distributed by SageMaker or DLC, make sure that you give your HyperPod cluster the permissions to pull ECR images through the IAM role for SageMaker HyperPod. If you use your own or an open source Docker image, you can skip this step. Add the following permissions to the IAM role for SageMaker HyperPod. In this tutorial, we use the SMP Docker image pre-packaged with the SMP library .
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:BatchCheckLayerAvailability", "ecr:BatchGetImage", "ecr-public:*", "ecr:GetDownloadUrlForLayer", "ecr:GetAuthorizationToken", "sts:*" ], "Resource": "*" } ] }
-
On the compute node, clone the repository and go to the folder that provides the example scripts of training with SMP.
$
git clone https://github.com/aws-samples/awsome-distributed-training/
$
cd awsome-distributed-training/3.test_cases/17.SM-modelparallelv2
-
In this tutorial, run the sample script
docker_build.sh
that pulls the SMP Docker image, build the Docker container, and runs it as an Enroot runtime. You can modify this as you want. $
cat docker_build.sh
#!/usr/bin/env bash region=
us-west-2
dlc_account_id=658645717510
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com docker build -t smpv2 . enroot import -o smpv2.sqsh dockerd://smpv2:latest$
bash docker_build.sh
-
Create a batch script to launch a training job using
sbatch
. In this tutorial, the provided sample scriptlaunch_training_enroot.sh
launches a model-parallel training job of the 70-billion-parameter Llama 2 model with a synthetic dataset on 8 compute nodes. A set of training scripts are provided at 3.test_cases/17.SM-modelparallelv2/scripts
, and launch_training_enroot.sh
takestrain_external.py
as the entrypoint script.Important
To use the a Docker container on SageMaker HyperPod, you must mount the
/var/log
directory from the host machine, which is the HyperPod compute node in this case, onto the/var/log
directory in the container. You can set it up by adding the following variable for Enroot."${HYPERPOD_PATH:="
/var/log/aws/clusters
":"/var/log/aws/clusters
"}"$
cat
launch_training_enroot.sh
#!/bin/bash # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: MIT-0 #SBATCH --nodes=
8
# number of nodes to use, 2 p4d(e) = 16 A100 GPUs #SBATCH --job-name=smpv2_llama
# name of your job #SBATCH --exclusive # job has exclusive use of the resource, no sharing #SBATCH --wait-all-nodes=1 set -ex; ########################### ###### User Variables ##### ########################### ######################### model_type=llama_v2
model_size=70b
# Toggle this to use synthetic data use_synthetic_data=1 # To run training on your own data set Training/Test Data path -> Change this to the tokenized dataset path in Fsx. Acceptable formats are huggingface (arrow) and Jsonlines. # Also change the use_synthetic_data to 0 export TRAINING_DIR=/fsx/path_to_data
export TEST_DIR=/fsx/path_to_data
export CHECKPOINT_DIR=$(pwd)/checkpoints # Variables for Enroot : "${IMAGE:=$(pwd)/smpv2.sqsh
}" : "${HYPERPOD_PATH:="/var/log/aws/clusters
":"/var/log/aws/clusters
"}"# This is needed for validating its hyperpod cluster
: "${TRAIN_DATA_PATH:=$TRAINING_DIR:$TRAINING_DIR}" : "${TEST_DATA_PATH:=$TEST_DIR:$TEST_DIR}" : "${CHECKPOINT_PATH:=$CHECKPOINT_DIR:$CHECKPOINT_DIR}" ########################### ## Environment Variables ## ########################### #export NCCL_SOCKET_IFNAME=en export NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_PROTO="simple" export NCCL_SOCKET_IFNAME="^lo,docker" export RDMAV_FORK_SAFE=1 export FI_EFA_USE_DEVICE_RDMA=1 export NCCL_DEBUG_SUBSYS=off export NCCL_DEBUG="INFO" export SM_NUM_GPUS=8 export GPU_NUM_DEVICES=8 export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 # async runtime error ... export CUDA_DEVICE_MAX_CONNECTIONS=1 ######################### ## Command and Options ## ######################### if [ "$model_size" == "7b" ]; then HIDDEN_WIDTH=4096 NUM_LAYERS=32 NUM_HEADS=32 LLAMA_INTERMEDIATE_SIZE=11008 DEFAULT_SHARD_DEGREE=8 # More Llama model size options elif [ "$model_size" == "70b" ]; then HIDDEN_WIDTH=8192 NUM_LAYERS=80 NUM_HEADS=64 LLAMA_INTERMEDIATE_SIZE=28672 # Reduce for better perf on p4de DEFAULT_SHARD_DEGREE=64 fi if [ -z "$shard_degree" ]; then SHARD_DEGREE=$DEFAULT_SHARD_DEGREE else SHARD_DEGREE=$shard_degree fi if [ -z "$LLAMA_INTERMEDIATE_SIZE" ]; then LLAMA_ARGS="" else LLAMA_ARGS="--llama_intermediate_size $LLAMA_INTERMEDIATE_SIZE " fi if [ $use_synthetic_data == 1 ]; then echo "using synthetic data" declare -a ARGS=( --container-image $IMAGE --container-mounts $HYPERPOD_PATH,$CHECKPOINT_PATH ) else echo "using real data...." declare -a ARGS=( --container-image $IMAGE --container-mounts $HYPERPOD_PATH,$TRAIN_DATA_PATH,$TEST_DATA_PATH,$CHECKPOINT_PATH ) fi declare -a TORCHRUN_ARGS=( # change this to match the number of gpus per node: --nproc_per_node=8
\ --nnodes=$SLURM_JOB_NUM_NODES \ --rdzv_id=$SLURM_JOB_ID \ --rdzv_backend=c10d
\ --rdzv_endpoint=$(hostname) \ ) srun -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}"/path_to/train_external.py
\ --train_batch_size4
\ --max_steps100
\ --hidden_width $HIDDEN_WIDTH \ --num_layers $NUM_LAYERS \ --num_heads $NUM_HEADS \ ${LLAMA_ARGS} \ --shard_degree $SHARD_DEGREE \ --model_type $model_type \ --profile_nsys1
\ --use_smp_implementation1
\ --max_context_width4096
\ --tensor_parallel_degree1
\ --use_synthetic_data $use_synthetic_data \ --training_dir $TRAINING_DIR \ --test_dir $TEST_DIR \ --dataset_typehf
\ --checkpoint_dir $CHECKPOINT_DIR \ --checkpoint_freq100
\$
sbatch
launch_training_enroot.sh
To find the downloadable code examples, see Run a model-parallel training job using the SageMaker model parallelism library,
Docker and Enroot with Slurm