Run Docker containers on a Slurm compute node on HyperPod - Amazon SageMaker

Run Docker containers on a Slurm compute node on HyperPod

To run Docker containers with Slurm on SageMaker HyperPod, you need to use Enroot and Pyxis. The Enroot package helps convert Docker images into a runtime that Slurm can understand, while the Pyxis enables scheduling the runtime as a Slurm job through an srun command, srun --container-image=docker/image:tag.

Tip

The Docker, Enroot, and Pyxis packages should be installed during cluster creation as part of running the lifecycle scripts as guided in Start with base lifecycle scripts provided by HyperPod. Use the base lifecycle scripts provided by the HyperPod service team when creating a HyperPod cluster. Those base scripts are set up to install the packages by default. In the config.py script, there's the Config class with the boolean type parameter for installing the packages set to True (enable_docker_enroot_pyxis=True). This is called by and parsed in the lifecycle_script.py script, which calls install_docker.sh and install_enroot_pyxis.sh scripts from the utils folder. The installation scripts are where the actual installations of the packages take place. Additionally, the installation scripts identify if they can detect NVMe store paths from the instances they are run on and set up the root paths for Docker and Enroot to /opt/dlami/nvme. The default root volume of any fresh instance is mounted to /tmp only with a 100GB EBS volume, which runs out if the workload you plan to run involves training of LLMs and thus large size Docker containers. If you use instance families such as P and G with local NVMe storage, you need to make sure that you use the NVMe storage attached at /opt/dlami/nvme, and the installation scripts take care of the configuration processes.

To check if the root paths are set up properly

On a compute node of your Slurm cluster on SageMaker HyperPod, run the following commands to make sure that the lifecycle script worked properly and the root volume of each node is set to /opt/dlami/nvme/*. The following commands shows examples of checking the Enroot runtime path and the data root path for 8 compute nodes of a Slurm cluster.

$ srun -N 8 cat /etc/enroot/enroot.conf | grep "ENROOT_RUNTIME_PATH" ENROOT_RUNTIME_PATH /opt/dlami/nvme/tmp/enroot/user-$(id -u) ... // The same or similar lines repeat 7 times
$ srun -N 8 cat /etc/docker/daemon.json { "data-root": "/opt/dlami/nvme/docker/data-root" } ... // The same or similar lines repeat 7 times

After you confirm that the runtime paths are properly set to /opt/dlami/nvme/*, you're ready to build and run Docker containers with Enroot and Pyxis.

To test Docker with Slurm

  1. On your compute node, try the following commands to check if Docker and Enroot are properly installed.

    $ docker --help $ enroot --help
  2. Test if Pyxis and Enroot installed correctly by running one of the NVIDIA CUDA Ubuntu images.

    $ srun --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY nvidia-smi pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY DAY MMM DD HH:MM:SS YYYY +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: XX.YY | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 27W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

    You can also test it by creating a script and running an sbatch command as follows.

    $ cat <<EOF >> container-test.sh #!/bin/bash #SBATCH --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY nvidia-smi EOF $ sbatch container-test.sh pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY DAY MMM DD HH:MM:SS YYYY +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: XX.YY | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 27W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

To run a test Slurm job with Docker

After you have completed setting up Slurm with Docker, you can bring any pre-built Docker images and run using Slurm on SageMaker HyperPod. The following is a sample use case that walks you through how to run a training job using Docker and Slurm on SageMaker HyperPod. It shows an example job of model-parallel training of the Llama 2 model with the SageMaker model parallelism (SMP) library.

  1. If you want to use one of the pre-built ECR images distributed by SageMaker or DLC, make sure that you give your HyperPod cluster the permissions to pull ECR images through the IAM role for SageMaker HyperPod. If you use your own or an open source Docker image, you can skip this step. Add the following permissions to the IAM role for SageMaker HyperPod. In this tutorial, we use the SMP Docker image pre-packaged with the SMP library .

    { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:BatchCheckLayerAvailability", "ecr:BatchGetImage", "ecr-public:*", "ecr:GetDownloadUrlForLayer", "ecr:GetAuthorizationToken", "sts:*" ], "Resource": "*" } ] }
  2. On the compute node, clone the repository and go to the folder that provides the example scripts of training with SMP.

    $ git clone https://github.com/aws-samples/awsome-distributed-training/ $ cd awsome-distributed-training/3.test_cases/17.SM-modelparallelv2
  3. In this tutorial, run the sample script docker_build.sh that pulls the SMP Docker image, build the Docker container, and runs it as an Enroot runtime. You can modify this as you want.

    $ cat docker_build.sh #!/usr/bin/env bash region=us-west-2 dlc_account_id=658645717510 aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com docker build -t smpv2 . enroot import -o smpv2.sqsh dockerd://smpv2:latest
    $ bash docker_build.sh
  4. Create a batch script to launch a training job using sbatch. In this tutorial, the provided sample script launch_training_enroot.sh launches a model-parallel training job of the 70-billion-parameter Llama 2 model with a synthetic dataset on 8 compute nodes. A set of training scripts are provided at 3.test_cases/17.SM-modelparallelv2/scripts, and launch_training_enroot.sh takes train_external.py as the entrypoint script.

    Important

    To use the a Docker container on SageMaker HyperPod, you must mount the /var/log directory from the host machine, which is the HyperPod compute node in this case, onto the /var/log directory in the container. You can set it up by adding the following variable for Enroot.

    "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}"
    $ cat launch_training_enroot.sh #!/bin/bash # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. # SPDX-License-Identifier: MIT-0 #SBATCH --nodes=8 # number of nodes to use, 2 p4d(e) = 16 A100 GPUs #SBATCH --job-name=smpv2_llama # name of your job #SBATCH --exclusive # job has exclusive use of the resource, no sharing #SBATCH --wait-all-nodes=1 set -ex; ########################### ###### User Variables ##### ########################### ######################### model_type=llama_v2 model_size=70b # Toggle this to use synthetic data use_synthetic_data=1 # To run training on your own data set Training/Test Data path -> Change this to the tokenized dataset path in Fsx. Acceptable formats are huggingface (arrow) and Jsonlines. # Also change the use_synthetic_data to 0 export TRAINING_DIR=/fsx/path_to_data export TEST_DIR=/fsx/path_to_data export CHECKPOINT_DIR=$(pwd)/checkpoints # Variables for Enroot : "${IMAGE:=$(pwd)/smpv2.sqsh}" : "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}" # This is needed for validating its hyperpod cluster : "${TRAIN_DATA_PATH:=$TRAINING_DIR:$TRAINING_DIR}" : "${TEST_DATA_PATH:=$TEST_DIR:$TEST_DIR}" : "${CHECKPOINT_PATH:=$CHECKPOINT_DIR:$CHECKPOINT_DIR}" ########################### ## Environment Variables ## ########################### #export NCCL_SOCKET_IFNAME=en export NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_PROTO="simple" export NCCL_SOCKET_IFNAME="^lo,docker" export RDMAV_FORK_SAFE=1 export FI_EFA_USE_DEVICE_RDMA=1 export NCCL_DEBUG_SUBSYS=off export NCCL_DEBUG="INFO" export SM_NUM_GPUS=8 export GPU_NUM_DEVICES=8 export FI_EFA_SET_CUDA_SYNC_MEMOPS=0 # async runtime error ... export CUDA_DEVICE_MAX_CONNECTIONS=1 ######################### ## Command and Options ## ######################### if [ "$model_size" == "7b" ]; then HIDDEN_WIDTH=4096 NUM_LAYERS=32 NUM_HEADS=32 LLAMA_INTERMEDIATE_SIZE=11008 DEFAULT_SHARD_DEGREE=8 # More Llama model size options elif [ "$model_size" == "70b" ]; then HIDDEN_WIDTH=8192 NUM_LAYERS=80 NUM_HEADS=64 LLAMA_INTERMEDIATE_SIZE=28672 # Reduce for better perf on p4de DEFAULT_SHARD_DEGREE=64 fi if [ -z "$shard_degree" ]; then SHARD_DEGREE=$DEFAULT_SHARD_DEGREE else SHARD_DEGREE=$shard_degree fi if [ -z "$LLAMA_INTERMEDIATE_SIZE" ]; then LLAMA_ARGS="" else LLAMA_ARGS="--llama_intermediate_size $LLAMA_INTERMEDIATE_SIZE " fi if [ $use_synthetic_data == 1 ]; then echo "using synthetic data" declare -a ARGS=( --container-image $IMAGE --container-mounts $HYPERPOD_PATH,$CHECKPOINT_PATH ) else echo "using real data...." declare -a ARGS=( --container-image $IMAGE --container-mounts $HYPERPOD_PATH,$TRAIN_DATA_PATH,$TEST_DATA_PATH,$CHECKPOINT_PATH ) fi declare -a TORCHRUN_ARGS=( # change this to match the number of gpus per node: --nproc_per_node=8 \ --nnodes=$SLURM_JOB_NUM_NODES \ --rdzv_id=$SLURM_JOB_ID \ --rdzv_backend=c10d \ --rdzv_endpoint=$(hostname) \ ) srun -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" /path_to/train_external.py \ --train_batch_size 4 \ --max_steps 100 \ --hidden_width $HIDDEN_WIDTH \ --num_layers $NUM_LAYERS \ --num_heads $NUM_HEADS \ ${LLAMA_ARGS} \ --shard_degree $SHARD_DEGREE \ --model_type $model_type \ --profile_nsys 1 \ --use_smp_implementation 1 \ --max_context_width 4096 \ --tensor_parallel_degree 1 \ --use_synthetic_data $use_synthetic_data \ --training_dir $TRAINING_DIR \ --test_dir $TEST_DIR \ --dataset_type hf \ --checkpoint_dir $CHECKPOINT_DIR \ --checkpoint_freq 100 \ $ sbatch launch_training_enroot.sh

To find the downloadable code examples, see Run a model-parallel training job using the SageMaker model parallelism library, Docker and Enroot with Slurm in the Awsome Distributed Training GitHub repository. For more information about distributed training with a Slurm cluster on SageMaker HyperPod, proceed to the next topic at Run distributed training workloads with Slurm on HyperPod.