Run Docker containers on a Slurm compute node on HyperPod

To run Docker containers with Slurm on SageMaker HyperPod, you need to use Enroot and Pyxis. The Enroot package helps convert Docker images into a runtime that Slurm can understand, while the Pyxis enables scheduling the runtime as a Slurm job through an srun command, srun --container-image=docker/image:tag.

Tip

The Docker, Enroot, and Pyxis packages should be installed during cluster creation as part of running the lifecycle scripts as guided in Start with base lifecycle scripts provided by HyperPod. Use the base lifecycle scripts provided by the HyperPod service team when creating a HyperPod cluster. Those base scripts are set up to install the packages by default. In the config.py script, there's the Config class with the boolean type parameter for installing the packages set to True (enable_docker_enroot_pyxis=True). This is called by and parsed in the lifecycle_script.py script, which calls install_docker.sh and install_enroot_pyxis.sh scripts from the utils folder. The installation scripts are where the actual installations of the packages take place. Additionally, the installation scripts identify if they can detect NVMe store paths from the instances they are run on and set up the root paths for Docker and Enroot to /opt/dlami/nvme. The default root volume of any fresh instance is mounted to /tmp only with a 100GB EBS volume, which runs out if the workload you plan to run involves training of LLMs and thus large size Docker containers. If you use instance families such as P and G with local NVMe storage, you need to make sure that you use the NVMe storage attached at /opt/dlami/nvme, and the installation scripts take care of the configuration processes.

To check if the root paths are set up properly

On a compute node of your Slurm cluster on SageMaker HyperPod, run the following commands to make sure that the lifecycle script worked properly and the root volume of each node is set to /opt/dlami/nvme/*. The following commands shows examples of checking the Enroot runtime path and the data root path for 8 compute nodes of a Slurm cluster.


$ srun -N 8 cat /etc/enroot/enroot.conf | grep "ENROOT_RUNTIME_PATH"
ENROOT_RUNTIME_PATH        /opt/dlami/nvme/tmp/enroot/user-$(id -u)
... // The same or similar lines repeat 7 times


$ srun -N 8 cat /etc/docker/daemon.json
{
    "data-root": "/opt/dlami/nvme/docker/data-root"
}
... // The same or similar lines repeat 7 times

After you confirm that the runtime paths are properly set to /opt/dlami/nvme/*, you're ready to build and run Docker containers with Enroot and Pyxis.

To test Docker with Slurm

On your compute node, try the following commands to check if Docker and Enroot are properly installed.
```
$ docker --help
$ enroot --help
```

Test if Pyxis and Enroot installed correctly by running one of the NVIDIA CUDA Ubuntu images.


$ srun --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY nvidia-smi
pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
DAY MMM DD HH:MM:SS YYYY
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

You can also test it by creating a script and running an sbatch command as follows.


$ cat <<EOF >> container-test.sh
#!/bin/bash
#SBATCH --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
nvidia-smi
EOF

$ sbatch container-test.sh
pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
DAY MMM DD HH:MM:SS YYYY
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

To run a test Slurm job with Docker

After you have completed setting up Slurm with Docker, you can bring any pre-built Docker images and run using Slurm on SageMaker HyperPod. The following is a sample use case that walks you through how to run a training job using Docker and Slurm on SageMaker HyperPod. It shows an example job of model-parallel training of the Llama 2 model with the SageMaker AI model parallelism (SMP) library.

If you want to use one of the pre-built ECR images distributed by SageMaker AI or DLC, make sure that you give your HyperPod cluster the permissions to pull ECR images through the IAM role for SageMaker HyperPod. If you use your own or an open source Docker image, you can skip this step. Add the following permissions to the IAM role for SageMaker HyperPod. In this tutorial, we use the SMP Docker image pre-packaged with the SMP library .
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr-public:*",
                "ecr:GetDownloadUrlForLayer",
                "ecr:GetAuthorizationToken",
                "sts:*"
            ],
            "Resource": "*"
        }
    ]
}
```

On the compute node, clone the repository and go to the folder that provides the example scripts of training with SMP.


$ git clone https://github.com/aws-samples/awsome-distributed-training/
$ cd awsome-distributed-training/3.test_cases/17.SM-modelparallelv2

In this tutorial, run the sample script docker_build.sh that pulls the SMP Docker image, build the Docker container, and runs it as an Enroot runtime. You can modify this as you want.


$ cat docker_build.sh
#!/usr/bin/env bash

region=us-west-2
dlc_account_id=658645717510
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com

docker build -t smpv2 .
enroot import -o smpv2.sqsh  dockerd://smpv2:latest


$ bash docker_build.sh

Create a batch script to launch a training job using sbatch. In this tutorial, the provided sample script launch_training_enroot.sh launches a model-parallel training job of the 70-billion-parameter Llama 2 model with a synthetic dataset on 8 compute nodes. A set of training scripts are provided at 3.test_cases/17.SM-modelparallelv2/scripts, and launch_training_enroot.sh takes train_external.py as the entrypoint script.

Important

To use the a Docker container on SageMaker HyperPod, you must mount the /var/log directory from the host machine, which is the HyperPod compute node in this case, onto the /var/log directory in the container. You can set it up by adding the following variable for Enroot.


"${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}"


$ cat launch_training_enroot.sh
#!/bin/bash

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

#SBATCH --nodes=8 # number of nodes to use, 2 p4d(e) = 16 A100 GPUs
#SBATCH --job-name=smpv2_llama # name of your job
#SBATCH --exclusive # job has exclusive use of the resource, no sharing
#SBATCH --wait-all-nodes=1

set -ex;

###########################
###### User Variables #####
###########################

#########################
model_type=llama_v2
model_size=70b

# Toggle this to use synthetic data
use_synthetic_data=1


# To run training on your own data  set Training/Test Data path  -> Change this to the tokenized dataset path in Fsx. Acceptable formats are huggingface (arrow) and Jsonlines.
# Also change the use_synthetic_data to 0

export TRAINING_DIR=/fsx/path_to_data
export TEST_DIR=/fsx/path_to_data
export CHECKPOINT_DIR=$(pwd)/checkpoints

# Variables for Enroot
: "${IMAGE:=$(pwd)/smpv2.sqsh}"
: "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}" # This is needed for validating its hyperpod cluster
: "${TRAIN_DATA_PATH:=$TRAINING_DIR:$TRAINING_DIR}"
: "${TEST_DATA_PATH:=$TEST_DIR:$TEST_DIR}"
: "${CHECKPOINT_PATH:=$CHECKPOINT_DIR:$CHECKPOINT_DIR}"   


###########################
## Environment Variables ##
###########################

#export NCCL_SOCKET_IFNAME=en
export NCCL_ASYNC_ERROR_HANDLING=1

export NCCL_PROTO="simple"
export NCCL_SOCKET_IFNAME="^lo,docker"
export RDMAV_FORK_SAFE=1
export FI_EFA_USE_DEVICE_RDMA=1
export NCCL_DEBUG_SUBSYS=off
export NCCL_DEBUG="INFO"
export SM_NUM_GPUS=8
export GPU_NUM_DEVICES=8
export FI_EFA_SET_CUDA_SYNC_MEMOPS=0

# async runtime error ...
export CUDA_DEVICE_MAX_CONNECTIONS=1


#########################
## Command and Options ##
#########################

if [ "$model_size" == "7b" ]; then
    HIDDEN_WIDTH=4096
    NUM_LAYERS=32
    NUM_HEADS=32
    LLAMA_INTERMEDIATE_SIZE=11008
    DEFAULT_SHARD_DEGREE=8
# More Llama model size options
elif [ "$model_size" == "70b" ]; then
    HIDDEN_WIDTH=8192
    NUM_LAYERS=80
    NUM_HEADS=64
    LLAMA_INTERMEDIATE_SIZE=28672
    # Reduce for better perf on p4de
    DEFAULT_SHARD_DEGREE=64
fi


if [ -z "$shard_degree" ]; then
    SHARD_DEGREE=$DEFAULT_SHARD_DEGREE
else
    SHARD_DEGREE=$shard_degree
fi

if [ -z "$LLAMA_INTERMEDIATE_SIZE" ]; then
    LLAMA_ARGS=""
else
    LLAMA_ARGS="--llama_intermediate_size $LLAMA_INTERMEDIATE_SIZE "
fi


if [ $use_synthetic_data == 1 ]; then
    echo "using synthetic data"
    declare -a ARGS=(
    --container-image $IMAGE
    --container-mounts $HYPERPOD_PATH,$CHECKPOINT_PATH
    )
else
    echo "using real data...."
    declare -a ARGS=(
    --container-image $IMAGE
    --container-mounts $HYPERPOD_PATH,$TRAIN_DATA_PATH,$TEST_DATA_PATH,$CHECKPOINT_PATH
    )
fi


declare -a TORCHRUN_ARGS=(
    # change this to match the number of gpus per node:
    --nproc_per_node=8 \
    --nnodes=$SLURM_JOB_NUM_NODES \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$(hostname) \
)

srun -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" /path_to/train_external.py \
            --train_batch_size 4 \
            --max_steps 100 \
            --hidden_width $HIDDEN_WIDTH \
            --num_layers $NUM_LAYERS \
            --num_heads $NUM_HEADS \
            ${LLAMA_ARGS} \
            --shard_degree $SHARD_DEGREE \
            --model_type $model_type \
            --profile_nsys 1 \
            --use_smp_implementation 1 \
            --max_context_width 4096 \
            --tensor_parallel_degree 1 \
            --use_synthetic_data $use_synthetic_data \
            --training_dir $TRAINING_DIR \
            --test_dir $TEST_DIR \
            --dataset_type hf \
            --checkpoint_dir $CHECKPOINT_DIR \
            --checkpoint_freq 100 \

$ sbatch launch_training_enroot.sh

To find the downloadable code examples, see Run a model-parallel training job using the SageMaker AI model parallelism library, Docker and Enroot with Slurm in the Awsome Distributed Training GitHub repository. For more information about distributed training with a Slurm cluster on SageMaker HyperPod, proceed to the next topic at Run distributed training workloads with Slurm on HyperPod.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Schedule a Slurm job on a SageMaker HyperPod cluster

Run distributed training workloads with Slurm on HyperPod