Orchestrating SageMaker HyperPod clusters with Slurm - Amazon SageMaker AI

Orchestrating SageMaker HyperPod clusters with Slurm

Slurm support in SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, the resiliency features of SageMaker HyperPod monitors the cluster instances automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads. Additionally, with lifecycle configuration support in SageMaker HyperPod, you can customize your computing environment to best suit your needs and configure it with the Amazon SageMaker AI distributed training libraries to achieve optimal performance on AWS.

Operating clusters

You can create, configure, and maintain SageMaker HyperPod clusters graphically through the console user interface (UI) and programmatically through the AWS command line interface (CLI) or AWS SDK for Python (Boto3). With Amazon VPC, you can secure the cluster network and also take advantage of configuring your cluster with resources in your VPC, such as Amazon FSx for Lustre, which offers the fastest throughput. You can also give different IAM roles to cluster instance groups, and limit actions that your cluster resources and users can operate. To learn more, see SageMaker HyperPod operation.

Configuring your ML environment

SageMaker HyperPod runs SageMaker HyperPod DLAMI, which sets up an ML environment on the HyperPod clusters. You can configure additional customizations to the DLAMI by providing lifecycle scripts to support your use case. To learn more about how to set up lifecycle scripts, see Tutorial for getting started with SageMaker HyperPod and Customize SageMaker HyperPod clusters using lifecycle scripts.

Scheduling jobs

After you successfully create a HyperPod cluster, cluster users can log into the cluster nodes (such as head or controller node, log-in node, and worker node) and schedule jobs for running machine learning workloads. To learn more, see Jobs on SageMaker HyperPod clusters.

Resiliency against hardware failures

SageMaker HyperPod runs health checks on cluster nodes and provides a workload auto-resume functionality. With the cluster resiliency features of HyperPod, you can resume your workload from the last checkpoint you saved, after faulty nodes are replaced with healthy ones in clusters with more than 16 nodes. To learn more, see SageMaker HyperPod cluster resiliency.

Logging and managing clusters

You can find SageMaker HyperPod resource utilization metrics and lifecycle logs in Amazon CloudWatch, and manage SageMaker HyperPod resources by tagging them. Each CreateCluster API run creates a distinct log stream, named in <cluster-name>-<timestamp> format. In the log stream, you can check the host names, the name of failed lifecycle scripts, and outputs from the failed scripts such as stdout and stderr. For more information, see SageMaker HyperPod cluster management.

Compatible with SageMaker AI tools

Using SageMaker HyperPod, you can configure clusters with AWS optimized collective communications libraries offered by SageMaker AI, such as the SageMaker AI distributed data parallelism (SMDDP) library. The SMDDP library implements the AllGather operation optimized to the AWS compute and network infrastructure for the most performant SageMaker AI machine learning instances powered by NVIDIA A100 GPUs. To learn more, see Run distributed training workloads with Slurm on HyperPod.