Orchestrating SageMaker HyperPod clusters with Amazon EKS - Amazon SageMaker

Orchestrating SageMaker HyperPod clusters with Amazon EKS

SageMaker HyperPod is a SageMaker-managed service that enables large-scale training of foundation models on long-running and resilient compute clusters, integrating with Amazon EKS for orchestrating the HyperPod compute resources. You can run uninterrupted training jobs spanning weeks or months at scale using Amazon EKS clusters with HyperPod resiliency features that check for various hardware failures and automatically recover faulty nodes.

Key features for cluster admin users include the following.

For data scientist users, EKS support in HyperPod enables the following.

  • Running containerized workloads for training foundation models on the HyperPod cluster

  • Running inference on the EKS cluster, leveraging the integration between HyperPod and EKS

  • Leveraging the job auto-resume capability for Kubeflow PyTorch training (PyTorchJob)

The high-level architecture of Amazon EKS support in HyperPod involves a 1-to-1 mapping between an EKS cluster (control plane) and a HyperPod cluster (worker nodes) within a VPC, as shown in the following diagram.

EKS and HyperPod VPC architecture with control plane, cluster nodes, and AWS services.