Getting started with Amazon EKS support in SageMaker HyperPod - Amazon SageMaker AI

Getting started with Amazon EKS support in SageMaker HyperPod

In addition to the general Prerequisites for using SageMaker HyperPod for SageMaker HyperPod, check the following requirements and considerations for orchestrating SageMaker HyperPod clusters using Amazon EKS.

Requirements

Note

Before creating a HyperPod cluster, you need a running Amazon EKS cluster configured with VPC and installed using Helm.

  • If using the SageMaker AI console, you can create an Amazon EKS cluster within the HyperPod cluster console page. For more information, see Create a SageMaker HyperPod cluster.

  • If using AWS CLI, you should create an Amazon EKS cluster before creating a HyperPod cluster to associate with. For more information, see Create an Amazon EKS cluster in the Amazon EKS User Guide.

When provisioning your Amazon EKS cluster, consider the following:

  1. Kubernetes version support

    • SageMaker HyperPod supports Kubernetes versions 1.28, 1.29, and 1.30.

  2. Amazon EKS cluster authentication mode

    • The authentication mode of an Amazon EKS cluster supported by SageMaker HyperPod are API and API_AND_CONFIG_MAP.

  3. Networking

    • SageMaker HyperPod requires the Amazon VPC Container Network Interface (CNI) plug-in version 1.18.3 or later.

      Note

      AWS VPC CNI plugin for Kubernetes is the only CNI supported by SageMaker HyperPod.

    • The type of the subnet in your VPC must be private for HyperPod clusters.

  4. IAM roles

  5. Amazon EKS cluster add-ons

    • You can continue using the various add-ons provided by Amazon EKS such as Kube-proxy, CoreDNS, the Amazon VPC Container Network Interface (CNI) plugin, Amazon EKS pod identity, the GuardDuty agent, the Amazon FSx Container Storage Interface (CSI) driver, the Mountpoint for Amazon S3 CSI driver, the AWS Distro for OpenTelemetry, and the CloudWatch Observability agent.

Considerations for configuring SageMaker HyperPod clusters with Amazon EKS

  • You can't mount additional EBS volumes directly to Pods running on HyperPod cluster nodes. Instead, you need to utilize InstanceStorageConfigs to provision and mount additional EBS volumes to the HyperPod nodes. It's important to note that you can only attach additional EBS volumes to new instance groups while creating or updating a HyperPod cluster. Once you have configured instance groups with these additional EBS volumes, in your Amazon EKS Pod configuration file, you'll need to set the local path to /opt/sagemaker to properly mount the volumes to your Amazon EKS Pods.

  • You can deploy the Amazon EBS CSI (Container Storage Interface) controller on HyperPod nodes. However, the Amazon EBS CSI node DaemonSet, which facilitates the mounting and unmounting of EBS volumes, can only run on non-HyperPod instances. If you use instance-type labels for defining scheduling constraints, ensure that you use the SageMaker AI ML instance types prefixed with ml.. For example, for P5 instances, use ml.p5.48xlarge instead of p5.48xlarge.

Considerations for configuring network for SageMaker HyperPod clusters with Amazon EKS

  • Each HyperPod cluster instance supports one Elastic Network Interface (ENI). For the maximum number of Pods per instance type, refer to the following table.

    Instance type Max number of pods
    ml.p4d.24xlarge 49
    ml.p4de.24xlarge 49
    ml.p5.48xlarge 49
    ml.trn1.32xlarge 49
    ml.trn1n.32xlarge 49
    ml.g5.xlarge 14
    ml.g5.2xlarge 14
    ml.g5.4xlarge 29
    ml.g5.8xlarge 29
    ml.g5.12xlarge 49
    ml.g5.16xlarge 29
    ml.g5.24xlarge 49
    ml.g5.48xlarge 49
    ml.c5.large 9
    ml.c5.xlarge 14
    ml.c5.2xlarge 14
    ml.c5.4xlarge 29
    ml.c5.9xlarge 29
    ml.c5.12xlarge 29
    ml.c5.18xlarge 49
    ml.c5.24xlarge 49
    ml.c5n.large 9
    ml.c5n.2xlarge 14
    ml.c5n.4xlarge 29
    ml.c5n.9xlarge 29
    ml.c5n.18xlarge 49
    ml.m5.large 9
    ml.m5.xlarge 14
    ml.m5.2xlarge 14
    ml.m5.4xlarge 29
    ml.m5.8xlarge 29
    ml.m5.12xlarge 29
    ml.m5.16xlarge 49
    ml.m5.24xlarge 49
    ml.t3.medium 5
    ml.t3.large 11
    ml.t3.xlarge 14
    ml.t3.2xlarge 14
    ml.g6.xlarge 14
    ml.g6.2xlarge 14
    ml.g6.4xlarge 29
    ml.g6.8xlarge 29
    ml.g6.12xlarge 29
    ml.g6.16xlarge 49
    ml.g6.24xlarge 49
    ml.g6.48xlarge 49
    ml.gr6.4xlarge 29
    ml.gr6.8xlarge 29
    ml.g6e.xlarge 14
    ml.g6e.2xlarge 14
    ml.g6e.4xlarge 29
    ml.g6e.8xlarge 29
    ml.g6e.12xlarge 29
    ml.g6e.16xlarge 49
    ml.g6e.24xlarge 49
    ml.g6e.48xlarge 49
    ml.p5e.48xlarge 49
  • Only Pods with hostNetwork = true have access to the Amazon EC2 Instance Metadata Service (IMDS) by default. Use the Amazon EKS Pod identity or the IAM roles for service accounts (IRSA) to manage access to the AWS credentials for Pods.

  • SageMaker HyperPod clusters currently support only IPv4 IP addressing. IPv6 IP addressing is not supported at this time.

Considerations for using the HyperPod cluster resiliency features

  • Node auto-replacement is not supported for CPU instances.

  • The HyperPod health monitoring agent needs to be installed for node auto-recovery to work. The agent can be installed using Helm. For more information, see Install packages on the Amazon EKS cluster using Helm.

  • The HyperPod deep health check and health monitoring agent supports GPU and Trn instances.

  • SageMaker AI applies the following taint to nodes when they are undergoing deep health checks:

    effect: NoSchedule key: sagemaker.amazonaws.com/node-health-status value: Unschedulable
    Note

    You cannot add custom taints to nodes in instance groups with DeepHealthChecks turned on.

Once your Amazon EKS cluster is running, configure your cluster using the Helm package manager as instructed in Install packages on the Amazon EKS cluster using Helm before creating your HyperPod cluster.