Getting started with Amazon EKS support in SageMaker HyperPod
In addition to the general Prerequisites for using SageMaker HyperPod for SageMaker HyperPod, check the following requirements and considerations for orchestrating SageMaker HyperPod clusters using Amazon EKS.
Requirements
Note
Before creating a HyperPod cluster, you need a running Amazon EKS cluster configured with VPC and installed using Helm.
-
If using the SageMaker AI console, you can create an Amazon EKS cluster within the HyperPod cluster console page. For more information, see Create a SageMaker HyperPod cluster.
-
If using AWS CLI, you should create an Amazon EKS cluster before creating a HyperPod cluster to associate with. For more information, see Create an Amazon EKS cluster in the Amazon EKS User Guide.
When provisioning your Amazon EKS cluster, consider the following:
-
Kubernetes version support
-
SageMaker HyperPod supports Kubernetes versions 1.28, 1.29, and 1.30.
-
-
Amazon EKS cluster authentication mode
-
The authentication mode of an Amazon EKS cluster supported by SageMaker HyperPod are
API
andAPI_AND_CONFIG_MAP
.
-
-
Networking
-
SageMaker HyperPod requires the Amazon VPC Container Network Interface (CNI) plug-in version 1.18.3 or later.
Note
AWS VPC CNI plugin for Kubernetes
is the only CNI supported by SageMaker HyperPod. -
The type of the subnet in your VPC must be private for HyperPod clusters.
-
-
IAM roles
-
Ensure the necessary IAM roles for HyperPod are set up as guided in the AWS Identity and Access Management for SageMaker HyperPod section.
-
-
Amazon EKS cluster add-ons
-
You can continue using the various add-ons provided by Amazon EKS such as Kube-proxy, CoreDNS, the Amazon VPC Container Network Interface (CNI) plugin, Amazon EKS pod identity, the GuardDuty agent, the Amazon FSx Container Storage Interface (CSI) driver, the Mountpoint for Amazon S3 CSI driver, the AWS Distro for OpenTelemetry, and the CloudWatch Observability agent.
-
Considerations for configuring SageMaker HyperPod clusters with Amazon EKS
-
You can't mount additional EBS volumes directly to Pods running on HyperPod cluster nodes. Instead, you need to utilize InstanceStorageConfigs to provision and mount additional EBS volumes to the HyperPod nodes. It's important to note that you can only attach additional EBS volumes to new instance groups while creating or updating a HyperPod cluster. Once you have configured instance groups with these additional EBS volumes, in your Amazon EKS Pod configuration file, you'll need to set the local path
to /opt/sagemaker
to properly mount the volumes to your Amazon EKS Pods. -
You can deploy the Amazon EBS CSI (Container Storage Interface) controller on HyperPod nodes. However, the Amazon EBS CSI node DaemonSet, which facilitates the mounting and unmounting of EBS volumes, can only run on non-HyperPod instances. If you use instance-type labels for defining scheduling constraints, ensure that you use the SageMaker AI ML instance types prefixed with
ml.
. For example, for P5 instances, useml.p5.48xlarge
instead ofp5.48xlarge
.
Considerations for configuring network for SageMaker HyperPod clusters with Amazon EKS
-
Each HyperPod cluster instance supports one Elastic Network Interface (ENI). For the maximum number of Pods per instance type, refer to the following table.
Instance type Max number of pods ml.p4d.24xlarge 49 ml.p4de.24xlarge 49 ml.p5.48xlarge 49 ml.trn1.32xlarge 49 ml.trn1n.32xlarge 49 ml.g5.xlarge 14 ml.g5.2xlarge 14 ml.g5.4xlarge 29 ml.g5.8xlarge 29 ml.g5.12xlarge 49 ml.g5.16xlarge 29 ml.g5.24xlarge 49 ml.g5.48xlarge 49 ml.c5.large 9 ml.c5.xlarge 14 ml.c5.2xlarge 14 ml.c5.4xlarge 29 ml.c5.9xlarge 29 ml.c5.12xlarge 29 ml.c5.18xlarge 49 ml.c5.24xlarge 49 ml.c5n.large 9 ml.c5n.2xlarge 14 ml.c5n.4xlarge 29 ml.c5n.9xlarge 29 ml.c5n.18xlarge 49 ml.m5.large 9 ml.m5.xlarge 14 ml.m5.2xlarge 14 ml.m5.4xlarge 29 ml.m5.8xlarge 29 ml.m5.12xlarge 29 ml.m5.16xlarge 49 ml.m5.24xlarge 49 ml.t3.medium 5 ml.t3.large 11 ml.t3.xlarge 14 ml.t3.2xlarge 14 ml.g6.xlarge 14 ml.g6.2xlarge 14 ml.g6.4xlarge 29 ml.g6.8xlarge 29 ml.g6.12xlarge 29 ml.g6.16xlarge 49 ml.g6.24xlarge 49 ml.g6.48xlarge 49 ml.gr6.4xlarge 29 ml.gr6.8xlarge 29 ml.g6e.xlarge 14 ml.g6e.2xlarge 14 ml.g6e.4xlarge 29 ml.g6e.8xlarge 29 ml.g6e.12xlarge 29 ml.g6e.16xlarge 49 ml.g6e.24xlarge 49 ml.g6e.48xlarge 49 ml.p5e.48xlarge 49 -
Only Pods with
hostNetwork = true
have access to the Amazon EC2 Instance Metadata Service (IMDS) by default. Use the Amazon EKS Pod identity or the IAM roles for service accounts (IRSA) to manage access to the AWS credentials for Pods. -
SageMaker HyperPod clusters currently support only IPv4 IP addressing. IPv6 IP addressing is not supported at this time.
Considerations for using the HyperPod cluster resiliency features
-
Node auto-replacement is not supported for CPU instances.
-
The HyperPod health monitoring agent needs to be installed for node auto-recovery to work. The agent can be installed using Helm. For more information, see Install packages on the Amazon EKS cluster using Helm.
-
The HyperPod deep health check and health monitoring agent supports GPU and Trn instances.
-
SageMaker AI applies the following taint to nodes when they are undergoing deep health checks:
effect: NoSchedule key: sagemaker.amazonaws.com/node-health-status value: Unschedulable
Note
You cannot add custom taints to nodes in instance groups with
DeepHealthChecks
turned on.
Once your Amazon EKS cluster is running, configure your cluster using the Helm package manager as instructed in Install packages on the Amazon EKS cluster using Helm before creating your HyperPod cluster.