Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Create a SageMaker HyperPod cluster

Focus mode
Create a SageMaker HyperPod cluster - Amazon SageMaker AI

See the following instructions on creating a new SageMaker HyperPod cluster using the SageMaker HyperPod console UI.

  1. Open the Amazon SageMaker AI console at https://console.aws.amazon.com/sagemaker/.

  2. Choose HyperPod Clusters in the left navigation pane and then Cluster Management.

  3. In the SageMaker HyperPod landing page, choose Create HyperPod cluster.

  4. From the drop-down menu of Create HyperPod cluster, choose Orchestrated by Amazon EKS.

  5. From the Amazon EKS cluster list, choose the EKS cluster with which you want to configure the new HyperPod cluster.

    1. If you need to create a new EKS cluster, choose Create EKS cluster. You can create it from the EKS cluster list page without having to open the Amazon EKS console.

      Note

      The VPC subnet you choose for HyperPod has to be private.

    2. After submitting a new EKS cluster creation request, wait until the EKS cluster becomes Active.

    3. Install the Helm chart as instructed in Install packages on the Amazon EKS cluster using Helm.

    4. After the EKS cluster creation has completed, choose Create HyperPod cluster and then Orchestrated by EKS again. You should be able to find and select the new EKS cluster. To proceed, choose Select.

  6. On the Configure a new HyperPod cluster page, set up the basic information for the cluster such as name, options to enable the HyperPod cluster resiliency features, and tags.

  7. For Cluster name, specify a name for the new cluster.

  8. For Cluster resiliency - node recovery, specify Automatic to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.

  9. For Tags, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see Tagging your AWS resources.

  10. In Step 2: Advanced configuration, configure network settings within the cluster and in-and-out of the cluster. For orchestration of SageMaker HyperPod cluster with Amazon EKS, the VPC is automatically set to the one configured with the EKS cluster you selected.

  11. In Step 3: Configure instance groups, choose Create instance group. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. In the Create an instance group configuration pop-up window, fill the instance group configuration information.

    Create an instance group pop-up page, configure a new instance group following the UI guidance.

    1. For Instance group name, specify a name for the instance group.

    2. For Select instance type, choose the instance for the instance group.

    3. For Quantity, specify an integer not exceeding the instance quota for cluster usage.

    4. Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as s3://amzn-s3-demo-bucket/Lifecycle-scripts/base-config/.

      For a quick start, download the sample script on_create.sh from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. This script sets up the logging file /var/log/provision/provisioning.log required for CloudWatch to gather logs from Pod containers. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage.

    5. For S3 bucket URI for lifecycle scripts, enter the Amazon S3 path in which the lifecycle scripts are stored.

    6. For Directory path to the entrypoint script in the base Amazon S3 path, enter the file name of the lifecycle script under Amazon S3 path to lifecycle script files. If you use the provided sample script, enter on_create.sh.

    7. For IAM role, choose the IAM role you have created for SageMaker HyperPod resources, following the section IAM role for SageMaker HyperPod.

    8. Under Advanced configuration, you can set up the following optional configurations.

      1. (Optional) For Threads per core, specify 1 for disabling multi-threading and 2 for enabling multi-threading. To find which instance type supports multi-threading, see the reference table of CPU cores and threads per CPU core per instance type in the Amazon EC2 User Guide.

      2. (Optional) For Additional instance storage configs, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is /opt/sagemaker. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the df -h command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the Amazon EBS volumes section in the Amazon Elastic Block Store User Guide.

  12. For Deep health check, select the advanced health checks you want to run on the instances. To learn more, see Deep health checks.

  13. In Step 4: Review and create, review the configuration you have set from Step 1 to Step 3 and finish submitting the cluster creation request.

  14. After the status of the cluster turns to InService, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see Jobs on SageMaker HyperPod clusters.

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.