# Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod"></a>

SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, the resiliency features of SageMaker HyperPod monitor the cluster instances automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads.

To get started, check [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md), set up [AWS Identity and Access Management for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md), and choose one of the following orchestrator options supported by SageMaker HyperPod.

**Slurm support in SageMaker HyperPod**

SageMaker HyperPod provides support for running machine learning workloads on resilient clusters by integrating with Slurm, an open-source workload manager. Slurm support in SageMaker HyperPod enables seamless cluster orchestration through Slurm cluster configuration, allowing you to set up head, login, and worker nodes on the SageMaker HyperPod clusters This integration also facilitates Slurm-based job scheduling for running ML workloads on the cluster, as well as direct access to cluster nodes for job scheduling. With HyperPod's lifecycle configuration support, you can customize the computing environment of the clusters to meet your specific requirements. Additionally, by leveraging the Amazon SageMaker AI distributed training libraries, you can optimize the clusters' performance on AWS computing and network resources. To learn more, see [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md). 

**Amazon EKS support in SageMaker HyperPod**

SageMaker HyperPod also integrates with Amazon EKS to enable large-scale training of foundation models on long-running and resilient compute clusters. This allows cluster admin users to provision HyperPod clusters and attach them to an EKS control plane, enabling dynamic capacity management, direct access to cluster instances, and resiliency capabilities. For data scientists, Amazon EKS support in HyperPod allows running containerized workloads for training foundation models, inference on the EKS cluster, and leveraging the job auto-resume capability for Kubeflow PyTorch training. The architecture involves a 1-to-1 mapping between an EKS cluster (control plane) and a HyperPod cluster (worker nodes) within a VPC, providing a tightly integrated solution for running large-scale ML workloads. To learn more, see [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**UltraServers with HyperPod**

HyperPod with UltraServers delivers AI computing power by integrating NVIDIA superchips into a cohesive, high-performance infrastructure. Each NVL72 UltraServer combines 18 instances with 72 NVIDIA Blackwell GPUs interconnected via NVLink, enabling faster inference and faster training performance compared to previous generation instances. This architecture is particularly valuable for organizations working with trillion-parameter foundation models, as the unified GPU memory allows entire models to remain within a single NVLink domain, eliminating cross-node networking bottlenecks. HyperPod enhances this hardware advantage with intelligent topology-aware scheduling that optimizes workload placement, automatic instance replacement to minimize disruptions, and flexible deployment options that support both dedicated and shared resource configurations. For teams pushing the boundaries of model size and performance, this integration provides the computational foundation needed to train and deploy the most advanced AI models with unprecedented efficiency.

SageMaker HyperPod automatically optimizes instance placement across your UltraServers. By default, HyperPod prioritizes all instances in one UltraServer before using a different one. For example, if you want 14 instances and have 2 UltraServers in your plan, SageMaker AI uses all of the instances in the first UltraServer. If you want 20 instances, SageMaker AI uses all 18 instances in the first UltraServer and then uses 2 more from the second.

## AWS Regions supported by SageMaker HyperPod
<a name="sagemaker-hyperpod-available-regions"></a>

SageMaker HyperPod is available in the following AWS Regions. 
+ us-east-1
+ us-east-2
+ us-west-1
+ us-west-2
+ eu-central-1
+ eu-north-1
+ eu-west-1
+ eu-west-2
+ eu-south-2
+ ap-south-1
+ ap-southeast-1
+ ap-southeast-2
+ ap-southeast-3
+ ap-southeast-4
+ ap-northeast-1
+ sa-east-1

**Topics**
+ [AWS Regions supported by SageMaker HyperPod](#sagemaker-hyperpod-available-regions)
+ [Amazon SageMaker HyperPod quickstart](sagemaker-hyperpod-quickstart.md)
+ [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md)
+ [AWS Identity and Access Management for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md)
+ [Customer managed AWS KMS key encryption for SageMaker HyperPod](smcluster-cmk.md)
+ [SageMaker HyperPod recipes](sagemaker-hyperpod-recipes.md)
+ [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md)
+ [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md)
+ [Using topology-aware scheduling in Amazon SageMaker HyperPod](sagemaker-hyperpod-topology.md)
+ [Deploying models on Amazon SageMaker HyperPod](sagemaker-hyperpod-model-deployment.md)
+ [HyperPod in Studio](sagemaker-hyperpod-studio.md)
+ [SageMaker HyperPod references](sagemaker-hyperpod-ref.md)
+ [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md)
+ [Amazon SageMaker HyperPod AMI](sagemaker-hyperpod-release-ami.md)

# Amazon SageMaker HyperPod quickstart
<a name="sagemaker-hyperpod-quickstart"></a>

This quickstart guides you through creating your first HyperPod cluster with Slurm and Amazon EKS (EKS) orchestrations. Choose the orchestration that best fits your infrastructure needs to get started with SageMaker HyperPod.

**Topics**
+ [Create a Slurm-orchestrated SageMaker HyperPod cluster](#sagemaker-hyperpod-quickstart-slurm)
+ [Create an EKS-orchestrated SageMaker HyperPod cluster](#sagemaker-hyperpod-quickstart-eks)
+ [Submit workloads](#sagemaker-hyperpod-quickstart-workload)

## Create a Slurm-orchestrated SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-quickstart-slurm"></a>

Follow these steps to create your first SageMaker HyperPod cluster with Slurm orchestration.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Slurm**.

1. On the cluster creation page, choose **Quick setup**. With this option, you get started immediately with default settings. SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

1. On **General settings**, specify a name for the new cluster. You can’t change the name after the cluster is created.

1. On **Instance groups**, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

   Follow these steps to add an instance group.

   1. For **Instance group type**, choose a type for your instance group. For this quickstart, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`. 

   1. For **Name**, specify a name for the instance group. For this quickstart, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

   1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

   1. For **Instance type**, choose the instance for the instance group. For this quickstart, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 

      Ensure that you choose the instance type with sufficient quotas in your account, or request additional quotas by following the instructions at [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

   1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this quickstart, enter **1** for all three groups.

   1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

   1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com//ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

   1. Choose **Add instance group**.

1.  On **Quick configuration defaults**, review the default settings. This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process.

1. Choose **Submit**.

For more information, see [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md).

## Create an EKS-orchestrated SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-quickstart-eks"></a>

Follow these steps to create your first SageMaker HyperPod cluster with Amazon EKS orchestration.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Amazon EKS**.

1. On the cluster creation page, choose **Quick configuration**. With this option, you can get started immediately with default settings. SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

1. On **General settings**, specify a name for the new cluster. You can’t change the name after the cluster is created. 

1. On **Instance groups**, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

   Follow these steps to add an instance group.

   1. For **Instance group type**, choose **Standard** or **Restricted Instance Group (RIG)**. Typically, you will choose **Standard**, which provides a general purpose computing environment without additional security restrictions. **Restricted Instance Group (RIG)** is a specialized environment for foundational models customization such as Amazon Nova. For more information about setting up RIG for Amazon Nova model customization, see Amazon Nova customization on SageMaker HyperPod in the [Amazon Nova 1.0 user guide](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) or the [Amazon Nova 2.0 user guide](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html).

   1. For **Name**, specify a name for the instance group.

   1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

   1. For **Instance type**, choose the instance for the instance group. Ensure that you choose the instance type with sufficient quotas in your account, or request additional quotas by following at [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

   1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this quickstart, enter **1** for all three groups.

   1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

   1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com//ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

   1. For **Instance deep health checks**, choose your option. Deep health checks monitor instance health during creation and after software updates, automatically recovering faulty instances through reboots or replacements when enabled.

   1. Choose **Add instance group**.

1.  On **Quick configuration defaults**, review the default settings. This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process.

1. Choose **Submit**.

For more information, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

## Submit workloads
<a name="sagemaker-hyperpod-quickstart-workload"></a>

Follow these workshop tutorials to submit sample workloads.
+ [Amazon SageMaker HyperPod for Slurm](https://catalog.workshops.aws/sagemaker-hyperpod/en-US)
+ [Amazon SageMaker HyperPod for Amazon EKS](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US)

# Prerequisites for using SageMaker HyperPod
<a name="sagemaker-hyperpod-prerequisites"></a>

The following sections walk you through prerequisites before getting started with SageMaker HyperPod.

**Topics**
+ [SageMaker HyperPod quotas](#sagemaker-hyperpod-prerequisites-quotas)
+ [Setting up SageMaker HyperPod with a custom Amazon VPC](#sagemaker-hyperpod-prerequisites-optional-vpc)
+ [Setting up SageMaker HyperPod clusters across multiple AZs](#sagemaker-hyperpod-prerequisites-multiple-availability-zones)
+ [Setting up AWS Systems Manager and Run As for cluster user access control](#sagemaker-hyperpod-prerequisites-ssm)
+ [(Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre](#sagemaker-hyperpod-prerequisites-optional-fsx)

## SageMaker HyperPod quotas
<a name="sagemaker-hyperpod-prerequisites-quotas"></a>

You can create SageMaker HyperPod clusters given the quotas for *cluster usage* in your AWS account.

**Important**  
To learn more about SageMaker HyperPod pricing, see [SageMaker HyperPod pricing](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-pricing) and [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

### View Amazon SageMaker HyperPod quotas using the AWS Management Console
<a name="sagemaker-hyperpod-prerequisites-quotas-view"></a>

Look up the default and applied values of a *quota*, also referred to as a *limit*, for *cluster usage*, which is used for SageMaker HyperPod.

1. Open the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).

1. In the left navigation pane, choose **AWS services**.

1. From the **AWS services** list, search for and select **Amazon SageMaker AI**.

1. In the **Service quotas** list, you can see the service quota name, applied value (if it's available), AWS default quota, and whether the quota value is adjustable. 

1. In the search bar, type **cluster usage**. This shows quotas for cluster usage, applied quotas, and the default quotas.

**List of common service quotas to create a HyperPod cluster and its pre-requisites**

You might want to check if you have requested service quota limit increases for the following quotas to create a new HyperPod cluster along with pre-requisites in the SageMaker AI console. Navigate to the **Service Quota** console and search for the following terms.


****  

| No | Quota Name | Search Term | Description | 
| --- | --- | --- | --- | 
| 1 | Maximum number instances allowed per SageMaker HyperPod cluster | Under SageMaker AI search for “Maximum number instances allowed per SageMaker HyperPod cluster” | Your account-level quota value must be more than the number of instance you wish to add to your cluster | 
| 2 | Maximum size of EBS volume in GB for a SageMaker HyperPod cluster instance |  Under SageMaker AI search for “Maximum size of EBS volume in GB for a HyperPod cluster instance”   |  Your account-level quota value must be more than the EBS volume you wish to add to your cluster  | 
| 3 | Total number of instances allowed across SageMaker HyperPod clusters |  Under SageMaker AI search for  “Total number of instances allowed across SageMaker HyperPod clusters”   | Your account-level quota value must be more than the total instances you wish to add across all your clusters in your account in aggregate | 
| 4 |  Instance Quotas   |  Under SageMaker AI search for  "ml.<instance\$1type> for cluster usage" eg: ml.p5.48xlarge for cluster usage  | Your account-level quota value for the particular instance type (eg: ml.p5.48xlarge) must be greater than the number of instances to add across all your clusters in your account in aggregate. | 
| 5 |  VPCs per Region  | Under Amazon Virtual Private Cloud (Amazon VPC) search for “VPCs per Region" | Your account-level quota value must be enough to create a new VPC in the account when setting up your HyperPod cluster. Do check if you have already exhausted this quota limit by checking the VPC console. This quota increase is only needed if you will create a new VPC via the Quick or Custom cluster setup option in the SageMaker HyperPod console. | 
| 6 |  Internet gateways per Region  |  Under Amazon Virtual Private Cloud (Amazon VPC) search for “Internet gateways per Region"  | Your account-level quota value must be enough to create one additional Internet gateway in the account when setting up your SageMaker HyperPod cluster. This quota increase is only needed if you will create a new VPC via the Quick or Custom cluster setup option in the SageMaker HyperPod console.  | 
| 7 | Network interfaces per Region | Under Amazon Virtual Private Cloud (Amazon VPC) search for “Network interfaces per Region" |  Your account-level quota value must have enough Network Interfaces in the account when setting up your HyperPod cluster.   | 
| 8 | EC2-VPC Elastic IPs | Under Amazon Elastic Compute Cloud (Amazon EC2) search for “EC2-VPC Elastic IPs" | Your account-level quota value must be enough to create a new VPC in the account when setting up your HyperPod cluster. Do check whether you have already exhausted this quota limit by checking the VPC console. This quota increase is only needed if you will create a new VPC via the Quick or Custom cluster setup option in the SageMaker HyperPod console. | 

### Request a Amazon SageMaker HyperPod quota increase using the AWS Management Console
<a name="sagemaker-hyperpod-prerequisites-quotas-increase"></a>

Increase your quotas at the account or resource level.

1. To increase the quota of instances for *cluster usage*, select the quota that you want to increase.

1. If the quota is adjustable, you can request a quota increase at either the account level or resource level based on the value listed in the **Adjustability** column.

1. For **Increase quota value**, enter the new value. The new value must be greater than the current value.

1. Choose **Request**.

1. To view any pending or recently resolved requests in the console, navigate to the **Request history** tab from the service's details page, or choose **Dashboard** from the navigation pane. For pending requests, choose the status of the request to open the request receipt. The initial status of a request is **Pending**. After the status changes to **Quota requested**, you see the case number with AWS Support. Choose the case number to open the ticket for your request.

To learn more about requesting a quota increase in general, see [Requesting a Quota Increase](https://docs.aws.amazon.com/servicequotas/latest/userguide/request-quota-increase.html) in the *AWS Service Quotas User Guide*.

## Setting up SageMaker HyperPod with a custom Amazon VPC
<a name="sagemaker-hyperpod-prerequisites-optional-vpc"></a>

To set up a SageMaker HyperPod cluster with a custom Amazon VPC, review the following prerequisites.

**Note**  
VPC configuration is mandatory for Amazon EKS orchestration. For Slurm orchestration, VPC setup is optional.
+  Validate [Elastic Network Interface](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) (ENI) capacity in your AWS account before creating a SageMaker HyperPod cluster with a custom VPC. The ENI limit is controlled by Amazon EC2 and varies by AWS Region. SageMaker HyperPod cannot automatically request quota increases. 

**To verify your current ENI quota:**

  1. Open the [Service Quotas console](https://console.aws.amazon.com/servicequotas/).

  1. In the **Manage quotas** section, use the ** AWS Services** drop-down list to search for **VPC**. 

  1. Choose to view the quotas of **Amazon Virtual Private Cloud (Amazon VPC)**. 

  1. Look for the service quota **Network interfaces per Region** or the **Quota code** `L-DF5E4CA3`.

  If your current ENI limit is insufficient for your SageMaker HyperPod cluster needs, request a quota increase. Ensuring adequate ENI capacity beforehand helps prevent cluster deployment failures.
+ When using a custom VPC to connect a SageMaker HyperPod cluster with AWS resources, provide the VPC name, ID, AWS Region, subnet IDs, and security group IDs during cluster creation.
**Note**  
When your Amazon VPC and subnets support IPv6 in the [https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateCluster.html#sagemaker-CreateCluster-request-VpcConfig](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateCluster.html#sagemaker-CreateCluster-request-VpcConfig) of the cluster or at the Instance group level using the `OverrideVPCConfig` attribute of [https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html), network communications differ based on the cluster orchestration platform:  
Slurm-orchestrated clusters automatically configure nodes with dual IPv6 and IPv4 addresses, allowing immediate IPv6 network communications. No additional configuration is required beyond the `VPCConfig` IPv6 settings.
In EKS-orchestrated clusters, nodes receive dual-stack addressing, but pods can only use IPv6 when the Amazon EKS cluster is explicitly IPv6-enabled. You must create a new IPv6 Amazon EKS cluster - existing IPv4 Amazon EKS clusters cannot be converted to IPv6. For information about deploying an IPv6 Amazon EKS cluster, see [Amazon EKS IPv6 Cluster Deployment](https://docs.aws.amazon.com/eks/latest/userguide/deploy-ipv6-cluster.html#_deploy_an_ipv6_cluster_with_eksctl).
Additional resources for IPv6 configuration:  
For information about adding IPv6 support to your VPC, see to [IPv6 Support for VPC](https://docs.aws.amazon.com//vpc/latest/userguide/vpc-migrate-ipv6.html).
For information about creating a new IPv6-compatible VPC, see [Amazon VPC Creation Guide](https://docs.aws.amazon.com//vpc/latest/userguide/create-vpc.html).
To configure SageMaker HyperPod with a custom Amazon VPC, see [Custom Amazon VPC setup for SageMaker HyperPod](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-prerequisites.html#sagemaker-hyperpod-prerequisites-optional-vpc).
+ Make sure that all resources are deployed in the same AWS Region as the SageMaker HyperPod cluster. Configure security group rules to allow inter-resource communication within the VPC. For example, when creating a VPC in `us-west-2`, provision subnets across one or more Availability Zones (such as `us-west-2a` or `us-west-2b`), and create a security group allowing intra-group traffic.
**Note**  
SageMaker HyperPod supports multi-Availability Zone deployment. For more information, see [Setting up SageMaker HyperPod clusters across multiple AZs](#sagemaker-hyperpod-prerequisites-multiple-availability-zones).
+ Establish Amazon Simple Storage Service (Amazon S3) connectivity for VPC-deployed SageMaker HyperPod instance groups by creating a VPC endpoint. Without internet access, instance groups cannot store or retrieve lifecycle scripts, training data, or model artifacts. We recommend that you create a custom IAM policy restricting Amazon S3 bucket access to the private VPC. For more information, see [Endpoints for Amazon S3](https://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-endpoints-s3.html) in the *AWS PrivateLink Guide*.
+ For HyperPod clusters using Elastic Fabric Adapter (EFA)-enabled instances, configure the security group to allow all inbound and outbound traffic to and from the security group itself. Specifically, avoid using `0.0.0.0/0` for outbound rules, as this may cause EFA health check failures. For more information about EFA security group preparation guidelines, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.
+ Consider your subnet's Classless Inter-Domain Routing (CIDR) block size carefully before creating HyperPod clusters.
  + The subnet CIDR block size cannot be changed after creation. This is especially important when you use large accelerated instances like P5. Without sufficient block size, you must recreate your clusters when scaling up.
  + When choosing the appropriate subnet CIDR block size, consider these factors: your instance types, expected number of instances, and the number of IP addresses consumed by each instance.
  + For Slurm-orchestrated clusters, each P5 instance can create 32 IP addresses (one per network card). For EKS-orchestrated clusters, each P5 instance can create 81 IP addresses (50 from the primary card plus one from each of the remaining 31 cards). For detailed specifications, see [Network specifications ](https://docs.aws.amazon.com/ec2/latest/instancetypes/ac.html#ac_network) from *Amazon EC2 Instance Types Developer Guide*.
  + For examples of CloudFormation templates that specify the subnet CIDR block size, see the [HyperPod Slurm template](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod.yaml) and [HyperPod Amazon EKS template](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/cfn-templates/nested-stacks/private-subnet-stack.yaml) in the [awsome-distributed-training repository](https://github.com/aws-samples/awsome-distributed-training/tree/main).

## Setting up SageMaker HyperPod clusters across multiple AZs
<a name="sagemaker-hyperpod-prerequisites-multiple-availability-zones"></a>

You can configure your SageMaker HyperPod clusters across multiple Availability Zones (AZs) to improve reliability and availability.

**Note**  
Elastic Fabric Adapter (EFA) traffic cannot cross AZs or VPCs. This does not apply to normal IP traffic from the ENA device of an EFA interface. For more information, see [EFA limitations](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html).
+ ** Default behavior **

  HyperPod deploys all cluster instances in a single Availability Zone. The VPC configuration determines the deployment AZ:
  + For Slurm-orchestrated clusters, VPC configuration is optional. When no VPC configuration is provided, HyperPod defaults to one subnet from the platform VPC. 
  + For EKS-orchestrated clusters, VPC configuration is required.
  + For both Slurm and EKS orchestrators, when [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html) is provided, HyperPod selects a subnet from the provided `VpcConfig`'s subnet list. All instance groups inherit the subnet's AZ. 
**Note**  
Once you create a cluster, you cannot modify its `VpcConfig` settings.

  To learn more about configuring VPCs for HyperPod clusters, see the preceding section, [Setting up SageMaker HyperPod with a custom Amazon VPC](#sagemaker-hyperpod-prerequisites-optional-vpc).
+ ** Multi-AZ configuration **

  You can set up your HyperPod cluster across multiple AZs when creating a cluster or when adding a new instance group to an existing cluster. To configure multi-AZ deployments, you can override the default VPC settings of the cluster by specifying different subnets and security groups, potentially across different Availability Zones, for individual instance groups within your cluster. 

  SageMaker HyperPod API users can use the `OverrideVpcConfig` property within the [ClusterInstanceGroupSpecification](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html) when working with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) or [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) APIs.

  The `OverrideVpcConfig` field:
  + Cannot be modified after the instance group is created.
  + Is optional. If not specified, the cluster level [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_VpcConfig.html) is used as default.
  + For Slurm-orchestrated clusters, can only be specified when cluster level `VpcConfig` is provided. If no `VpcConfig` is specified at cluster level, `OverrideVpcConfig` cannot be used for any instance group.
  + Contains two required fields:
    + `Subnets` - accepts between 1 and 16 subnet IDs
    + `SecurityGroupIds` - accepts between 1 and 5 security group IDs

  For more information about creating or updating a SageMaker HyperPod cluster using the SageMaker HyperPod console UI or the AWS CLI:
  + Slurm orchestration: See [Operating Slurm-orchestrated HyperPod clusters](sagemaker-hyperpod-operate-slurm.md).
  + EKS orchestration. See [Operating EKS-orchestrated HyperPod clusters](sagemaker-hyperpod-eks-operate.md).

**Note**  
When running workloads across multiple AZs, be aware that network communication between AZs introduces additional latency. Consider this impact when designing latency-sensitive applications.

## Setting up AWS Systems Manager and Run As for cluster user access control
<a name="sagemaker-hyperpod-prerequisites-ssm"></a>

[SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) comes with [AWS Systems Manager](https://aws.amazon.com/systems-manager/) (SSM) out of the box to help you manage access to your SageMaker HyperPod cluster instance groups. This section describes how to create operating system (OS) users in your SageMaker HyperPod clusters and associate them with IAM users and roles. This is useful to authenticate SSM sessions using the credentials of the OS user account.

**Note**  
Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.

### Enabling Run As in your AWS account
<a name="sagemaker-hyperpod-prerequisites-ssm-enable-runas"></a>

As an AWS account admin or a cloud administrator, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the [Run As feature in SSM](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html). With this feature, you can start each SSM session using the OS user associated to the IAM role or user.

To enable Run As in your AWS account, follow the steps in [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html). If you already created OS users in your cluster, make sure that you associate them with IAM roles or users by tagging them as guided in **Option 2** of step 5 under **To turn on Run As support for Linux and macOS managed nodes**.

## (Optional) Setting up SageMaker HyperPod with Amazon FSx for Lustre
<a name="sagemaker-hyperpod-prerequisites-optional-fsx"></a>

To start using SageMaker HyperPod and mapping data paths between the cluster and your FSx for Lustre ﬁle system, select one of the AWS Regions supported by SageMaker HyperPod. After choosing the AWS Region you prefer, you also should determine which Availability Zone (AZ) to use. 

If you use SageMaker HyperPod compute nodes in AZs different from the AZs where your FSx for Lustre ﬁle system is set up within the same AWS Region, there might be communication and network overhead. We recommend that you to use the same physical AZ as the one for the SageMaker HyperPod service account to avoid any cross-AZ traffic between SageMaker HyperPod clusters and your FSx for Lustre ﬁle system. Also, make sure that you have configured it with your VPC. If you want to use Amazon FSx as the main file system for storage, you must configure SageMaker HyperPod clusters with your VPC.

# AWS Identity and Access Management for SageMaker HyperPod
<a name="sagemaker-hyperpod-prerequisites-iam"></a>

AWS Identity and Access Management (IAM) is an AWS service that helps an administrator securely control access to AWS resources. IAM administrators control who can be *authenticated* (signed in) and *authorized* (have permissions) to use Amazon EKS resources. IAM is an AWS service that you can use with no additional charge.

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

Let's assume that there are two main layers of SageMaker HyperPod users: *cluster admin users* and *data scientist users*.
+ **Cluster admin users** – Are responsible for creating and managing SageMaker HyperPod clusters. This includes configuring the HyperPod clusters and managing user access to them.
  + Create and configure SageMaker HyperPod clusters with Slurm or Amazon EKS.
  + Create and configure IAM roles for data scientist users and HyperPod cluster resources.
  + For SageMaker HyperPod orchestration with Amazon EKS, create and configure [EKS access entries](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html), [role-based access control (RBAC)](sagemaker-hyperpod-eks-setup-rbac.md), and Pod Identity to fulfill data science use cases.
+ **Data scientist users** – Focus on ML model training. They use the open-source orchestrator or the SageMaker HyperPod CLI to submit and manage training jobs.
  + Assume and use the IAM Role provided by cluster admin users.
  + Interact with the open-source orchestrator CLIs supported by SageMaker HyperPod (Slurm or Kubernetes) or the SageMaker HyperPod CLI to check clusters capacity, connect to cluster, and submit workloads.

Set up IAM roles for cluster admins by attaching the right permissions or policies to operate SageMaker HyperPod clusters. Cluster admins also should create IAM roles to provide to SageMaker HyperPod resources to assume to run and communicate with necessary AWS resources, such as Amazon S3, Amazon CloudWatch, and AWS Systems Manager (SSM). Finally, the AWS account admin or the cluster admins should grant scientists permissions to access the SageMaker HyperPod clusters and run ML workloads.

Depending on which orchestrator you choose, permissions needed for the cluster admin and scientists may vary. You can also control the scope of permissions for various actions in the roles using the condition keys per service. Use the following Service Authorization References for adding detailed scope for the services related to SageMaker HyperPod.
+ [Amazon Elastic Compute Cloud](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonec2.html)
+ [Amazon Elastic Container Registry](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonelasticcontainerregistry.html) (for SageMaker HyperPod cluster orchestration with Amazon EKS)
+ [Amazon Elastic Kubernetes Service](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonelastickubernetesservice.html) (for SageMaker HyperPod cluster orchestration with Amazon EKS)
+ [Amazon FSx](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonfsx.html)
+ [AWS IAM Identity Center (successor to AWS Single Sign-On)](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awsiamidentitycentersuccessortoawssinglesign-on.html)
+ [AWS Identity and Access Management (IAM)](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awsidentityandaccessmanagementiam.html)
+ [Amazon Simple Storage Service](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazons3.html)
+ [Amazon SageMaker AI](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html)
+ [AWS Systems Manager](https://docs.aws.amazon.com/service-authorization/latest/reference/list_awssystemsmanager.html)

**Topics**
+ [IAM permissions for cluster creation](#sagemaker-hyperpod-prerequisites-iam-cluster-creation)
+ [IAM users for cluster admin](#sagemaker-hyperpod-prerequisites-iam-cluster-admin)
+ [IAM users for scientists](#sagemaker-hyperpod-prerequisites-iam-cluster-user)
+ [IAM role for SageMaker HyperPod](#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod)

## IAM permissions for cluster creation
<a name="sagemaker-hyperpod-prerequisites-iam-cluster-creation"></a>

Creating HyperPod clusters requires the IAM permissions outlined in the following policy example. If your AWS account has [https://docs.aws.amazon.com//aws-managed-policy/latest/reference/AdministratorAccess.html](https://docs.aws.amazon.com//aws-managed-policy/latest/reference/AdministratorAccess.html) permissions, these permissions are granted by default.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:DeleteCluster",
                "sagemaker:UpdateCluster"
            ],
            "Resource": "arn:aws:sagemaker:*:*:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:AddTags"
            ],
            "Resource": "arn:aws:sagemaker:*:*:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListTags",
                "sagemaker:ListClusters",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListComputeQuotas",
                "sagemaker:ListTrainingPlans",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudformation:CreateStack",
                "cloudformation:UpdateStack",
                "cloudformation:DeleteStack",
                "cloudformation:ContinueUpdateRollback",
                "cloudformation:SetStackPolicy",
                "cloudformation:ValidateTemplate",
                "cloudformation:DescribeStacks",
                "cloudformation:DescribeStackEvents",
                "cloudformation:Get*",
                "cloudformation:List*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/sagemaker-*",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": [
                        "sagemaker.amazonaws.com",
                        "eks.amazonaws.com",
                        "lambda.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "iam:GetRole"
            ],
            "Resource": "arn:aws:iam::*:role/*",
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": [
                        "sagemaker.amazonaws.com",
                        "eks.amazonaws.com",
                        "lambda.amazonaws.com",
                        "cloudformation.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Sid": "AmazonVPCFullAccess",
            "Effect": "Allow",
            "Action": [
                "ec2:AcceptVpcPeeringConnection",
                "ec2:AcceptVpcEndpointConnections",
                "ec2:AllocateAddress",
                "ec2:AssignIpv6Addresses",
                "ec2:AssignPrivateIpAddresses",
                "ec2:AssociateAddress",
                "ec2:AssociateDhcpOptions",
                "ec2:AssociateRouteTable",
                "ec2:AssociateSecurityGroupVpc",
                "ec2:AssociateSubnetCidrBlock",
                "ec2:AssociateVpcCidrBlock",
                "ec2:AttachClassicLinkVpc",
                "ec2:AttachInternetGateway",
                "ec2:AttachNetworkInterface",
                "ec2:AttachVpnGateway",
                "ec2:AuthorizeSecurityGroupEgress",
                "ec2:AuthorizeSecurityGroupIngress",
                "ec2:CreateCarrierGateway",
                "ec2:CreateCustomerGateway",
                "ec2:CreateDefaultSubnet",
                "ec2:CreateDefaultVpc",
                "ec2:CreateDhcpOptions",
                "ec2:CreateEgressOnlyInternetGateway",
                "ec2:CreateFlowLogs",
                "ec2:CreateInternetGateway",
                "ec2:CreateLocalGatewayRouteTableVpcAssociation",
                "ec2:CreateNatGateway",
                "ec2:CreateNetworkAcl",
                "ec2:CreateNetworkAclEntry",
                "ec2:CreateNetworkInterface",
                "ec2:CreateNetworkInterfacePermission",
                "ec2:CreateRoute",
                "ec2:CreateRouteTable",
                "ec2:CreateSecurityGroup",
                "ec2:CreateSubnet",
                "ec2:CreateTags",
                "ec2:CreateVpc",
                "ec2:CreateVpcEndpoint",
                "ec2:CreateVpcEndpointConnectionNotification",
                "ec2:CreateVpcEndpointServiceConfiguration",
                "ec2:CreateVpcPeeringConnection",
                "ec2:CreateVpnConnection",
                "ec2:CreateVpnConnectionRoute",
                "ec2:CreateVpnGateway",
                "ec2:DeleteCarrierGateway",
                "ec2:DeleteCustomerGateway",
                "ec2:DeleteDhcpOptions",
                "ec2:DeleteEgressOnlyInternetGateway",
                "ec2:DeleteFlowLogs",
                "ec2:DeleteInternetGateway",
                "ec2:DeleteLocalGatewayRouteTableVpcAssociation",
                "ec2:DeleteNatGateway",
                "ec2:DeleteNetworkAcl",
                "ec2:DeleteNetworkAclEntry",
                "ec2:DeleteNetworkInterface",
                "ec2:DeleteNetworkInterfacePermission",
                "ec2:DeleteRoute",
                "ec2:DeleteRouteTable",
                "ec2:DeleteSecurityGroup",
                "ec2:DeleteSubnet",
                "ec2:DeleteTags",
                "ec2:DeleteVpc",
                "ec2:DeleteVpcEndpoints",
                "ec2:DeleteVpcEndpointConnectionNotifications",
                "ec2:DeleteVpcEndpointServiceConfigurations",
                "ec2:DeleteVpcPeeringConnection",
                "ec2:DeleteVpnConnection",
                "ec2:DeleteVpnConnectionRoute",
                "ec2:DeleteVpnGateway",
                "ec2:DescribeAccountAttributes",
                "ec2:DescribeAddresses",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeCarrierGateways",
                "ec2:DescribeClassicLinkInstances",
                "ec2:DescribeCustomerGateways",
                "ec2:DescribeDhcpOptions",
                "ec2:DescribeEgressOnlyInternetGateways",
                "ec2:DescribeFlowLogs",
                "ec2:DescribeInstances",
                "ec2:DescribeInternetGateways",
                "ec2:DescribeIpv6Pools",
                "ec2:DescribeLocalGatewayRouteTables",
                "ec2:DescribeLocalGatewayRouteTableVpcAssociations",
                "ec2:DescribeKeyPairs",
                "ec2:DescribeMovingAddresses",
                "ec2:DescribeNatGateways",
                "ec2:DescribeNetworkAcls",
                "ec2:DescribeNetworkInterfaceAttribute",
                "ec2:DescribeNetworkInterfacePermissions",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribePrefixLists",
                "ec2:DescribeRouteTables",
                "ec2:DescribeSecurityGroupReferences",
                "ec2:DescribeSecurityGroupRules",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSecurityGroupVpcAssociations",
                "ec2:DescribeStaleSecurityGroups",
                "ec2:DescribeSubnets",
                "ec2:DescribeTags",
                "ec2:DescribeVpcAttribute",
                "ec2:DescribeVpcClassicLink",
                "ec2:DescribeVpcClassicLinkDnsSupport",
                "ec2:DescribeVpcEndpointConnectionNotifications",
                "ec2:DescribeVpcEndpointConnections",
                "ec2:DescribeVpcEndpoints",
                "ec2:DescribeVpcEndpointServiceConfigurations",
                "ec2:DescribeVpcEndpointServicePermissions",
                "ec2:DescribeVpcEndpointServices",
                "ec2:DescribeVpcPeeringConnections",
                "ec2:DescribeVpcs",
                "ec2:DescribeVpnConnections",
                "ec2:DescribeVpnGateways",
                "ec2:DetachClassicLinkVpc",
                "ec2:DetachInternetGateway",
                "ec2:DetachNetworkInterface",
                "ec2:DetachVpnGateway",
                "ec2:DisableVgwRoutePropagation",
                "ec2:DisableVpcClassicLink",
                "ec2:DisableVpcClassicLinkDnsSupport",
                "ec2:DisassociateAddress",
                "ec2:DisassociateRouteTable",
                "ec2:DisassociateSecurityGroupVpc",
                "ec2:DisassociateSubnetCidrBlock",
                "ec2:DisassociateVpcCidrBlock",
                "ec2:EnableVgwRoutePropagation",
                "ec2:EnableVpcClassicLink",
                "ec2:EnableVpcClassicLinkDnsSupport",
                "ec2:GetSecurityGroupsForVpc",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:ModifySecurityGroupRules",
                "ec2:ModifySubnetAttribute",
                "ec2:ModifyVpcAttribute",
                "ec2:ModifyVpcEndpoint",
                "ec2:ModifyVpcEndpointConnectionNotification",
                "ec2:ModifyVpcEndpointServiceConfiguration",
                "ec2:ModifyVpcEndpointServicePermissions",
                "ec2:ModifyVpcPeeringConnectionOptions",
                "ec2:ModifyVpcTenancy",
                "ec2:MoveAddressToVpc",
                "ec2:RejectVpcEndpointConnections",
                "ec2:RejectVpcPeeringConnection",
                "ec2:ReleaseAddress",
                "ec2:ReplaceNetworkAclAssociation",
                "ec2:ReplaceNetworkAclEntry",
                "ec2:ReplaceRoute",
                "ec2:ReplaceRouteTableAssociation",
                "ec2:ResetNetworkInterfaceAttribute",
                "ec2:RestoreAddressToClassic",
                "ec2:RevokeSecurityGroupEgress",
                "ec2:RevokeSecurityGroupIngress",
                "ec2:UnassignIpv6Addresses",
                "ec2:UnassignPrivateIpAddresses",
                "ec2:UpdateSecurityGroupRuleDescriptionsEgress",
                "ec2:UpdateSecurityGroupRuleDescriptionsIngress"
            ],
            "Resource": "*"
        },
        {
            "Sid": "CloudWatchPermissions",
            "Effect": "Allow",
            "Action": [
                "cloudwatch:*",
                "logs:*",
                "sns:CreateTopic",
                "sns:ListSubscriptions",
                "sns:ListSubscriptionsByTopic",
                "sns:ListTopics",
                "sns:Subscribe",
                "iam:GetPolicy",
                "iam:GetPolicyVersion",
                "iam:GetRole",
                "oam:ListSinks",
                "rum:*",
                "synthetics:*",
                "xray:*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:DeleteBucket",
                "s3:PutBucketPolicy",
                "s3:PutBucketTagging",
                "s3:PutBucketPublicAccessBlock",
                "s3:PutBucketLogging",
                "s3:DeleteBucketPolicy",
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:PutEncryptionConfiguration",
                "s3:AbortMultipartUpload",
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": [
                "arn:aws:s3:::*",
                "arn:aws:s3:::*/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "eks:CreateCluster",
                "eks:DeleteCluster",
                "eks:CreateNodegroup",
                "eks:DeleteNodegroup",
                "eks:UpdateNodegroupConfig",
                "eks:UpdateNodegroupVersion",
                "eks:UpdateClusterConfig",
                "eks:UpdateClusterVersion",
                "eks:CreateFargateProfile",
                "eks:DeleteFargateProfile",
                "eks:CreateAddon",
                "eks:DeleteAddon",
                "eks:UpdateAddon",
                "eks:CreateAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:UpdateAccessEntry",
                "eks:AssociateAccessPolicy",
                "eks:AssociateIdentityProviderConfig",
                "eks:DisassociateIdentityProviderConfig",
                "eks:TagResource",
                "eks:UntagResource",
                "eks:AccessKubernetesApi",
                "eks:Describe*",
                "eks:List*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter",
                "ssm:PutParameter",
                "ssm:DeleteParameter",
                "ssm:DescribeParameters"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:GenerateDataKey"
            ],
            "Resource": "*",
            "Condition": {
                "StringLike": {
                    "kms:ViaService": [
                        "sagemaker.*.amazonaws.com",
                        "ec2.*.amazonaws.com",
                        "s3.*.amazonaws.com",
                        "eks.*.amazonaws.com"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:GetFunction",
                "lambda:UpdateFunctionCode",
                "lambda:UpdateFunctionConfiguration",
                "lambda:AddPermission",
                "lambda:RemovePermission",
                "lambda:PublishLayerVersion",
                "lambda:DeleteLayerVersion",
                "lambda:InvokeFunction",
                "lambda:Get*",
                "lambda:List*",
                "lambda:TagResource"
            ],
            "Resource": [
                "arn:aws:lambda:*:*:function:*",
                "arn:aws:lambda:*:*:layer:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:DeleteRole",
                "iam:DeleteRolePolicy"
            ],
            "Resource": [
                "arn:aws:iam::*:role/*sagemaker*",
                "arn:aws:iam::*:role/*eks*",
                "arn:aws:iam::*:role/*hyperpod*",
                "arn:aws:iam::*:policy/*sagemaker*",
                "arn:aws:iam::*:policy/*hyperpod*",
                "arn:aws:iam::*:role/*LifeCycleScriptStack*",
                "arn:aws:iam::*:role/*LifeCycleScript*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreateRole",
                "iam:TagRole",
                "iam:PutRolePolicy",
                "iam:Get*",
                "iam:List*",
                "iam:AttachRolePolicy",
                "iam:DetachRolePolicy"
            ],
            "Resource": [
                "arn:aws:iam::*:role/*",
                "arn:aws:iam::*:policy/*"
            ]
        },
        {
            "Sid": "FullAccessToFSx",
            "Effect": "Allow",
            "Action": [
                "fsx:AssociateFileGateway",
                "fsx:AssociateFileSystemAliases",
                "fsx:CancelDataRepositoryTask",
                "fsx:CopyBackup",
                "fsx:CopySnapshotAndUpdateVolume",
                "fsx:CreateAndAttachS3AccessPoint",
                "fsx:CreateBackup",
                "fsx:CreateDataRepositoryAssociation",
                "fsx:CreateDataRepositoryTask",
                "fsx:CreateFileCache",
                "fsx:CreateFileSystem",
                "fsx:CreateFileSystemFromBackup",
                "fsx:CreateSnapshot",
                "fsx:CreateStorageVirtualMachine",
                "fsx:CreateVolume",
                "fsx:CreateVolumeFromBackup",
                "fsx:DetachAndDeleteS3AccessPoint",
                "fsx:DeleteBackup",
                "fsx:DeleteDataRepositoryAssociation",
                "fsx:DeleteFileCache",
                "fsx:DeleteFileSystem",
                "fsx:DeleteSnapshot",
                "fsx:DeleteStorageVirtualMachine",
                "fsx:DeleteVolume",
                "fsx:DescribeAssociatedFileGateways",
                "fsx:DescribeBackups",
                "fsx:DescribeDataRepositoryAssociations",
                "fsx:DescribeDataRepositoryTasks",
                "fsx:DescribeFileCaches",
                "fsx:DescribeFileSystemAliases",
                "fsx:DescribeFileSystems",
                "fsx:DescribeS3AccessPointAttachments",
                "fsx:DescribeSharedVpcConfiguration",
                "fsx:DescribeSnapshots",
                "fsx:DescribeStorageVirtualMachines",
                "fsx:DescribeVolumes",
                "fsx:DisassociateFileGateway",
                "fsx:DisassociateFileSystemAliases",
                "fsx:ListTagsForResource",
                "fsx:ManageBackupPrincipalAssociations",
                "fsx:ReleaseFileSystemNfsV3Locks",
                "fsx:RestoreVolumeFromSnapshot",
                "fsx:TagResource",
                "fsx:UntagResource",
                "fsx:UpdateDataRepositoryAssociation",
                "fsx:UpdateFileCache",
                "fsx:UpdateFileSystem",
                "fsx:UpdateSharedVpcConfiguration",
                "fsx:UpdateSnapshot",
                "fsx:UpdateStorageVirtualMachine",
                "fsx:UpdateVolume"
            ],
            "Resource": "*"
        }
    ]
}
```

------

## IAM users for cluster admin
<a name="sagemaker-hyperpod-prerequisites-iam-cluster-admin"></a>

Cluster administrators (admins) operate and configure SageMaker HyperPod clusters, performing the tasks in [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md). The following policy example includes the minimum set of permissions for cluster administrators to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account.

**Note**  
IAM users with cluster admin roles can use condition keys to provide granular access control when managing SageMaker HyperPod cluster resources specifically for the `CreateCluster` and `UpdateCluster` actions. To find the condition keys supported for these actions, search for `CreateCluster` or `UpdateCluster` in the [Actions defined by SageMaker AI](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html#amazonsagemaker-actions-as-permissions).

------
#### [ Slurm ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:ListClusters"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:DeleteCluster",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware",
                "sagemaker:BatchDeleteClusterNodes"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        }
    ]
}
```

------
#### [ Amazon EKS ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::111122223333:role/execution-role-name"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:DeleteCluster",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware",
                "sagemaker:BatchAddClusterNodes",
                "sagemaker:BatchDeleteClusterNodes",
                "sagemaker:ListComputeQuotas",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DeleteClusterSchedulerConfig",
                "sagemaker:DeleteComputeQuota",
                "eks:DescribeCluster",
                "eks:CreateAccessEntry",
                "eks:DescribeAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:AssociateAccessPolicy",
                "iam:CreateServiceLinkedRole"
            ],
            "Resource": "*"
        }
    ]
}
```

------

To grant permissions to access the SageMaker AI console, use the sample policy provided at [Permissions required to use the Amazon SageMaker AI console](https://docs.aws.amazon.com/sagemaker/latest/dg/security_iam_id-based-policy-examples.html#console-permissions).

To grant permissions to access the Amazon EC2 Systems Manager console, use the sample policy provided at [Using the AWS Systems Manager console](https://docs.aws.amazon.com/systems-manager/latest/userguide/security_iam_id-based-policy-examples.html#security_iam_id-based-policy-examples-console) in the * AWS Systems Manager User Guide*.

You might also consider attaching the [`AmazonSageMakerFullAccess`](security-iam-awsmanpol.md#security-iam-awsmanpol-AmazonSageMakerFullAccess) policy to the role; however, note that the `AmazonSageMakerFullAccess` policy grants permissions to the entire SageMaker API calls, features, and resources.

For guidance on IAM users in general, see [IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users.html) in the *AWS Identity and Access Management User Guide*.

## IAM users for scientists
<a name="sagemaker-hyperpod-prerequisites-iam-cluster-user"></a>

Scientists log into and run ML workloads on SageMaker HyperPod cluster nodes provisioned by cluster admins. For scientists in your AWS account, you should grant the permission `"ssm:StartSession"` to run the SSM `start-session` command. The following is a policy example for IAM users.

------
#### [ Slurm ]

Add the following policy to grant SSM session permissions to connect to an SSM target for all resources. This allows you to access HyperPod clusters.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	             
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": "*"    
        }
    ]
}
```

------

------
#### [ Amazon EKS ]

Grant the following IAM role permissions for data scientists to run `hyperpod list-clusters` and `hyperpod connect-cluster` commands among the HyperPod CLI commands. To learn more about the HyperPod CLI, see [Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-run-jobs.md). It also includes SSM session permissions to connect to an SSM target for all resources. This allows you to access HyperPod clusters.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "DescribeHyerpodClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "sagemaker:DescribeCluster"
            ],
            "Resource": "arn:aws:sagemaker:us-east-2:111122223333:cluster/hyperpod-cluster-name"
        },
        {
            "Sid": "UseEksClusterPermissions",
            "Effect": "Allow",
            "Action": [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:sagemaker:us-east-2:111122223333:cluster/eks-cluster-name"
        },
        {
            "Sid": "ListClustersPermission",
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListClusters"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession",
                "ssm:TerminateSession"
            ],
            "Resource": "*"    
        }
    ]
}
```

------

To grant data scientists IAM users or roles access to Kubernetes APIs in the cluster, see also [Grant IAM users and roles access to Kubernetes APIs](https://docs.aws.amazon.com/eks/latest/userguide/grant-k8s-access.html) in the *Amazon EKS User Guide*.

------

## IAM role for SageMaker HyperPod
<a name="sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod"></a>

For SageMaker HyperPod clusters to run and communicate with necessary AWS resources, you need create an IAM role for HyperPod cluster to assume. 

Start with attaching the managed role [AWS managed policy: AmazonSageMakerHyperPodServiceRolePolicy](security-iam-awsmanpol-AmazonSageMakerHyperPodServiceRolePolicy.md). Given this AWS managed policy, SageMaker HyperPod cluster instance groups assume the role to communicate with Amazon CloudWatch, Amazon S3, and AWS Systems Manager Agent (SSM Agent). This managed policy is the minimum requirement for SageMaker HyperPod resources to run properly, so you must provide an IAM role with this policy to all instance groups. 

**Tip**  
Depending on your preference on designing the level of permissions for multiple instance groups, you can also set up multiple IAM roles and attach them to different instance groups. When you set up your cluster user access to specific SageMaker HyperPod cluster nodes, the nodes assume the role with the selective permissions you manually attached.  
When you set up the access for scientists to specific cluster nodes through [AWS Systems Manager](https://aws.amazon.com/systems-manager/) (see also [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm)), the cluster nodes assume the role with the selective permissions you manually attach.

After you are done with creating IAM roles, make notes of their names and ARNs. You use the roles when creating a SageMaker HyperPod cluster, granting the correct permissions required for each instance group to communicate with necessary AWS resources.

------
#### [ Slurm ]

For HyperPod orchestrated with Slurm, you must attach the following managed policy to the SageMaker HyperPod IAM role.
+ [AmazonSageMakerClusterInstanceRolePolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerClusterInstanceRolePolicy.html)

**(Optional) Additional permissions for using SageMaker HyperPod with Amazon Virtual Private Cloud**

If you want to use your own Amazon Virtual Private Cloud (VPC) instead of the default SageMaker AI VPC, you should add the following additional permissions to the IAM role for SageMaker HyperPod.

```
{
    "Effect": "Allow",
    "Action": [
        "ec2:CreateNetworkInterface",
        "ec2:CreateNetworkInterfacePermission",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteNetworkInterfacePermission",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeVpcs",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups",
        "ec2:DetachNetworkInterface"
    ],
    "Resource": "*"
}
{
    "Effect": "Allow",
    "Action": "ec2:CreateTags",
    "Resource": [
        "arn:aws:ec2:*:*:network-interface/*"
    ]
}
```

The following list breaks down which permissions are needed to enable SageMaker HyperPod cluster functionalities when you configure the cluster with your own Amazon VPC.
+ The following `ec2` permissions are required to enable configuring a SageMaker HyperPod cluster with your VPC.

  ```
  {
      "Effect": "Allow",
      "Action": [
          "ec2:CreateNetworkInterface",
          "ec2:CreateNetworkInterfacePermission",
          "ec2:DeleteNetworkInterface",
          "ec2:DeleteNetworkInterfacePermission",
          "ec2:DescribeNetworkInterfaces",
          "ec2:DescribeVpcs",
          "ec2:DescribeDhcpOptions",
          "ec2:DescribeSubnets",
          "ec2:DescribeSecurityGroups"
      ],
      "Resource": "*"
  }
  ```
+ The following `ec2` permission is required to enable the [SageMaker HyperPod auto-resume functionality](sagemaker-hyperpod-resiliency-slurm-auto-resume.md).

  ```
  {
      "Effect": "Allow",
      "Action": [
          "ec2:DetachNetworkInterface"
      ],
      "Resource": "*"
  }
  ```
+ The following `ec2` permission allows SageMaker HyperPod to create tags on the network interfaces within your account.

  ```
  {
      "Effect": "Allow",
      "Action": "ec2:CreateTags",
      "Resource": [
          "arn:aws:ec2:*:*:network-interface/*"
      ]
  }
  ```

------
#### [ Amazon EKS ]

For HyperPod orchestrated with Amazon EKS, you must attach the following managed policies to the SageMaker HyperPod IAM role.
+ [AmazonSageMakerClusterInstanceRolePolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerClusterInstanceRolePolicy.html)

In addition to the managed policies, attach the following permission policy to the role.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:AssignPrivateIpAddresses",
        "ec2:AttachNetworkInterface",
        "ec2:CreateNetworkInterface",
        "ec2:CreateNetworkInterfacePermission",
        "ec2:DeleteNetworkInterface",
        "ec2:DeleteNetworkInterfacePermission",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceTypes",
        "ec2:DescribeNetworkInterfaces",
        "ec2:DescribeTags",
        "ec2:DescribeVpcs",
        "ec2:DescribeDhcpOptions",
        "ec2:DescribeSubnets",
        "ec2:DescribeSecurityGroups",
        "ec2:DetachNetworkInterface",
        "ec2:ModifyNetworkInterfaceAttribute",
        "ec2:UnassignPrivateIpAddresses",
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetAuthorizationToken",
        "ecr:GetDownloadUrlForLayer",
        "eks-auth:AssumeRoleForPodIdentity"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2:CreateTags"
      ],
      "Resource": [
        "arn:aws:ec2:*:*:network-interface/*"
      ]
    }
  ]
}
```

------

**Note**  
The `"eks-auth:AssumeRoleForPodIdentity"` permission is optional. It's required if you plan to use EKS Pod identity.

**SageMaker HyperPod service-linked role**

For Amazon EKS support in SageMaker HyperPod, HyperPod creates a service-linked role with [AWS managed policy: AmazonSageMakerHyperPodServiceRolePolicy](security-iam-awsmanpol-AmazonSageMakerHyperPodServiceRolePolicy.md) to monitor and support resiliency on your EKS cluster such as replacing nodes and restarting jobs.

**Additional IAM policies for Amazon EKS cluster with restricted instance group (RIG)**

Workloads running in restricted instance groups rely on the execution role to load data from Amazon S3. You must add the additional Amazon S3 permissions to the execution role so that customization jobs running in restricted instance groups can properly fetch input data.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket"      
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket/*"
      ]
    }
  ]
}
```

------

------

# Customer managed AWS KMS key encryption for SageMaker HyperPod
<a name="smcluster-cmk"></a>

By default, the root Amazon EBS volume attached to your SageMaker HyperPod cluster is encrypted using an AWS KMS key owned by AWS. You now have the option to encrypt both the root Amazon EBS volume and the secondary volume with your own customer managed KMS keys. The following topic describes how customer managed keys (CMKs) work with volumes in HyperPod clusters.

**Note**  
The following exclusions apply when using customer managed keys for SageMaker HyperPod clusters:  
Customer managed key encryption is only supported for clusters using the continuous node provisioning mode. Restricted instance groups don't support customer managed keys.
HyperPod clusters don’t currently support passing AWS KMS encryption context in customer managed key encryption requests. Therefore, ensure your KMS key policy is not scoped down using encryption context conditions, as this prevents the cluster from using the key.
KMS key transition isn't currently supported, so you can't change the KMS key specified in your configuration. To use a different key, create a new instance group with the desired key and delete your old instance group.
Specifying customer managed keys for HyperPod clusters through the console isn't currently supported.

## Permissions
<a name="smcluster-cmk-permissions"></a>

Before you can use your customer managed key with HyperPod, you must complete the following prerequisites:
+ Ensure that the AWS IAM execution role that you're using for SageMaker AI has the following permissions for AWS KMS added. The `[ kms:CreateGrant](https://docs.aws.amazon.com/kms/latest/APIReference/API_CreateGrant.html)` permission allows HyperPod to take the following actions using permissions to your KMS key:
  + Scaling out your instance count (UpdateCluster operations)
  + Adding cluster nodes (BatchAddClusterNodes operations)
  + Patching software (UpdateClusterSoftware operations)

  For more information on updating your IAM role's permissions, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the *IAM User Guide*.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "kms:CreateGrant",
                  "kms:DescribeKey"
              ],
              "Resource": "*"
          }
      ]
  }
  ```

------
+ Add the following permissions to your KMS key policy. For more information, see [Change a key policy](https://docs.aws.amazon.com/kms/latest/developerguide/key-policy-modifying.html) in the *AWS KMS Developer Guide*.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Id": "hyperpod-key-policy",
      "Statement": [
          {
              "Sid": "Enable IAM User Permissions",
              "Effect": "Allow",
              "Principal": {
                  "AWS": "arn:aws:iam::111122223333:root"
              },
              "Action": "kms:*",
              "Resource": "*"
          },
          {
              "Effect": "Allow",
              "Principal": {
                  "AWS": "arn:aws:iam::111122223333:role/<iam-role>"
              },
              "Action": "kms:CreateGrant",
              "Resource": "arn:aws:kms:us-east-1:111122223333:key/key-id",
              "Condition": {
                  "StringEquals": {
                      "kms:ViaService": "sagemaker.us-east-1.amazonaws.com"
                  },
                  "Bool": {
                      "kms:GrantIsForAWSResource": "true"
                  }
              }
          },
          {
              "Effect": "Allow",
              "Principal": {
                  "AWS": "arn:aws:iam::111122223333:role/<iam-role>"
              },
              "Action": "kms:DescribeKey",
              "Resource": "arn:aws:kms:us-east-1:111122223333:key/key-id",
              "Condition": {
                  "StringEquals": {
                      "kms:ViaService": "sagemaker.us-east-1.amazonaws.com"
                  }
              }
          }
      ]
  }
  ```

------

## How to use your KMS key
<a name="smcluster-cmk-usage"></a>

You can specify your customer managed keys when creating or updating a cluster using the [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) and [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API operations. The `InstanceStorageConfigs` structure allows up to two `EbsVolumeConfig` configurations, in which you can configure the root Amazon EBS volume and, optionally, a secondary volume. You can use the same KMS key or a different KMS key for each volume, depending on your needs.

You can choose to specify a customer managed key for neither, both, or either volume. However, you can't specify two root volumes or two secondary volumes.

When configuring the root volume, the following requirements apply:
+ `RootVolume` must be set to `True`. The default value is `False`, which configures the secondary volume instead.
+ The `VolumeKmsKeyId` field is required and you must specify your customer managed key. This is because the root volume must always be encrypted with either an AWS owned key or a customer managed key (if you don't specify your own, then an AWS owned key is used).
+ You can't specify the `VolumeSizeInGB` field for root volumes since HyperPod determines the size of the root volume for you.

When configuring the secondary volume, the following requirements apply:
+ `RootVolume` must be `False` (the default value of this field is `False`).
+ The `VolumeKmsKeyId` field is optional. You can use the same customer managed key you specified for the root volume, or you can use a different key.
+ The `VolumeSizeInGB` field is required, since you must specify your desired size for the secondary volume.

**Important**  
When using customer managed keys, we strongly recommend that you use different KMS keys for each instance group in your cluster. Using the same customer managed key across multiple instance groups might lead to unintentional continued permissions even if you try to revoke a grant. For example, if you revoke an AWS KMS grant for one instance group's volumes, that instance group might still allow scaling and patching operations due to grants existing on other instance groups using the same key. To prevent this issue, ensure that you assign unique KMS keys to each instance group in your cluster. If you need to restrict permissions on instance groups, you can try one of the following options:  
Disable the KMS key.
Apply deny policies to the KMS key policy.
Revoke all instance group grants for the key (rather than revoking one grant).
Delete the instance group.
Delete the cluster.

The following examples show how to specify customer managed keys for both root and secondary volumes using the CreateCluster and UpdateCluster APIs. These examples show only the required fields for customer managed key integration. To configure a customer managed key for only one of the volumes, then only specify one `EbsVolumeConfig`.

For more information about configuring cluster creation and update requests, see [Creating a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-cli-command-create-cluster.md) and [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md).

------
#### [ CreateCluster ]

The following example shows a [ create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) AWS CLI request with customer managed key encryption.

```
aws sagemaker create-cluster \
  --cluster-name <your-hyperpod-cluster> \
  --instance-groups '[{
    "ExecutionRole": "arn:aws:iam::111122223333:role/<your-SageMaker-Execution-Role>",
    "InstanceCount": 2,
    "InstanceGroupName": "<your-ig-name>",
    "InstanceStorageConfigs": [
            {
                "EbsVolumeConfig": {
                    "RootVolume": True,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/root-volume-key-id"
                }
            },
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/secondary-volume-key-id"
                }
            }
    ],
    "InstanceType": "<desired-instance-type>"
  }]' \
  --vpc-config '{
    "SecurityGroupIds": ["<sg-id>"],
    "Subnets": ["<subnet-id>"]
  }'
```

------
#### [ UpdateCluster ]

The following example shows an [ update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI request with customer managed key encryption.

```
aws sagemaker update-cluster \
  --cluster-name <your-hyperpod-cluster> \
  --instance-groups '[{
    "InstanceGroupName": "<your-ig-name>",
    "InstanceStorageConfigs": [
            {
                "EbsVolumeConfig": {
                    "RootVolume": true,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/root-volume-key-id"
                }
            },
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/secondary-volume-key-id"
                }
            }
    ]
  }]'
```

------

# SageMaker HyperPod recipes
<a name="sagemaker-hyperpod-recipes"></a>

Amazon SageMaker HyperPod recipes are pre-configured training stacks provided by AWS to help you quickly start training and fine-tuning publicly available foundation models (FMs) from various model families such as Llama, Mistral, Mixtral, or DeepSeek. Recipes automate the end-to-end training loop, including loading datasets, applying distributed training techniques, and managing checkpoints for faster recovery from faults. 

SageMaker HyperPod recipes are particularly beneficial for users who may not have deep machine learning expertise, as they abstract away much of the complexity involved in training large models.

You can run recipes within SageMaker HyperPod or as SageMaker training jobs.

The following tables are maintained in the SageMaker HyperPod GitHub repository and provide the most up-to-date information on the models supported for pre-training and fine-tuning, their respective recipes and launch scripts, supported instance types, and more.
+ For the most current list of supported models, recipes, and launch scripts for pre-training, see the [pre-training table](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#pre-training).
+ For the most current list of supported models, recipes, and launch scripts for fine-tuning, see the [fine-tuning table](https://github.com/aws/sagemaker-hyperpod-recipes?tab=readme-ov-file#fine-tuning).

For SageMaker HyperPod users, the automation of end-to-end training workflows comes from the integration of the training adapter with SageMaker HyperPod recipes. The training adapter is built on the [NVIDIA NeMo framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html) and the [Neuronx Distributed Training package](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html). If you're familiar with using NeMo, the process of using the training adapter is the same. The training adapter runs the recipe on your cluster.

![\[Diagram showing SageMaker HyperPod recipe workflow. A "Recipe" icon at the top feeds into a "HyperPod recipe launcher" box. This box connects to a larger section labeled "Cluster: Slurm, K8s, ..." containing three GPU icons with associated recipe files. The bottom of the cluster section is labeled "Train with HyperPod Training Adapter".\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-hyperpod-recipes-overview.png)


You can also train your own model by defining your own custom recipe.

To get started with a tutorial, see [Tutorials](sagemaker-hyperpod-recipes-tutorials.md).

**Topics**
+ [Tutorials](sagemaker-hyperpod-recipes-tutorials.md)
+ [Default configurations](default-configurations.md)
+ [Cluster-specific configurations](cluster-specific-configurations.md)
+ [Considerations](cluster-specific-configurations-special-considerations.md)
+ [Advanced settings](cluster-specific-configurations-advanced-settings.md)
+ [Appendix](appendix.md)

# Tutorials
<a name="sagemaker-hyperpod-recipes-tutorials"></a>

The following quick-start tutorials help you get started with using the recipes for training:
+ SageMaker HyperPod with Slurm Orchestration
  + Pre-training
    + [HyperPod Slurm cluster pre-training tutorial (GPU)](hyperpod-gpu-slurm-pretrain-tutorial.md)
    + [Trainium Slurm cluster pre-training tutorial](hyperpod-trainium-slurm-cluster-pretrain-tutorial.md)
  + Fine-tuning
    + [HyperPod Slurm cluster PEFT-Lora tutorial (GPU)](hyperpod-gpu-slurm-peft-lora-tutorial.md)
    + [HyperPod Slurm cluster DPO tutorial (GPU)](hyperpod-gpu-slurm-dpo-tutorial.md)
+ SageMaker HyperPod with K8s Orchestration
  + Pre-training
    + [Kubernetes cluster pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial.md)
    + [Trainium SageMaker training jobs pre-training tutorial](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)
+ SageMaker training jobs
  + Pre-training
    + [SageMaker training jobs pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md)
    + [Trainium SageMaker training jobs pre-training tutorial](sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial.md)

# HyperPod Slurm cluster pre-training tutorial (GPU)
<a name="hyperpod-gpu-slurm-pretrain-tutorial"></a>

The following tutorial sets up Slurm environment and starts a training job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment to run the recipe, make sure you have:  
Set up a HyperPod GPU Slurm cluster.  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or an NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## HyperPod GPU Slurm environment setup
<a name="hyperpod-gpu-slurm-environment-setup"></a>

To initiate a training job on a HyperPod GPU Slurm cluster, do the following:

1. SSH into the head node of your Slurm cluster.

1. After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

   ```
   #set up a virtual environment
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location.

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
   git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
   cd sagemaker-hyperpod-recipes
   pip3 install -r requirements.txt
   ```

1. Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). To gain a deeper understanding of how to use the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

   ```
   REGION="<region>"
   IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
   aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
   enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
   mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
   ```

1. To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

   ```
   container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
   ```

## Launch the training job
<a name="hyperpod-gpu-slurm-launch-training-job"></a>

After you install the dependencies, start a training job from the `sagemaker-hyperpod-recipes/launcher_scripts` directory. You get the dependencies by cloning the [SageMaker HyperPod recipes repository](https://github.com/aws/sagemaker-hyperpod-recipes):

First, pick your training recipe from Github, the model name is specified as part of the recipe. We use the `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh` script to launch a Llama 8b with sequence length 8192 pre-training recipe, `llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain`, in the following example.
+ `IMAGE`: The container from the environment setup section.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment ouput directory
EXP_DIR="${YOUR_EXP_DIR}"

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
  recipes=training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
  base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
  recipes.run.name="hf_llama3_8b" \
  recipes.exp_manager.exp_dir="$EXP_DIR" \
  recipes.model.data.train_dir="$TRAIN_DIR" \
  recipes.model.data.val_dir="$VAL_DIR" \
  container="${IMAGE}" \
  +cluster.container_mounts.0="/fsx:/fsx"
```

After you've configured all the required parameters in the launcher script, you can run the script using the following command.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# Trainium Slurm cluster pre-training tutorial
<a name="hyperpod-trainium-slurm-cluster-pretrain-tutorial"></a>

The following tutorial sets up a Trainium environment on a Slurm cluster and starts a training job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up a SageMaker HyperPod Trainium Slurm cluster.
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the Trainium environment on the Slurm Cluster
<a name="hyperpod-trainium-slurm-cluster-pretrain-setup-trainium-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the Neuron environment. For information about setting up Neuron, see [Neuron setup steps](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_SFT.html#setting-up-the-environment). We recommend relying on the Deep learning AMI's that come pre-installed with Neuron's drivers, such as [Ubuntu 20 with DLAMI Pytorch](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu20-pytorch-dlami.html#setup-torch-neuronx-ubuntu20-dlami-pytorch).
+ Clone the SageMaker HyperPod recipes repository to a shared storage location in the cluster. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Go through the following tutorial: [HuggingFace Llama3-8B Pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#)
+ Prepare a model configuration. The model configurations available in the Neuron repo. For the model configuration used the in this tutorial, see [llama3 8b model config](https://github.com/aws-neuron/neuronx-distributed/blob/main/examples/training/llama/tp_zero1_llama_hf_pretrain/8B_config_llama3/config.json)

## Launch the training job in Trainium
<a name="hyperpod-trainium-slurm-cluster-pretrain-launch-training-job-trainium"></a>

To launch a training job in Trainium, specify a cluster configuration and a Neuron recipe. For example, to launch a llama3 8b pre-training job in Trainium, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh`, to the following:
+ `MODEL_CONFIG`: The model config from the environment setup section
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

```
#!/bin/bash

#Users should set up their cluster type in /recipes_collection/config.yaml

SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}

COMPILE=0
TRAIN_DIR="${TRAIN_DIR}" # Location of training dataset
MODEL_CONFIG="${MODEL_CONFIG}" # Location of config.json for the model

HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    instance_type="trn1.32xlarge" \
    recipes.run.compile="$COMPILE" \
    recipes.run.name="hf-llama3-8b" \
    recipes.trainer.num_nodes=4 \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    recipes.data.train_dir="$TRAIN_DIR" \
    recipes.model.model_config="$MODEL_CONFIG"
```

To launch the training job, run the following command:

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# HyperPod Slurm cluster DPO tutorial (GPU)
<a name="hyperpod-gpu-slurm-dpo-tutorial"></a>

The following tutorial sets up a Slurm environment and starts a direct preference optimization (DPO) job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up HyperPod GPU Slurm cluster  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
A tokenized binary preference dataset in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the HyperPod GPU Slurm environment
<a name="hyperpod-gpu-slurm-dpo-hyperpod-gpu-slurm-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). For more information about using the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## Launch the training job
<a name="hyperpod-gpu-slurm-dpo-launch-training-job"></a>

To launch a DPO job for the Llama 8 billion parameter model with a sequence length of 8192 on a single Slurm compute node, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh`, to the following:
+ `IMAGE`: The container from the environment setup section.
+ `HF_MODEL_NAME_OR_PATH`: Define the name or the path of the pre-trained weights in the hf\$1model\$1name\$1or\$1path parameter of the recipe.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

**Note**  
The reference model used for DPO in this setup is automatically derived from the base model being trained (no separate reference model is explicitly defined). DPO specific hyperparameters are preconfigured with the following default values:  
`beta`: 0.1 (controls the strength of KL divergence regularization)
`label_smoothing`: 0.0 (no smoothing applied to preference labels)

```
recipes.dpo.beta=${BETA}
recipes.dpo.label_smoothing=${LABEL_SMOOTHING}
```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset
# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${HF_MODEL_NAME_OR_PATH}"
BETA="${BETA}"
LABEL_SMOOTHING="${LABEL_SMOOTHING}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_dpo \
base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
recipes.run.name="hf_llama3_dpo" \
recipes.exp_manager.exp_dir="$EXP_DIR" \
recipes.model.data.train_dir="$TRAIN_DIR" \
recipes.model.data.val_dir="$VAL_DIR" \
recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
container="${IMAGE}" \
+cluster.container_mounts.0="/fsx:/fsx" \
recipes.model.hf_access_token="${HF_ACCESS_TOKEN}" \
recipes.dpo.enabled=true \
recipes.dpo.beta="${BETA}" \
recipes.dpo.label_smoothing="${LABEL_SMOOTHING}$" \
```

After you've configured all the required parameters in the preceding script, you can initiate the training job by running it.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_dpo.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# HyperPod Slurm cluster PEFT-Lora tutorial (GPU)
<a name="hyperpod-gpu-slurm-peft-lora-tutorial"></a>

The following tutorial sets up Slurm environment and starts a parameter-efficient fine-tuning (PEFT) job on a Llama 8 billion parameter model.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up HyperPod GPU Slurm cluster  
Your HyperPod Slurm cluster must have Nvidia Enroot and Pyxis enabled (these are enabled by default).
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up the HyperPod GPU Slurm environment
<a name="hyperpod-gpu-slurm-peft-lora-setup-hyperpod-gpu-slurm-environment"></a>

To initiate a training job on a Slurm cluster, do the following:
+ SSH into the head node of your Slurm cluster.
+ After you log in, set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  #set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Clone the SageMaker HyperPod recipes and SageMaker HyperPod adapter repositories to a shared storage location. The shared storage location can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.

  ```
  git clone https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo.git
  git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Create a squash file using Enroot. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md). For more information about using the Enroot file, see [Build AWS-optimized Nemo-Launcher image](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/2.nemo-launcher#2-build-aws-optimized-nemo-launcher-image).

  ```
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  aws ecr get-login-password --region ${REGION} | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
  enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
  mv $PWD/smdistributed-modelparallel.sqsh "/fsx/<any-path-in-the-shared-filesystem>"
  ```
+ To use the Enroot squash file to start training, use the following example to modify the `recipes_collection/config.yaml` file.

  ```
  container: /fsx/path/to/your/smdistributed-modelparallel.sqsh
  ```

## Launch the training job
<a name="hyperpod-gpu-slurm-peft-lora-launch-training-job"></a>

To launch a PEFT job for the Llama 8 billion parameter model with a sequence length of 8192 on a single Slurm compute node, set the launch script, `launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh`, to the following:
+ `IMAGE`: The container from the environment setup section.
+ `HF_MODEL_NAME_OR_PATH`: Define the name or the path of the pre-trained weights in the hf\$1model\$1name\$1or\$1path parameter of the recipe.
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=${HF_ACCESS_TOKEN}
  ```

```
#!/bin/bash
IMAGE="${YOUR_IMAGE}"
SAGEMAKER_TRAINING_LAUNCHER_DIR="${SAGEMAKER_TRAINING_LAUNCHER_DIR:-${PWD}}"

TRAIN_DIR="${YOUR_TRAIN_DIR}" # Location of training dataset
VAL_DIR="${YOUR_VAL_DIR}" # Location of validation dataset

# experiment output directory
EXP_DIR="${YOUR_EXP_DIR}"
HF_ACCESS_TOKEN="${YOUR_HF_TOKEN}"
HF_MODEL_NAME_OR_PATH="${YOUR_HF_MODEL_NAME_OR_PATH}"

# Add hf_model_name_or_path and turn off synthetic_data
HYDRA_FULL_ERROR=1 python3 ${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py \
    recipes=fine-tuning/llama/hf_llama3_8b_seq8k_gpu_lora \
    base_results_dir=${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results \
    recipes.run.name="hf_llama3_lora" \
    recipes.exp_manager.exp_dir="$EXP_DIR" \
    recipes.model.data.train_dir="$TRAIN_DIR" \
    recipes.model.data.val_dir="$VAL_DIR" \
    recipes.model.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH" \
    container="${IMAGE}" \
    +cluster.container_mounts.0="/fsx:/fsx" \
    recipes.model.hf_access_token="${HF_ACCESS_TOKEN}"
```

After you've configured all the required parameters in the preceding script, you can initiate the training job by running it.

```
bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_gpu_lora.sh
```

For more information about the Slurm cluster configuration, see [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md).

# Kubernetes cluster pre-training tutorial (GPU)
<a name="sagemaker-hyperpod-gpu-kubernetes-cluster-pretrain-tutorial"></a>

There are two ways to launch a training job in a GPU Kubernetes cluster:
+ (Recommended) [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli)
+ The NeMo style launcher

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
A HyperPod GPU Kubernetes cluster is setup properly.
A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## GPU Kubernetes environment setup
<a name="sagemaker-hyperpod-gpu-kubernetes-environment-setup"></a>

To set up a GPU Kubernetes environment, do the following:
+ Set up the virtual environment. Make sure you're using Python 3.9 or greater.

  ```
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  ```
+ Install dependencies using one of the following methods:
  + (Recommended): [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli) method:

    ```
    # install HyperPod command line tools
    git clone https://github.com/aws/sagemaker-hyperpod-cli
    cd sagemaker-hyperpod-cli
    pip3 install .
    ```
  + SageMaker HyperPod recipes method:

    ```
    # install SageMaker HyperPod Recipes.
    git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
    cd sagemaker-hyperpod-recipes
    pip3 install -r requirements.txt
    ```
+ [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)
+ [Install Helm](https://helm.sh/docs/intro/install/)
+ Connect to your Kubernetes cluster

  ```
  aws eks update-kubeconfig --region "CLUSTER_REGION" --name "CLUSTER_NAME"
  hyperpod connect-cluster --cluster-name "CLUSTER_NAME" [--region "CLUSTER_REGION"] [--namespace <namespace>]
  ```

## Launch the training job with the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-cli"></a>

We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for the `hf_llama3_8b_seq16k_gpu_p5x16_pretrain` model.
+ `your_training_container`: A Deep Learning container. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq16k_gpu_p5x16_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
"recipes.run.name": "hf-llama3-8b",
"recipes.exp_manager.exp_dir": "/data/<your_exp_dir>",
"container": "658645717510.dkr.ecr.<region>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
"recipes.model.data.train_dir": "<your_train_data_dir>",
"recipes.model.data.val_dir": "<your_val_data_dir>",
"cluster": "k8s",
"cluster_type": "k8s"
}'
```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` becomes `Completed` when you run `kubectl get pods`.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-gpu-kubernetes-launch-training-job-recipes"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating `k8s.yaml`, `config.yaml`, and running the launch script.
+ In `k8s.yaml`, update `persistent_volume_claims`. It mounts the Amazon FSx claim to the `/data` directory of each computing pod

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ In `config.yaml`, update `repo_url_or_path` under `git`.

  ```
  git:
    repo_url_or_path: <training_adapter_repo>
    branch: null
    commit: null
    entry_script: null
    token: null
  ```
+ Update `launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh`
  + `your_contrainer`: A Deep Learning container. To find the most recent release of the SMP container, see [Release notes for the SageMaker model parallelism library](model-parallel-release-notes.md).
  + (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

    ```
    recipes.model.hf_access_token=<your_hf_token>
    ```

  ```
  #!/bin/bash
  #Users should setup their cluster type in /recipes_collection/config.yaml
  REGION="<region>"
  IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  EXP_DIR="<your_exp_dir>" # Location to save experiment info including logging, checkpoints, ect
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
      recipes=training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain \
      base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
      recipes.run.name="hf-llama3" \
      recipes.exp_manager.exp_dir="$EXP_DIR" \
      cluster=k8s \
      cluster_type=k8s \
      container="${IMAGE}" \
      recipes.model.data.train_dir=$TRAIN_DIR \
      recipes.model.data.val_dir=$VAL_DIR
  ```
+ Launch the training job

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq16k_gpu_p5x16_pretrain.sh
  ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
```

```
NAME READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod <name-of-pod>
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

For more information about the k8s cluster configuration, see [Running a training job on HyperPod k8s](cluster-specific-configurations-run-training-job-hyperpod-k8s.md).

# Trainium Kubernetes cluster pre-training tutorial
<a name="sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial"></a>

You can use one of the following methods to start a training job in a Trainium Kubernetes cluster.
+ (Recommended) [HyperPod command-line tool](https://github.com/aws/sagemaker-hyperpod-cli)
+ The NeMo style launcher

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Set up a HyperPod Trainium Kubernetes cluster
A shared storage location that can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up your Trainium Kubernetes environment
<a name="sagemaker-hyperpod-trainium-setup-trainium-kubernetes-environment"></a>

To set up the Trainium Kubernetes environment, do the following:

1. Complete the steps in the following tutorial: [HuggingFace Llama3-8B Pretraining](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-training/tutorials/hf_llama3_8B_pretraining.html#download-the-dataset) starting from **Download the dataset**. 

1. Prepare a model configuration. They're available in the Neuron repo. For this tutorial, you can use the llama3 8b model config.

1. Virtual environment setup. Make sure you're using Python 3.9 or greater.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. Install the dependencies
   + (Recommended) Use the following HyperPod command-line tool

     ```
     # install HyperPod command line tools
     git clone https://github.com/aws/sagemaker-hyperpod-cli
     cd sagemaker-hyperpod-cli
     pip3 install .
     ```
   + If you're using SageMaker HyperPod recipes, specify the following

     ```
     # install SageMaker HyperPod Recipes.
     git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
     cd sagemaker-hyperpod-recipes
     pip3 install -r requirements.txt
     ```

1. [Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   hyperpod connect-cluster --cluster-name "${CLUSTER_NAME}" [--region "${CLUSTER_REGION}"] [--namespace <namespace>]
   ```

1. Container: The [Neuron container](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx)

## Launch the training job with the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-trainium-launch-training-job-cli"></a>

We recommend using the SageMaker HyperPod command-line interface (CLI) tool to submit your training job with your configurations. The following example submits a training job for the `hf_llama3_8b_seq8k_trn1x4_pretrain` Trainium model.
+ `your_neuron_container`: The [Neuron container](https://github.com/aws-neuron/deep-learning-containers?tab=readme-ov-file#pytorch-training-neuronx).
+ `your_model_config`: The model configuration from the environment setup section
+ (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  "recipes.model.hf_access_token": "<your_hf_token>"
  ```

```
hyperpod start-job --recipe training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
--persistent-volume-claims fsx-claim:data \
--override-parameters \
'{
 "cluster": "k8s",
 "cluster_type": "k8s",
 "container": "<your_neuron_contrainer>",
 "recipes.run.name": "hf-llama3",
 "recipes.run.compile": 0,
 "recipes.model.model_config": "<your_model_config>",
 "instance_type": "trn1.32xlarge",
 "recipes.data.train_dir": "<your_train_data_dir>"
}'
```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                              READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job `STATUS` changes to `Running`, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-trainium-launch-training-job-recipes"></a>

Alternatively, use SageMaker HyperPod recipes to submit your training job. To submit the training job using a recipe, update `k8s.yaml` and `config.yaml`. Run the bash script for the model to launch it.
+ In `k8s.yaml`, update persistent\$1volume\$1claims to mount the Amazon FSx claim to the /data directory in the compute nodes

  ```
  persistent_volume_claims:
    - claimName: fsx-claim
      mountPath: data
  ```
+ Update launcher\$1scripts/llama/run\$1hf\$1llama3\$18b\$1seq8k\$1trn1x4\$1pretrain.sh
  + `your_neuron_contrainer`: The container from the environment setup section
  + `your_model_config`: The model config from the environment setup section

  (Optional) You can provide the HuggingFace token if you need pre-trained weights from HuggingFace by setting the following key-value pair:

  ```
  recipes.model.hf_access_token=<your_hf_token>
  ```

  ```
   #!/bin/bash
  #Users should set up their cluster type in /recipes_collection/config.yaml
  IMAGE="<your_neuron_contrainer>"
  MODEL_CONFIG="<your_model_config>"
  SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
  TRAIN_DIR="<your_training_data_dir>" # Location of training dataset
  VAL_DIR="<your_val_data_dir>" # Location of talidation dataset
  
  HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
    recipes=training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain \
    base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
    recipes.run.name="hf-llama3-8b" \
    instance_type=trn1.32xlarge \
    recipes.model.model_config="$MODEL_CONFIG" \
    cluster=k8s \
    cluster_type=k8s \
    container="${IMAGE}" \
    recipes.data.train_dir=$TRAIN_DIR \
    recipes.data.val_dir=$VAL_DIR
  ```
+ Launch the job

  ```
  bash launcher_scripts/llama/run_hf_llama3_8b_seq8k_trn1x4_pretrain.sh
  ```

After you've submitted a training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods
NAME                             READY   STATUS             RESTARTS        AGE
hf-llama3-<your-alias>-worker-0   0/1     running         0               36s
```

If the `STATUS` is at `PENDING` or `ContainerCreating`, run the following command to get more details.

```
kubectl describe pod name_of_pod
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs name_of_pod
```

The `STATUS` will turn to `Completed` when you run `kubectl get pods`.

For more information about the k8s cluster configuration, see [Trainium Kubernetes cluster pre-training tutorial](#sagemaker-hyperpod-trainium-kubernetes-cluster-pretrain-tutorial).

# SageMaker training jobs pre-training tutorial (GPU)
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial"></a>

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with GPU instances.
+ Set up your environment
+ Launch a training job using SageMaker HyperPod recipes

Before you begin, make sure you have following prerequisites.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Amazon FSx file system or an Amazon S3 bucket where you can load the data and output the training artifacts.
Requested a Service Quota for 1x ml.p4d.24xlarge and 1x ml.p5.48xlarge on Amazon SageMaker AI. To request a service quota increase, do the following:  
On the AWS Service Quotas console, navigate to AWS services,
Choose **Amazon SageMaker AI**.
Choose one ml.p4d.24xlarge and one ml.p5.48xlarge instance.
Create an AWS Identity and Access Management(IAM) role with the following managed policies to give SageMaker AI permissions to run the examples.  
AmazonSageMakerFullAccess
AmazonEC2FullAccess
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) You must get a HuggingFace token if you're using the model weights from HuggingFace for pre-training or fine-tuning. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## GPU SageMaker training jobs environment setup
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-environment-setup"></a>

Before you run a SageMaker training job, configure your AWS credentials and preferred region by running the `aws configure` command. As an alternative to the configure command, you can provide your credentials through environment variables such as `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN.` For more information, see [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk).

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ (Optional) Set up the virtual environment and dependencies. If you are using a Jupyter notebook in Amazon SageMaker Studio, you can skip this step. Make sure you're using Python 3.9 or greater.

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  # Set the aws region.
  
  aws configure set <your_region>
  ```
+ Install SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ `Container`: The GPU container is set automatically by the SageMaker AI Python SDK. You can also provide your own container.
**Note**  
If you're running a Llama 3.2 multi-modal training job, the `transformers` version must be `4.45.2 `or greater.

  Append `transformers==4.45.2` to `requirements.txt` in `source_dir` only when you're using the SageMaker AI Python SDK. For example, append it if you're using it in a notebook in SageMaker AI JupyterLab.

  If you are using HyperPod recipes to launch using cluster type `sm_jobs`, this will be done automatically.

## Launch the training job using a Jupyter Notebook
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-notebook"></a>

You can use the following Python code to run a SageMaker training job with your recipe. It leverages the PyTorch estimator from the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) to submit the recipe. The following example launches the llama3-8b recipe on the SageMaker AI Training platform.

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket() 
output = os.path.join(f"s3://{bucket}", "output")
output_path = "<s3-URI>"

overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-recipe",
    role=role,
    instance_type="ml.p5.48xlarge",
    training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    tensorboard_output_config=tensorboard_output_config,
)

estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)
```

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the `fit()` method. Use the training\$1recipe parameter to specify the recipe you want to use for training.

**Note**  
If you're running a Llama 3.2 multi-modal training job, the transformers version must be 4.45.2 or greater.

Append `transformers==4.45.2` to `requirements.txt` in `source_dir` only when you're using SageMaker AI Python SDK directly. For example, you must append the version to the text file when you're using a Jupyter notebook.

When you deploy the endpoint for a SageMaker training job, you must specify the image URI that you're using. If don't provide the image URI, the estimator uses the training image as the image for the deployment. The training images that SageMaker HyperPod provides don't contain the dependencies required for inference and deployment. The following is an example of how an inference image can be used for deployment:

```
from sagemaker import image_uris
container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge')
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)
```

**Note**  
Running the preceding code on Sagemaker notebook instance might need more than the default 5GB of storage that SageMaker AI JupyterLab provides. If you run into space not available issues, create a new notebook instance where you use a different notebook instance and increase the storage of the notebook.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-recipes"></a>

Update the `./recipes_collection/cluster/sm_jobs.yaml` file to look like the following:

```
sm_jobs_config:
  output_path: <s3_output_path>
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    enable_remote_debug: True
  recipe_overrides:
    exp_manager:
      explicit_log_dir: /opt/ml/output/tensorboard
    data:
      train_dir: /opt/ml/input/data/train
    model:
      model_config: /opt/ml/input/data/train/config.json
    compiler_cache_url: "<compiler_cache_url>"
```

Update `./recipes_collection/config.yaml` to specify `sm_jobs` in the `cluster` and `cluster_type`.

```
defaults:
  - _self_
  - cluster: sm_jobs  # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs  # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
```

Launch the job with the following command

```
python3 main.py --config-path recipes_collection --config-name config
```

For more information about configuring SageMaker training jobs, see Run a training job on SageMaker training jobs.

# Trainium SageMaker training jobs pre-training tutorial
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-pretrain-tutorial"></a>

This tutorial guides you through the process of setting up and running a pre-training job using SageMaker training jobs with AWS Trainium instances.
+ Set up your environment
+ Launch a training job

Before you begin, make sure you have the following prerequisites.

**Prerequisites**  
Before you start setting up your environment, make sure you have:  
Amazon FSx file system or S3 bucket where you can load the data and output the training artifacts.
Request a Service Quota for the `ml.trn1.32xlarge` instance on Amazon SageMaker AI. To request a service quota increase, do the following:  
Navigate to the AWS Service Quotas console.
Choose AWS services.
Select JupyterLab.
Specify one instance for `ml.trn1.32xlarge`.
Create an AWS Identity and Access Management (IAM) role with the `AmazonSageMakerFullAccess` and `AmazonEC2FullAccess` managed policies. These policies provide Amazon SageMaker AI with permissions to run the examples.
Data in one of the following formats:  
JSON
JSONGZ (Compressed JSON)
ARROW
(Optional) If you need the pre-trained weights from HuggingFace or if you're training a Llama 3.2 model, you must get the HuggingFace token before you start training. For more information about getting the token, see [User access tokens](https://huggingface.co/docs/hub/en/security-tokens).

## Set up your environment for Trainium SageMaker training jobs
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-environment-setup"></a>

Before you run a SageMaker training job, use the `aws configure` command to configure your AWS credentials and preferred region . As an alternative, you can also provide your credentials through environment variables such as the `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and `AWS_SESSION_TOKEN`. For more information, see [SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk).

We strongly recommend using a SageMaker AI Jupyter notebook in SageMaker AI JupyterLab to launch a SageMaker training job. For more information, see [SageMaker JupyterLab](studio-updated-jl.md).
+ (Optional) If you are using Jupyter notebook in Amazon SageMaker Studio, you can skip running the following command. Make sure to use a version >= python 3.9

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  ```
+ Install SageMaker AI Python SDK

  ```
  pip3 install --upgrade sagemaker
  ```
+ 
  + If you are running a llama 3.2 multi-modal training job, the `transformers` version must be `4.45.2` or greater.
    + Append `transformers==4.45.2` to `requirements.txt` in source\$1dir only when you're using the SageMaker AI Python SDK.
    + If you are using HyperPod recipes to launch using `sm_jobs` as the cluster type, you don't have to specify the transformers version.
  + `Container`: The Neuron container is set automatically by SageMaker AI Python SDK.

## Launch the training job with a Jupyter Notebook
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-notebook"></a>

You can use the following Python code to run a SageMaker training job using your recipe. It leverages the PyTorch estimator from the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/) to submit the recipe. The following example launches the llama3-8b recipe as a SageMaker AI Training Job.
+ `compiler_cache_url`: Cache to be used to save the compiled artifacts, such as an Amazon S3 artifact.

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "explicit_log_dir": "/opt/ml/output/tensorboard",
    },
    "data": {
        "train_dir": "/opt/ml/input/data/train",
    },
    "model": {
        "model_config": "/opt/ml/input/data/train/config.json",
    },
    "compiler_cache_url": "<compiler_cache_url>"
} 

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-trn",
    role=role,
    instance_type="ml.trn1.32xlarge",
    sagemaker_session=sagemaker_session,
    training_recipe="training/llama/hf_llama3_70b_seq8k_trn1x16_pretrain",
    recipe_overrides=recipe_overrides,
)

estimator.fit(inputs={"train": "your-inputs"}, wait=True)
```

The preceding code creates a PyTorch estimator object with the training recipe and then fits the model using the `fit()` method. Use the `training_recipe` parameter to specify the recipe you want to use for training.

## Launch the training job with the recipes launcher
<a name="sagemaker-hyperpod-trainium-sagemaker-training-jobs-launch-training-job-recipes"></a>
+ Update `./recipes_collection/cluster/sm_jobs.yaml`
  + compiler\$1cache\$1url: The URL used to save the artifacts. It can be an Amazon S3 URL.

  ```
  sm_jobs_config:
    output_path: <s3_output_path>
    wait: True
    tensorboard_config:
      output_path: <s3_output_path>
      container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
    wait: True  # Whether to wait for training job to finish
    inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
      s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
        train: <s3_train_data_path>
        val: null
    additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
      max_run: 180000
      image_uri: <your_image_uri>
      enable_remote_debug: True
      py_version: py39
    recipe_overrides:
      model:
        exp_manager:
          exp_dir: <exp_dir>
        data:
          train_dir: /opt/ml/input/data/train
          val_dir: /opt/ml/input/data/val
  ```
+ Update `./recipes_collection/config.yaml`

  ```
  defaults:
    - _self_
    - cluster: sm_jobs
    - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
  cluster_type: sm_jobs # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
  
  instance_type: ml.trn1.32xlarge
  base_results_dir: ~/sm_job/hf_llama3_8B # Location to store the results, checkpoints and logs.
  ```
+ Launch the job with `main.py`

  ```
  python3 main.py --config-path recipes_collection --config-name config
  ```

For more information about configuring SageMaker training jobs, see [SageMaker training jobs pre-training tutorial (GPU)](sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial.md).

# Default configurations
<a name="default-configurations"></a>

This section outlines the essential components and settings required to initiate and customize your Large Language Model (LLM) training processes using SageMaker HyperPod. This section covers the key repositories, configuration files, and recipe structures that form the foundation of your training jobs. Understanding these default configurations is crucial for effectively setting up and managing your LLM training workflows, whether you're using pre-defined recipes or customizing them to suit your specific needs.

**Topics**
+ [GitHub repositories](github-repositories.md)
+ [General configuration](sagemaker-hyperpod-recipes-general-configuration.md)

# GitHub repositories
<a name="github-repositories"></a>

To launch a training job, you utilize files from two distinct GitHub repositories:
+ [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes)
+ [SageMaker HyperPod training adapter for NeMo](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo)

These repositories contain essential components for initiating, managing, and customizing Large Language Model (LLM) training processes. You use the scripts from the repositories to set up and run the training jobs for your LLMs.

## HyperPod recipe repository
<a name="sagemaker-hyperpod-recipe-repository"></a>

Use the [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes) repository to get a recipe.

1. `main.py`: This file serves as the primary entry point for initiating the process of submitting a training job to either a cluster or a SageMaker training job.

1. `launcher_scripts`: This directory contains a collection of commonly used scripts designed to facilitate the training process for various Large Language Models (LLMs).

1. `recipes_collection`: This folder houses a compilation of pre-defined LLM recipes provided by the developers. Users can leverage these recipes in conjunction with their custom data to train LLM models tailored to their specific requirements.

You use the SageMaker HyperPod recipes to launch training or fine-tuning jobs. Regardless of the cluster you're using, the process of submitting the job is the same. For example, you can use the same script to submit a job to a Slurm or Kubernetes cluster. The launcher dispatches a training job based on three configuration files:

1. General Configuration (`config.yaml`): Includes common settings such as the default parameters or environment variables used in the training job.

1. Cluster Configuration (cluster): For training jobs using clusters only. If you're submitting a training job to a Kubernetes cluster, you might need to specify information such as volume, label, or restart policy. For Slurm clusters, you might need to specify the Slurm job name. All the parameters are related to the specific cluster that you're using.

1. Recipe (recipes): Recipes contain the settings for your training job, such as the model types, sharding degree, or dataset paths. For example, you can specify Llama as your training model and train it using model or data parallelism techniques like Fully Sharded Distributed Parallel (FSDP) across eight machines. You can also specify different checkpoint frequencies or paths for your training job.

After you've specified a recipe, you run the launcher script to specify an end-to-end training job on a cluster based on the configurations through the `main.py` entry point. For each recipe that you use, there are accompanying shell scripts located in the launch\$1scripts folder. These examples guide you through submitting and initiating training jobs. The following figure illustrates how a SageMaker HyperPod recipe launcher submits a training job to a cluster based on the preceding. Currently, the SageMaker HyperPod recipe launcher is built on top of the Nvidia NeMo Framework Launcher. For more information, see [NeMo Launcher Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html).

![\[Diagram illustrating the HyperPod recipe launcher workflow. On the left, inside a dashed box, are three file icons labeled "Recipe", "config.yaml", and "slurm.yaml or k8s.yaml or sm_job.yaml (Cluster config)". An arrow points from this box to a central box labeled "HyperPod recipe Launcher". From this central box, another arrow points right to "Training Job", with "main.py" written above the arrow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/sagemaker-hyperpod-recipe-launcher.png)


## HyperPod recipe adapter repository
<a name="hyperpod-recipe-adapter"></a>

The SageMaker HyperPod training adapter is a training framework. You can use it to manage the entire lifecycle of your training jobs. Use the adapter to distribute the pre-training or fine-tuning of your models across multiple machines. The adaptor uses different parallelism techniques to distribute the training. It also handles the implementation and management of saving the checkpoints. For more details, see [Advanced settings](cluster-specific-configurations-advanced-settings.md).

Use the [SageMaker HyperPod recipe adapter repository](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo) to use the recipe adapter.

1. `src`: This directory contains the implementation of Large-scale Language Model (LLM) training, encompassing various features such as model parallelism, mixed-precision training, and checkpointing management.

1. `examples`: This folder provides a collection of examples demonstrating how to create an entry point for training an LLM model, serving as a practical guide for users.

# General configuration
<a name="sagemaker-hyperpod-recipes-general-configuration"></a>

The config.yaml file specifies the training recipe and the cluster. It also includes runtime configurations such as environment variables for the training job.

```
defaults:
  - _self_
  - cluster: slurm 
  - recipes: training/llama/hf_llama3_8b_seq8192_gpu
instance_type: p5.48xlarge
git:
  repo_url_or_path: null
  branch: null
  commit: null
  entry_script: null
  token: null
env_vars:
  NCCL_DEBUG: WARN
```

You can modify the following parameters in `config.yaml`:

1. `defaults`: Specify your default settings, such as the default cluster or default recipes.

1. `instance_type`: Modify the Amazon EC2 instance type to match the instance type that you're using.

1. `git`: Specify the location of the SageMaker HyperPod recipe adapter repository for the training job.

1. `env_vars`: You can specify the environment variables to be passed into your runtime training job. For example, you can adjust the logging level of NCCL by specifying the NCCL\$1DEBUG environment variable.

The recipe is the core configuration that defines your training job architecture. This file includes many important pieces of information for your training job, such as the following:
+ Whether to use model parallelism
+ The source of your datasets
+ Mixed precision training
+ Checkpointing-related configurations

You can use the recipes as-is. You can also use the following information to modify them.

## run
<a name="run"></a>

The following is the basic run information for running your training job.

```
run:
  name: llama-8b
  results_dir: ${base_results_dir}/${.name}
  time_limit: "6-00:00:00"
  model_type: hf
```

1. `name`: Specify the name for your training job in the configuration file.

1. `results_dir`: You can specify the directory where the results of your training job are stored.

1. `time_limit`: You can set a maximum training time for your training job to prevent it from occupying hardware resources for too long.

1. `model_type`: You can specify the type of model you are using. For example, you can specify `hf` if your model is from HuggingFace.

## exp\$1manager
<a name="exp-manager"></a>

The exp\$1manager configures the experiment. With the exp\$1manager, you can specify fields such as the output directory or checkpoint settings. The following is an example of how you can configure the exp\$1manager.

```
exp_manager:
  exp_dir: null
  name: experiment
  create_tensorboard_logger: True
```

1. `exp_dir`: The experiment directory includes the standard output and standard error files for your training job. By default, it uses your current directory.

1. `name`: The experiment name used to identify your experiment under the exp\$1dir.

1. `create_tensorboard_logger`: Specify `True` or `False` to enable or disable the TensorBoard logger.

## Checkpointing
<a name="checkpointing"></a>

Here are three types of checkpointing we support:
+ Auto checkpointing
+ Manual checkpointing
+ Full checkpointing

### Auto checkpointing
<a name="auto-checkpointing"></a>

If you're saving or loading checkpoints that are automatically managed by the SageMaker HyperPod recipe adapter, you can enable `auto_checkpoint`. To enable `auto_checkpoint`, set `enabled` to `True`. You can use auto checkpointing for both training and fine-tuning. You can use auto checkpoinitng for both shared file systems and Amazon S3.

```
exp_manager
  checkpoint_dir: ${recipes.exp_manager.exp_dir}/checkpoints/
  auto_checkpoint:
    enabled: True
```

Auto checkpoint is saving the local\$1state\$1dict asynchronously with an automatically computed optimal saving interval.

**Note**  
Under this checkpointing mode, the auto saved checkpointing doesn't support re-sharding between training runs. To resume from the latest auto saved checkpoint, you must preserve the same shard degrees. You don't need to specify extra information to auto resume.

### Manual checkpointing
<a name="manual-checkpointing"></a>

You can modify `checkpoint_callback_params` to asynchronously save an intermediate checkpoint in shared\$1state\$1dict. For example, you can specify the following configuration to enable sharded checkpointing every 10 steps and keep the latest 3 checkpoints.

Sharded checkpointing allows you to change the shard degrees between training runs and load the checkpoint by setting `resume_from_checkpoint`.

**Note**  
If is a PEFT fine tuning, sharded checkpointing doesn't support Amazon S3.
Auto and manual checkpointing are mutually exclusive.
Only FSDP shard degrees and replication degrees changes are allowed.

```
exp_manager:
  checkpoint_callback_params:
    # Set save_top_k = 0 to disable sharded checkpointing
    save_top_k: 3
    every_n_train_steps: 10
    monitor: "step"
    mode: "max"
    save_last: False
  resume_from_checkpoint: ${recipes.exp_manager.exp_dir}/checkpoints/
```

To learn more about checkpointing, see [Checkpointing using SMP](model-parallel-core-features-v2-checkpoints.md).

### Full checkpointing
<a name="full-checkpointing"></a>

The exported full\$1state\$1dict checkpoint can be used for inference or fine tuning. You can load a full checkpoint through hf\$1model\$1name\$1or\$1path. Under this mode, only the model weights are saved.

To export the full\$1state\$1dict model, you can set the following parameters.

**Note**  
Currently, full checkpointing isn't supported for Amazon S3 checkpointing. You can't set the S3 path for `exp_manager.checkpoint_dir` if you're enabling full checkpointing. However, you can set `exp_manager.export_full_model.final_export_dir` to a specific directory on your local filesystem while setting `exp_manager.checkpoint_dir` to an Amazon S3 path.

```
exp_manager:
  export_full_model:
    # Set every_n_train_steps = 0 to disable full checkpointing
    every_n_train_steps: 0
    save_last: True
    final_export_dir : null
```

## model
<a name="model"></a>

Define various aspects of your model architecture and training process. This includes settings for model parallelism, precision, and data handling. Below are the key components you can configure within the model section:

### model parallelism
<a name="model-parallelism"></a>

After you've specified the recipe, you define the model that you're training. You can also define the model parallelism. For example, you can define tensor\$1model\$1parallel\$1degree. You can enable other features like training with FP8 precision. For example, you can train a model with tensor parallelism and context parallelism:

```
model:
  model_type: llama_v3
  # Base configs
  train_batch_size: 4
  val_batch_size: 1
  seed: 12345
  grad_clip: 1.0

  # Model parallelism
  tensor_model_parallel_degree: 4
  expert_model_parallel_degree: 1
  context_parallel_degree: 2
```

To gain a better understanding of different types of model parallelism techniques, you can refer to the following approaches:

1. [Tensor parallelism](model-parallel-core-features-v2-tensor-parallelism.md)

1. [Expert parallelism](model-parallel-core-features-v2-expert-parallelism.md)

1. [Context parallelism](model-parallel-core-features-v2-context-parallelism.md)

1. [Hybrid sharded data parallelism](model-parallel-core-features-v2-sharded-data-parallelism.md)

### FP8
<a name="fp8"></a>

To enable FP8 (8-bit floating-point precision), you can specify the FP8-related configuration in the following example:

```
model:
  # FP8 config
  fp8: True
  fp8_amax_history_len: 1024
  fp8_amax_compute_algo: max
```

It's important to note that the FP8 data format is currently supported only on the P5 instance type. If you are using an older instance type, such as P4, disable the FP8 feature for your model training process. For more information about FP8, see [Mixed precision training](model-parallel-core-features-v2-mixed-precision.md).

### data
<a name="data"></a>

You can specify your custom datasets for your training job by adding the data paths under data. The data module in our system supports the following data formats:

1. JSON

1. JSONGZ (Compressed JSON)

1. ARROW

However, you are responsible for preparing your own pre-tokenized dataset. If you're an advanced user with specific requirements, there is also an option to implement and integrate a customized data module. For more information on HuggingFace datasets, see [Datasets](https://huggingface.co/docs/datasets/v3.1.0/en/index).

```
model:
  data:
    train_dir: /path/to/your/train/data
    val_dir: /path/to/your/val/data
    dataset_type: hf
    use_synthetic_data: False
```

You can specify how you're training the model. By default, the recipe uses pre-training instead of fine-tuning. The following example configures the recipe to run a fine-tuning job with LoRA (Low-Rank Adaptation).

```
model:
  # Fine tuning config
  do_finetune: True
  # The path to resume from, needs to be HF compatible
  hf_model_name_or_path: null
  hf_access_token: null
  # PEFT config
  peft:
    peft_type: lora
    rank: 32
    alpha: 16
    dropout: 0.1
```

For information about the recipes, see [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes).

# Cluster-specific configurations
<a name="cluster-specific-configurations"></a>

SageMaker HyperPod offers flexibility in running training jobs across different cluster environments. Each environment has its own configuration requirements and setup process. This section outlines the steps and configurations needed for running training jobs in SageMaker HyperPod Slurm, SageMaker HyperPod k8s, and SageMaker training jobs. Understanding these configurations is crucial for effectively leveraging the power of distributed training in your chosen environment.

You can use a recipe in the following cluster environments:
+ SageMaker HyperPod Slurm Orchestration
+ SageMaker HyperPod Amazon Elastic Kubernetes Service Orchestration
+ SageMaker training jobs

To launch a training job in a cluster, set and install the corresponding cluster configuration and environment.

**Topics**
+ [Running a training job on HyperPod Slurm](cluster-specific-configurations-run-training-job-hyperpod-slurm.md)
+ [Running a training job on HyperPod k8s](cluster-specific-configurations-run-training-job-hyperpod-k8s.md)
+ [Running a SageMaker training job](cluster-specific-configurations-run-sagemaker-training-job.md)

# Running a training job on HyperPod Slurm
<a name="cluster-specific-configurations-run-training-job-hyperpod-slurm"></a>

SageMaker HyperPod Recipes supports submitting a training job to a GPU/Trainium slurm cluster. Before you submit the training job, update the cluster configuration. Use one of the following methods to update the cluster configuration:
+ Modify `slurm.yaml`
+ Override it through the command line

After you've updated the cluster configuration, install the environment.

## Configure the cluster
<a name="cluster-specific-configurations-configure-cluster-slurm-yaml"></a>

To submit a training job to a Slurm cluster, specify the Slurm-specific configuration. Modify `slurm.yaml` to configure the Slurm cluster. The following is an example of a Slurm cluster configuration. You can modify this file for your own training needs:

```
job_name_prefix: 'sagemaker-'
slurm_create_submission_file_only: False 
stderr_to_stdout: True
srun_args:
  # - "--no-container-mount-home"
slurm_docker_cfg:
  docker_args:
    # - "--runtime=nvidia" 
  post_launch_commands: 
container_mounts: 
  - "/fsx:/fsx"
```

1. `job_name_prefix`: Specify a job name prefix to easily identify your submissions to the Slurm cluster.

1. `slurm_create_submission_file_only`: Set this configuration to True for a dry run to help you debug.

1. `stderr_to_stdout`: Specify whether you're redirecting your standard error (stderr) to standard output (stdout).

1. `srun_args`: Customize additional srun configurations, such as excluding specific compute nodes. For more information, see the srun documentation.

1. `slurm_docker_cfg`: The SageMaker HyperPod recipe launcher launches a Docker container to run your training job. You can specify additional Docker arguments within this parameter.

1. `container_mounts`: Specify the volumes you're mounting into the container for the recipe launcher, for your training jobs to access the files in those volumes.

# Running a training job on HyperPod k8s
<a name="cluster-specific-configurations-run-training-job-hyperpod-k8s"></a>

SageMaker HyperPod Recipes supports submitting a training job to a GPU/Trainium Kubernetes cluster. Before you submit the training job do one of the following:
+ Modify the `k8s.yaml` cluster configuration file
+ Override the cluster configuration through the command line

After you've done either of the preceding steps, install the corresponding environment.

## Configure the cluster using `k8s.yaml`
<a name="cluster-specific-configurations-configure-cluster-k8s-yaml"></a>

To submit a training job to a Kubernetes cluster, you specify Kubernetes-specific configurations. The configurations include the cluster namespace or the location of the persistent volume.

```
pullPolicy: Always
restartPolicy: Never
namespace: default
persistent_volume_claims:
  - null
```

1. `pullPolicy`: You can specify the pull policy when you submit a training job. If you specify "Always," the Kubernetes cluster always pulls your image from the repository. For more information, see [Image pull policy](https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy).

1. `restartPolicy`: Specify whether to restart your training job if it fails.

1. `namespace`: You can specify the Kubernetes namespace where you're submitting the training job.

1. `persistent_volume_claims`: You can specify a shared volume for your training job for all training processes to access the files in the volume.

# Running a SageMaker training job
<a name="cluster-specific-configurations-run-sagemaker-training-job"></a>

SageMaker HyperPod Recipes supports submitting a SageMaker training job. Before you submit the training job, you must update the cluster configuration, `sm_job.yaml`, and install corresponding environment.

## Use your recipe as a SageMaker training job
<a name="cluster-specific-configurations-cluster-config-sm-job-yaml"></a>

You can use your recipe as a SageMaker training job if you aren't hosting a cluster. You must modify the SageMaker training job configuration file, `sm_job.yaml`, to run your recipe.

```
sm_jobs_config:
  output_path: null 
  tensorboard_config:
    output_path: null 
    container_logs_path: null
  wait: True 
  inputs: 
    s3: 
      train: null
      val: null
    file_system:  
      directory_path: null
  additional_estimator_kwargs: 
    max_run: 1800
```

1. `output_path`: You can specify where you're saving your model to an Amazon S3 URL.

1. `tensorboard_config`: You can specify a TensorBoard related configuration such as the output path or TensorBoard logs path.

1. `wait`: You can specify whether you're waiting for the job to be completed when you submit your training job.

1. `inputs`: You can specify the paths for your training and validation data. The data source can be from a shared filesystem such as Amazon FSx or an Amazon S3 URL.

1. `additional_estimator_kwargs`: Additional estimator arguments for submitting a training job to the SageMaker training job platform. For more information, see [Algorithm Estimator](https://sagemaker.readthedocs.io/en/stable/api/training/algorithm.html).

# Considerations
<a name="cluster-specific-configurations-special-considerations"></a>

When you're using a Amazon SageMaker HyperPod recipes, there are some factors that can impact the process of model training.
+ The `transformers` version must be `4.45.2` or greater for Llama 3.2. If you're using a Slurm or K8s workflow, the version is automatically updated.
+ Mixtral does not support 8-bit floating point precision (FP8)
+ Amazon EC2 p4 instance does not support FP8

# Advanced settings
<a name="cluster-specific-configurations-advanced-settings"></a>

The SageMaker HyperPod recipe adapter is built on top of the Nvidia Nemo and Pytorch-lightning frameworks. If you've already used these frameworks, integrating your custom models or features into the SageMaker HyperPod recipe adapter is a similar process. In addition to modifying the recipe adapter, you can change your own pre-training or fine-tuning script. For guidance on writing your custom training script, see [examples](https://github.com/aws/sagemaker-hyperpod-training-adapter-for-nemo/tree/main/examples).

## Use the SageMaker HyperPod adapter to create your own model
<a name="cluster-specific-configurations-use-hyperpod-adapter-create-model"></a>

Within the recipe adapter, you can customize the following files in the following locations:

1. `collections/data`: Contains a module responsible for loading datasets. Currently, it only supports datasets from HuggingFace. If you have more advanced requirements, the code structure allows you to add custom data modules within the same folder.

1. `collections/model`: Includes the definitions of various language models. Currently, it supports common large language models like Llama, Mixtral, and Mistral. You have the flexibility to introduce your own model definitions within this folder.

1. `collections/parts`: This folder contains strategies for training models in a distributed manner. One example is the Fully Sharded Data Parallel (FSDP) strategy, which allows for sharding a large language model across multiple accelerators. Additionally, the strategies support various forms of model parallelism. You also have the option to introduce your own customized training strategies for model training.

1. `utils`: Contains various utilities aimed at facilitating the management of a training job. It serves as a repository where for your own tools. You can use your own tools for tasks such as troubleshooting or benchmarking. You can also add your own personalized PyTorch Lightning callbacks within this folder. You can use PyTorch Lightning callbacks to seamlessly integrate specific functionalities or operations into the training lifecycle.

1. `conf`: Contains the configuration schema definitions used for validating specific parameters in a training job. If you introduce new parameters or configurations, you can add your customized schema to this folder. You can use the customized schema to define the validation rules. You can validate data types, ranges, or any other parameter constraint. You can also define you own custom schema to validate the parameters.

# Appendix
<a name="appendix"></a>

Use the following information to get information about monitoring and analyzing training results.

## Monitor training results
<a name="monitor-training-results"></a>

Monitoring and analyzing training results is essential for developers to assess convergence and troubleshoot issues. SageMaker HyperPod recipes offer Tensorboard integration to analyze training behavior. To address the challenges of profiling large distributed training jobs, these recipes also incorporate VizTracer. VizTracer is a low-overhead tool for tracing and visualizing Python code execution. For more information about VizTracer, see [VizTracer](https://viztracer.readthedocs.io/en/latest/installation.html).

The following sections guide you through the process of implementing these features in your SageMaker HyperPod recipes.

### Tensorboard
<a name="tensorboard"></a>

Tensorboard is a powerful tool for visualizing and analyzing the training process. To enable Tensorboard, modify your recipe by setting the following parameter:

```
exp_manager:
  exp_dir: null
  name: experiment
  create_tensorboard_logger: True
```

After you enable the Tensorboard logger, the training logs are generated and stored within the experiment directory. The experiment directed is defined in exp\$1manager.exp\$1dir. To access and analyze these logs locally, use the following procedure:

**To access and analyze logs**

1. Download the Tensorboard experiment folder from your training environment to your local machine.

1. Open a terminal or command prompt on your local machine.

1. Navigate to the directory containing the downloaded experiment folder.

1. Launch Tensorboard with the following the command.

   ```
   tensorboard --port=<port> --bind_all --logdir experiment.
   ```

1. Open your web browser and visit http://localhost:8008.

You can now see the status and visualizations of your training jobs within the Tensorboard interface. Seeing the status and visualizations helps you monitor and analyze the training process. Monitoring and analyzing the training process helps you gain insights into the behavior and performance of your models. For more information about how you monitor and analyze the training with Tensorboard, see the [NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/llms/index.html).

### VizTracer
<a name="viztracer"></a>

To enable VizTracer, you can modify your recipe by setting the model.viztracer.enabled parameter to true. For example, you can update your llama recipe to enable VizTracer by adding the following configuration:

```
model:
  viztracer:
    enabled: true
```

After the training has completed, your VizTracer profile is in the experiment folder exp\$1dir/result.json. To analyze your profile, you can download it and open it using the vizviewer tool:

```
vizviewer --port <port> result.json
```

This command launches the vizviewer on port 9001. You can view your VizTracer by specifying http://localhost:<port> in your browser. After you open VizTracer, you begin analyzing the training. For more information about using VizTracer, see VizTracer documentation.

## SageMaker JumpStart versus SageMaker HyperPod
<a name="sagemaker-jumpstart-vs-hyperpod"></a>

While SageMaker JumpStart provides fine-tuning capabilities, the SageMaker HyperPod recipes provide the following:
+ Additional fine-grained control over the training loop
+ Recipe customization for your own models and data
+ Support for model parallelism

Use the SageMaker HyperPod recipes when you need access to the model's hyperparameters, multi-node training, and customization options for the training loop.

For more information about fine-tuning your models in SageMaker JumpStart, see [Fine-tune publicly available foundation models with the `JumpStartEstimator` class](jumpstart-foundation-models-use-python-sdk-estimator-class.md)

# Orchestrating SageMaker HyperPod clusters with Slurm
<a name="sagemaker-hyperpod-slurm"></a>

Slurm support in SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). It accelerates development of FMs by removing undifferentiated heavy-lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 Graphical Processing Units (GPUs). When accelerators fail, the resiliency features of SageMaker HyperPod monitors the cluster instances automatically detect and replace the faulty hardware on the fly so that you can focus on running ML workloads. Additionally, with lifecycle configuration support in SageMaker HyperPod, you can customize your computing environment to best suit your needs and configure it with the Amazon SageMaker AI distributed training libraries to achieve optimal performance on AWS.

**Operating clusters**

You can create, conﬁgure, and maintain SageMaker HyperPod clusters graphically through the console user interface (UI) and programmatically through the AWS command line interface (CLI) or AWS SDK for Python (Boto3). With Amazon VPC, you can secure the cluster network and also take advantage of configuring your cluster with resources in your VPC, such as Amazon FSx for Lustre, which offers the fastest throughput. You can also give different IAM roles to cluster instance groups, and limit actions that your cluster resources and users can operate. To learn more, see [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md).

**Configuring your ML environment**

SageMaker HyperPod runs [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami), which sets up an ML environment on the HyperPod clusters. You can configure additional customizations to the DLAMI by providing lifecycle scripts to support your use case. To learn more about how to set up lifecycle scripts, see [Getting started with SageMaker HyperPod](smcluster-getting-started-slurm.md) and [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

**Scheduling jobs**

After you successfully create a HyperPod cluster, cluster users can log into the cluster nodes (such as head or controller node, log-in node, and worker node) and schedule jobs for running machine learning workloads. To learn more, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

**Resiliency against hardware failures**

SageMaker HyperPod runs health checks on cluster nodes and provides a workload auto-resume functionality. With the cluster resiliency features of HyperPod, you can resume your workload from the last checkpoint you saved, after faulty nodes are replaced with healthy ones in clusters with more than 16 nodes. To learn more, see [SageMaker HyperPod cluster resiliency](sagemaker-hyperpod-resiliency-slurm.md).

**Logging and managing clusters**

You can find SageMaker HyperPod resource utilization metrics and lifecycle logs in Amazon CloudWatch, and manage SageMaker HyperPod resources by tagging them. Each `CreateCluster` API run creates a distinct log stream, named in `<cluster-name>-<timestamp>` format. In the log stream, you can check the host names, the name of failed lifecycle scripts, and outputs from the failed scripts such as `stdout` and `stderr`. For more information, see [SageMaker HyperPod cluster management](sagemaker-hyperpod-cluster-management-slurm.md).

**Compatible with SageMaker AI tools**

Using SageMaker HyperPod, you can configure clusters with AWS optimized collective communications libraries offered by SageMaker AI, such as the [SageMaker AI distributed data parallelism (SMDDP) library](data-parallel.md). The SMDDP library implements the `AllGather` operation optimized to the AWS compute and network infrastructure for the most performant SageMaker AI machine learning instances powered by NVIDIA A100 GPUs. To learn more, see [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md).

**Instance placement with UltraServers**

SageMaker AI automatically allocates jobs to instances within your UltraServer based on a best effort strategy of using all of the instances in one UltraServer before using another one. For example, if you request 14 instances and have 2 UltraServers in your training plan, SageMaker AI uses all of the instances in the first UltraServer. If you requested 20 instances and have 2 UltraServers in your training plan, SageMaker AI will will use all 17 instances in the first UltraServer and then use 3 from the second UltraServer.

**Topics**
+ [Getting started with SageMaker HyperPod](smcluster-getting-started-slurm.md)
+ [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md)
+ [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md)
+ [SageMaker HyperPod multi-head node support](sagemaker-hyperpod-multihead-slurm.md)
+ [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md)
+ [SageMaker HyperPod cluster resources monitoring](sagemaker-hyperpod-cluster-observability-slurm.md)
+ [SageMaker HyperPod cluster resiliency](sagemaker-hyperpod-resiliency-slurm.md)
+ [Continuous provisioning for enhanced cluster operations with Slurm](sagemaker-hyperpod-scaling-slurm.md)
+ [SageMaker HyperPod cluster management](sagemaker-hyperpod-cluster-management-slurm.md)
+ [SageMaker HyperPod FAQs](sagemaker-hyperpod-faq-slurm.md)

# Getting started with SageMaker HyperPod
<a name="smcluster-getting-started-slurm"></a>

Get started with creating your first SageMaker HyperPod cluster and learn the cluster operation functionalities of SageMaker HyperPod. You can create a SageMaker HyperPod cluster through the SageMaker AI console UI or the AWS CLI commands. This tutorial shows how to create a new SageMaker HyperPod cluster with Slurm, which is a popular workload scheduler software. After you go through this tutorial, you will know how to log into the cluster nodes using the AWS Systems Manager commands (`aws ssm`). After you complete this tutorial, see also [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md) to learn more about the SageMaker HyperPod basic oparations, and [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md) to learn how to schedule jobs on the provisioned cluster.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

**Topics**
+ [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md)
+ [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-slurm-console-create-cluster-cfn.md)
+ [Getting started with SageMaker HyperPod using the AWS CLI](smcluster-getting-started-slurm-cli.md)

# Getting started with SageMaker HyperPod using the SageMaker AI console
<a name="smcluster-getting-started-slurm-console"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the SageMaker AI console UI. Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes, `my-controller-group`, `my-login-group`, and `worker-group-1`.

**Topics**
+ [Create cluster](#smcluster-getting-started-slurm-console-create-cluster-page)
+ [Deploy resources](#smcluster-getting-started-slurm-console-create-cluster-deploy)
+ [Delete the cluster and clean resources](#smcluster-getting-started-slurm-console-delete-cluster-and-clean)

## Create cluster
<a name="smcluster-getting-started-slurm-console-create-cluster-page"></a>

To navigate to the **SageMaker HyperPod Clusters** page and choose **Slurm** orchestration, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Slurm**.

1. On the Slurm cluster creation page, you will see two options. Choose the option that best fits your needs.

   1. **Quick setup** - To get started immediately with default settings, choose **Quick setup**. With this option, SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

   1. **Custom setup** - To integrate with existing AWS resources or have specific networking, security, or storage requirements, choose **Custom setup**. With this option, you can choose to use the existing resources or create new ones, and you can customize the configuration that best fits your needs.

## Quick setup
<a name="smcluster-getting-started-slurm-console-create-cluster-default"></a>

On the **Quick setup** section, follow these steps to create your HyperPod cluster with Slurm orchestration.

### General settings
<a name="smcluster-getting-started-slurm-console-create-cluster-default-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

### Instance groups
<a name="smcluster-getting-started-slurm-console-create-cluster-default-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group for Controller and Compute group types.

**Important**  
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

Follow these steps to add an instance group.

1. For **Instance group type**, choose a type for your instance group. For this tutorial, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`.

1. For **Name**, specify a name for the instance group. For this tutorial, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group. For this tutorial, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. Choose **Add instance group**.

### Quick setup defaults
<a name="smcluster-getting-started-slurm-console-create-cluster-default-settings"></a>

This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process. Review the default settings.

## Custom setup
<a name="smcluster-getting-started-slurm-console-create-cluster-custom"></a>

On the **Custom setup** section, follow these steps to create your HyperPod cluster with Slurm orchestration.

### General settings
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

For **Instance recovery**, choose **Automatic - *recommended*** or **None**.

### Networking
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-network"></a>

Configure your network settings for the cluster creation. These settings can't be changed after the cluster is created.

1. For **VPC**, choose your own VPC if you already have one that gives SageMaker AI access to your VPC. To create a new VPC, follow the instructions at [Create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc.html) in the *Amazon Virtual Private Cloud User Guide*. You can leave it as **None** to use the default SageMaker AI VPC.

1. For **VPC IPv4 CIDR block**, enter the starting IP of your VPC.

1. For **Availability Zones**, choose the Availability Zones (AZ) where HyperPod will create subnets for your cluster. Choose AZs that match the location of your accelerated compute capacity.

1. For **Security groups**, create a security group or choose up to five security groups configured with rules to allow inter-resource communication within the VPC.

### Instance groups
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group.

**Important**  
You can add one instance group at a time. To create multiple instance groups, repeat the process for each instance group.

Follow these steps to add an instance group.

1. For **Instance group type**, choose a type for your instance group. For this tutorial, choose **Controller (head)** for `my-controller-group`, **Login** for `my-login-group`, and **Compute (worker)** for `worker-group-1`.

1. For **Name**, specify a name for the instance group. For this tutorial, create three instance groups named `my-controller-group`, `my-login-group`, and `worker-group-1`.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group. For this tutorial, select `ml.c5.xlarge` for `my-controller-group`, `ml.m5.4xlarge` for `my-login-group`, and `ml.trn1.32xlarge` for `worker-group-1`. 
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. Choose **Add instance group**.

### Lifecycle scripts
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-lifecycle"></a>

You can choose to use the default lifecycle scripts or the custom lifecycle scripts, which will be stored in your Amazon S3 bucket. You can view the default lifecycle scripts in the [Awesome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts). To learn more about the lifecycle scripts, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

1. For **Lifecycle scripts**, choose to use default or custom lifecycle scripts.

1. For **S3 bucket for lifecycle scripts**, choose to create a new bucket or use an existing bucket to store the lifecycle scripts.

### Permissions
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-permissions"></a>

Choose or create an IAM role that allows HyperPod to run and access necessary AWS resources on your behalf.

### Storage
<a name="smcluster-getting-started-slurm-console-create-cluster-custom-storage"></a>

Configure the FSx for Lustre file system to be provisioned on the HyperPod cluster.

1. For **File system**, choose an existing FSx for Lustre file system, to create a new FSx for Lustre file system, or don't provision an FSx for Lustre file system.

1. For **Throughput per unit of storage**, choose the throughput that will be available per TiB of provisioned storage.

1. For **Storage capacity**, enter a capacity value in TB.

1. For **Data compression type**, choose **LZ4** to enable data compression.

1. For **Lustre version**, view the value that's recommended for the new file systems.

### Tags - optional
<a name="smcluster-getting-started-slurm-console-create-cluster-tags"></a>

For **Tags - *optional***, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

## Deploy resources
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy"></a>

After you complete the cluster configurations using either **Quick setup** or **Custom setup**, choose the following option to start resource provisioning and cluster creation.
+  **Submit** - SageMaker AI will start provisioning the default configuration resources and creating the cluster. 
+ **Download CloudFormation template parameters** - You will download the configuration parameter JSON file and run AWS CLI command to deploy the CloudFormation stack to provision the configuration resources and creating the cluster. You can edit the downloaded parameter JSON file if needed. If you choose this option, see more instructions in [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-slurm-console-create-cluster-cfn.md).

## Delete the cluster and clean resources
<a name="smcluster-getting-started-slurm-console-delete-cluster-and-clean"></a>

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the `InService` state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker AI instances when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of two instance groups. One of them uses a C5 instance, so make sure you delete the cluster by following the instructions at [Delete a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster).

However, if you have created a cluster with reserved compute capacity, the status of the clusters does not affect service billing.

To clean up the lifecycle scripts from the S3 bucket used for this tutorial, go to the S3 bucket you used during cluster creation and remove the files entirely.

If you have tested running any workloads on the cluster, make sure if you have uploaded any data or if your job saved any artifacts to different S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent any incurring charges, delete all artifacts and data from the storage or file system.

# Creating SageMaker HyperPod clusters using CloudFormation templates
<a name="smcluster-getting-started-slurm-console-create-cluster-cfn"></a>

You can create SageMaker HyperPod clusters using the CloudFormation templates for HyperPod. You must install AWS CLI to proceed.

**Topics**
+ [Configure resources in the console and deploy using CloudFormation](#smcluster-getting-started-slurm-console-create-cluster-deploy-console)
+ [Configure resources and deploy using CloudFormation](#smcluster-getting-started-slurm-console-create-cluster-deploy-cfn)

## Configure resources in the console and deploy using CloudFormation
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy-console"></a>

You can configure resources using the AWS Management Console and deploy using the CloudFormation templates. 

Follow these steps.

1. *Instead of choosing **Submit***, choose **Download CloudFormation template parameters** at the end of the tutorial in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md). The tutorial contains important configuration information you will need to create your cluster successfully.
**Important**  
If you choose **Submit**, you will not be able to deploy a cluster with the same name until you delete the cluster.

   After you choose **Download CloudFormation template parameters**, the **Using the configuration file to create the cluster using the AWS CLI** window will appear on the right side of the page.

1. On the **Using the configuration file to create the cluster using the AWS CLI** window, choose **Download configuration parameters file**. The file will be downloaded to your machine. You can edit the configuration JSON file based on your needs or leave it as-is, if no change is required.

1. In the terminal, navigate to the location of the parameter file `file://params.json`.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url https://aws-sagemaker-hyperpod-cluster-setup.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the [CloudFormation console](https://console.aws.amazon.com/cloudformation).

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

## Configure resources and deploy using CloudFormation
<a name="smcluster-getting-started-slurm-console-create-cluster-deploy-cfn"></a>

You can configure resources and deploy using the CloudFormation templates for SageMaker HyperPod.

Follow these steps.

1. Download a CloudFormation template for SageMaker HyperPod from the [sagemaker-hyperpod-cluster-setup](https://github.com/aws/sagemaker-hyperpod-cluster-setup) GitHub repository.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url URL_of_the_file_that_contains_the_template_body
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the CloudFormation console.

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

# Getting started with SageMaker HyperPod using the AWS CLI
<a name="smcluster-getting-started-slurm-cli"></a>

Create your first SageMaker HyperPod cluster using the AWS CLI commands for HyperPod.

## Create your first SageMaker HyperPod cluster with Slurm
<a name="smcluster-getting-started-slurm-cli-create-cluster"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Slurm through the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-cli). Following the tutorial, you'll create a HyperPod cluster with three Slurm nodes: `my-controller-group`, `my-login-group`, and `worker-group-1`.

With the API-driven configuration approach, you define Slurm node types and partition assignments directly in the CreateCluster API request using `SlurmConfig`. This eliminates the need for a separate `provisioning_parameters.json` file and provides built-in validation, drift detection, and per-instance-group FSx configuration.

1. First, prepare and upload lifecycle scripts to an Amazon S3 bucket. During cluster creation, HyperPod runs them in each instance group. Upload lifecycle scripts to Amazon S3 using the following command.

   ```
   aws s3 sync \
       ~/local-dir-to-lifecycle-scripts/* \
       s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
   ```
**Note**  
The S3 bucket path should start with a prefix `sagemaker-`, because the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) with `AmazonSageMakerClusterInstanceRolePolicy` only allows access to Amazon S3 buckets that starts with the specific prefix.

   If you are starting from scratch, use sample lifecycle scripts provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). The following sub-steps show how to download and upload the sample lifecycle scripts to an Amazon S3 bucket.

   1. Download a copy of the lifecycle script samples to a directory on your local computer.

      ```
      git clone https://github.com/aws-samples/awsome-distributed-training/
      ```

   1. Go into the directory [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config), where you can find a set of lifecycle scripts.

      ```
      cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
      ```

      To learn more about the lifecycle script samples, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

   1. Upload the scripts to `s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src`. You can do so by using the Amazon S3 console, or by running the following AWS CLI Amazon S3 command.

      ```
      aws s3 sync \
          ~/local-dir-to-lifecycle-scripts/* \
          s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src
      ```
**Note**  
With API-driven configuration, you do not need to create or upload a `provisioning_parameters.json` file. The Slurm configuration is defined directly in the CreateCluster API request in the next step.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) request file in JSON format and save as `create_cluster.json`.

   With API-driven configuration, you specify the Slurm node type and partition assignment for each instance group using the `SlurmConfig` field. You also configure the cluster-level Slurm settings using `Orchestrator.Slurm`.

   For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` in [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md).

   ```
   {
       "ClusterName": "my-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole",
               "InstanceStorageConfigs": [
                   {
                       "EbsVolumeConfig": {
                           "VolumeSizeInGB": 500
                       }
                   }
               ]
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Login"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Compute",
                   "PartitionNames": ["partition-1"]
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           }
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       }
   }
   ```

   **SlurmConfig fields:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

   **Orchestrator.Slurm fields:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

   **SlurmConfigStrategy options:**
   + `Managed` (recommended): HyperPod fully manages `slurm.conf` and detects unauthorized changes (drift detection). Updates fail if drift is detected.
   + `Overwrite`: HyperPod overwrites `slurm.conf` on updates, ignoring any manual changes.
   + `Merge`: HyperPod preserves manual changes and merges them with API configuration.

   **Adding FSx for Lustre (optional):**

   To mount an FSx for Lustre filesystem to your compute nodes, add `FsxLustreConfig` to the `InstanceStorageConfigs` for the instance group. This requires a Custom VPC configuration.

   ```
   {
       "InstanceGroupName": "worker-group-1",
       "InstanceType": "ml.trn1.32xlarge",
       "InstanceCount": 1,
       "SlurmConfig": {
           "NodeType": "Compute",
           "PartitionNames": ["partition-1"]
       },
       "InstanceStorageConfigs": [
           {
               "FsxLustreConfig": {
                   "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                   "MountPath": "/fsx",
                   "MountName": "abcdefgh"
               }
           }
       ],
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
   }
   ```

   **Adding FSx for OpenZFS (optional):**

   You can also mount FSx for OpenZFS filesystems:

   ```
   "InstanceStorageConfigs": [
       {
           "FsxOpenZfsConfig": {
               "DnsName": "fs-0xyz789abc123456.fsx.us-west-2.amazonaws.com",
               "MountPath": "/shared"
           }
       }
   ]
   ```
**Note**  
Each instance group can have at most one FSx for Lustre and one FSx for OpenZFS configuration. Different instance groups can mount different filesystems.

   **Adding VPC configuration (required for FSx):**

   If using FSx, you must specify a Custom VPC configuration:

   ```
   {
       "ClusterName": "my-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::<account-id>:role/HyperPodExecutionRole"
           },
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       },
       "VpcConfig": {
           "SecurityGroupIds": ["sg-0abc123def456789a"],
           "Subnets": ["subnet-0abc123def456789a"]
       }
   }
   ```

1. Run the following command to create the cluster.

   ```
   aws sagemaker create-cluster --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the created cluster.

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster"
   }
   ```

   If you receive an error due to resource limits, ensure that you change the instance type to one with sufficient quotas in your account, or request additional quotas by following [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

   **Common validation errors:**    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-slurm-cli.html)

1. Run `describe-cluster` to check the status of the cluster.

   ```
   aws sagemaker describe-cluster --cluster-name my-hyperpod-cluster
   ```

   Example response:

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/my-hyperpod-cluster",
       "ClusterName": "my-hyperpod-cluster",
       "ClusterStatus": "Creating",
       "InstanceGroups": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceType": "ml.c5.xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Controller"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Login"
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceCount": 1,
               "CurrentCount": 0,
               "TargetCount": 1,
               "SlurmConfig": {
                   "NodeType": "Compute",
                   "PartitionNames": ["partition-1"]
               },
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://sagemaker-<bucket>/src",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodExecutionRole"
           }
       ],
       "Orchestrator": {
           "Slurm": {
               "SlurmConfigStrategy": "Managed"
           }
       },
       "CreationTime": "2024-01-15T10:30:00Z"
   }
   ```

   After the status of the cluster turns to **InService**, proceed to the next step. Cluster creation typically takes 10-15 minutes.

1. Run `list-cluster-nodes` to check the details of the cluster nodes.

   ```
   aws sagemaker list-cluster-nodes --cluster-name my-hyperpod-cluster
   ```

   Example response:

   ```
   {
       "ClusterNodeSummaries": [
           {
               "InstanceGroupName": "my-controller-group",
               "InstanceId": "i-0abc123def456789a",
               "InstanceType": "ml.c5.xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:35:00Z"
           },
           {
               "InstanceGroupName": "my-login-group",
               "InstanceId": "i-0abc123def456789b",
               "InstanceType": "ml.m5.4xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:35:00Z"
           },
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceId": "i-0abc123def456789c",
               "InstanceType": "ml.trn1.32xlarge",
               "InstanceStatus": {
                   "Status": "Running",
                   "Message": ""
               },
               "LaunchTime": "2024-01-15T10:36:00Z"
           }
       ]
   }
   ```

   The `InstanceId` is what your cluster users need for logging (`aws ssm`) into them. For more information about logging into the cluster nodes and running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

1. Connect to your cluster using AWS Systems Manager Session Manager.

   ```
   aws ssm start-session \
       --target sagemaker-cluster:my-hyperpod-cluster_my-login-group-i-0abc123def456789b \
       --region us-west-2
   ```

   Once connected, verify Slurm is configured correctly:

   ```
   # Check Slurm nodes
   sinfo
   
   # Check Slurm partitions
   sinfo -p partition-1
   
   # Submit a test job
   srun -p partition-1 --nodes=1 hostname
   ```

## Delete the cluster and clean resources
<a name="smcluster-getting-started-slurm-cli-delete-cluster-and-clean"></a>

After you have successfully tested creating a SageMaker HyperPod cluster, it continues running in the `InService` state until you delete the cluster. We recommend that you delete any clusters created using on-demand SageMaker AI capacity when not in use to avoid incurring continued service charges based on on-demand pricing. In this tutorial, you have created a cluster that consists of three instance groups. Make sure you delete the cluster by running the following command.

```
aws sagemaker delete-cluster --cluster-name my-hyperpod-cluster
```

To clean up the lifecycle scripts from the Amazon S3 bucket used for this tutorial, go to the Amazon S3 bucket you used during cluster creation and remove the files entirely.

```
aws s3 rm s3://sagemaker-<unique-s3-bucket-name>/<lifecycle-script-directory>/src --recursive
```

If you have tested running any model training workloads on the cluster, also check if you have uploaded any data or if your job has saved any artifacts to different Amazon S3 buckets or file system services such as Amazon FSx for Lustre and Amazon Elastic File System. To prevent incurring charges, delete all artifacts and data from the storage or file system.

## Related topics
<a name="smcluster-getting-started-slurm-cli-related-topics"></a>
+ [SageMaker HyperPod Slurm configuration](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-slurm-configuration)
+ [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md)
+ [FSx configuration via InstanceStorageConfigs](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-slurm-fsx-config)
+ [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md)

# SageMaker HyperPod Slurm cluster operations
<a name="sagemaker-hyperpod-operate-slurm"></a>

This section provides guidance on managing SageMaker HyperPod through the SageMaker AI console UI or the AWS Command Line Interface (CLI). You'll learn how to perform various tasks related to SageMaker HyperPod, whether you prefer a visual interface or working with commands.

**Topics**
+ [Managing SageMaker HyperPod Slurm clusters using the SageMaker console](sagemaker-hyperpod-operate-slurm-console-ui.md)
+ [Managing SageMaker HyperPod Slurm clusters using the AWS CLI](sagemaker-hyperpod-operate-slurm-cli-command.md)

# Managing SageMaker HyperPod Slurm clusters using the SageMaker console
<a name="sagemaker-hyperpod-operate-slurm-console-ui"></a>

The following topics provide guidance on how to manage SageMaker HyperPod through the console UI.

**Topics**
+ [Create a SageMaker HyperPod cluster](#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster)
+ [Browse your SageMaker HyperPod clusters](#sagemaker-hyperpod-operate-slurm-console-ui-browse-clusters)
+ [View details of each SageMaker HyperPod cluster](#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters)
+ [Edit a SageMaker HyperPod cluster](#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters)
+ [Delete a SageMaker HyperPod cluster](#sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster)

## Create a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-create-cluster"></a>

See the instructions in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md) to create a new SageMaker HyperPod cluster through the SageMaker HyperPod console UI.

## Browse your SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-operate-slurm-console-ui-browse-clusters"></a>

Under **Clusters** in the main pane of the SageMaker HyperPod console on the SageMaker HyperPod console main page, all created clusters should appear listed under the **Clusters** section, which provides a summary view of clusters, their ARNs, status, and creation time.

## View details of each SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters"></a>

Under **Clusters** on the console main page, the cluster **Names** are activated as links. Choose the cluster name link to see details of each cluster.

## Edit a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to update.

1. Select your cluster, and choose **Edit**.

1. In the **Edit <your-cluster>** page, you can edit the configurations of existing instance groups, add more instance groups, delete instance groups, and change tags for the cluster. After making changes, choose **Submit**. 

   1. In the **Configure instance groups** section, you can add more instance groups by choosing **Create instance group**.

   1. In the **Configure instance groups** section, you can choose **Edit** to change its configuration or **Delete** to remove the instance group permanently.
**Important**  
When deleting an instance group, consider the following points:  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal
The removal process cannot be undone.
**Note**  
Deleting an instance group will terminate all compute resources associated with that group.

   1. In the **Tags** section, you can update tags for the cluster.

## Delete a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-operate-slurm-console-ui-delete-cluster"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to delete.

1. Select your cluster, and choose **Delete**.

1. In the pop-up window for cluster deletion, review the cluster information carefully to confirm that you chose the right cluster to delete.

1. After you reviewed the cluster information, choose **Yes, delete cluster**.

1. In the text field to confirm this deletion, type **delete**.

1. Choose **Delete** on the lower right corner of the pop-up window to finish sending the cluster deletion request.

# Managing SageMaker HyperPod Slurm clusters using the AWS CLI
<a name="sagemaker-hyperpod-operate-slurm-cli-command"></a>

The following topics provide guidance on writing SageMaker HyperPod API request files in JSON format and run them using the AWS CLI commands.

**Topics**
+ [Create a new cluster](#sagemaker-hyperpod-operate-slurm-cli-command-create-cluster)
+ [Describe a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster)
+ [List details of cluster nodes](#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes)
+ [Describe details of a cluster node](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster-node)
+ [List clusters](#sagemaker-hyperpod-operate-slurm-cli-command-list-clusters)
+ [Update cluster configuration](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster)
+ [Update the SageMaker HyperPod platform software of a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software)
+ [Scale down a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-scale-down)
+ [Delete a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-delete-cluster)

## Create a new cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-create-cluster"></a>

1. Prepare lifecycle configuration scripts and upload them to an S3 bucket, such as `s3://sagemaker-amzn-s3-demo-bucket/lifecycle-script-directory/src/`. The following step 2 assumes that there’s an entry point script named `on_create.sh` in the specified S3 bucket.
**Important**  
Make sure that you set the S3 path to start with `s3://sagemaker-`. The [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) has the managed [https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html) attached, which allows access to S3 buckets with the specific prefix `sagemaker-`.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API request file in JSON format. You should configure instance groups to match with the Slurm cluster you design in the `provisioning_parameters.json` file that'll be used during cluster creating as part of running a set of lifecycle scripts. To learn more, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md). The following template has two instance groups to meet the minimum requirement for a Slurm cluster: one controller (head) node and one compute (worker) node. For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` from the section [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

   ```
   // create_cluster.json
   {
       "ClusterName": "your-hyperpod-cluster",
       "InstanceGroups": [
           {
               "InstanceGroupName": "controller-group",
               "InstanceType": "ml.m5.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
               // Optional: Configure an additional storage per instance group.
               "InstanceStorageConfigs": [
                   {
                      // Attach an additional EBS volume to each instance within the instance group.
                      // The default mount path for the additional EBS volume is /opt/sagemaker.
                      "EbsVolumeConfig":{
                         // Specify an integer between 1 and 16384 in gigabytes (GB).
                         "VolumeSizeInGB": integer,
                      }
                   }
               ]
           }, 
           {
               "InstanceGroupName": "worker-group-1",
               "InstanceType": "ml.p4d.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster"
           }
       ],
       // Optional
       "Tags": [ 
           { 
              "Key": "string",
              "Value": "string"
           }
       ],
       // Optional
       "VpcConfig": { 
           "SecurityGroupIds": [ "string" ],
           "Subnets": [ "string" ]
       }
   }
   ```

   Depending on how you design the cluster structure through your lifecycle scripts, you can configure up to 20 instance groups under the `InstanceGroups` parameter.

   For the `Tags` request parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see [Tagging AWS Resources User Guide](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

   For the `VpcConfig` request parameter, specify the information of a VPC you want to use. For more information, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

1. Run the [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) command as follows.

   ```
   aws sagemaker create-cluster \
       --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the new cluster.

## Describe a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster"></a>

Run [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster.html) to check the status of the cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster --cluster-name your-hyperpod-cluster
```

After the status of the cluster turns to **InService**, proceed to the next step. Using this API, you can also retrieve failure messages from running other HyperPod API operations.

## List details of cluster nodes
<a name="sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes"></a>

Run [list-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) to check the key information of the cluster nodes.

```
aws sagemaker list-cluster-nodes --cluster-name your-hyperpod-cluster
```

This returns a response, and the `InstanceId` is what you need to use for logging (using `aws ssm`) into them.

## Describe details of a cluster node
<a name="sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster-node"></a>

Run [describe-cluster-node](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster-node.html) to retrieve details of a cluster node. You can get the cluster node ID from list-cluster-nodes output. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster-node \
    --cluster-name your-hyperpod-cluster \
    --node-id i-111222333444555aa
```

## List clusters
<a name="sagemaker-hyperpod-operate-slurm-cli-command-list-clusters"></a>

Run [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-clusters.html) to list all clusters in your account.

```
aws sagemaker list-clusters
```

You can also add additional flags to filter the list of clusters down. To learn more about what this command runs at low level and additional flags for filtering, see the [ListClusters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusters.html) API reference.

## Update cluster configuration
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster"></a>

Run [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) to update the configuration of a cluster.

**Note**  
You can use the `UpdateCluster` API to scale down or remove entire instance groups from your SageMaker HyperPod cluster. For additional instructions on how to scale down or delete instance groups, see [Scale down a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-scale-down).

1. Create an `UpdateCluster` request file in JSON format. Make sure that you specify the right cluster name and instance group name to update. You can change the instance type, the number of instances, the lifecycle configuration entrypoint script, and the path to the script.

   1. For `ClusterName`, specify the name of the cluster you want to update.

   1. For `InstanceGroupName`

      1. To update an existing instance group, specify the name of the instance group you want to update.

      1. To add a new instance group, specify a new name not existing in your cluster.

   1. For `InstanceType`

      1. To update an existing instance group, you must match the instance type you initially specified to the group.

      1. To add a new instance group, specify an instance type you want to configure the group with.

   1. For `InstanceCount`

      1. To update an existing instance group, specify an integer that corresponds to your desired number of instances. You can provide a higher or lower value (down to 0) to scale the instance group up or down.

      1. To add a new instance group, specify an integer greater or equal to 1. 

   1. For `LifeCycleConfig`, you can change both `SourceS3Uri` and `OnCreate` values as you want to update the instance group.

   1. For `ExecutionRole`

      1. For updating an existing instance group, keep using the same IAM role you attached during cluster creation.

      1. For adding a new instance group, specify an IAM role you want to attach.

   1. For `ThreadsPerCore`

      1. For updating an existing instance group, keep using the same value you specified during cluster creation.

      1. For adding a new instance group, you can choose any value from the allowed options per instance type. For more information, search the instance type and see the **Valid threads per core** column in the reference table at [CPU cores and threads per CPU core per instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html) in the *Amazon EC2 User Guide*.

   The following code snippet is a JSON request file template you can use. For more information about the request syntax and parameters of this API, see the [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API reference.

   ```
   // update_cluster.json
   {
       // Required
       "ClusterName": "name-of-cluster-to-update",
       // Required
       "InstanceGroups": [
           {
               "InstanceGroupName": "name-of-instance-group-to-update",
               "InstanceType": "ml.m5.xlarge",
               "InstanceCount": 1,
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
                   "OnCreate": "on_create.sh"
               },
               "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
               // Optional: Configure an additional storage per instance group.
               "InstanceStorageConfigs": [
                   {
                      // Attach an additional EBS volume to each instance within the instance group.
                      // The default mount path for the additional EBS volume is /opt/sagemaker.
                      "EbsVolumeConfig":{
                         // Specify an integer between 1 and 16384 in gigabytes (GB).
                         "VolumeSizeInGB": integer,
                      }
                   }
               ]
           },
           // add more blocks of instance groups as needed
           { ... }
       ]
   }
   ```

1. Run the following `update-cluster` command to submit the request. 

   ```
   aws sagemaker update-cluster \
       --cli-input-json file://complete/path/to/update_cluster.json
   ```

## Update the SageMaker HyperPod platform software of a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software"></a>

Run [update-cluster-software](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster-software.html) to update existing clusters with software and security patches provided by the SageMaker HyperPod service. For `--cluster-name`, specify either the name or the ARN of the cluster to update.

**Important**  
Note that you must back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see [Use the backup script provided by SageMaker HyperPod](#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

```
aws sagemaker update-cluster-software --cluster-name your-hyperpod-cluster
```

This command calls the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API. After the API call, SageMaker HyperPod checks if there's a newer DLAMI available for the cluster instances. If a DLAMI update is required, SageMaker HyperPod will update the cluster instances to use the latest [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) and run your lifecycle scripts in the Amazon S3 bucket that you specified during cluster creation or update. If the cluster is already using the latest DLAMI, SageMaker HyperPod will not make any changes to the cluster or run the lifecycle scripts again. The SageMaker HyperPod service team regularly rolls out new [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami)s for enhancing security and improving user experiences. We recommend that you always keep updating to the latest SageMaker HyperPod DLAMI. For future SageMaker HyperPod DLAMI updates for security patching, follow up with [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Tip**  
If the security patch fails, you can retrieve failure messages by running the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) API as instructed at [Describe a cluster](#sagemaker-hyperpod-operate-slurm-cli-command-describe-cluster).

**Note**  
You can only run this API programatically. The patching functionality is not implemented in the SageMaker HyperPod console UI.

### Use the backup script provided by SageMaker HyperPod
<a name="sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup"></a>

SageMaker HyperPod provides a script to back up and restore your data at [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/patching-backup.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/patching-backup.sh) in the *Awsome Distributed Training GitHub repository*. The script provides the following two functions.

**To back up data to an S3 bucket before patching**

```
sudo bash patching-backup.sh --create <s3-buckup-bucket-path>
```

After you run the command, the script checks `squeue` if there are queued jobs, stops Slurm if there's no job in the queue, backs up `mariadb`, and copies local items on disc defined under `LOCAL_ITEMS`. You can add more files and directories to `LOCAL_ITEMS`.

```
# Define files and directories to back up.
LOCAL_ITEMS=(
    "/var/spool/slurmd"
    "/var/spool/slurmctld"
    "/etc/systemd/system/slurmctld.service"
    "/home/ubuntu/backup_slurm_acct_db.sql"
    # ... Add more items as needed
)
```

Also, you can add custom code to the provided script to back up any applications for your use case.

**To restore data from an S3 bucket after patching**

```
sudo bash patching-backup.sh --restore <s3-buckup-bucket-path>
```

## Scale down a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-scale-down"></a>

You can scale down the number of instances or delete instance groups in your SageMaker HyperPod cluster to optimize resource allocation or reduce costs.

You scale down by either using the `UpdateCluster` API operation to randomly terminate instances from your instance group down to a specified number, or by terminating specific instances using the `BatchDeleteClusterNodes` API operation. You can also completely remove entire instance groups using the `UpdateCluster` API. For more information about how to scale down using these methods, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md).

**Note**  
You cannot remove instances that are configured as Slurm controller nodes. Attempting to delete a Slurm controller node results in a validation error with the error code `NODE_ID_IN_USE`.

## Delete a cluster
<a name="sagemaker-hyperpod-operate-slurm-cli-command-delete-cluster"></a>

Run [delete-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-cluster.html) to delete a cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker delete-cluster --cluster-name your-hyperpod-cluster
```

# Customizing SageMaker HyperPod clusters using lifecycle scripts
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm"></a>

SageMaker HyperPod offers always up-and-running compute clusters, which are highly customizable as you can write lifecycle scripts to tell SageMaker HyperPod how to set up the cluster resources. The following topics are best practices for preparing lifecycle scripts to set up SageMaker HyperPod clusters with open source workload manager tools.

The following topics discuss in-depth best practices for preparing lifecycle scripts to set up Slurm configurations on SageMaker HyperPod.

## High-level overview
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview"></a>

The following procedure is the main flow of provisioning a HyperPod cluster and setting it up with Slurm. The steps are put in order of a ***bottom-up*** approach.

1. Plan how you want to create Slurm nodes on a HyperPod cluster. For example, if you want to configure two Slurm nodes, you'll need to set up two instance groups in a HyperPod cluster.

1. Prepare Slurm configuration. Choose one of the following approaches:
   + **Option A: API-driven configuration (recommended)** – Define Slurm node types and partitions directly in the `CreateCluster` API payload using `SlurmConfig` within each instance group. With this approach:
     + No `provisioning_parameters.json` file is needed
     + Slurm topology is defined in the API payload alongside instance group definitions
     + FSx filesystems are configured per-instance-group via `InstanceStorageConfigs`
     + Configuration strategy is controlled via `Orchestrator.Slurm.SlurmConfigStrategy`

     Example `SlurmConfig` in an instance group:

     ```
     {
         "InstanceGroupName": "gpu-compute",
         "InstanceType": "ml.p4d.24xlarge",
         "InstanceCount": 8,
         "SlurmConfig": {
             "NodeType": "Compute",
             "PartitionNames": ["gpu-training"]
         }
     }
     ```
   + **Option B: Legacy configuration** – Prepare a `provisioning_parameters.json` file, which is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). `provisioning_parameters.json` should contain Slurm node configuration information to be provisioned on the HyperPod cluster. This should reflect the design of Slurm nodes from Step 1.

1. Prepare a set of lifecycle scripts to set up Slurm on HyperPod to install software packages and set up an environment in the cluster for your use case. You should structure the lifecycle scripts to collectively run in order in a central Python script (`lifecycle_script.py`), and write an entrypoint shell script (`on_create.sh`) to run the Python script. The entrypoint shell script is what you need to provide to a HyperPod cluster creation request later in Step 5. 

   Also, note that you should write the scripts to expect `resource_config.json` that will be generated by HyperPod during cluster creation. `resource_config.json` contains HyperPod cluster resource information such as IP addresses, instance types, and ARNs, and is what you need to use for configuring Slurm.

1. Collect all the files from the previous steps into a folder. The folder structure depends on the configuration approach you selected in Step 2.

   If you selected Option A (API-driven configuration):

   Your folder only needs lifecycle scripts for custom setup tasks. Slurm configuration and FSx mounting are handled automatically by HyperPod based on the API payload.

   ```
   └── lifecycle_files // your local folder
   
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scripts to be fed into lifecycle_script.py
   ```
**Note**  
The `provisioning_parameters.json` file is not required when using API-driven configuration.

   If you selected Option B (legacy configuration):

   Your folder must include `provisioning_parameters.json` and the full set of lifecycle scripts.

   ```
   └── lifecycle_files // your local folder
   
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

1. Upload all the files to an S3 bucket. Copy and keep the S3 bucket path. Note that you should create an S3 bucket path starting with `sagemaker-` because you need to choose an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attached with [`AmazonSageMakerClusterInstanceRolePolicy`](security-iam-awsmanpol-AmazonSageMakerClusterInstanceRolePolicy.md), which only allows S3 bucket paths starting with the prefix `sagemaker-`. The following command is an example command to upload all the files to an S3 bucket.

   ```
   aws s3 cp --recursive ./lifecycle_files s3://sagemaker-hyperpod-lifecycle/src
   ```

1. Prepare a HyperPod cluster creation request. 
   + Option 1: If you use the AWS CLI, write a cluster creation request in JSON format (`create_cluster.json`) following the instructions at [Create a new cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-create-cluster).
   + Option 2: If you use the SageMaker AI console UI, fill the **Create a cluster** request form in the HyperPod console UI following the instructions at [Create a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster).

   At this stage, make sure that you create instance groups in the same structure that you planned in Step 1 and 2. Also, make sure that you specify the S3 bucket from Step 5 in the request forms.

1. Submit the cluster creation request. HyperPod provisions a cluster based on the request, and then creates a `resource_config.json` file in the HyperPod cluster instances, and sets up Slurm on the cluster running the lifecycle scripts.

The following topics walk you through and dive deep into details on how to organize configuration files and lifecycle scripts to work properly during HyperPod cluster creation.

**Topics**
+ [High-level overview](#sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-highlevel-overview)
+ [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md)
+ [What particular configurations HyperPod manages in Slurm configuration files](sagemaker-hyperpod-lifecycle-best-practices-slurm-what-hyperpod-overrides-in-slurm-conf.md)
+ [Slurm log rotations](sagemaker-hyperpod-slurm-log-rotation.md)
+ [Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-setup-with-fsx.md)
+ [Validating the JSON configuration files before creating a Slurm cluster on HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md)
+ [Validating runtime before running production workloads on a HyperPod Slurm cluster](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-runtime.md)
+ [Developing lifecycle scripts interactively on a HyperPod cluster node](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-develop-lifecycle-scripts.md)

# Base lifecycle scripts provided by HyperPod
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config"></a>

This section walks you through every component of the basic flow of setting up Slurm on HyperPod in a ***top-down*** approach. It starts from preparing a HyperPod cluster creation request to run the `CreateCluster` API, and dives deep into the hierarchical structure down to lifecycle scripts. Use the sample lifecycle scripts provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). Clone the repository by running the following command.

```
git clone https://github.com/aws-samples/awsome-distributed-training/
```

The base lifecycle scripts for setting up a Slurm cluster on SageMaker HyperPod are available at [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config).

```
cd awsome-distributed-training/1.architectures/5.sagemaker_hyperpods/LifecycleScripts/base-config
```

The following flowchart shows a detailed overview of how you should design the base lifecycle scripts. The descriptions below the diagram and the procedural guide explain how they work during the HyperPod `CreateCluster` API call.

![\[A detailed flow chart of HyperPod cluster creation and the structure of lifecycle scripts.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-lifecycle-structure.png)


***Figure:** A detailed flow chart of HyperPod cluster creation and the structure of lifecycle scripts. (1) The dashed arrows are directed to where the boxes are "called into" and shows the flow of configuration files and lifecycle scripts preparation. It starts from preparing `provisioning_parameters.json` and lifecycle scripts. These are then coded in `lifecycle_script.py` for a collective execution in order. And the execution of the `lifecycle_script.py` script is done by the `on_create.sh` shell script, which to be run in the HyperPod instance terminal. (2) The solid arrows show the main HyperPod cluster creation flow and how the boxes are "called into" or "submitted to". `on_create.sh` is required for cluster creation request, either in `create_cluster.json` or the **Create a cluster** request form in the console UI. After you submit the request, HyperPod runs the `CreateCluster` API based on the given configuration information from the request and the lifecycle scripts. (3) The dotted arrow indicates that the HyperPod platform creates `resource_config.json` in the cluster instances during cluster resource provisioning. `resource_config.json` contains HyperPod cluster resource information such as the cluster ARN, instance types, and IP addresses. It is important to note that you should prepare the lifecycle scripts to expect the `resource_config.json` file during cluster creation. For more information, see the procedural guide below.*

The following procedural guide explains what happens during HyperPod cluster creation and how the base lifecycle scripts are designed.

1. `create_cluster.json` – To submit a HyperPod cluster creation request, you prepare a `CreateCluster` request file in JSON format. In this best practices example, we assume that the request file is named `create_cluster.json`. Write `create_cluster.json` to provision a HyperPod cluster with instance groups. The best practice is to add the same number of instance groups as the number of Slurm nodes you plan to configure on the HyperPod cluster. Make sure that you give distinctive names to the instance groups that you'll assign to Slurm nodes you plan to set up.

   Also, you are required to specify an S3 bucket path to store your entire set of configuration files and lifecycle scripts to the field name `InstanceGroups.LifeCycleConfig.SourceS3Uri` in the `CreateCluster` request form, and specify the file name of an entrypoint shell script (assume that it's named `on_create.sh`) to `InstanceGroups.LifeCycleConfig.OnCreate`.
**Note**  
If you are using the **Create a cluster** submission form in the HyperPod console UI, the console manages filling and submitting the `CreateCluster` request on your behalf, and runs the `CreateCluster` API in the backend. In this case, you don't need to create `create_cluster.json`; instead, make sure that you specify the correct cluster configuration information to the **Create a cluster** submission form.

1. `on_create.sh` – For each instance group, you need to provide an entrypoint shell script, `on_create.sh`, to run commands, run scripts to install software packages, and set up the HyperPod cluster environment with Slurm. The two things you need to prepare are a `provisioning_parameters.json` required by HyperPod for setting up Slurm and a set of lifecycle scripts for installing software packages. This script should be written to find and run the following files as shown in the sample script at [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh).
**Note**  
Make sure that you upload the entire set of lifecycle scripts to the S3 location you specify in `create_cluster.json`. You should also place your `provisioning_parameters.json` in the same location.

   1. `provisioning_parameters.json` – This is a [Configuration form for provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms-slurm). The `on_create.sh` script finds this JSON file and defines environment variable for identifying the path to it. Through this JSON file, you can configure Slurm nodes and storage options such as Amazon FSx for Lustre for Slurm to communicate with. In `provisioning_parameters.json`, make sure that you assign the HyperPod cluster instance groups using the names you specified in `create_cluster.json` to the Slurm nodes appropriately based on how you plan to set them up.

      The following diagram shows an example of how the two JSON configuration files `create_cluster.json` and `provisioning_parameters.json` should be written to assign HyperPod instance groups to Slurm nodes. In this example, we assume a case of setting up three Slurm nodes: controller (management) node, log-in node (which is optional), and compute (worker) node.
**Tip**  
To help you validate these two JSON files, the HyperPod service team provides a validation script, [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py). To learn more, see [Validating the JSON configuration files before creating a Slurm cluster on HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files.md).  
![\[Direct comparison between .json files.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-lifecycle-slurm-config.png)

      ***Figure:** Direct comparison between `create_cluster.json` for HyperPod cluster creation and `provisiong_params.json` for Slurm configuration. The number of instance groups in `create_cluster.json` should match with the number of nodes you want to configure as Slurm nodes. In case of the example in the figure, three Slurm nodes will be configured on a HyperPod cluster of three instance groups. You should assign the HyperPod cluster instance groups to Slurm nodes by specifying the instance group names accordingly.*

   1. `resource_config.json` – During cluster creation, the `lifecycle_script.py` script is written to expect a `resource_config.json` file from HyperPod. This file contains information about the cluster, such as instance types and IP addresses.

      When you run the `CreateCluster` API, HyperPod creates a resource configuration file at `/opt/ml/config/resource_config.json` based on the `create_cluster.json` file. The file path is saved to the environment variable named `SAGEMAKER_RESOURCE_CONFIG_PATH`. 
**Important**  
The `resource_config.json` file is auto-generated by the HyperPod platform, and you DO NOT need to create it. The following code is to show an example of `resource_config.json` that would be created from the cluster creation based on `create_cluster.json` in the previous step, and to help you understand what happens in the backend and how an auto-generated `resource_config.json` would look.

      ```
      {
      
          "ClusterConfig": {
              "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde01234yz",
              "ClusterName": "your-hyperpod-cluster"
          },
          "InstanceGroups": [
              {
                  "Name": "controller-machine",
                  "InstanceType": "ml.c5.xlarge",
                  "Instances": [
                      {
                          "InstanceName": "controller-machine-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              },
              {
                  "Name": "login-group",
                  "InstanceType": "ml.m5.xlarge",
                  "Instances": [
                      {
                          "InstanceName": "login-group-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              },
              {
                  "Name": "compute-nodes",
                  "InstanceType": "ml.trn1.32xlarge",
                  "Instances": [
                      {
                          "InstanceName": "compute-nodes-1",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-2",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-3",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      },
                      {
                          "InstanceName": "compute-nodes-4",
                          "AgentIpAddress": "111.222.333.444",
                          "CustomerIpAddress": "111.222.333.444",
                          "InstanceId": "i-12345abcedfg67890"
                      }
                  ]
              }
          ]
      }
      ```

   1. `lifecycle_script.py` – This is the main Python script that collectively runs lifecycle scripts setting up Slurm on the HyperPod cluster while being provisioned. This script reads in `provisioning_parameters.json` and `resource_config.json` from the paths that are specified or identified in `on_create.sh`, passes the relevant information to each lifecycle script, and then runs the lifecycle scripts in order.

      Lifecycle scripts are a set of scripts that you have a complete flexibility to customize to install software packages and set up necessary or custom configurations during cluster creation, such as setting up Slurm, creating users, installing Conda or Docker. The sample [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script is prepared to run other base lifecycle scripts in the repository, such as launching Slurm deamons ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh)), mounting Amazon FSx for Lustre ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh)), and setting up MariaDB accounting ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh)) and RDS accounting ([https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_rds_accounting.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_rds_accounting.sh)). You can also add more scripts, package them under the same directory, and add code lines to `lifecycle_script.py` to let HyperPod run the scripts. For more information about the base lifecycle scripts, see also [3.1 Lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod#31-lifecycle-scripts) in the *Awsome Distributed Training GitHub repository*.
**Note**  
HyperPod runs [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) on each instance of a cluster, and the AMI has pre-installed software packages complying compatibilities between them and HyperPod functionalities. Note that if you reinstall any of the pre-installed packages, you are responsible for installing compatible packages and note that some HyperPod functionalities might not work as expected.

      In addition to the default setups, more scripts for installing the following software are available under the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder. The `lifecycle_script.py` file is already prepared to include code lines for running the installation scripts, so see the following items to search those lines and uncomment to activate them.

      1. The following code lines are for installing [Docker](https://www.docker.com/), [Enroot](https://github.com/NVIDIA/enroot), and [Pyxis](https://github.com/NVIDIA/pyxis). These packages are required to run Docker containers on a Slurm cluster. 

         To enable this installation step, set the `enable_docker_enroot_pyxis` parameter to `True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file.

         ```
         # Install Docker/Enroot/Pyxis
         if Config.enable_docker_enroot_pyxis:
             ExecuteBashScript("./utils/install_docker.sh").run()
             ExecuteBashScript("./utils/install_enroot_pyxis.sh").run(node_type)
         ```

      1. You can integrate your HyperPod cluster with [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html) to export metrics about the HyperPod cluster and cluster nodes to Amazon Managed Grafana dashboards. To export metrics and use the [Slurm dashboard](https://grafana.com/grafana/dashboards/4323-slurm-dashboard/), the [NVIDIA DCGM Exporter dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/), and the [EFA Metrics dashboard](https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/) on Amazon Managed Grafana, you need to install the [Slurm exporter for Prometheus](https://github.com/vpenso/prometheus-slurm-exporter), the [NVIDIA DCGM exporter](https://github.com/NVIDIA/dcgm-exporter), and the [EFA node exporter](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md). For more information about installing the exporter packages and using Grafana dashboards on an Amazon Managed Grafana workspace, see [SageMaker HyperPod cluster resources monitoring](sagemaker-hyperpod-cluster-observability-slurm.md). 

         To enable this installation step, set the `enable_observability` parameter to `True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file.

         ```
         # Install metric exporting software and Prometheus for observability
         
         if Config.enable_observability:
             if node_type == SlurmNodeType.COMPUTE_NODE:
                 ExecuteBashScript("./utils/install_docker.sh").run()
                 ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
                 ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()
             
             if node_type == SlurmNodeType.HEAD_NODE:
                 wait_for_scontrol()
                 ExecuteBashScript("./utils/install_docker.sh").run()
                 ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
                 ExecuteBashScript("./utils/install_prometheus.sh").run()
         ```

1. Make sure that you upload all configuration files and setup scripts from **Step 2** to the S3 bucket you provide in the `CreateCluster` request in **Step 1**. For example, assume that your `create_cluster.json` has the following.

   ```
   "LifeCycleConfig": { 
   
       "SourceS3URI": "s3://sagemaker-hyperpod-lifecycle/src",
       "OnCreate": "on_create.sh"
   }
   ```

   Then, your `"s3://sagemaker-hyperpod-lifecycle/src"` should contain `on_create.sh`, `lifecycle_script.py`, `provisioning_parameters.json`, and all other setup scripts. Assume that you have prepared the files in a local folder as follows.

   ```
   └── lifecycle_files // your local folder
       ├── provisioning_parameters.json
       ├── on_create.sh
       ├── lifecycle_script.py
       └── ... // more setup scrips to be fed into lifecycle_script.py
   ```

   To upload the files, use the S3 command as follows.

   ```
   aws s3 cp --recursive ./lifecycle_scripts s3://sagemaker-hyperpod-lifecycle/src
   ```

# What particular configurations HyperPod manages in Slurm configuration files
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-what-hyperpod-overrides-in-slurm-conf"></a>

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html) and [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html) files at `/opt/slurm/etc/` to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows which specific parameters the HyperPod agent handles and overwrites. 

**Important**  
We strongly recommend that you **do not** change these parameters managed by HyperPod.
+ In [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html), HyperPod sets up the following basic parameters: `ClusterName`, `SlurmctldHost`, `PartitionName`, and `NodeName`.

  Also, to enable the [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md) functionality, HyperPod requires the `TaskPlugin` and `SchedulerParameters` parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

  ```
  TaskPlugin=task/none
  SchedulerParameters=permit_job_expansion
  ```
+ In [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html), HyperPod manages `NodeName` for GPU nodes.

# Slurm log rotations
<a name="sagemaker-hyperpod-slurm-log-rotation"></a>

SageMaker HyperPod provides automatic log rotation for Slurm daemon logs to help manage disk space usage and maintain system performance. Log rotation is crucial for preventing logs from consuming excessive disk space and ensuring optimal system operation by automatically archiving and removing old log files while maintaining recent logging information. Slurm log rotations are enabled by default when you create a cluster.

## How log rotation works
<a name="sagemaker-hyperpod-slurm-log-rotation-how-it-works"></a>

When enabled, the log rotation configuration:
+ Monitors all Slurm log files with the extension `.log` located in the `/var/log/slurm/` folder on the controller, login and compute nodes.
+ Rotates logs when they reach 50 MB in size.
+ Maintains up to two rotated log files before deleting them.
+ Sends SIGUSR2 signal to Slurm daemons (`slurmctld`, `slurmd`, and `slurmdbd`) after rotation.

## List of log files rotated
<a name="sagemaker-hyperpod-slurm-log-rotation-log-files-list"></a>

Slurm logs are located in the `/var/log/slurm/` directory. Log rotation is enabled for all files that match `/var/log/slurm/*.log`. When rotation occurs, rotated files have numerical suffixes (such as `slurmd.log.1`). The following list is not exhaustive but shows some of the critical log files that rotate automatically:
+ `/var/log/slurm/slurmctld.log`
+ `/var/log/slurm/slurmd.log`
+ `/var/log/slurm/slurmdb.log`
+ `/var/log/slurm/slurmrestd.log`

## Enable or disable log rotation
<a name="sagemaker-hyperpod-slurm-log-rotation-enable-disable"></a>

You can control the log rotation feature using the `enable_slurm_log_rotation` parameter in the `config.py` script of your cluster's lifecycle scripts, as shown in the following example:

```
class Config:
    # Set false if you want to disable log rotation of Slurm daemon logs
    enable_slurm_log_rotation = True  # Default value
```

To disable log rotation, set the parameter to `False`, as shown in the following example:

```
enable_slurm_log_rotation = False
```

**Note**  
Lifecycle scripts run on all Slurm nodes (controller, login, and compute nodes) during cluster creation. They also run on new nodes when added to the cluster. Updating the log rotation configurations must be done manually after cluster creation. The log rotation configuration is stored in `/etc/logrotate.d/sagemaker-hyperpod-slurm`. We recommend keeping log rotation enabled to prevent log files from consuming excessive disk space. To disable log rotation, delete the `sagemaker-hyperpod-slurm` file or comment out its contents by adding `#` at the start of each line in the `sagemaker-hyperpod-slurm` file.

## Default log rotation settings
<a name="sagemaker-hyperpod-slurm-log-rotation-default-settings"></a>

The following settings are configured automatically for each log file rotated:


| Setting | Value | Description | 
| --- | --- | --- | 
| rotate | 2 | Number of rotated log files to keep | 
| size | 50 MB | Maximum size before rotation | 
| copytruncate | enabled | Copies and truncates the original log file | 
| compress | disabled | Rotated logs are not compressed | 
| missingok | enabled | No error if log file is missing | 
| notifempty | enabled | Doesn't rotate empty files | 
| noolddir | enabled | Rotated files stay in same directory | 

# Mounting Amazon FSx for Lustre and Amazon FSx for OpenZFS to a HyperPod cluster
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-setup-with-fsx"></a>

To mount an Amazon FSx for Lustre shared file system to your HyperPod cluster, set up the following.

1. Use your Amazon VPC. 

   1. For HyperPod cluster instances to communicate within your VPC, make sure that you attach the [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc) to the IAM role for SageMaker HyperPod. 

   1. In `create_cluster.json`, include the following VPC information.

      ```
      "VpcConfig": { 
          "SecurityGroupIds": [ "string" ],
          "Subnets": [ "string" ]
      }
      ```

      For more tips about setting up Amazon VPC, see [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md).

1. To finish configuring Slurm with Amazon FSx for Lustre, you can use one of the following approaches. You can find the Amazon FSx information either from the Amazon FSx for Lustre console in your account or by running the following AWS CLI command, `aws fsx describe-file-systems`.

   **Option A: API-Driven Configuration (Recommended)**

   Specify the Amazon FSx configuration directly in the CreateCluster API payload using `InstanceStorageConfigs` within each instance group. This approach supports both FSx for Lustre and FSx for OpenZFS, and allows per-instance-group FSx configuration.

   ```
   "InstanceStorageConfigs": [
       {
           "FsxLustreConfig": {
               "DnsName": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
               "MountPath": "/fsx",
               "MountName": "1abcdefg"
           }
       }
   ]
   ```

   For FSx for OpenZFS, use `FsxOpenZfsConfig` instead:

   ```
   "InstanceStorageConfigs": [
       {
           "FsxOpenZfsConfig": {
               "DnsName": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
               "MountPath": "/fsx-openzfs"
           }
       }
   ]
   ```

   For more details, see [Getting started with SageMaker HyperPod using the AWS CLI](sagemaker-hyperpod-quickstart.md).

   **Option B: Legacy Configuration**

   Specify the Amazon FSx DNS name and Amazon FSx mount name in `provisioning_parameters.json` as shown in the figure in the [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) section.

   ```
   "fsx_dns_name": "fs-12345678a90b01cde.fsx.us-west-2.amazonaws.com",
   "fsx_mountname": "1abcdefg"
   ```

# Validating the JSON configuration files before creating a Slurm cluster on HyperPod
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-json-files"></a>

To validate the JSON configuration files before submitting a cluster creation request, use the configuration validation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/validate-config.py). This script parses and compares your HyperPod cluster configuration JSON file and Slurm configuration JSON file, and identifies if there's any resource misconfiguration between the two files and also across Amazon EC2, Amazon VPC, and Amazon FSx resources. For example, to validate the `create_cluster.json` and `provisioning_parameters.json` files from the [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) section, run the validation script as follows.

```
python3 validate-config.py --cluster-config create_cluster.json --provisioning-parameters provisioning_parameters.json
```

The following is an example output of a successful validation.

```
✔️  Validated instance group name worker-group-1 is correct ...

✔️  Validated subnet subnet-012345abcdef67890 ...
✔️  Validated security group sg-012345abcdef67890 ingress rules ...
✔️  Validated security group sg-012345abcdef67890 egress rules ...
✔️  Validated FSx Lustre DNS name fs-012345abcdef67890.fsx.us-east-1.amazonaws.com
✔️  Validated FSx Lustre mount name abcdefgh
✅ Cluster Validation succeeded
```

# Validating runtime before running production workloads on a HyperPod Slurm cluster
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-validate-runtime"></a>

To check the runtime before running any production workloads on a Slurm cluster on HyperPod, use the runtime validation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/hyperpod-precheck.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/hyperpod-precheck.py). This script checks if the Slurm cluster has all packages installed for running Docker, if the cluster has a properly mounted FSx for Lustre file system and a user directory sharing the file system, and if the Slurm deamon is running on all compute nodes.

To run the script on multiple nodes at once, use `srun` as shown in the following example command of running the script on a Slurm cluster of 8 nodes.

```
# The following command runs on 8 nodes
srun -N 8 python3 hyperpod-precheck.py
```

**Note**  
To learn more about the validation script such as what runtime validation functions the script provides and guidelines to resolve issues that don't pass the validations, see [Runtime validation before running workloads](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod#35-runtime-validation-before-running-workloads) in the *Awsome Distributed Training GitHub repository*.

# Developing lifecycle scripts interactively on a HyperPod cluster node
<a name="sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-develop-lifecycle-scripts"></a>

This section explains how you can interactively develop lifecycle scripts without repeatedly creating and deleting a HyperPod cluster.

1. Create a HyperPod cluster with the base lifecycle scripts.

1. Log in to a cluster node.

1. Develop a script (`configure_xyz.sh`) by editing and running it repeatedly on the node.

   1. HyperPod runs the lifecycle scripts as the root user, so we recommend that you run the `configure_xyz.sh` as the root user while developing to make sure that the script is tested under the same condition while run by HyperPod.

1. Integrate the script into `lifecycle_script.py` by adding a code line similar to the following.

   ```
   ExecuteBashScript("./utils/configure_xyz.sh").run()
   ```

1. Upload the updated lifecycle scripts to the S3 bucket that you initially used for uploading the base lifecycle scripts.

1. Test the integrated version of `lifecycle_script.py` by creating a new HyperPod cluster. You can also use manual instance replacement to test the updated lifecycle scripts by creating new instances. For detailed instructions, see [Manually replace a node](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.html#sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace). Note that only worker nodes are replaceable.

# SageMaker HyperPod multi-head node support
<a name="sagemaker-hyperpod-multihead-slurm"></a>

You can create multiple controller (head) nodes in a single SageMaker HyperPod Slurm cluster, with one serving as the primary controller node and the others serving as backup controller nodes. The primary controller node is responsible for controlling the compute (worker) nodes and handling Slurm operations. The backup controller nodes constantly monitor the primary controller node. If the primary controller node fails or becomes unresponsive, one of the backup controller nodes will automatically take over as the new primary controller node.

Configuring multiple controller nodes in SageMaker HyperPod Slurm clusters provides several key benefits. It eliminates the risk of single controller node failure by providing controller head nodes, enables automatic failover to backup controller nodes with faster recovery, and allows you to manage your own accounting databases and Slurm configuration independently.

## Key concepts
<a name="sagemaker-hyperpod-multihead-slurm-concepts"></a>

The following provides details about the concepts related to SageMaker HyperPod multiple controller (head) nodes support for Slurm clusters.

**Controller node**

A controller node is an Amazon EC2 instance within a cluster that runs critical Slurm services for managing and coordinating the cluster's operations. Specifically, it hosts the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) and the [Slurm database daemon (slurmdbd)](https://slurm.schedmd.com/slurmdbd.html). A controller node is also known as a head node.

**Primary controller node**

A primary controller node is the active and currently controlling controller node in a Slurm cluster. It is identified by Slurm as the primary controller node responsible for managing the cluster. The primary controller node receives and executes commands from users to control and allocate resources on the compute nodes for running jobs.

**Backup controller node**

A backup controller node is an inactive and standby controller node in a Slurm cluster. It is identified by Slurm as a backup controller node that is not currently managing the cluster. The backup controller node runs the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) in standby mode. Any controller commands executed on the backup controller nodes will be propagated to the primary controller node for execution. Its primary purpose is to continuously monitor the primary controller node and take over its responsibilities if the primary controller node fails or becomes unresponsive.

**Compute node**

A compute node is an Amazon EC2 instance within a cluster that hosts the [Slurm worker daemon (slurmd)](https://slurm.schedmd.com/slurmd.html). The compute node's primary function is to execute jobs assigned by the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) running on the primary controller node. When a job is scheduled, the compute node receives instructions from the [Slurm controller daemon (slurmctld)](https://slurm.schedmd.com/slurmctld.html) to carry out the necessary tasks and computations for that job within the node itself. A compute is also known as a worker node.

## How it works
<a name="sagemaker-hyperpod-multihead-slurm-how"></a>

The following diagram illustrates how different AWS services work together to support the multiple controller (head) nodes architecture for SageMaker HyperPod Slurm clusters.

![\[SageMaker HyperPod multi-head nodes architecture diagram\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-multihead-architecture.png)


The AWS services that work together to support the SageMaker HyperPod multiple controller (head) nodes architecture include the following.


**AWS services that work together to support the SageMaker HyperPod multiple controller nodes architecture**  

| Service | Description | 
| --- | --- | 
| IAM (AWS Identity and Access Management) | Defines two IAM roles to control the access permissions: one role for the compute node instance group and the other for the controller node instance group. | 
| Amazon RDS for MariaDB | Stores accounting data for Slurm, which holds job records and metering data. | 
| AWS Secrets Manager | Stores and manages credentials that can be accessed by Amazon FSx for Lustre. | 
| Amazon FSx for Lustre  | Stores Slurm configurations and runtime state. | 
| Amazon VPC | Provides an isolated network environment where the HyperPod cluster and its resources are deployed. | 
| Amazon SNS  | Sends notifications to administrators when there are status changes (Slurm controller is ON or OFF) related to the primary controller (head) node. | 

The HyperPod cluster itself consists of controller nodes (primary and backup) and compute nodes. The controller nodes run the Slurm controller (SlurmCtld) and database (SlurmDBd) components, which manage and monitor the workload across the compute nodes.

The controller nodes access Slurm configurations and runtime state stored in the Amazon FSx for Lustre file system. The Slurm accounting data is stored in the Amazon RDS for MariaDB database. AWS Secrets Manager provides secure access to the database credentials for the controller nodes.

If there is a status change (Slurm controller is `ON` or `OFF`) in the Slurm controller nodes, Amazon SNS sends notifications to the admin for further action.

This multiple controller nodes architecture eliminates the single point of failure of a single controller (head) node, enables fast and automatic failover recovery, and gives you control over the Slurm accounting database and configurations.

# Setting up multiple controller nodes for a SageMaker HyperPod Slurm cluster
<a name="sagemaker-hyperpod-multihead-slurm-setup"></a>

This topic explains how to configure multiple controller (head) nodes in a SageMaker HyperPod Slurm cluster using lifecycle scripts. Before you start, review the prerequisites listed in [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md) and familiarize yourself with the lifecycle scripts in [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md). The instructions in this topic use AWS CLI commands in Amazon Linux environment. Note that the environment variables used in these commands are available in the current session unless explicitly preserved.

**Topics**
+ [Provisioning resources using CloudFormation stacks](sagemaker-hyperpod-multihead-slurm-cfn.md)
+ [Creating and attaching an IAM policy](sagemaker-hyperpod-multihead-slurm-iam.md)
+ [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md)
+ [Creating a SageMaker HyperPod cluster](sagemaker-hyperpod-multihead-slurm-create.md)
+ [Considering important notes](sagemaker-hyperpod-multihead-slurm-notes.md)
+ [Reviewing environment variables reference](sagemaker-hyperpod-multihead-slurm-variables-reference.md)

# Provisioning resources using CloudFormation stacks
<a name="sagemaker-hyperpod-multihead-slurm-cfn"></a>

To set up multiple controller nodes in a HyperPod Slurm cluster, provision AWS resources through two CloudFormation stacks: [Provision basic resources](#sagemaker-hyperpod-multihead-slurm-cfn-basic) and [Provision additional resources to support multiple controller nodes](#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

## Provision basic resources
<a name="sagemaker-hyperpod-multihead-slurm-cfn-basic"></a>

Follow these steps to provision basic resources for your Amazon SageMaker HyperPod Slurm cluster.

1. Download the [sagemaker-hyperpod.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod.yaml) template file to your machine. This YAML file is an CloudFormation template that defines the following resources to create for your Slurm cluster.
   + An execution IAM role for the compute node instance group
   + An Amazon S3 bucket to store the lifecycle scripts
   + Public and private subnets (private subnets have internet access through NAT gateways)
   + Internet Gateway/NAT gateways
   + Two Amazon EC2 security groups
   + An Amazon FSx volume to store configuration files

1. Run the following CLI command to create a CloudFormation stack named `sagemaker-hyperpod`. Define the Availability Zone (AZ) IDs for your cluster in `PrimarySubnetAZ` and `BackupSubnetAZ`. For example, *use1-az4* is an AZ ID for an Availability Zone in the `us-east-1` Region. For more information, see [Availability Zone IDs](https://docs.aws.amazon.com//ram/latest/userguide/working-with-az-ids.html) and [Setting up SageMaker HyperPod clusters across multiple AZs](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-multiple-availability-zones).

   ```
   aws cloudformation deploy \
   --template-file /path_to_template/sagemaker-hyperpod.yaml \
   --stack-name sagemaker-hyperpod \
   --parameter-overrides PrimarySubnetAZ=use1-az4 BackupSubnetAZ=use1-az1 \
   --capabilities CAPABILITY_IAM
   ```

   For more information, see [deploy](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/deploy/) from the AWS Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.

   ```
   Waiting for changeset to be created..
   Waiting for stack create/update to complete
   Successfully created/updated stack - sagemaker-hyperpod
   ```

1. (Optional) Verify the stack in the [CloudFormation console](https://console.aws.amazon.com/cloudformation/home).
   + From the left navigation, choose **Stack**.
   + On the **Stack** page, find and choose **sagemaker-hyperpod**.
   + Choose the tabs like **Resources** and **Outputs** to review the resources and outputs.

1. Create environment variables from the stack (`sagemaker-hyperpod`) outputs. You will use values of these variables to [Provision additional resources to support multiple controller nodes](#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

   ```
   source .env
   PRIMARY_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`PrimaryPrivateSubnet`].OutputValue' --output text)
   BACKUP_SUBNET=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`BackupPrivateSubnet`].OutputValue' --output text)
   EMAIL=$(bash -c 'read -p "INPUT YOUR SNSSubEmailAddress HERE: " && echo $REPLY')
   DB_USER_NAME=$(bash -c 'read -p "INPUT YOUR DB_USER_NAME HERE: " && echo $REPLY')
   SECURITY_GROUP=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`SecurityGroup`].OutputValue' --output text)
   ROOT_BUCKET_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonS3BucketName`].OutputValue' --output text)
   SLURM_FSX_DNS_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemDNSname`].OutputValue' --output text)
   SLURM_FSX_MOUNT_NAME=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`FSxLustreFilesystemMountname`].OutputValue' --output text)
   COMPUTE_NODE_ROLE=$(aws --region $REGION cloudformation describe-stacks --stack-name $SAGEMAKER_STACK_NAME --query 'Stacks[0].Outputs[?OutputKey==`AmazonSagemakerClusterExecutionRoleArn`].OutputValue' --output text)
   ```

   When you see prompts asking for your email address and database user name, enter values like the following.

   ```
   INPUT YOUR SNSSubEmailAddress HERE: Email_address_to_receive_SNS_notifications
   INPUT YOUR DB_USER_NAME HERE: Database_user_name_you_define
   ```

   To verify variable values, use the `print $variable` command.

   ```
   print $REGION
   us-east-1
   ```

## Provision additional resources to support multiple controller nodes
<a name="sagemaker-hyperpod-multihead-slurm-cfn-multihead"></a>

Follow these steps to provision additional resources for your Amazon SageMaker HyperPod Slurm cluster with multiple controller nodes.

1. Download the [sagemaker-hyperpod-slurm-multi-headnode.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/sagemaker-hyperpod-slurm-multi-headnode.yaml) template file to your machine. This second YAML file is an CloudFormation template that defines the additional resources to create for multiple controller nodes support in your Slurm cluster.
   + An execution IAM role for the controller node instance group
   + An Amazon RDS for MariaDB instance
   + An Amazon SNS topic and subscription
   + AWS Secrets Manager credentials for Amazon RDS for MariaDB

1. Run the following CLI command to create a CloudFormation stack named `sagemaker-hyperpod-mh`. This second stack uses the CloudFormation template to create additional AWS resources to support the multiple controller nodes architecture.

   ```
   aws cloudformation deploy \
   --template-file /path_to_template/slurm-multi-headnode.yaml \
   --stack-name sagemaker-hyperpod-mh \
   --parameter-overrides \
   SlurmDBSecurityGroupId=$SECURITY_GROUP \
   SlurmDBSubnetGroupId1=$PRIMARY_SUBNET \
   SlurmDBSubnetGroupId2=$BACKUP_SUBNET \
   SNSSubEmailAddress=$EMAIL \
   SlurmDBUsername=$DB_USER_NAME \
   --capabilities CAPABILITY_NAMED_IAM
   ```

   For more information, see [deploy](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/deploy/) from the AWS Command Line Interface Reference. The stack creation can take a few minutes to complete. When it's complete, you will see the following in your command line interface.

   ```
   Waiting for changeset to be created..
   Waiting for stack create/update to complete
   Successfully created/updated stack - sagemaker-hyperpod-mh
   ```

1. (Optional) Verify the stack in the [AWS Cloud Formation console](https://console.aws.amazon.com/cloudformation/home).
   + From the left navigation, choose **Stack**.
   + On the **Stack** page, find and choose **sagemaker-hyperpod-mh**.
   + Choose the tabs like **Resources** and **Outputs** to review the resources and outputs.

1. Create environment variables from the stack (`sagemaker-hyperpod-mh`) outputs. You will use values of these variables to update the configuration file (`provisioning_parameters.json`) in [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md).

   ```
   source .env
   SLURM_DB_ENDPOINT_ADDRESS=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBEndpointAddress`].OutputValue' --output text)
   SLURM_DB_SECRET_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmDBSecretArn`].OutputValue' --output text)
   SLURM_EXECUTION_ROLE_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmExecutionRoleArn`].OutputValue' --output text)
   SLURM_SNS_FAILOVER_TOPIC_ARN=$(aws --region us-east-1 cloudformation describe-stacks --stack-name $MULTI_HEAD_SLURM_STACK --query 'Stacks[0].Outputs[?OutputKey==`SlurmFailOverSNSTopicArn`].OutputValue' --output text)
   ```

# Creating and attaching an IAM policy
<a name="sagemaker-hyperpod-multihead-slurm-iam"></a>

This section explains how to create an IAM policy and attach it to the execution role you created in [Provision additional resources to support multiple controller nodes](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-multihead).

1. Download the [IAM policy example](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/1.AmazonSageMakerClustersExecutionRolePolicy.json) to your machine from the GitHub repository.

1. Create an IAM policy with the downloaded example, using the [create-policy](https://docs.aws.amazon.com//cli/latest/reference/iam/create-policy.html) CLI command.

   ```
   aws --region us-east-1 iam create-policy \
       --policy-name AmazonSagemakerExecutionPolicy \
       --policy-document file://1.AmazonSageMakerClustersExecutionRolePolicy.json
   ```

   Example output of the command.

   ```
   {
       "Policy": {
           "PolicyName": "AmazonSagemakerExecutionPolicy",
           "PolicyId": "ANPAXISIWY5UYZM7WJR4W",
           "Arn": "arn:aws:iam::111122223333:policy/AmazonSagemakerExecutionPolicy",
           "Path": "/",
           "DefaultVersionId": "v1",
           "AttachmentCount": 0,
           "PermissionsBoundaryUsageCount": 0,
           "IsAttachable": true,
           "CreateDate": "2025-01-22T20:01:21+00:00",
           "UpdateDate": "2025-01-22T20:01:21+00:00"
       }
   }
   ```

1. Attach the policy `AmazonSagemakerExecutionPolicy` to the Slurm execution role you created in [Provision additional resources to support multiple controller nodes](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-multihead), using the [attach-role-policy](https://docs.aws.amazon.com//cli/latest/reference/iam/attach-role-policy.html) CLI command.

   ```
   aws --region us-east-1 iam attach-role-policy \
       --role-name AmazonSagemakerExecutionRole \
       --policy-arn arn:aws:iam::111122223333:policy/AmazonSagemakerExecutionPolicy
   ```

   This command doesn't produce any output.

   (Optional) If you use environment variables, here are the example commands.
   + To get the role name and policy name 

     ```
     POLICY=$(aws --region $REGION iam list-policies --query 'Policies[?PolicyName==AmazonSagemakerExecutionPolicy].Arn' --output text)
     ROLENAME=$(aws --region $REGION iam list-roles --query "Roles[?Arn=='${SLURM_EXECUTION_ROLE_ARN}'].RoleName" —output text)
     ```
   + To attach the policy

     ```
     aws  --region us-east-1 iam attach-role-policy \
          --role-name $ROLENAME --policy-arn $POLICY
     ```

For more information, see [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

# Preparing and uploading lifecycle scripts
<a name="sagemaker-hyperpod-multihead-slurm-scripts"></a>

After creating all the required resources, you'll need to set up [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) for your SageMaker HyperPod cluster. These [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) provide a [base configuration](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) you can use to create a basic HyperPod Slurm cluster.

## Prepare the lifecycle scripts
<a name="sagemaker-hyperpod-multihead-slurm-prepare-scripts"></a>

Follow these steps to get the lifecycle scripts.

1. Download the [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) from the GitHub repository to your machine.

1. Upload the [lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts) to the Amazon S3 bucket you created in [Provision basic resources](sagemaker-hyperpod-multihead-slurm-cfn.md#sagemaker-hyperpod-multihead-slurm-cfn-basic), using the [cp](https://docs.aws.amazon.com//cli/latest/reference/s3/cp.html) CLI command.

   ```
   aws s3 cp --recursive LifeCycleScripts/base-config s3://${ROOT_BUCKET_NAME}/LifeCycleScripts/base-config
   ```

## Create configuration file
<a name="sagemaker-hyperpod-multihead-slurm-update-config-file"></a>

Follow these steps to create the configuration file and upload it to the same Amazon S3 bucket where you store the lifecycle scripts.

1. Create a configuration file named `provisioning_parameters.json` with the following configuration. Note that `slurm_sns_arn` is optional. If not provided, HyperPod will not set up the Amazon SNS notifications.

   ```
   cat <<EOF > /tmp/provisioning_parameters.json
   {
     "version": "1.0.0",
     "workload_manager": "slurm",
     "controller_group": "$CONTOLLER_IG_NAME",
     "login_group": "my-login-group",
     "worker_groups": [
       {
         "instance_group_name": "$COMPUTE_IG_NAME",
         "partition_name": "dev"
       }
     ],
     "fsx_dns_name": "$SLURM_FSX_DNS_NAME",
     "fsx_mountname": "$SLURM_FSX_MOUNT_NAME",
     "slurm_configurations": {
       "slurm_database_secret_arn": "$SLURM_DB_SECRET_ARN",
       "slurm_database_endpoint": "$SLURM_DB_ENDPOINT_ADDRESS",
       "slurm_shared_directory": "/fsx",
       "slurm_database_user": "$DB_USER_NAME",
       "slurm_sns_arn": "$SLURM_SNS_FAILOVER_TOPIC_ARN"
     }
   }
   EOF
   ```

1. Upload the `provisioning_parameters.json` file to the same Amazon S3 bucket where you store the lifecycle scripts.

   ```
   aws s3 cp /tmp/provisioning_parameters.json s3://${ROOT_BUCKET_NAME}/LifeCycleScripts/base-config/provisioning_parameters.json
   ```
**Note**  
If you are using API-driven configuration, the `provisioning_parameters.json` file is not required. With API-driven configuration, you define Slurm node types, partitions, and FSx mounting directly in the CreateCluster API payload. For details, see [Getting started with SageMaker HyperPod using the AWS CLI](smcluster-getting-started-slurm-cli.md).

## Verify files in Amazon S3 bucket
<a name="sagemaker-hyperpod-multihead-slurm-verify-s3"></a>

After you upload all the lifecycle scripts and the `provisioning_parameters.json` file, your Amazon S3 bucket should look like the following.

![\[Image showing all the lifecycle scripts uploaded to the Amazon S3 bucket in the Amazon Simple Storage Service console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-scripts-s3.png)


For more information, see [Start with base lifecycle scripts provided by HyperPod](https://docs.aws.amazon.com//sagemaker/latest/dg/sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.html).

# Creating a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-multihead-slurm-create"></a>

After setting up all the required resources and uploading the scripts to the Amazon S3 bucket, you can create a cluster.

1. To create a cluster, run the [https://docs.aws.amazon.com//cli/latest/reference/sagemaker/create-cluster.html](https://docs.aws.amazon.com//cli/latest/reference/sagemaker/create-cluster.html) AWS CLI command. The creation process can take up to 15 minutes to complete.

   ```
   aws --region $REGION sagemaker create-cluster \
       --cluster-name $HP_CLUSTER_NAME \
       --vpc-config '{
           "SecurityGroupIds":["'$SECURITY_GROUP'"],
           "Subnets":["'$PRIMARY_SUBNET'", "'$BACKUP_SUBNET'"]
       }' \
       --instance-groups '[{                  
       "InstanceGroupName": "'$CONTOLLER_IG_NAME'",
       "InstanceType": "ml.t3.medium",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://'$BUCKET_NAME'",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$SLURM_EXECUTION_ROLE_ARN'",
       "ThreadsPerCore": 1
   },
   {
       "InstanceGroupName": "'$COMPUTE_IG_NAME'",          
       "InstanceType": "ml.c5.xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://'$BUCKET_NAME'",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "'$COMPUTE_NODE_ROLE'",
       "ThreadsPerCore": 1
   }]'
   ```

   After successful execution, the command returns the cluster ARN like the following.

   ```
   {
       "ClusterArn": "arn:aws:sagemaker:us-east-1:111122223333:cluster/cluster_id"
   }
   ```

1. (Optional) To check the status of your cluster, you can use the SageMaker AI console ([https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/)). From the left navigation, choose **HyperPod Clusters**, then choose **Cluster Management**. Choose a cluster name to open the cluster details page. If your cluster is created successfully, you will see the cluster status is **InService**.  
![\[Image showing a HyperPod Slurm cluster with multiple controller nodes in the Amazon SageMaker AI console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-multihead-cluster.png)

# Considering important notes
<a name="sagemaker-hyperpod-multihead-slurm-notes"></a>

This section provides several important notes which you might find helpful. 

1. To migrate to a multi-controller Slurm cluster, complete these steps.

   1. Follow the instructions in [Provisioning resources using CloudFormation stacks](sagemaker-hyperpod-multihead-slurm-cfn.md) to provision all the required resources.

   1. Follow the instructions in [Preparing and uploading lifecycle scripts](sagemaker-hyperpod-multihead-slurm-scripts.md) to upload the updated lifecycle scripts. When updating the `provisioning_parameters.json` file, move your existing controller group to the `worker_groups` section, and add a new controller group name in the `controller_group` section.

   1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) API call to create a new controller group and keep the original compute instance groups and controller group.

1. To scale down the number of controller nodes, use the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) CLI command. For each controller instance group, the minimum number of controller nodes you can scale down to is 1. This means that you cannot scale down the number of controller nodes to 0.
**Important**  
For clusters created before Jan 24, 2025, you must first update your cluster software using the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API before running the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) CLI command.

   The following is an example CLI command to scale down the number of controller nodes.

   ```
   aws sagemaker update-cluster \
       --cluster-name my_cluster \
       --instance-groups '[{                  
       "InstanceGroupName": "controller_ig_name",
       "InstanceType": "ml.t3.medium",
       "InstanceCount": 3,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket1",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "slurm_execution_role_arn",
       "ThreadsPerCore": 1
   },
   {
       "InstanceGroupName": "compute-ig_name",       
       "InstanceType": "ml.c5.xlarge",
       "InstanceCount": 2,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket1",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "compute_node_role_arn",
       "ThreadsPerCore": 1
   }]'
   ```

1. To batch delete the controller nodes, use the [batch-delete-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html) CLI command. For each controller instance group, you must keep at least one controller node. If you want to batch delete all the controller nodes, the API operation won't work.
**Important**  
For clusters created before Jan 24, 2025, you must first update your cluster software using the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API before running the [batch-delete-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html) CLI command.

   The following is an example CLI command to batch delete the controller nodes.

   ```
   aws sagemaker batch-delete-cluster-nodes --cluster-name my_cluster --node-ids instance_ids_to_delete
   ```

1. To troubleshoot your cluster creation issues, check the failure message from the cluster details page in your SageMaker AI console. You can also use CloudWatch logs to troubleshoot cluster creation issues. From the CloudWatch console, choose **Log groups**. Then, search `clusters` to see the list of log groups related to your cluster creation.  
![\[Image showing Amazon SageMaker HyperPod cluster log groups in the CloudWatch console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-lifecycle-multihead-logs.png)

# Reviewing environment variables reference
<a name="sagemaker-hyperpod-multihead-slurm-variables-reference"></a>

The following environment variables are defined and used in the tutorial of [Setting up multiple controller nodes for a SageMaker HyperPod Slurm cluster](sagemaker-hyperpod-multihead-slurm-setup.md). These environment variables are only available in the current session unless explicitly preserved. They are defined using the `$variable_name` syntax. Variables with key/value pairs represent AWS-created resources, while variables without keys are user-defined.


**Environment variables reference**  

| Variable | Description | 
| --- | --- | 
| \$1BACKUP\$1SUBNET |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1COMPUTE\$1IG\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1COMPUTE\$1NODE\$1ROLE |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1CONTOLLER\$1IG\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1DB\$1USER\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1EMAIL |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1PRIMARY\$1SUBNET |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1POLICY |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1REGION |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1ROOT\$1BUCKET\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SECURITY\$1GROUP |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1DB\$1ENDPOINT\$1ADDRESS |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1DB\$1SECRET\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1EXECUTION\$1ROLE\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1FSX\$1DNS\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1FSX\$1MOUNT\$1NAME |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 
| \$1SLURM\$1SNS\$1FAILOVER\$1TOPIC\$1ARN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-multihead-slurm-variables-reference.html)  | 

# Jobs on SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-run-jobs-slurm"></a>

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters. Examples of running ML workloads on HyperPod clusters are also provided in the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/). The following topics walk you through how to log in to the provisioned HyperPod clusters and get you started with running sample ML workloads.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

**Topics**
+ [Accessing your SageMaker HyperPod cluster nodes](sagemaker-hyperpod-run-jobs-slurm-access-nodes.md)
+ [Scheduling a Slurm job on a SageMaker HyperPod cluster](sagemaker-hyperpod-run-jobs-slurm-schedule-slurm-job.md)
+ [Running Docker containers on a Slurm compute node on HyperPod](sagemaker-hyperpod-run-jobs-slurm-docker.md)
+ [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md)

# Accessing your SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes"></a>

You can access your **InService** cluster through AWS Systems Manager (SSM) by running the AWS CLI command `aws ssm start-session` with the SageMaker HyperPod cluster host name in format of `sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]`. You can retrieve the cluster ID, the instance ID, and the instance group name from the [SageMaker HyperPod console](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters) or by running `describe-cluster` and `list-cluster-nodes` from the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes). For example, if your cluster ID is `aa11bbbbb222`, the cluster node name is `controller-group`, and the cluster node ID is `i-111222333444555aa`, the SSM `start-session` command should be the following.

**Note**  
Granting users access to HyperPod cluster nodes allows them to install and operate user-managed software on the nodes. Ensure that you maintain the principle of least-privilege permissions for users.  
If you haven't set up AWS Systems Manager, follow the instructions provided at [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

```
$ aws ssm start-session \
    --target sagemaker-cluster:aa11bbbbb222_controller-group-i-111222333444555aa \
    --region us-west-2
Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

Note that this initially connects you as the root user. Before running jobs, switch to the `ubuntu` user by running the following command.

```
root@ip-111-22-333-444:/usr/bin# sudo su - ubuntu
ubuntu@ip-111-22-333-444:/usr/bin#
```

For advanced settings for practical use of HyperPod clusters, see the following topics.

**Topics**
+ [Additional tips for accessing your SageMaker HyperPod cluster nodes](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-tips)
+ [Set up a multi-user environment through the Amazon FSx shared space](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-fxs-shared-space)
+ [Set up a multi-user environment by integrating HyperPod clusters with Active Directory](#sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-active-directory)

## Additional tips for accessing your SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-tips"></a>

**Use the `easy-ssh.sh` script provided by HyperPod for simplifying the connection process**

To make the previous process into a single line command, the HyperPod team provides the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh) script that retrieves your cluster information, aggregates them into the SSM command, and connects to the compute node. You don't need to manually look for the required HyperPod cluster information as this script runs `describe-cluster` and `list-cluster-nodes` commands and parses the information needed for completing the SSM command. The following example commands show how to run the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh) script. If it runs successfully, you'll be connected to the cluster as the root user. It also prints a code snippet to set up SSH by adding the HyperPod cluster as a remote host through an SSM proxy. By setting up SSH, you can connect your local development environment such as Visual Studio Code with the HyperPod cluster.

```
$ chmod +x easy-ssh.sh
$ ./easy-ssh.sh -c <node-group> <cluster-name>
Cluster id: <cluster_id>
Instance id: <instance_id>
Node Group: <node-group>
Add the following to your ~/.ssh/config to easily connect:

$ cat <<EOF >> ~/.ssh/config
Host <cluster-name>
  User ubuntu
  ProxyCommand sh -c "aws ssm start-session  --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id> --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
EOF

Add your ssh keypair and then you can do:

$ ssh <cluster-name>

aws ssm start-session --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id>

Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

Note that this initially connects you as the root user. Before running jobs, switch to the `ubuntu` user by running the following command.

```
root@ip-111-22-333-444:/usr/bin# sudo su - ubuntu
ubuntu@ip-111-22-333-444:/usr/bin#
```

**Set up for easy access with SSH by using the HyperPod compute node as a remote host**

To further simplify access to the compute node using SSH from a local machine, the `easy-ssh.sh` script outputs a code snippet of setting up the HyperPod cluster as a remote host as shown in the previous section. The code snippet is auto-generated to help you directly add to the `~/.ssh/config` file on your local device. The following procedure shows how to set up for easy access using SSH through the SSM proxy, so that you or your cluster users can directly run `ssh <cluster-name>` to connect to the HyperPod cluster node.

1. On your local device, add the HyperPod compute node with a user name as a remote host to the `~/.ssh/config` file. The following command shows how to append the auto-generated code snippet from the `easy-ssh.sh` script to the `~/.ssh/config` file. Make sure that you copy it from the auto-generated output of the `easy-ssh.sh` script that has the correct cluster information.

   ```
   $ cat <<EOF >> ~/.ssh/config
   Host <cluster-name>
     User ubuntu
     ProxyCommand sh -c "aws ssm start-session  --target sagemaker-cluster:<cluster_id>_<node-group>-<instance_id> --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
   EOF
   ```

1. On the HyperPod cluster node, add the public key on your local device to the `~/.ssh/authorized_keys` file on the HyperPod cluster node.

   1. Print the public key file on your local machine.

      ```
      $ cat ~/.ssh/id_rsa.pub
      ```

      This should return your key. Copy the output of this command. 

      (Optional) If you don't have a public key, create one by running the following command.

      ```
      $ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""
      ```

   1. Connect to the cluster node and switch to the user to add the key. The following command is an example of accessing as the `ubuntu` user. Replace `ubuntu` to the user name for which you want to set up the easy access with SSH.

      ```
      $ ./easy-ssh.sh -c <node-group> <cluster-name>
      $ sudo su - ubuntu
      ubuntu@ip-111-22-333-444:/usr/bin#
      ```

   1. Open the `~/.ssh/authorized_keys` file and add the public key at the end of the file.

      ```
      ubuntu@ip-111-22-333-444:/usr/bin# vim ~/.ssh/authorized_keys
      ```

After you finish setting up, you can connect to the HyperPod cluster node as the user by running a simplified SSH command as follows.

```
$ ssh <cluster-name>
ubuntu@ip-111-22-333-444:/usr/bin#
```

Also, you can use the host for remote development from an IDE on your local device, such as [Visual Studio Code Remote - SSH](https://code.visualstudio.com/docs/remote/ssh).

## Set up a multi-user environment through the Amazon FSx shared space
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-fxs-shared-space"></a>

You can use the Amazon FSx shared space to manage a multi-user environment in a Slurm cluster on SageMaker HyperPod. If you have configured your Slurm cluster with Amazon FSx during the HyperPod cluster creation, this is a good option for setting up workspace for your cluster users. Create a new user and setup the home directory for the user on the Amazon FSx shared file system.

**Tip**  
To allow users to access your cluster through their user name and dedicated directories, you should also associate them with IAM roles or users by tagging them as guided in **Option 2** of step 5 under the procedure **To turn on Run As support for Linux and macOS managed nodes** provided at [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html) in the AWS Systems Manager User Guide. See also [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

**To set up a multi-user environment while creating a Slurm cluster on SageMaker HyperPod**

The SageMaker HyperPod service team provides a script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) as part of the base lifecycle script samples. 

1. Prepare a text file named `shared_users.txt` that you need to create in the following format. The first column is for user names, the second column is for unique user IDs, and the third column is for the user directories in the Amazon FSx shared space.

   ```
   username1,uid1,/fsx/username1
   username2,uid2,/fsx/username2
   ...
   ```

1. Make sure that you upload the `shared_users.txt` and [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) files to the S3 bucket for HyperPod lifecycle scripts. While the cluster creation, cluster update, or cluster software update is in progress, the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/add_users.sh) reads in the `shared_users.txt` and sets up the user directories properly.

**To create new users and add to an existing Slurm cluster running on SageMaker HyperPod **

1. On the head node, run the following command to save a script that helps create a user. Make sure that you run this with sudo permissions.

   ```
   $ cat > create-user.sh << EOL
   #!/bin/bash
   
   set -x
   
   # Prompt user to get the new user name.
   read -p "Enter the new user name, i.e. 'sean': 
   " USER
   
   # create home directory as /fsx/<user>
   # Create the new user on the head node
   sudo useradd \$USER -m -d /fsx/\$USER --shell /bin/bash;
   user_id=\$(id -u \$USER)
   
   # add user to docker group
   sudo usermod -aG docker \${USER}
   
   # setup SSH Keypair
   sudo -u \$USER ssh-keygen -t rsa -q -f "/fsx/\$USER/.ssh/id_rsa" -N ""
   sudo -u \$USER cat /fsx/\$USER/.ssh/id_rsa.pub | sudo -u \$USER tee /fsx/\$USER/.ssh/authorized_keys
   
   # add user to compute nodes
   read -p "Number of compute nodes in your cluster, i.e. 8: 
   " NUM_NODES
   srun -N \$NUM_NODES sudo useradd -u \$user_id \$USER -d /fsx/\$USER --shell /bin/bash;
   
   # add them as a sudoer
   read -p "Do you want this user to be a sudoer? (y/N):
   " SUDO
   if [ "\$SUDO" = "y" ]; then
           sudo usermod -aG sudo \$USER
           sudo srun -N \$NUM_NODES sudo usermod -aG sudo \$USER
           echo -e "If you haven't already you'll need to run:\n\nsudo visudo /etc/sudoers\n\nChange the line:\n\n%sudo   ALL=(ALL:ALL) ALL\n\nTo\n\n%sudo   ALL=(ALL:ALL) NOPASSWD: ALL\n\nOn each node."
   fi
   EOL
   ```

1. Run the script with the following command. You'll be prompted for adding the name of a user and the number of compute nodes that you want to allow the user to access.

   ```
   $ bash create-user.sh
   ```

1. Test the user by running the following commands. 

   ```
   $ sudo su - <user> && ssh $(srun hostname)
   ```

1. Add the user information to the `shared_users.txt` file, so the user will be created on any new compute nodes or new clusters.

## Set up a multi-user environment by integrating HyperPod clusters with Active Directory
<a name="sagemaker-hyperpod-run-jobs-slurm-access-nodes-multi-user-with-active-directory"></a>

In practical use cases, HyperPod clusters are typically used by multiple users: machine learning (ML) researchers, software engineers, data scientists, and cluster administrators. They edit their own files and run their own jobs without impacting each other's work. To set up a multi-user environment, use the Linux user and group mechanism to statically create multiple users on each instance through lifecycle scripts. However, the drawback to this approach is that you need to duplicate user and group settings across multiple instances in the cluster to keep a consistent configuration across all instances when you make updates such as adding, editing, and removing users.

To solve this, you can use [Lightweight Directory Access Protocol (LDAP)](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol) and [LDAP over TLS/SSL (LDAPS)](https://en.wikipedia.org/wiki/Lightweight_Directory_Access_Protocol) to integrate with a directory service such as [AWS Directory Service for Microsoft Active Directory](https://aws.amazon.com/directoryservice/). To learn more about setting up Active Directory and a multi-user environment in a HyperPod cluster, see the blog post [Integrate HyperPod clusters with Active Directory for seamless multi-user login](https://aws.amazon.com/blogs/machine-learning/integrate-hyperpod-clusters-with-active-directory-for-seamless-multi-user-login/).

# Scheduling a Slurm job on a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-run-jobs-slurm-schedule-slurm-job"></a>

You can launch training jobs using the standard Slurm `sbatch` or `srun` commands. For example, to launch an 8-node training job, you can run `srun -N 8 --exclusive train.sh` SageMaker HyperPod supports training in a range of environments, including `conda`, `venv`, `docker`, and `enroot`. You can configure an ML environment by running lifecycle scripts on your SageMaker HyperPod clusters. You also have an option to attach a shared file system such as Amazon FSx, which can also be used as a virtual environment.

The following example shows how to run a job for training Llama-2 with the Fully Sharded Data Parallelism (FSDP) technique on a SageMaker HyperPod cluster with an Amazon FSx shared file system. You can also find more examples from the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/).

**Tip**  
All SageMaker HyperPod examples are available in the `3.test_cases` folder of the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/).

1. Clone the [Awsome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/), and copy the training job examples to your Amazon FSx file system. 

   ```
   $ TRAINING_DIR=/fsx/users/my-user/fsdp
   $ git clone https://github.com/aws-samples/awsome-distributed-training/
   ```

1. Run the [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/0.create_conda_env.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/0.create_conda_env.sh) script. This creates a `conda` environment on your Amazon FSx file system. Make sure that the file system is accessible to all nodes in the cluster.

1. Build the virtual Conda environment by launching a single node slurm job as follows.

   ```
   $ srun -N 1 /path_to/create_conda_env.sh
   ```

1. After the environment is built, you can launch a training job by pointing to the environment path on the shared volume. You can launch both single-node and multi-node training jobs with the same setup. To launch a job, create a job launcher script (also called an entry point script) as follows.

   ```
   #!/usr/bin/env bash
   set -ex
   
   ENV_PATH=/fsx/users/my_user/pytorch_env
   TORCHRUN=$ENV_PATH/bin/torchrun
   TRAINING_SCRIPT=/fsx/users/my_user/pt_train.py
   
   WORLD_SIZE_JOB=$SLURM_NTASKS
   RANK_NODE=$SLURM_NODEID
   PROC_PER_NODE=8
   MASTER_ADDR=(`scontrol show hostnames \$SLURM_JOB_NODELIST | head -n 1`)
   MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
   
   DIST_ARGS="--nproc_per_node=$PROC_PER_NODE \
              --nnodes=$WORLD_SIZE_JOB \
              --node_rank=$RANK_NODE \
              --master_addr=$MASTER_ADDR \
              --master_port=$MASTER_PORT \
             "
             
   $TORCHRUN $DIST_ARGS $TRAINING_SCRIPT
   ```
**Tip**  
If you want to make your training job more resilient against hardware failures by using the auto-resume capability of SageMaker HyperPod, you need to properly set up the environment variable `MASTER_ADDR` in the entrypoint script. To learn more, see [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md).

   This tutorial assumes that this script is saved as `/fsx/users/my_user/train.sh`.

1. With this script in the shared volume at `/fsx/users/my_user/train.sh`, run the following `srun` command to schedule the Slurm job.

   ```
   $ cd /fsx/users/my_user/
   $ srun -N 8 train.sh
   ```

# Running Docker containers on a Slurm compute node on HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-docker"></a>

To run Docker containers with Slurm on SageMaker HyperPod, you need to use [Enroot](https://github.com/NVIDIA/enroot) and [Pyxis](https://github.com/NVIDIA/pyxis). The Enroot package helps convert Docker images into a runtime that Slurm can understand, while the Pyxis enables scheduling the runtime as a Slurm job through an `srun` command, `srun --container-image=docker/image:tag`. 

**Tip**  
The Docker, Enroot, and Pyxis packages should be installed during cluster creation as part of running the lifecycle scripts as guided in [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md). Use the [base lifecycle scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) provided by the HyperPod service team when creating a HyperPod cluster. Those base scripts are set up to install the packages by default. In the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) script, there's the `Config` class with the boolean type parameter for installing the packages set to `True` (`enable_docker_enroot_pyxis=True`). This is called by and parsed in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script, which calls `install_docker.sh` and `install_enroot_pyxis.sh` scripts from the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder. The installation scripts are where the actual installations of the packages take place. Additionally, the installation scripts identify if they can detect NVMe store paths from the instances they are run on and set up the root paths for Docker and Enroot to `/opt/dlami/nvme`. The default root volume of any fresh instance is mounted to `/tmp` only with a 100GB EBS volume, which runs out if the workload you plan to run involves training of LLMs and thus large size Docker containers. If you use instance families such as P and G with local NVMe storage, you need to make sure that you use the NVMe storage attached at `/opt/dlami/nvme`, and the installation scripts take care of the configuration processes.

**To check if the root paths are set up properly**

On a compute node of your Slurm cluster on SageMaker HyperPod, run the following commands to make sure that the lifecycle script worked properly and the root volume of each node is set to `/opt/dlami/nvme/*`. The following commands shows examples of checking the Enroot runtime path and the data root path for 8 compute nodes of a Slurm cluster.

```
$ srun -N 8 cat /etc/enroot/enroot.conf | grep "ENROOT_RUNTIME_PATH"
ENROOT_RUNTIME_PATH        /opt/dlami/nvme/tmp/enroot/user-$(id -u)
... // The same or similar lines repeat 7 times
```

```
$ srun -N 8 cat /etc/docker/daemon.json
{
    "data-root": "/opt/dlami/nvme/docker/data-root"
}
... // The same or similar lines repeat 7 times
```

After you confirm that the runtime paths are properly set to `/opt/dlami/nvme/*`, you're ready to build and run Docker containers with Enroot and Pyxis.

**To test Docker with Slurm**

1. On your compute node, try the following commands to check if Docker and Enroot are properly installed.

   ```
   $ docker --help
   $ enroot --help
   ```

1. Test if Pyxis and Enroot installed correctly by running one of the [NVIDIA CUDA Ubuntu](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda) images.

   ```
   $ srun --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY nvidia-smi
   pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   DAY MMM DD HH:MM:SS YYYY
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
   | N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
   
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
   ```

   You can also test it by creating a script and running an `sbatch` command as follows.

   ```
   $ cat <<EOF >> container-test.sh
   #!/bin/bash
   #SBATCH --container-image=nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   nvidia-smi
   EOF
   
   $ sbatch container-test.sh
   pyxis: importing docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   pyxis: imported docker image: nvidia/cuda:XX.Y.Z-base-ubuntuXX.YY
   DAY MMM DD HH:MM:SS YYYY
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: XX.YY    |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
   | N/A   40C    P0    27W /  70W |      0MiB / 15109MiB |      0%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
   
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
   ```

**To run a test Slurm job with Docker**

After you have completed setting up Slurm with Docker, you can bring any pre-built Docker images and run using Slurm on SageMaker HyperPod. The following is a sample use case that walks you through how to run a training job using Docker and Slurm on SageMaker HyperPod. It shows an example job of model-parallel training of the Llama 2 model with the SageMaker AI model parallelism (SMP) library.

1. If you want to use one of the pre-built ECR images distributed by SageMaker AI or DLC, make sure that you give your HyperPod cluster the permissions to pull ECR images through the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). If you use your own or an open source Docker image, you can skip this step. Add the following permissions to the [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). In this tutorial, we use the [SMP Docker image](distributed-model-parallel-support-v2.md#distributed-model-parallel-supported-frameworks-v2) pre-packaged with the SMP library .

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "ecr:BatchCheckLayerAvailability",
                   "ecr:BatchGetImage",
                   "ecr-public:*",
                   "ecr:GetDownloadUrlForLayer",
                   "ecr:GetAuthorizationToken",
                   "sts:*"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. On the compute node, clone the repository and go to the folder that provides the example scripts of training with SMP.

   ```
   $ git clone https://github.com/aws-samples/awsome-distributed-training/
   $ cd awsome-distributed-training/3.test_cases/17.SM-modelparallelv2
   ```

1. In this tutorial, run the sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/docker_build.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/docker_build.sh) that pulls the SMP Docker image, build the Docker container, and runs it as an Enroot runtime. You can modify this as you want.

   ```
   $ cat docker_build.sh
   #!/usr/bin/env bash
   
   region=us-west-2
   dlc_account_id=658645717510
   aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com
   
   docker build -t smpv2 .
   enroot import -o smpv2.sqsh  dockerd://smpv2:latest
   ```

   ```
   $ bash docker_build.sh
   ```

1. Create a batch script to launch a training job using `sbatch`. In this tutorial, the provided sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/launch_training_enroot.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/17.SM-modelparallelv2/launch_training_enroot.sh) launches a model-parallel training job of the 70-billion-parameter Llama 2 model with a synthetic dataset on 8 compute nodes. A set of training scripts are provided at [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2/scripts](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2/scripts), and `launch_training_enroot.sh` takes `train_external.py` as the entrypoint script.
**Important**  
To use the a Docker container on SageMaker HyperPod, you must mount the `/var/log` directory from the host machine, which is the HyperPod compute node in this case, onto the `/var/log` directory in the container. You can set it up by adding the following variable for Enroot.  

   ```
   "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}"
   ```

   ```
   $ cat launch_training_enroot.sh
   #!/bin/bash
   
   # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
   # SPDX-License-Identifier: MIT-0
   
   #SBATCH --nodes=8 # number of nodes to use, 2 p4d(e) = 16 A100 GPUs
   #SBATCH --job-name=smpv2_llama # name of your job
   #SBATCH --exclusive # job has exclusive use of the resource, no sharing
   #SBATCH --wait-all-nodes=1
   
   set -ex;
   
   ###########################
   ###### User Variables #####
   ###########################
   
   #########################
   model_type=llama_v2
   model_size=70b
   
   # Toggle this to use synthetic data
   use_synthetic_data=1
   
   
   # To run training on your own data  set Training/Test Data path  -> Change this to the tokenized dataset path in Fsx. Acceptable formats are huggingface (arrow) and Jsonlines.
   # Also change the use_synthetic_data to 0
   
   export TRAINING_DIR=/fsx/path_to_data
   export TEST_DIR=/fsx/path_to_data
   export CHECKPOINT_DIR=$(pwd)/checkpoints
   
   # Variables for Enroot
   : "${IMAGE:=$(pwd)/smpv2.sqsh}"
   : "${HYPERPOD_PATH:="/var/log/aws/clusters":"/var/log/aws/clusters"}" # This is needed for validating its hyperpod cluster
   : "${TRAIN_DATA_PATH:=$TRAINING_DIR:$TRAINING_DIR}"
   : "${TEST_DATA_PATH:=$TEST_DIR:$TEST_DIR}"
   : "${CHECKPOINT_PATH:=$CHECKPOINT_DIR:$CHECKPOINT_DIR}"   
   
   
   ###########################
   ## Environment Variables ##
   ###########################
   
   #export NCCL_SOCKET_IFNAME=en
   export NCCL_ASYNC_ERROR_HANDLING=1
   
   export NCCL_PROTO="simple"
   export NCCL_SOCKET_IFNAME="^lo,docker"
   export RDMAV_FORK_SAFE=1
   export FI_EFA_USE_DEVICE_RDMA=1
   export NCCL_DEBUG_SUBSYS=off
   export NCCL_DEBUG="INFO"
   export SM_NUM_GPUS=8
   export GPU_NUM_DEVICES=8
   export FI_EFA_SET_CUDA_SYNC_MEMOPS=0
   
   # async runtime error ...
   export CUDA_DEVICE_MAX_CONNECTIONS=1
   
   
   #########################
   ## Command and Options ##
   #########################
   
   if [ "$model_size" == "7b" ]; then
       HIDDEN_WIDTH=4096
       NUM_LAYERS=32
       NUM_HEADS=32
       LLAMA_INTERMEDIATE_SIZE=11008
       DEFAULT_SHARD_DEGREE=8
   # More Llama model size options
   elif [ "$model_size" == "70b" ]; then
       HIDDEN_WIDTH=8192
       NUM_LAYERS=80
       NUM_HEADS=64
       LLAMA_INTERMEDIATE_SIZE=28672
       # Reduce for better perf on p4de
       DEFAULT_SHARD_DEGREE=64
   fi
   
   
   if [ -z "$shard_degree" ]; then
       SHARD_DEGREE=$DEFAULT_SHARD_DEGREE
   else
       SHARD_DEGREE=$shard_degree
   fi
   
   if [ -z "$LLAMA_INTERMEDIATE_SIZE" ]; then
       LLAMA_ARGS=""
   else
       LLAMA_ARGS="--llama_intermediate_size $LLAMA_INTERMEDIATE_SIZE "
   fi
   
   
   if [ $use_synthetic_data == 1 ]; then
       echo "using synthetic data"
       declare -a ARGS=(
       --container-image $IMAGE
       --container-mounts $HYPERPOD_PATH,$CHECKPOINT_PATH
       )
   else
       echo "using real data...."
       declare -a ARGS=(
       --container-image $IMAGE
       --container-mounts $HYPERPOD_PATH,$TRAIN_DATA_PATH,$TEST_DATA_PATH,$CHECKPOINT_PATH
       )
   fi
   
   
   declare -a TORCHRUN_ARGS=(
       # change this to match the number of gpus per node:
       --nproc_per_node=8 \
       --nnodes=$SLURM_JOB_NUM_NODES \
       --rdzv_id=$SLURM_JOB_ID \
       --rdzv_backend=c10d \
       --rdzv_endpoint=$(hostname) \
   )
   
   srun -l "${ARGS[@]}" torchrun "${TORCHRUN_ARGS[@]}" /path_to/train_external.py \
               --train_batch_size 4 \
               --max_steps 100 \
               --hidden_width $HIDDEN_WIDTH \
               --num_layers $NUM_LAYERS \
               --num_heads $NUM_HEADS \
               ${LLAMA_ARGS} \
               --shard_degree $SHARD_DEGREE \
               --model_type $model_type \
               --profile_nsys 1 \
               --use_smp_implementation 1 \
               --max_context_width 4096 \
               --tensor_parallel_degree 1 \
               --use_synthetic_data $use_synthetic_data \
               --training_dir $TRAINING_DIR \
               --test_dir $TEST_DIR \
               --dataset_type hf \
               --checkpoint_dir $CHECKPOINT_DIR \
               --checkpoint_freq 100 \
   
   $ sbatch launch_training_enroot.sh
   ```

To find the downloadable code examples, see [Run a model-parallel training job using the SageMaker AI model parallelism library, Docker and Enroot with Slurm](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2#option-2----run-training-using-docker-and-enroot) in the *Awsome Distributed Training GitHub repository*. For more information about distributed training with a Slurm cluster on SageMaker HyperPod, proceed to the next topic at [Running distributed training workloads with Slurm on HyperPod](sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload.md).

# Running distributed training workloads with Slurm on HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload"></a>

SageMaker HyperPod is specialized for workloads of training large language models (LLMs) and foundation models (FMs). These workloads often require the use of multiple parallelism techniques and optimized operations for ML infrastructure and resources. Using SageMaker HyperPod, you can use the following SageMaker AI distributed training frameworks:
+ The [SageMaker AI distributed data parallelism (SMDDP) library](data-parallel.md) that offers collective communication operations optimized for AWS.
+ The [SageMaker AI model parallelism (SMP) library](model-parallel-v2.md) that implements various model parallelism techniques.

**Topics**
+ [Using SMDDP on a SageMaker HyperPod](#sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smddp)
+ [Using SMP on a SageMaker HyperPod cluster](#sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smp)

## Using SMDDP on a SageMaker HyperPod
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smddp"></a>

The [SMDDP library](data-parallel.md) is a collective communication library that improves compute performance of distributed data parallel training. The SMDDP library works with the following open source distributed training frameworks:
+ [PyTorch distributed data parallel (DDP)](https://pytorch.org/docs/stable/notes/ddp.html)
+ [PyTorch fully sharded data parallelism (FSDP)](https://pytorch.org/docs/stable/fsdp.html)
+ [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+ [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed)

The SMDDP library addresses communications overhead of the key collective communication operations by offering the following for SageMaker HyperPod.
+ The library offers `AllGather` optimized for AWS. `AllGather` is a key operation used in sharded data parallel training, which is a memory-efficient data parallelism technique offered by popular libraries. These include the SageMaker AI model parallelism (SMP) library, DeepSpeed Zero Redundancy Optimizer (ZeRO), and PyTorch Fully Sharded Data Parallelism (FSDP).
+ The library performs optimized node-to-node communication by fully utilizing the AWS network infrastructure and the SageMaker AI ML instance topology. 

**To run sample data-parallel training jobs**

Explore the following distributed training samples implementing data parallelism techniques using the SMDDP library.
+ [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/12.SM-dataparallel-FSDP](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/12.SM-dataparallel-FSDP)
+ [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/13.SM-dataparallel-deepspeed](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/13.SM-dataparallel-deepspeed)

**To set up an environment for using the SMDDP library on SageMaker HyperPod**

The following are training environment requirements for using the SMDDP library on SageMaker HyperPod.
+ PyTorch v2.0.1 and later
+ CUDA v11.8 and later
+ `libstdc++` runtime version greater than 3
+ Python v3.10.x and later
+ `ml.p4d.24xlarge` and `ml.p4de.24xlarge`, which are supported instance types by the SMDDP library
+ `imdsv2` enabled on training host

Depending on how you want to run the distributed training job, there are two options to install the SMDDP library:
+ A direct installation using the SMDDP binary file.
+ Using the SageMaker AI Deep Learning Containers (DLCs) pre-installed with the SMDDP library.

Docker images pre-installed with the SMDDP library or the URLs to the SMDDP binary files are listed at [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-frameworks) in the SMDDP library documentation.

**To install the SMDDP library on the SageMaker HyperPod DLAMI**
+ `pip install --no-cache-dir https://smdataparallel.s3.amazonaws.com/binary/pytorch/<pytorch-version>/cuXYZ/YYYY-MM-DD/smdistributed_dataparallel-X.Y.Z-cp310-cp310-linux_x86_64.whl`
**Note**  
If you work in a Conda environment, ensure that you install PyTorch using `conda install` instead of `pip`.  

  ```
  conda install pytorch==X.Y.Z  torchvision==X.Y.Z torchaudio==X.Y.Z pytorch-cuda=X.Y.Z -c pytorch -c nvidia
  ```

**To use the SMDDP library on a Docker container**
+ The SMDDP library is pre-installed on the SageMaker AI Deep Learning Containers (DLCs). To find the list of SageMaker AI framework DLCs for PyTorch with the SMDDP library, see [Supported Frameworks](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-data-parallel-support.html#distributed-data-parallel-supported-frameworks) in the SMDDP library documentation. You can also bring your own Docker container with required dependencies installed to use the SMDDP library. To learn more about setting up a custom Docker container to use the SMDDP library, see also [Create your own Docker container with the SageMaker AI distributed data parallel library](data-parallel-bring-your-own-container.md).
**Important**  
To use the SMDDP library in a Docker container, mount the `/var/log` directory from the host machine onto `/var/log` in the container. This can be done by adding the following option when running your container.  

  ```
  docker run <OTHER_OPTIONS> -v /var/log:/var/log ...
  ```

To learn how to run data-parallel training jobs with SMDDP in general, see [Distributed training with the SageMaker AI distributed data parallelism library](data-parallel-modify-sdp.md).

## Using SMP on a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-run-jobs-slurm-distributed-training-workload-smp"></a>

The [SageMaker AI model parallelism (SMP) library](model-parallel-v2.md) offers various [state-of-the-art model parallelism techniques](model-parallel-core-features-v2.md), including:
+ fully sharded data parallelism
+ expert parallelism
+ mixed precision training with FP16/BF16 and FP8 data types
+ tensor parallelism

The SMP library is also compatible with open source frameworks such as PyTorch FSDP, NVIDIA Megatron, and NVIDIA Transformer Engine.

**To run a sample model-parallel training workload**

The SageMaker AI service teams provide sample training jobs implementing model parallelism with the SMP library at [https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2](https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/17.SM-modelparallelv2).

# SageMaker HyperPod cluster resources monitoring
<a name="sagemaker-hyperpod-cluster-observability-slurm"></a>

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, integrate the cluster with [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html) and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html). The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster's behavior. By leveraging these services, you gain a centralized and unified view of your HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads.

**Tip**  
To find practical examples and solutions, see also the [SageMaker HyperPod workshop](https://catalog.workshops.aws/sagemaker-hyperpod).

![\[An overview of configuring SageMaker HyperPod with Amazon Managed Service for Prometheus and Amazon Managed Grafana.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-observability-architecture.png)


Figure: This architecture diagram shows an overview of configuring SageMaker HyperPod with Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Proceed to the following topics to set up for SageMaker HyperPod cluster observability.

**Topics**
+ [Prerequisites for SageMaker HyperPod cluster observability](sagemaker-hyperpod-cluster-observability-slurm-prerequisites.md)
+ [Installing metrics exporter packages on your HyperPod cluster](sagemaker-hyperpod-cluster-observability-slurm-install-exporters.md)
+ [Validating Prometheus setup on the head node of a HyperPod cluster](sagemaker-hyperpod-cluster-observability-slurm-validate-prometheus-setup.md)
+ [Setting up an Amazon Managed Grafana workspace](sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws.md)
+ [Exported metrics reference](sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference.md)
+ [Amazon SageMaker HyperPod Slurm metrics](smcluster-slurm-metrics.md)

# Prerequisites for SageMaker HyperPod cluster observability
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites"></a>

Before proceeding with the steps to [Installing metrics exporter packages on your HyperPod cluster](sagemaker-hyperpod-cluster-observability-slurm-install-exporters.md), ensure that the following prerequisites are met.

## Enable IAM Identity Center
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites-iam-id-center"></a>

To enable observability for your SageMaker HyperPod cluster, you must first enable IAM Identity Center. This is a prerequisite for deploying an CloudFormation stack that sets up the Amazon Managed Grafana workspace and Amazon Managed Service for Prometheus. Both of these services also require the IAM Identity Center for authentication and authorization, ensuring secure user access and management of the monitoring infrastructure.

For detailed guidance on enabling IAM Identity Center, see the [Enabling IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/get-set-up-for-idc.html) section in the *AWS IAM Identity Center User Guide*. 

After successfully enabling IAM Identity Center, set up a user account that will serve as the administrative user throughout the following configuration precedures.

## Create and deploy an CloudFormation stack for SageMaker HyperPod observability
<a name="sagemaker-hyperpod-cluster-observability-slurm-prerequisites-cloudformation-stack"></a>

Create and deploy a CloudFormation stack for SageMaker HyperPod observability to monitor HyperPod cluster metrics in real time using Amazon Managed Service for Prometheus and Amazon Managed Grafana. To deploy the stack, note that you also should enable your [IAM Identity Center](https://console.aws.amazon.com/singlesignon) beforehand.

Use the sample CloudFormation script [https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability.yaml](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/4.prometheus-grafana/cluster-observability.yaml) that helps you set up Amazon VPC subnets, Amazon FSx for Lustre file systems, Amazon S3 buckets, and IAM roles required to create a HyperPod cluster observability stack.

# Installing metrics exporter packages on your HyperPod cluster
<a name="sagemaker-hyperpod-cluster-observability-slurm-install-exporters"></a>

In the [base configuration lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) that the SageMaker HyperPod team provides also includes installation of various metric exporter packages. To activate the installation step, the only thing you need to do is to set the parameter `enable_observability=True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file. The lifecycle scripts are designed to bootstrap your cluster with the following open-source metric exporter packages.


|  |  |  | 
| --- |--- |--- |
| Name | Script deployment target node | Exporter description | 
| [Slurm exporter for Prometheus](https://github.com/vpenso/prometheus-slurm-exporter) | Head (controller) node |  Exports Slurm Accounting metrics.  | 
|  [Elastic Fabric Adapter (EFA) node exporter](https://github.com/aws-samples/awsome-distributed-training/tree/main/4.validation_and_observability/3.efa-node-exporter)  |  Compute node  |  Exports metrics from cluster nodes and EFA. The package is a fork of the [Prometheus node exporter](https://github.com/prometheus/node_exporter).  | 
|  [NVIDIA Data Center GPU Management (DCGM) exporter](https://github.com/NVIDIA/dcgm-exporter)  | Compute node |  Exports NVIDIA DCGM metrics about health and performance of NVIDIA GPUs.  | 

With `enable_observability=True` in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/config.py) file, the following installation step is activated in the [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py) script. 

```
# Install metric exporting software and Prometheus for observability
if Config.enable_observability:
    if node_type == SlurmNodeType.COMPUTE_NODE:
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_dcgm_exporter.sh").run()
        ExecuteBashScript("./utils/install_efa_node_exporter.sh").run()

    if node_type == SlurmNodeType.HEAD_NODE:
        wait_for_scontrol()
        ExecuteBashScript("./utils/install_docker.sh").run()
        ExecuteBashScript("./utils/install_slurm_exporter.sh").run()
        ExecuteBashScript("./utils/install_prometheus.sh").run()
```

On the compute nodes, the script installs the NVIDIA Data Center GPU Management (DCGM) exporter and the Elastic Fabric Adapter (EFA) node exporter. The DCGM exporter is an exporter for Prometheus that collects metrics from NVIDIA GPUs, enabling monitoring of GPU usage, performance, and health. The EFA node exporter, on the other hand, gathers metrics related to the EFA network interface, which is essential for low-latency and high-bandwidth communication in HPC clusters.

On the head node, the script installs the Slurm exporter for Prometheus and the [Prometheus open-source software](https://prometheus.io/docs/introduction/overview/). The Slurm exporter provides Prometheus with metrics related to Slurm jobs, partitions, and node states.

Note that the lifecycle scripts are designed to install all the exporter packages as docker containers, so the Docker package also should be installed on both the head and compute nodes. The scripts for these components are conveniently provided in the [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils) folder of the *Awsome Distributed Training GitHub repository*.

After you have successfully set up your HyperPod cluster installed with the exporter packages, proceed to the next topic to finish setting up Amazon Managed Service for Prometheus and Amazon Managed Grafana.

# Validating Prometheus setup on the head node of a HyperPod cluster
<a name="sagemaker-hyperpod-cluster-observability-slurm-validate-prometheus-setup"></a>

After you have successfully set up your HyperPod cluster installed with the exporter packages, check if Prometheus is properly set up on the head node of your HyperPod cluster.

1. Connect to the head node of your cluster. For instructions on accessing a node, see [Accessing your SageMaker HyperPod cluster nodes](sagemaker-hyperpod-run-jobs-slurm-access-nodes.md).

1. Run the following command to verify the Prometheus config and service file created by the lifecycle script `install_prometheus.sh` is running on the controller node. The output should show the Active status as **active (running)**.

   ```
   $ sudo systemctl status prometheus
   • prometheus service - Prometheus Exporter
   Loaded: loaded (/etc/systemd/system/prometheus.service; enabled; preset:disabled)
   Active: active (running) since DAY YYYY-MM-DD HH:MM:SS UTC; Ss ago
   Main PID: 12345 (prometheus)
   Tasks: 7 (limit: 9281)
   Memory: 35M
   CPU: 234ms
   CGroup: /system.slice/prometheus.service
           -12345 /usr/bin/prometheus--config.file=/etc/prometheus/prometheus.yml
   ```

1. Validate the Prometheus configuration file as follows. The output must be similar to the following, with three exporter configured with the right compute node IP addresses.

   ```
   $ cat /etc/prometheus/prometheus.yml
   global:
     scrape_interval: 15s
     evaluation_interval: 15s
     scrape_timeout: 15s
   
   scrape_configs:
     - job_name: 'slurm_exporter'
       static_configs:
         - targets:
             - 'localhost:8080'
     - job_name: 'dcgm_exporter'
       static_configs:
         - targets:
             - '<ComputeNodeIP>:9400'
             - '<ComputeNodeIP>:9400'
     - job_name: 'efa_node_exporter'
       static_configs:
         - targets:
             - '<ComputeNodeIP>:9100'
             - '<ComputeNodeIP>:9100'
   
   remote_write:
     - url: <AMPReoteWriteURL>
       queue_config:
         max_samples_per_send: 1000
         max_shards: 200
         capacity: 2500
       sigv4:
         region: <Region>
   ```

1. To test if Prometheus is exporting Slurm, DCGM, and EFA metrics properly, run the following `curl` command for Prometheus on port `:9090` on the head node.

   ```
   $ curl -s http://localhost:9090/metrics | grep -E 'slurm|dcgm|efa'
   ```

   With the metrics exported to Amazon Managed Service for Prometheus Workspace through the Prometheus remote write configuration from the controller node, you can proceed to the next topic to set up Amazon Managed Grafana dashboards to display the metrics.

# Setting up an Amazon Managed Grafana workspace
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws"></a>

Create a new Amazon Managed Grafana workspace or update an existing Amazon Managed Grafana workspace with Amazon Managed Service for Prometheus as the data source.

**Topics**
+ [Create a Grafana workspace and set Amazon Managed Service for Prometheus as a data source](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-create)
+ [Open the Grafana workspace and finish setting up the data source](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-connect-data-source)
+ [Import open-source Grafana dashboards](#sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-import-dashboards)

## Create a Grafana workspace and set Amazon Managed Service for Prometheus as a data source
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-create"></a>

To visualize metrics from Amazon Managed Service for Prometheus, create an Amazon Managed Grafana workspace and set it up to use Amazon Managed Service for Prometheus as a data source.

1. To create a Grafana workspace, follow the instructions at [Creating a workspace](https://docs.aws.amazon.com/grafana/latest/userguide/AMG-create-workspace.html#creating-workspace) in the *Amazon Managed Service for Prometheus User Guide*.

   1. In Step 13, select Amazon Managed Service for Prometheus as the data source.

   1. In Step 17, you can add the admin user and also other users in your IAM Identity Center.

For more information, see also the following resources.
+ [Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-amg.html) in the *Amazon Managed Service for Prometheus User Guide*
+ [Use AWS data source configuration to add Amazon Managed Service for Prometheus as a data source](https://docs.aws.amazon.com/grafana/latest/userguide/AMP-adding-AWS-config.html) in the *Amazon Managed Grafana User Guide*

## Open the Grafana workspace and finish setting up the data source
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-connect-data-source"></a>

After you have successfully created or updated an Amazon Managed Grafana workspace, select the workspace URL to open the workspace. This prompts you to enter a user name and the password of the user that you have set up in IAM Identity Center. You should log in using the admin user to finish setting up the workspace.

1. In the workspace **Home** page, choose **Apps**, **AWS Data Sources**, and **Data sources**.

1. In the **Data sources** page, and choose the **Data sources** tab.

1. For **Service**, choose Amazon Managed Service for Prometheus.

1. In the **Browse and provision data sources** section, choose the AWS region where you provisioned an Amazon Managed Service for Prometheus workspace.

1. From the list of data sources in the selected Region, choose the one for Amazon Managed Service for Prometheus. Make sure that you check the resource ID and the resource alias of the Amazon Managed Service for Prometheus workspace that you have set up for HyperPod observability stack.

## Import open-source Grafana dashboards
<a name="sagemaker-hyperpod-cluster-observability-slurm-managed-grafana-ws-import-dashboards"></a>

After you've successfully set up your Amazon Managed Grafana workspace with Amazon Managed Service for Prometheus as the data source, you'll start collecting metrics to Prometheus, and then should start seeing the various dashboards showing charts, information, and more. The Grafana open source software provides various dashboards, and you can import them into Amazon Managed Grafana.

**To import open-source Grafana dashboards to Amazon Managed Grafana**

1. In the **Home** page of your Amazon Managed Grafana workspace, choose **Dashboards**.

1. Choose the drop down menu button with the UI text **New**, and select **Import**.

1. Paste the URL to the [Slurm Dashboard](https://grafana.com/grafana/dashboards/4323-slurm-dashboard/).

   ```
   https://grafana.com/grafana/dashboards/4323-slurm-dashboard/
   ```

1. Select **Load**.

1. Repeat the previous steps to import the following dashboards.

   1. [Node Exporter Full Dashboard](https://grafana.com/grafana/dashboards/1860-node-exporter-full/)

      ```
      https://grafana.com/grafana/dashboards/1860-node-exporter-full/
      ```

   1. [NVIDIA DCGM Exporter Dashboard](https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/)

      ```
      https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/
      ```

   1. [EFA Metrics Dashboard](https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/)

      ```
      https://grafana.com/grafana/dashboards/20579-efa-metrics-dev/
      ```

   1. [FSx for Lustre Metrics Dashboard](https://grafana.com/grafana/dashboards/20906-fsx-lustre/)

      ```
      https://grafana.com/grafana/dashboards/20906-fsx-lustre/
      ```

# Exported metrics reference
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference"></a>

The following sections present comprehensive lists of metrics exported from SageMaker HyperPod to Amazon Managed Service for Prometheus upon the successful configuration of the CloudFormation stack for SageMaker HyperPod observability. You can start monitoring these metrics visualized in the Amazon Managed Grafana dashboards.

## Slurm exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-slurm-exporter"></a>

Provides visualized information of Slurm clusters on SageMaker HyperPod.

**Types of metrics**
+ **Cluster Overview:** Displaying the total number of nodes, jobs, and their states.
+ **Job Metrics:** Visualizing job counts and states over time.
+ **Node Metrics:** Showing node states, allocation, and available resources.
+ **Partition Metrics:** Monitoring partition-specific metrics such as CPU, memory, and GPU utilization.
+ **Job Efficiency:** Calculating job efficiency based on resources utilized.

**List of metrics**


| Metric name | Description | 
| --- | --- | 
| slurm\$1job\$1count | Total number of jobs in the Slurm cluster | 
| slurm\$1job\$1state\$1count | Count of jobs in each state (e.g., running, pending, completed) | 
| slurm\$1node\$1count  | Total number of nodes in the Slurm cluster | 
| slurm\$1node\$1state\$1count  | Count of nodes in each state (e.g., idle, alloc, mix) | 
| slurm\$1partition\$1node\$1count  | Count of nodes in each partition | 
| slurm\$1partition\$1job\$1count  | Count of jobs in each partition | 
| slurm\$1partition\$1alloc\$1cpus  | Total number of allocated CPUs in each partition | 
| slurm\$1partition\$1free\$1cpus  | Total number of available CPUs in each partition | 
| slurm\$1partition\$1alloc\$1memory  | Total allocated memory in each partition | 
| slurm\$1partition\$1free\$1memory  | Total available memory in each partition | 
| slurm\$1partition\$1alloc\$1gpus  | Total allocated GPUs in each partition | 
| slurm\$1partition\$1free\$1gpus  | Total available GPUs in each partition | 

## Node exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-node-exporter"></a>

Provides visualized information of system metrics collected by the [Prometheus node exporter](https://github.com/prometheus/node_exporter) from the HyperPod cluster nodes.

**Types of metrics**
+ **System overview:** Displaying CPU load averages and memory usage.
+ **Memory metrics:** Visualizing memory utilization including total memory, free memory, and swap space.
+ **Disk usage:** Monitoring disk space utilization and availability.
+ **Network traffic:** Showing network bytes received and transmitted over time.
+ **File system metrics:** Analyzing file system usage and availability.
+ **Disk I/O metrics:** Visualizing disk read and write activity.

**List of metrics**

For a complete list of metrics exported, see the [Node exporter ](https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default) and [procfs](https://github.com/prometheus/procfs?tab=readme-ov-file) GitHub repositories. The following table shows a subset of the metrics that provides insights into system resource utilization such as CPU load, memory usage, disk space, and network activity.


| Metric name | Description | 
| --- | --- | 
|  node\$1load1  | 1-minute load average | 
|  node\$1load5  | 5-minute load average | 
|  node\$1load15  | 15-minute load average | 
|  node\$1memory\$1MemTotal  | Total system memory | 
|  node\$1memory\$1MemFree  | Free system memory | 
|  node\$1memory\$1MemAvailable  | Available memory for allocation to processes | 
|  node\$1memory\$1Buffers  | Memory used by the kernel for buffering | 
|  node\$1memory\$1Cached  | Memory used by the kernel for caching file system data | 
|  node\$1memory\$1SwapTotal  | Total swap space available | 
|  node\$1memory\$1SwapFree  | Free swap space | 
|  node\$1memory\$1SwapCached  | Memory that once was swapped out, is swapped back in but still in swap | 
|  node\$1filesystem\$1avail\$1bytes  | Available disk space in bytes | 
|  node\$1filesystem\$1size\$1bytes  | Total disk space in bytes | 
|  node\$1filesystem\$1free\$1bytes  | Free disk space in bytes | 
|  node\$1network\$1receive\$1bytes  | Network bytes received | 
|  node\$1network\$1transmit\$1bytes  | Network bytes transmitted | 
|  node\$1disk\$1read\$1bytes  | Disk bytes read | 
|  node\$1disk\$1written\$1bytes  | Disk bytes written | 

## NVIDIA DCGM exporter dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-nvidia-dcgm-exporter"></a>

Provides visualized information of NVIDIA GPU metrics collected by the [NVIDIA DCGM exporter](https://github.com/NVIDIA/dcgm-exporter).

**Types of metrics**
+ **GPU Overview:** Displaying GPU utilization, temperatures, power usage, and memory usage. 
+ **Temperature Metrics:** Visualizing GPU temperatures over time. 
+ **Power Usage:** Monitoring GPU power draw and power usage trends. 
+ **Memory Utilization:** Analyzing GPU memory usage including used, free, and total memory. 
+ **Fan Speed:** Showing GPU fan speeds and variations. 
+ **ECC Errors:** Tracking GPU memory ECC errors and pending errors.

**List of metrics**

The following table shows a list of the metrics that provides insights into the NVIDIA GPU health and performance, including clock frequencies, temperatures, power usage, memory utilization, fan speeds, and error metrics.


| Metric name | Description | 
| --- | --- | 
|  DCGM\$1FI\$1DEV\$1SM\$1CLOCK  | SM clock frequency (in MHz) | 
|  DCGM\$1FI\$1DEV\$1MEM\$1CLOCK  | Memory clock frequency (in MHz) | 
|  DCGM\$1FI\$1DEV\$1MEMORY\$1TEMP  | Memory temperature (in C) | 
|  DCGM\$1FI\$1DEV\$1GPU\$1TEMP  | GPU temperature (in C) | 
|  DCGM\$1FI\$1DEV\$1POWER\$1USAGE  | Power draw (in W) | 
|  DCGM\$1FI\$1DEV\$1TOTAL\$1ENERGY\$1CONSUMPTION  | Total energy consumption since boot (in mJ) | 
|  DCGM\$1FI\$1DEV\$1PCIE\$1REPLAY\$1COUNTER  | Total number of PCIe retries | 
|  DCGM\$1FI\$1DEV\$1MEM\$1COPY\$1UTIL  | Memory utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1ENC\$1UTIL  | Encoder utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1DEC\$1UTIL  | Decoder utilization (in %) | 
|  DCGM\$1FI\$1DEV\$1XID\$1ERRORS  | Value of the last XID error encountered | 
|  DCGM\$1FI\$1DEV\$1FB\$1FREE  | Frame buffer memory free (in MiB) | 
|  DCGM\$1FI\$1DEV\$1FB\$1USED  | Frame buffer memory used (in MiB) | 
|  DCGM\$1FI\$1DEV\$1NVLINK\$1BANDWIDTH\$1TOTAL  | Total number of NVLink bandwidth counters for all lanes | 
|  DCGM\$1FI\$1DEV\$1VGPU\$1LICENSE\$1STATUS  | vGPU License status | 
|  DCGM\$1FI\$1DEV\$1UNCORRECTABLE\$1REMAPPED\$1ROWS  | Number of remapped rows for uncorrectable errors | 
|  DCGM\$1FI\$1DEV\$1CORRECTABLE\$1REMAPPED\$1ROWS  | Number of remapped rows for correctable errors | 
|  DCGM\$1FI\$1DEV\$1ROW\$1REMAP\$1FAILURE  | Whether remapping of rows has failed | 

## EFA metrics dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-efa-exporter"></a>

Provides visualized information of the metrics from [Amazon Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) equipped on P instances collected by the [EFA node exporter](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md).

**Types of metrics**
+ **EFA error metrics:** Visualizing errors such as allocation errors, command errors, and memory map errors.
+ **EFA network traffic:** Monitoring received and transmitted bytes, packets, and work requests.
+ **EFA RDMA performance:** Analyzing RDMA read and write operations, including bytes transferred and error rates.
+ **EFA port lifespan:** Displaying the lifespan of EFA ports over time.
+ **EFA keep-alive packets:** Tracking the number of keep-alive packets received.

**List of metrics**

The following table shows a list of the metrics that provides insights into various aspects of EFA operation, including errors, completed commands, network traffic, and resource utilization.


| Metric name | Description | 
| --- | --- | 
|  node\$1amazonefa\$1info  | Non-numeric data from /sys/class/infiniband/, value is always 1. | 
|  node\$1amazonefa\$1lifespan  | Lifespan of the port | 
|  node\$1amazonefa\$1rdma\$1read\$1bytes  | Number of bytes read with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1resp\$1bytes  | Number of read response bytes with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1wr\$1err  | Number of read write errors with RDMA | 
|  node\$1amazonefa\$1rdma\$1read\$1wrs  | Number of read rs with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1bytes  | Number of bytes written with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1recv\$1bytes  | Number of bytes written and received with RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1wr\$1err  | Number of bytes written with error RDMA | 
|  node\$1amazonefa\$1rdma\$1write\$1wrs  | Number of bytes written wrs RDMA | 
|  node\$1amazonefa\$1recv\$1bytes  | Number of bytes received | 
|  node\$1amazonefa\$1recv\$1wrs  | Number of bytes received wrs | 
|  node\$1amazonefa\$1rx\$1bytes  | Number of bytes received | 
|  node\$1amazonefa\$1rx\$1drops  | Number of packets dropped | 
|  node\$1amazonefa\$1rx\$1pkts  | Number of packets received | 
|  node\$1amazonefa\$1send\$1bytes  | Number of bytes sent | 
|  node\$1amazonefa\$1send\$1wrs  | Number of wrs sent | 
|  node\$1amazonefa\$1tx\$1bytes  | Number of bytes transmitted | 
|  node\$1amazonefa\$1tx\$1pkts  | Number of packets transmitted | 

## FSx for Lustre metrics dashboard
<a name="sagemaker-hyperpod-cluster-observability-slurm-exported-metrics-reference-fsx-exporter"></a>

Provides visualized information of the [metrics from Amazon FSx for Lustre](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring-cloudwatch.html) file system collected by [Amazon CloudWatch](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring-cloudwatch.html).

**Note**  
The Grafana FSx for Lustre dashboard utilizes Amazon CloudWatch as its data source, which differs from the other dashboards that you have configured to use Amazon Managed Service for Prometheus. To ensure accurate monitoring and visualization of metrics related to your FSx for Lustre file system, configure the FSx for Lustre dashboard to use Amazon CloudWatch as the data source, specifying the same AWS Region where your FSx for Lustre file system is deployed.

**Types of metrics**
+ **DataReadBytes:** The number of bytes for file system read operations.
+ **DataWriteBytes:** The number of bytes for file system write operations.
+ **DataReadOperations:** The number of read operations.
+ **DataWriteOperations:** The number of write operations.
+ **MetadataOperations:** The number of meta data operations.
+ **FreeDataStorageCapacity:** The amount of available storage capacity.

# Amazon SageMaker HyperPod Slurm metrics
<a name="smcluster-slurm-metrics"></a>

Amazon SageMaker HyperPod provides a set of Amazon CloudWatch metrics that you can use to monitor the health and performance of your HyperPod clusters. These metrics are collected from the Slurm workload manager running on your HyperPod clusters and are available in the `/aws/sagemaker/Clusters` CloudWatch namespace.

## Cluster level metrics
<a name="smcluster-slurm-metrics-cluster"></a>

The following cluster-level metrics are available for HyperPod. These metrics use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| cluster\$1node\$1count | Total number of nodes in the cluster | cluster\$1node\$1count | 
| cluster\$1idle\$1node\$1count | Number of idle nodes in the cluster | N/A | 
| cluster\$1failed\$1node\$1count | Number of failed nodes in the cluster | cluster\$1failed\$1node\$1count | 
| cluster\$1cpu\$1count | Total CPU cores in the cluster | node\$1cpu\$1limit | 
| cluster\$1idle\$1cpu\$1count | Number of idle CPU cores in the cluster | N/A | 
| cluster\$1gpu\$1count | Total GPUs in the cluster | node\$1gpu\$1limit | 
| cluster\$1idle\$1gpu\$1count | Number of idle GPUs in the cluster | N/A | 
| cluster\$1running\$1task\$1count | Number of running Slurm jobs in the cluster | N/A | 
| cluster\$1pending\$1task\$1count | Number of pending Slurm jobs in the cluster | N/A | 
| cluster\$1preempted\$1task\$1count | Number of preempted Slurm jobs in the cluster | N/A | 
| cluster\$1avg\$1task\$1wait\$1time | Average wait time for Slurm jobs in the cluster | N/A | 
| cluster\$1max\$1task\$1wait\$1time | Maximum wait time for Slurm jobs in the cluster | N/A | 

## Instance level metrics
<a name="smcluster-slurm-metrics-instance"></a>

The following instance-level metrics are available for HyperPod. These metrics also use the `ClusterId` dimension to identify the specific HyperPod cluster.


| CloudWatch metric name | Notes | Amazon EKS Container Insights metric name | 
| --- | --- | --- | 
| node\$1gpu\$1utilization | Average GPU utilization across all instances | node\$1gpu\$1utilization | 
| node\$1gpu\$1memory\$1utilization | Average GPU memory utilization across all instances | node\$1gpu\$1memory\$1utilization | 
| node\$1cpu\$1utilization | Average CPU utilization across all instances | node\$1cpu\$1utilization | 
| node\$1memory\$1utilization | Average memory utilization across all instances | node\$1memory\$1utilization | 

# SageMaker HyperPod cluster resiliency
<a name="sagemaker-hyperpod-resiliency-slurm"></a>

SageMaker HyperPod through Slurm orchestration provides the following cluster resiliency features.

**Topics**
+ [Health monitoring agent](sagemaker-hyperpod-resiliency-slurm-cluster-health-check.md)
+ [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md)
+ [Manually replace or reboot a node using Slurm](sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance.md)

# Health monitoring agent
<a name="sagemaker-hyperpod-resiliency-slurm-cluster-health-check"></a>

This section describes the set of health checks that SageMaker HyperPod uses to regularly monitor cluster instance health for issues with devices such as accelerators (GPU and Trainium cores) and networking (EFA). SageMaker HyperPod health-monitoring agent (HMA) continuously monitors the health status of each GPU-based or Trainium-based instance. When it detects any instance or GPU failures, the agent marks the instance as unhealthy.

SageMaker HyperPod HMA performs the same health checks for both EKS and Slurm orchestrators. For more information about HMA, see [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md).

# Automatic node recovery and auto-resume
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume"></a>

**Note**  
As of September 11, 2025, HyperPod with Slurm orchestration now supports health monitoring agents. Run [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) and update to the latest version of the AMI in order to use this functionality.

This section talks about Amazon SageMaker HyperPod's two complementary resilience features: automatic node recovery that replaces faulty infrastructure without manual intervention, and auto-resume functionality that restarts training jobs from the last checkpoint after hardware failures.

## How automatic node recovery works
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-how"></a>

During cluster creation or update, cluster admin users can select the node (instance) recovery option between `Automatic` (Recommended) and `None` at the cluster level. If set to `Automatic`, SageMaker HyperPod reboots or replaces faulty nodes automatically. 

**Important**  
We recommend setting the `Automatic` option. By default, the clusters are set up with Automatic node recovery.

Automatic node recovery runs when issues are found from health-monitoring agent, basic health checks, and deep health checks. If set to `None`, the health monitoring agent will label the instances when a fault is detected, but it will not automatically initiate any repair or recovery actions on the affected nodes. We do not recommend this option.

## Running a training job with the Amazon SageMaker HyperPod auto-resume functionality
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-job"></a>

This section describes how to run a training job with the SageMaker HyperPod auto-resume functionality, which provides a zero-touch resiliency infrastructure to automatically recover a training job from the last saved checkpoint in the event of a hardware failure.

With the auto-resume functionality, if a job fails due to a hardware failure or any transient issues in-between training, SageMaker HyperPod auto-resume starts the node replacement workflow and restarts the job after the faulty nodes are replaced. The following hardware checks are run whenever a job fails while using auto-resume:


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | NVIDIA SMI | GPU | [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) utility is a well-known CLI to manage and monitor GPUs. The built-in health checker parses the output from nvidia-smi to determine the health of the instance. | 
| Accelerator | Neuron sysfs | Trainium | For Trainium-powered instances, the health of the Neuron devices is determined by reading counters from [Neuron sysfs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html) propagated directly by the Neuron driver. | 
| Network | EFA | GPU and Trainium | To aid in the diagnostic of Elastic Fabric Adaptor (EFA) devices, the EFA health checker runs a series of connectivity tests using all available EFA cards within the instance. | 

**Note**  
When [Generic Resources (GRES)](https://slurm.schedmd.com/gres.html) are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.

**Using the SageMaker HyperPod auto-resume functionality with Slurm**

When you use SageMaker HyperPod auto-resume with Slurm, you should run the job inside an exclusive allocation acquired either by using `salloc` or `sbatch`. In any case, you need to modify the entrypoint script to make sure that all setup steps run in a single `srun` command when resuming the job. Through the entrypoint script, it is important to set up the environment on the replaced node to be consistent with the environment that the job step was running before it was stopped. The following procedure shows how to prepare an entrypoint script to keep the environment consistent and run it as a single `srun` command.

**Tip**  
If you use `sbatch`, you can keep the batch script simple by creating a separate script for setting up the environment and using a single `srun` command.

1. Create a script using the following code example and save it as `train_auto_resume.sh`. This script deploys training environment setups assuming that there is no manual configuration previously made to the replaced node. This ensures that the environment is node-agnostic, so that when a node is replaced, the same environment is provisioned on the node before resuming the job.
**Note**  
The following code example shows how to discover the Slurm node list associated with the job. Do not use the `$SLURM_JOB_NODELIST` environment variable provided by Slurm, because its value might be outdated after SageMaker HyperPod auto-resumes the job. The following code example shows how to define a new `NODE_LIST` variable to replace `SLURM_JOB_NODELIST`, and then set up the `MASTER_NODE` and `MASTER_ADDR` variables off of the `NODE_LIST` variable.

   ```
   #!/bin/bash
   
   # Filename: train_auto_resume.sh
   # Sample containerized script to launch a training job with a single srun which can be auto-resumed.
   
   # Place your training environment setup here. 
   # Example: Install conda, docker, activate virtual env, etc.
   
   # Get the list of nodes for a given job
   NODE_LIST=$(scontrol show jobid=$SLURM_JOBID | \ # Show details of the SLURM job
               awk -F= '/NodeList=/{print $2}' | \  # Extract NodeList field
               grep -v Exc)                         # Exclude nodes marked as excluded
   
   # Determine the master node from the node list
   MASTER_NODE=$(scontrol show hostname $NODE_LIST | \ # Convert node list to hostnames
                 head -n 1)                            # Select the first hostname as master node
   
   # Get the master node address
   MASTER_ADDR=$(scontrol show node=$MASTER_NODE | \ # Show node information
                 awk -F= '/NodeAddr=/{print $2}' | \ # Extract NodeAddr
                 awk '{print $1}')                   # Print the first part of NodeAddr
   
   
   # Torchrun command to launch the training job
   torchrun_cmd="torchrun --nnodes=$SLURM_NNODES \
                          --nproc_per_node=1 \
                          --node_rank=$SLURM_NODE \
                          --master-addr=$MASTER_ADDR \
                          --master_port=1234 \
                          <your_training_script.py>"
   
   # Execute the torchrun command in the 'pytorch' Conda environment, 
   # streaming output live
   /opt/conda/bin/conda run --live-stream -n pytorch $torchrun_cmd
   ```
**Tip**  
You can use the preceding script to add more commands for installing any additional dependencies for your job. However, we recommend that you keep the dependency installation scripts to the [set of lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) that are used during cluster creation. If you use a virtual environment hosted on a shared directory, you can also utilize this script to activate the virtual environment.

1. Launch the job with SageMaker HyperPod auto-resume enabled by adding the flag `--auto-resume=1` to indicate that the `srun` command should be automatically retried in case of hardware failure. 
**Note**  
If you have set up a resource allocation using `sbatch` or `salloc`, you can run multiple `srun` commands within the allocation. In the event of a failure, the SageMaker HyperPod auto-resume functionality only operates in the current [job step](https://slurm.schedmd.com/job_launch.html#step_allocation) of the `srun` command with the flag `--auto-resume=1`. In other words, activating auto-resume in an `srun` command doesn't apply to other `srun` commands launched within a resource allocation session.

   The following are `srun` command examples with `auto-resume` enabled.

   **Using sbatch**

   Because most of the logic for setting up the environment is already in `train_auto_resume.sh`, the batch script should be simple and similar to the following code example. Assume that the following batch script is saved as `batch.sh`.

   ```
   #!/bin/bash
   #SBATCH --nodes 2
   #SBATCH --exclusive
   srun --auto-resume=1 train_auto_resume.sh
   ```

   Run the preceding batch script using the following command.

   ```
   sbatch batch.sh
   ```

   **Using salloc**

   Start by acquiring an exclusive allocation, and run the `srun` command with the `--auto-resume` flag and the entrypoint script.

   ```
   salloc -N 2 --exclusive
   srun --auto-resume=1 train_auto_resume.sh
   ```

## How automatic node recovery and auto-resume work together
<a name="sagemaker-hyperpod-resiliency-slurm-auto-resume-node-recovery"></a>

When both automatic node recovery and auto-resume systems are active, they follow a coordinated approach to handling failures. If the HMA detects a hardware fault, the node is marked for drain regardless of job-level status. With node automatic recovery enabled, the nodes are automatically replaced once all the jobs running in the nodes exit. In this scenario, for jobs with auto-resume enabled, if there is a non-zero exit status in the step, the auto resume kicks in (the jobs resume once nodes are replaced). Jobs without auto-resume enabled will simply exit, requiring manual resubmission by administrators or users.

**Note**  
If you use auto-resume, the nodes are always replaced (no reboots) when hardware failures are detected.

# Manually replace or reboot a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance"></a>

This section talks about when you should manually reboot or replace a node, with instructions on how to do both.

## When to manually reboot or replace a node
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-when"></a>

The HyperPod auto-resume functionality monitors if the state of your Slurm nodes turns to `fail` or `down`. You can check the state of Slurm nodes by running `sinfo`.

If a node remains stuck or unresponsive and the auto-resume process does not recover it, you can manually initiate recovery. The choice between rebooting and replacing a node depends on the nature of the issue. Consider rebooting when facing temporary or software-related problems, such as system hangs, memory leaks, GPU driver issues, kernel updates, or hung processes. However, if you encounter persistent or hardware-related problems like failing GPUs, memory or networking faults, repeated health check failures, or nodes that remain unresponsive after multiple reboot attempts, node replacement is the more appropriate solution.

## Ways to manually reboot or replace nodes
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-ways"></a>

SageMaker HyperPod offers two methods for manual node recovery. The preferred approach is using the SageMaker HyperPod Reboot and Replace APIs, which provides a faster and more transparent recovery process that works across all orchestrators. Alternatively, you can use traditional Slurm commands like `scontrol update`, though this legacy method requires direct access to the Slurm's controller node. Both methods activate the same SageMaker HyperPod recovery processes.

## Manually reboot a node using reboot API
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-reboot-api"></a>

 You can use the **BatchRebootClusterNodes** to manually reboot a faulty node in your SageMaker HyperPod cluster.

 Here is an example of running the reboot operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-reboot-cluster-nodes \
                --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
                --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

## Manually replace a node using replace API
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace-api"></a>

 You can use the **BatchReplaceClusterNodes** to manually replace a faulty node in your SageMaker HyperPod cluster.

 Here is an example of running the replace operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-replace-cluster-nodes \
                --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
                --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

## Manually reboot a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-reboot"></a>

You can also use the scontrol Slurm commands to trigger node recovery. These commands interact directly with the Slurm control plane and invoke the same underlying SageMaker HyperPod recovery mechanisms. 

In the following command , replace <ip-ipv4> with the Slurm node name (host name) of the faulty instance you want to reboot.

```
scontrol update node=<ip-ipv4> state=fail reason="Action:Reboot"
```

This marks the node as FAIL with the specified reason. SageMaker HyperPod detects this and reboots the instance. Avoid changing the node state or restarting the Slurm controller during the operation.

## Manually replace a node using Slurm
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-replace"></a>

You can use the scontrol update command as follows to replace a node.

In the following command, replace `<ip-ipv4>` with the Slurm node name (host name) of the faulty instance you want to replace.

```
scontrol update node=<ip-ipv4> state=fail reason="Action:Replace"
```

After running this command, the node will go into the `fail` state, waits for the currently running jobs to finish, is replaced with a healthy instance, and is recovered with the same host name. This process takes time depending on the available instances in your Availability Zone and the time it takes to run your lifecycle scripts. During the update and replacement processes, avoid changing the state of the node manually again or restarting the Slurm controller; doing so can lead to a replacement failure. If the node does not get recovered nor turn to the `idle` state after a long time, contact [AWS Support](https://console.aws.amazon.com/support/).

## Manually force change a node
<a name="sagemaker-hyperpod-resiliency-slurm-replace-faulty-instance-force"></a>

If the faulty node is continuously stuck in the `fail` state, the last resort you might try is to manually force change the node state to `down`. This requires administrator privileges (sudo permissions).

**Warning**  
Proceed carefully before you run the following command as it forces kill all jobs, and you might lose all unsaved work.

```
scontrol update node=<ip-ipv4> state=down reason="Action:Replace"
```

# Continuous provisioning for enhanced cluster operations with Slurm
<a name="sagemaker-hyperpod-scaling-slurm"></a>

Amazon SageMaker HyperPod clusters created with Slurm orchestration now support continuous provisioning, a capability that enables greater flexibility and efficiency when running large-scale AI/ML workloads. Continuous provisioning lets you start training quickly, scale seamlessly, perform maintenance without disrupting operations, and have granular visibility into cluster operations.

**Note**  
Continuous provisioning is available as an optional configuration for new HyperPod clusters created with Slurm orchestration. Existing clusters using the previous scaling model cannot be migrated to continuous provisioning at this time.

## How it works
<a name="sagemaker-hyperpod-scaling-slurm-how"></a>

The continuous provisioning system introduces a desired-state architecture that replaces the traditional all-or-nothing scaling model. In the previous model, if any instance group could not be fully provisioned, the entire cluster creation or update operation failed and rolled back. With continuous provisioning, the system accepts partial capacity and continues to provision remaining instances asynchronously.

The continuous provisioning system:
+ **Accepts the request**: Records the target instance count for each instance group.
+ **Initiates provisioning**: Begins launching instances for all instance groups in parallel.
+ **Provisions priority nodes first**: The cluster transitions to `InService` after at least one controller node (and one login node, if a login instance group is specified) is successfully provisioned.
+ **Tracks progress**: Monitors each instance launch attempt and records the status.
+ **Handles failures**: Automatically retries failed launches for worker nodes asynchronously.

Continuous provisioning is disabled by default. To use this feature, set `NodeProvisioningMode` to `Continuous` in your `CreateCluster` request.

With continuous provisioning enabled, you can initiate multiple scaling operations simultaneously without waiting for previous operations to complete. This lets you scale different instance groups in the same cluster concurrently and submit multiple scaling requests to the same instance group.

## Priority-based provisioning
<a name="sagemaker-hyperpod-scaling-slurm-priority"></a>

Slurm clusters require a controller node to be operational before worker nodes can register and accept jobs. Continuous provisioning handles this automatically through priority-based provisioning:

1. The controller instance group is provisioned first.

1. Once one controller node is healthy, login nodes and worker nodes begin provisioning in parallel.

1. The cluster transitions to `InService` when one controller node is up and one login node is up (if a login instance group is specified). If no login instance group is specified, the cluster transitions to `InService` as soon as the controller node is provisioned.

1. Worker nodes that cannot be immediately provisioned due to capacity constraints enter an asynchronous retry loop and are added to the Slurm cluster automatically as they become available.

## Controller failure handling
<a name="sagemaker-hyperpod-scaling-slurm-controller-failure"></a>

During cluster creation, if the controller node fails to provision, the behavior depends on whether the error is retryable or non-retryable.

**Retryable errors** (for example, unhealthy instance or transient failures):
+ HyperPod continuously replaces the instance and retries provisioning until the controller comes up.
+ Worker and login nodes that have already been provisioned remain available, but the cluster does not transition to `InService` until the controller is healthy.

**Non-retryable errors** (for example, no capacity available for the controller instance type or lifecycle script failure):
+ The cluster is marked as `Failed`.
+ You are notified of the failure reason and must take corrective action, such as choosing a different instance type, fixing lifecycle scripts, or retrying in a different Availability Zone.

## Prerequisites
<a name="sagemaker-hyperpod-scaling-slurm-prerequisites"></a>

Continuous provisioning requires that Slurm provisioning parameters (node types, partition names) are provided via the API payload in each instance group's `SlurmConfig` field. Clusters that rely on the legacy `provisioning_parameters.json` file in Amazon S3 are not compatible with continuous provisioning.

**Note**  
The following features are not currently supported with continuous provisioning on Slurm clusters: migration of existing clusters, multi-head node configuration via API-based Slurm topology, and `SlurmConfigStrategy`. Continuous provisioning operates exclusively in merge mode for `slurm.conf` management.

## Usage metering
<a name="sagemaker-hyperpod-scaling-slurm-metering"></a>

HyperPod clusters with continuous provisioning use instance-level metering to provide accurate billing that reflects actual resource usage. This metering approach differs from traditional cluster-level billing by tracking each instance independently.

**Instance-level billing**

With continuous provisioning, billing starts and stops at the individual instance level rather than waiting for cluster-level state changes. This approach provides the following benefits:
+ **Precise billing accuracy**: Billing starts when the lifecycle script execution begins. If the lifecycle script fails, the instance provision will be retried and you are charged for the duration of the lifecycle script runtime.
+ **Independent metering**: Each instance's billing lifecycle is managed separately, preventing cascading billing errors.
+ **Real-time billing updates**: Billing starts when an instance begins executing its lifecycle configuration script and stops when the instance enters a terminating state.

**Billing lifecycle**

Each instance in your HyperPod cluster follows this billing lifecycle:
+ **Billing starts**: When the instance successfully launches and begins executing its lifecycle configuration script.
+ **Billing continues**: Throughout the instance's operational lifetime.
+ **Billing stops**: When the instance enters a terminating state, regardless of the reason for termination.

**Note**  
Billing does not start for instances that fail to launch. If an instance launch fails due to insufficient capacity or other issues, you are not charged for that failed attempt. Billing is calculated at the instance level and costs are aggregated and reported under your cluster's Amazon Resource Name (ARN).

## Create a cluster with continuous provisioning enabled
<a name="sagemaker-hyperpod-scaling-slurm-create"></a>

**Note**  
Prepare a lifecycle configuration script and upload it to an Amazon S3 bucket that your execution role can access. For more information, see [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md).

Prepare a `CreateCluster` API request file in JSON format. Set `NodeProvisioningMode` to `Continuous` and provide Slurm topology information in each instance group's `SlurmConfig` field.

```
// create_cluster.json
{
    "ClusterName": "my-training-cluster",
    "NodeProvisioningMode": "Continuous",
    "Orchestrator": {
        "Slurm": {}
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "controller-group",
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Controller"
            }
        },
        {
            "InstanceGroupName": "login-group",
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Login"
            }
        },
        {
            "InstanceGroupName": "worker-gpu-a",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 16,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket/lifecycle-scripts/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster",
            "SlurmConfig": {
                "NodeType": "Compute",
                "PartitionNames": ["gpu-training"]
            }
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": ["sg-12345678"],
        "Subnets": ["subnet-12345678"]
    }
}
```

Run the `create-cluster` command to submit the request.

```
aws sagemaker create-cluster \
    --cli-input-json file://complete/path/to/create_cluster.json
```

This returns the ARN of the new cluster.

```
{
    "ClusterArn": "arn:aws:sagemaker:us-west-2:111122223333:cluster/abcde12345"
}
```

## Slurm configuration management
<a name="sagemaker-hyperpod-scaling-slurm-config"></a>

Continuous provisioning operates exclusively in merge mode for `slurm.conf` partition management. In merge mode, HyperPod applies its partition configuration changes additively on top of whatever you have modified in `slurm.conf`. HyperPod only updates the partition-related sections of `slurm.conf` (such as partition name and node name entries); other Slurm configuration parameters are not modified. This means:
+ Your manual edits to `slurm.conf` are preserved.
+ There is no automated drift detection or resolution of conflicts between your modifications and HyperPod's expected state.

The `SlurmConfigStrategy` parameter (`Managed`, `Merge`, `Overwrite`) is not supported with continuous provisioning. Passing any `SlurmConfigStrategy` value results in an API error.

# SageMaker HyperPod cluster management
<a name="sagemaker-hyperpod-cluster-management-slurm"></a>

The following topics discuss logging and managing SageMaker HyperPod clusters.

## Logging SageMaker HyperPod events
<a name="sagemaker-hyperpod-cluster-management-slurm-logging-hyperpod-events"></a>

All events and logs from SageMaker HyperPod are saved to Amazon CloudWatch under the log group name `/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]`. Every call to the `CreateCluster` API creates a new log group. The following list contains all of the available log streams collected in each log group.


|  |  | 
| --- |--- |
| Log Group Name | Log Stream Name | 
| /aws/sagemaker/Clusters/[ClusterName]/[ClusterID] | LifecycleConfig/[instance-group-name]/[instance-id] | 

## Logging SageMaker HyperPod at instance level
<a name="sagemaker-hyperpod-cluster-management-slurm-logging-at-instance-level"></a>

You can access the LifecycleScript logs published to CloudWatch during cluster instance configuration. Every instance within the created cluster generates a separate log stream, distinguishable by the `LifecycleConfig/[instance-group-name]/[instance-id]` format. 

All logs that are written to `/var/log/provision/provisioning.log` are uploaded to the preceding CloudWatch stream. Sample LifecycleScripts at [https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config) redirect their `stdout` and `stderr` to this location. If you are using your custom scripts, write your logs to the `/var/log/provision/provisioning.log` location for them to be available in CloudWatch.

**Lifecycle script log markers**

CloudWatch logs for lifecycle scripts include specific markers to help you track execution progress and identify issues:


|  |  | 
| --- |--- |
| Marker | Description | 
| START | Indicates the beginning of lifecycle script logs for the instance | 
| [SageMaker] Lifecycle scripts were provided, with S3 uri: [s3://bucket-name/] and entrypoint script: [script-name.sh] | Indicates the S3 location and entrypoint script that will be used | 
| [SageMaker] Downloading lifecycle scripts | Indicates scripts are being downloaded from the specified S3 location | 
| [SageMaker] Lifecycle scripts have been downloaded | Indicates scripts have been successfully downloaded from S3 | 
| [SageMaker] The lifecycle scripts succeeded | Indicates successful completion of all lifecycle scripts | 
| [SageMaker] The lifecycle scripts failed | Indicates failed execution of lifecycle scripts | 

These markers help you quickly identify where in the lifecycle script execution process an issue occurred. When troubleshooting failures, review the log entries to identify where the process stopped or failed.

**Lifecycle script failure messages**

If the lifecycle script exists but fails during execution, you will receive an error message that includes the CloudWatch log group name and log stream name. In the event that there are lifecycle script failures across multiple instances, the error message will indicate only one failed instance, but the log group should contain streams for all instances.

You can view the error message by running the [DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) API or by viewing the cluster details page in the SageMaker console. In the console, a **View lifecycle script logs** button is provided that navigates directly to the CloudWatch log stream. The error message has the following format:

```
Instance [instance-id] failed to provision with the following error: "Lifecycle scripts did not run successfully. To view lifecycle script logs,
visit log group ‘/aws/sagemaker/Clusters/[cluster-name]/[cluster-id]' and log stream ‘LifecycleConfig/[instance-group-name]/[instance-id]’.
If you cannot find corresponding lifecycle script logs in CloudWatch, please make sure you follow one of the options here:
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-faq-slurm.html#hyperpod-faqs-q1.” Note that multiple instances may be impacted.
```

## Tagging resources
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging"></a>

AWS Tagging system helps manage, identify, organize, search for, and filter resources. SageMaker HyperPod supports tagging, so you can manage the clusters as an AWS resource. During cluster creation or editing an existing cluster, you can add or edit tags for the cluster. To learn more about tagging in general, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

### Using the SageMaker HyperPod console UI
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-in-console"></a>

When you are [creating a new cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-create-cluster) and [editing a cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters), you can add, remove, or edit tags.

### Using the SageMaker HyperPod APIs
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-in-api-request"></a>

When you write a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) or [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API request file in JSON format, edit the `Tags` section.

### Using the AWS CLI tagging commands for SageMaker AI
<a name="sagemaker-hyperpod-cluster-management-slurm-tagging-using-cli"></a>

**To tag a cluster**

Use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/add-tags.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/add-tags.html) as follows.

```
aws sagemaker add-tags --resource-arn cluster_ARN --tags Key=string,Value=string
```

**To untag a cluster**

Use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-tags.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-tags.html) as follows.

```
aws sagemaker delete-tags --resource-arn cluster_ARN --tag-keys "tag_key"
```

**To list tags for a resource**

Use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-tags.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-tags.html) as follows.

```
aws sagemaker list-tags --resource-arn cluster_ARN
```

# SageMaker HyperPod FAQs
<a name="sagemaker-hyperpod-faq-slurm"></a>

Use the following frequently asked questions to troubleshoot problems with using SageMaker HyperPod.

**Topics**
+ [Why can't I find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?](#hyperpod-faqs-q1)
+ [What particular configurations does HyperPod manage in Slurm configuration files such as `slurm.conf` and `gres.conf`?](#hyperpod-faqs-q2)
+ [How do I run Docker on Slurm nodes on HyperPod?](#hyperpod-faqs-q3)
+ [Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?](#hyperpod-faqs-q4)
+ [How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?](#hyperpod-faqs-q5)
+ [How to set up EFA security groups?](#hyperpod-faqs-q6)
+ [How do I monitor my HyperPod cluster nodes? Are there any CloudWatch metrics exported from HyperPod?](#hyperpod-faqs-q7)
+ [Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.](#hyperpod-faqs-q8)
+ [Why are my compute nodes showing as "DOWN" or "DRAINED" after a reboot?](#hyperpod-faqs-q9)
+ [Why do my nodes keep getting drained due to Out of Memory (OOM) issues?](#hyperpod-faqs-q10)
+ [How can I ensure resources are properly cleaned up after jobs complete?](#hyperpod-faqs-q11)

## Why can't I find log groups of my SageMaker HyperPod cluster in Amazon CloudWatch?
<a name="hyperpod-faqs-q1"></a>

By default, agent logs and instance start-up logs are sent to the HyperPod platform account’s CloudWatch. In case of user lifecycle scripts, lifecycle configuration logs are sent to your account’s CloudWatch.

If you use the [sample lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) provided by the HyperPod service team, you can expect to find the lifecycle configuration logs written to `/var/log/provision/provisioning.log`, and you wouldn’t encounter this problem.

However, if you use custom paths for collecting logs from lifecycle provisioning and can’t find the log groups appearing in your account's CloudWatch, it might be due to mismatches in the log file paths specified in your lifecycle scripts and what the CloudWatch agent running on the HyperPod cluster instances looks for. In this case, it means that you need to properly set up your lifecycle scripts to send logs to the CloudWatch agent, and also set up the CloudWatch agent configuration accordingly. To resolve the problem, choose one of the following options.
+ **Option 1:** Update your lifecycle scripts to write logs to `/var/log/provision/provisioning.log`.
+ **Option 2:** Update the CloudWatch agent to look for your custom paths for logging lifecycle provisioning.

  1. Each HyperPod cluster instance contains a CloudWatch agent configuration file in JSON format at `/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json`. In the configuration file, find the field name `logs.logs_collected.files.collect_list.file_path`. With the default setup by HyperPod, the key-value pair should be `"file_path": "/var/log/provision/provisioning.log"` as documented at [Logging SageMaker HyperPod at instance level](sagemaker-hyperpod-cluster-management-slurm.md#sagemaker-hyperpod-cluster-management-slurm-logging-at-instance-level). The following code snippet shows how the JSON file looks with the HyperPod default configuration.

     ```
     "logs": {
         "logs_collected": {
             "files": {
                 "collect_list": [
                     {
                         "file_path": "/var/log/provision/provisioning.log",
                         "log_group_name": "/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]",
                         "log_stream_name": "LifecycleConfig/[InstanceGroupName]/{instance_id}",
                         "retention_in_days": -1
                     }
                 ]
             }
         },
         "force_flush_interval": 3
     }
     ```

  1. Replace the value for the `"file_path"` field name with the custom path you use in your lifecycle scripts. For example, if you have set up your lifecycle scripts to write to `/var/log/custom-provision/custom-provisioning.log`, update the value to match with it as follows.

     ```
     "file_path": "/var/log/custom-provision/custom-provisioning.log"
     ```

  1. Restart the CloudWatch agent with the configuration file to finish applying the custom path. For example, the following CloudWatch command shows how to restart the CloudWatch agent with the CloudWatch agent configuration file from step 1. For more information, see also [Troubleshooting the CloudWatch agent](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/troubleshooting-CloudWatch-Agent.html).

     ```
     sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
         -a fetch-config -m ec2 -s -c \
         file:/opt/aws/amazon-cloudwatch-agent/sagemaker_cwagent_config.json
     ```

## What particular configurations does HyperPod manage in Slurm configuration files such as `slurm.conf` and `gres.conf`?
<a name="hyperpod-faqs-q2"></a>

When you create a Slurm cluster on HyperPod, the HyperPod agent sets up the [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html) and [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html) files at `/opt/slurm/etc/` to manage the Slurm cluster based on your HyperPod cluster creation request and lifecycle scripts. The following list shows what specific parameters the HyperPod agent handles and overwrites. 

**Important**  
We strongly recommend that you DON’T change these parameters managed by HyperPod.
+ In [https://slurm.schedmd.com/slurm.conf.html](https://slurm.schedmd.com/slurm.conf.html), HyperPod sets up the following basic parameters: `ClusterName`, `SlurmctldHost`, `PartitionName`, and `NodeName`.

  Also, to enable the [Automatic node recovery and auto-resume](sagemaker-hyperpod-resiliency-slurm-auto-resume.md) functionality, HyperPod requires the `TaskPlugin` and `SchedulerParameters` parameters set as follows. The HyperPod agent sets up these two parameters with the required values by default.

  ```
  TaskPlugin=task/none
  SchedulerParameters=permit_job_expansion
  ```
+ In [https://slurm.schedmd.com/gres.conf.html](https://slurm.schedmd.com/gres.conf.html), HyperPod manages `NodeName` for GPU nodes.

## How do I run Docker on Slurm nodes on HyperPod?
<a name="hyperpod-faqs-q3"></a>

To help you run Docker on your Slurm nodes running on HyperPod, the HyperPod service team provides setup scripts that you can include as part of the lifecycle configuration for cluster creation. To learn more, see [Base lifecycle scripts provided by HyperPod](sagemaker-hyperpod-lifecycle-best-practices-slurm-slurm-base-config.md) and [Running Docker containers on a Slurm compute node on HyperPod](sagemaker-hyperpod-run-jobs-slurm-docker.md).

## Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?
<a name="hyperpod-faqs-q4"></a>

By default, the Linux OS sets the `#RemoveIPC=yes` flag. Slurm and mpirun jobs that use NCCL generate inter-process communication (IPC) resources under non-root user sessions. These user sessions might log out during the job process.

 When you run jobs with Slurm or mpirun, if `systemd` detects that the user isn't logged in, it cleans up the IPC resources. Slurm and mpirun jobs can run without the user being logged in, but that requires that you disable cleanup at the systemd level and set it up at the Slurm level instead. For more information, see [ Systemd in the NCCL documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#systemd). 

To disable cleanup at the systemd level, complete the following steps.

1. Set the flag `#RemoveIPC=no` in the file `/etc/systemd/logind.conf` if you're running training jobs that use Slurm and NCCL.

1.  By default, Slurm doesn't clean up shared resources. We recommend that you set up a Slurm epilog script to clean up shared resources. This cleanup is useful when you have a lot of shared resources and want to clean them up after training jobs. The following is an example script.

   ```
   #!/bin/bash
   : <<'SUMMARY'
   Script: epilog.sh
   
   Use this script with caution, as it can potentially delete unnecessary resources and cause issues if you don't use it correctly.
   
   Note: You must save this script in a shared in a shared location that is accessible to all nodes in the cluster, such as /fsx volume.
   Workers must be able to access the script to run the script after jobs.
   
   SUMMARY
   
   # Define the log directory and create it if it doesn't exist
   LOG_DIR="/<PLACEHOLDER>/epilogue" #NOTE: Update PLACEHOLDER to be a shared value path, such as /fsx/epilogue.
   mkdir -p "$LOG_DIR"
   
   # Name the log file using the Slurm job name and job ID
   log_file="$LOG_DIR/epilogue-${SLURM_JOB_NAME}_${SLURM_JOB_ID}.log"
   
   logging() {
       echo "[$(date)] $1" | tee -a "$log_file"
   }
   
   # Slurm epilogue script to clean up IPC resources
   logging "Starting IPC cleanup for Job $SLURM_JOB_ID"
   
   # Clean up shared memory segments by username
   for seg in $(ipcs -m | awk -v owner="$SLURM_JOB_USER" '$3 == owner {print $2}'); do
       if ipcrm -m "$seg"; then
           logging "Removed shared memory segment $seg"
       else
           logging "Failed to remove shared memory segment $seg"
       fi
   done
   
   # Clean up semaphores by username
   for sem in $(ipcs -s | awk -v user="$SLURM_JOB_USER" '$3 == user {print $2}'); do
       if ipcrm -s "$sem"; then
           logging "Removed semaphore $sem"
       else
           logging "Failed to remove semaphore $sem"
       fi
   done
   
   # Clean up NCCL IPC
   NCCL_IPC_PATH="/dev/shm/nccl-*"
   for file in $NCCL_IPC_PATH; do
       if [ -e "$file" ]; then
           if rm "$file"; then
               logging "Removed NCCL IPC file $file"
           else
               logging "Failed to remove NCCL IPC file $file"
           fi
       fi
   done
   logging "IPC cleanup completed for Job $SLURM_JOB_ID"
   exit 0
   ```

   For more information about the Epilog parameter, see [Slurm documentation](https://slurm.schedmd.com/slurm.conf.html#OPT_Epilog).

1. In the `slurm.conf` file from the controller node, add in a line to point to the epilog script you created.

   ```
   Epilog="/path/to/epilog.sh"  #For example: /fsx/epilogue/epilog.sh
   ```

1. Run the following commands to change permissions of the script and to make it executable.

   ```
   chown slurm:slurm /path/to/epilog.sh
   chmod +x  /path/to/epilog.sh
   ```

1. To apply all of your changes, run `scontrol reconfigure`.

## How do I use local NVMe store of P instances for launching Docker or Enroot containers with Slurm?
<a name="hyperpod-faqs-q5"></a>

Because the default root volume of your head node usually is limited by 100GB EBS volume, you need to set up Docker and Enroot to use local NVMe instance store. To learn how to set up NVMe store and use it for launching Docker containers, see [Running Docker containers on a Slurm compute node on HyperPod](sagemaker-hyperpod-run-jobs-slurm-docker.md).

## How to set up EFA security groups?
<a name="hyperpod-faqs-q6"></a>

If you want to create a HyperPod cluster with EFA-enabled instances, make sure that you set up a security group to allow all inbound and outbound traffic to and from the security group itself. To learn more, see [Step 1: Prepare an EFA-enabled security group](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security) in the *Amazon EC2 User Guide*.

## How do I monitor my HyperPod cluster nodes? Are there any CloudWatch metrics exported from HyperPod?
<a name="hyperpod-faqs-q7"></a>

To gain observability into the resource utilization of your HyperPod cluster, we recommend that you integrate the HyperPod cluster with Amazon Managed Grafana and Amazon Managed Service for Prometheus. With various open-source Grafana dashboards and exporter packages, you can export and visualize metrics related to the HyperPod cluster resources. To learn more about setting up SageMaker HyperPod with Amazon Managed Grafana and Amazon Managed Service for Prometheus, see [SageMaker HyperPod cluster resources monitoring](sagemaker-hyperpod-cluster-observability-slurm.md). Note that SageMaker HyperPod currently doesn't support the exportation of system metrics to Amazon CloudWatch.

## Can I add an additional storage to the HyperPod cluster nodes? The cluster instances have limited local instance store.
<a name="hyperpod-faqs-q8"></a>

If the default instance storage is insufficient for your workload, you can configure additional storage per instance. Starting from the [release on June 20, 2024](sagemaker-hyperpod-release-notes.md#sagemaker-hyperpod-release-notes-20240620), you can add an additional Amazon Elastic Block Store (EBS) volume to each instance in your SageMaker HyperPod cluster. Note that this capability cannot be applied to existing instance groups of SageMaker HyperPod clusters created before June 20, 2024. You can utilize this capability by patching existing SageMaker HyperPod clusters created before June 20, 2024 and adding new instance groups to them. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

## Why are my compute nodes showing as "DOWN" or "DRAINED" after a reboot?
<a name="hyperpod-faqs-q9"></a>

This typically occurs when nodes are rebooted using `sudo reboot` instead of Slurm's control interface. To properly reboot nodes, use the Slurm command `scontrol reboot nextstate=resume <list_of_nodes>`. This ensures Slurm maintains proper control of the node state and resumes normal operation after reboot.

For GPU instances (like NVIDIA P5), this can also happen if the node can't complete its boot process within Slurm's default time limit (60 seconds). To resolve this, increase the `TimeToResume` parameter in `slurm.conf` to 300 seconds. This gives GPU instances sufficient time to boot and initialize drivers.

## Why do my nodes keep getting drained due to Out of Memory (OOM) issues?
<a name="hyperpod-faqs-q10"></a>

OOM issues occur when jobs exceed the node's memory capacity. To prevent this, implement `cgroups` to enforce memory limits per job. This prevents a single job from affecting the entire node and improves isolation and stability.

Example setup in `slurm.conf`: 

```
TaskPlugin=task/cgroup
```

Example setup in `cgroup.conf`:

```
CgroupAutomount=yes
ConstrainCores=yes
CgroupPlugin=autodetect
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
SignalChildrenProcesses=yes
MaxRAMPercent=99
MaxSwapPercent=80
MinRAMSpace=100
```

For more information, see [Control Group in Slurm](https://slurm.schedmd.com/cgroups.html), [Cgroup and PAM-based login control for Slurm compute nodes](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/pam_adopt_cgroup_wheel.sh#L197), and [Configure Cgroups for Slurm](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/07-tips-and-tricks/16-enable-cgroups).

## How can I ensure resources are properly cleaned up after jobs complete?
<a name="hyperpod-faqs-q11"></a>

Implement epilogue scripts to automatically clean up resources after jobs complete. Resources might not be cleared correctly when jobs crash unexpectedly, contain bugs that prevent normal cleanup, or when shared memory buffers (include those shared between processes and GPU drivers) retain allocated.

Epilogue scripts can perform tasks such as clearing GPU memory, removing temporary files, and unmounting file systems. These scripts have limitations when resources are not exclusively allocated to a single job. For detailed instructions and sample scripts, see the second bullet point of the question [Why does my parallel training job fail when I use NVIDIA Collective Communications Library (NCCL) with Slurm on SageMaker HyperPod platform?](#hyperpod-faqs-q4). For more information, see [Enable Slurm epilog script](https://catalog.workshops.aws/sagemaker-hyperpod/en-US/07-tips-and-tricks/18-slurm-epilogue).

# Orchestrating SageMaker HyperPod clusters with Amazon EKS
<a name="sagemaker-hyperpod-eks"></a>

SageMaker HyperPod is a SageMaker AI-managed service that enables large-scale training of foundation models on long-running and resilient compute clusters, integrating with Amazon EKS for orchestrating the HyperPod compute resources. You can run uninterrupted training jobs spanning weeks or months at scale using Amazon EKS clusters with HyperPod resiliency features that check for various hardware failures and automatically recover faulty nodes. 

Key features for cluster admin users include the following.
+ Provisioning resilient HyperPod clusters and attaching them to an EKS control plane
+ Enabling dynamic capacity management, such as adding more nodes, updating software, and deleting clusters
+ Enabling access to the cluster instances directly through `kubectl` or SSM/SSH
+ Offering [resiliency capabilities](sagemaker-hyperpod-eks-resiliency.md), including basic health checks, deep health checks, a health-monitoring agent, and support for PyTorch job auto-resume
+ Integrating with observability tools such as [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html), [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html), and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html)

For data scientist users, EKS support in HyperPod enables the following.
+ Running containerized workloads for training foundation models on the HyperPod cluster
+ Running inference on the EKS cluster, leveraging the integration between HyperPod and EKS
+ Leveraging the job auto-resume capability for [Kubeflow PyTorch training (PyTorchJob)](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/)

**Note**  
Amazon EKS enables user-managed orchestration of tasks and infrastructure on SageMaker HyperPod through the Amazon EKS Control Plane. Ensure that user access to the cluster through the Kubernetes API Server endpoint follows the principle of least-privilege, and that network egress from the HyperPod cluster is secured.  
To learn more about securing access to the Amazon EKS API Server, see [Control network access to cluster API server endpoint](https://docs.aws.amazon.com/eks/latest/userguide/cluster-endpoint.html).  
To learn more about securing network access on HyperPod, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

The high-level architecture of Amazon EKS support in HyperPod involves a 1-to-1 mapping between an EKS cluster (control plane) and a HyperPod cluster (worker nodes) within a VPC, as shown in the following diagram.

![\[EKS and HyperPod VPC architecture with control plane, cluster nodes, and AWS services.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-diagram.png)


# Managing SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-operate"></a>

This section provides guidance on managing SageMaker HyperPod through the SageMaker AI console UI or the AWS Command Line Interface (CLI). It explains how to perform various tasks related to SageMaker HyperPod, whether you prefer a visual interface or working with commands.

**Topics**
+ [Getting started with Amazon EKS support in SageMaker HyperPod](sagemaker-hyperpod-eks-prerequisites.md)
+ [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md)
+ [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md)
+ [Custom Amazon Machine Images (AMIs) for SageMaker HyperPod clusters](hyperpod-custom-ami-support.md)
+ [Managing SageMaker HyperPod EKS clusters using the SageMaker console](sagemaker-hyperpod-eks-operate-console-ui.md)
+ [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-eks-console-create-cluster-cfn.md)
+ [Managing SageMaker HyperPod EKS clusters using the AWS CLI](sagemaker-hyperpod-eks-operate-cli-command.md)
+ [HyperPod managed tiered checkpointing](managed-tier-checkpointing.md)
+ [SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md)
+ [Usage reporting for cost attribution in SageMaker HyperPod](sagemaker-hyperpod-usage-reporting.md)
+ [Configuring storage for SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-setup-storage.md)
+ [Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters](sagemaker-hyperpod-eks-ebs.md)
+ [Configuring custom Kubernetes labels and taints in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-custom-labels-and-taints.md)

# Getting started with Amazon EKS support in SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-prerequisites"></a>

In addition to the general [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md) for SageMaker HyperPod, check the following requirements and considerations for orchestrating SageMaker HyperPod clusters using Amazon EKS.

**Important**  
You can set up resources configuration for creating SageMaker HyperPod clusters using the AWS Management Console and CloudFormation. For more information, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) and [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-eks-console-create-cluster-cfn.md).

**Requirements**

**Note**  
Before creating a HyperPod cluster, you need a running Amazon EKS cluster configured with VPC and installed using Helm.
+ If using the SageMaker AI console, you can create an Amazon EKS cluster within the HyperPod cluster console page. For more information, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).
+ If using AWS CLI, you should create an Amazon EKS cluster before creating a HyperPod cluster to associate with. For more information, see [Create an Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) in the Amazon EKS User Guide.

When provisioning your Amazon EKS cluster, consider the following:

1. **Kubernetes version support**
   + SageMaker HyperPod supports Kubernetes versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, and 1.34.

1. **Amazon EKS cluster authentication mode**
   + The authentication mode of an Amazon EKS cluster supported by SageMaker HyperPod are `API` and `API_AND_CONFIG_MAP`.

1. **Networking**
   + SageMaker HyperPod requires the Amazon VPC Container Network Interface (CNI) plug-in version 1.18.3 or later.
**Note**  
[AWS VPC CNI plugin for Kubernetes](https://github.com/aws/amazon-vpc-cni-k8s) is the only CNI supported by SageMaker HyperPod.
   + The [type of the subnet](https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html#subnet-types) in your VPC must be private for HyperPod clusters.

1. **IAM roles**
   + Ensure the necessary IAM roles for HyperPod are set up as guided in the [AWS Identity and Access Management for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md) section.

1. **Amazon EKS cluster add-ons**
   + You can continue using the various add-ons provided by Amazon EKS such as [Kube-proxy](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-kube-proxy.html), [CoreDNS](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-coredns.html), the [Amazon VPC Container Network Interface (CNI)](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-vpc-cni.html) plugin, Amazon EKS pod identity, the GuardDuty agent, the Amazon FSx Container Storage Interface (CSI) driver, the Mountpoint for Amazon S3 CSI driver, the AWS Distro for OpenTelemetry, and the CloudWatch Observability agent.

**Considerations for configuring SageMaker HyperPod clusters with Amazon EKS**
+ You must use distinct IAM roles based on the type of your nodes. For HyperPod nodes, use a role based on [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). For Amazon EKS nodes, see [Amazon EKS node IAM role](https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html).
+ You can provision and mount additional Amazon EBS volumes on SageMaker HyperPod nodes using two approaches: use [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs) for cluster-level volume provisioning (available when creating or updating instance groups), or use the Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver for dynamic pod-level volume management. With [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs), set the [local path](https://kubernetes.io/docs/concepts/storage/volumes/#local) to `/opt/sagemaker` to properly mount the volumes to your Amazon EKS pods. For information about how to deploy the [Amazon EBS CSI](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) controller on HyperPod nodes, see [Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters](sagemaker-hyperpod-eks-ebs.md).
+ If you use instance-type labels for defining scheduling constraints, ensure that you use the SageMaker AI ML instance types prefixed with `ml.`. For example, for P5 instances, use `ml.p5.48xlarge` instead of `p5.48xlarge`.

**Considerations for configuring network for SageMaker HyperPod clusters with Amazon EKS**
+ Each HyperPod cluster instance supports one Elastic Network Interface (ENI). For the maximum number of Pods per instance type, refer to the following table.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ Only Pods with `hostNetwork = true` have access to the Amazon EC2 Instance Metadata Service (IMDS) by default. Use the Amazon EKS Pod identity or the [IAM roles for service accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) to manage access to the AWS credentials for Pods.
+ EKS-orchestrated HyperPod clusters support dual IP addressing modes, allowing configuration with IPv4 or IPv6 for IPv6 Amazon EKS clusters in IPv6-enabled VPC and subnet environments. For more information, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

**Considerations for using the HyperPod cluster resiliency features**
+ Node auto-replacement is not supported for CPU instances.
+ The HyperPod health monitoring agent needs to be installed for node auto-recovery to work. The agent can be installed using Helm. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).
+ The HyperPod deep health check and health monitoring agent supports GPU and Trn instances.
+ SageMaker AI applies the following taint to nodes when they are undergoing deep health checks:

  ```
  effect: NoSchedule
  key: sagemaker.amazonaws.com/node-health-status
  value: Unschedulable
  ```
**Note**  
You cannot add custom taints to nodes in instance groups with `DeepHealthChecks` turned on.

 Once your Amazon EKS cluster is running, configure your cluster using the Helm package manager as instructed in [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md) before creating your HyperPod cluster.

# Installing packages on the Amazon EKS cluster using Helm
<a name="sagemaker-hyperpod-eks-install-packages-using-helm-chart"></a>

Before creating a SageMaker HyperPod cluster and attaching it to an Amazon EKS cluster, you should install packages using [Helm](https://helm.sh/), a package manager for Kubernetes. Helm is an open-source tool for setting up a installation process for Kubernetes clusters. It enables the automation and streamlining of dependency installations and simplifies various setups needed for preparing the Amazon EKS cluster as the orchestrator (control plane) for a SageMaker HyperPod cluster.

The SageMaker HyperPod service team provides a Helm chart package, which bundles key dependencies such as device/EFA plug-ins, plug-ins, [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/training/), and associated permission configurations.

**Important**  
This Helm installation step is required. If you set up your Amazon EKS cluster using the [AWS Management Console](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) or [CloudFormation](smcluster-getting-started-eks-console-create-cluster-cfn.md), you can skip this step because the installation is handled automatically during the setup process. If you set up the cluster directly using the APIs, use the provided Helm chart to configure your Amazon EKS cluster. Failure to configure your Amazon EKS cluster using the provided Helm chart might result in the SageMaker HyperPod cluster not functioning correctly or the creation process failing entirely. The `aws-hyperpod` namespace name cannot be modified.

1. [Install Helm](https://helm.sh/docs/intro/install/) on your local machine.

1. Download the Helm charts provided by SageMaker HyperPod located at `helm_chart/HyperPodHelmChart` in the [SageMaker HyperPod CLI repository](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart).

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli.git
   cd sagemaker-hyperpod-cli/helm_chart
   ```

1. Update the dependencies of the Helm chart, preview the changes that will be made to your Kubernetes cluster, and install the Helm chart.

   ```
   helm dependencies update HyperPodHelmChart
   ```

   ```
   helm install hyperpod-dependencies HyperPodHelmChart --namespace kube-system --dry-run
   ```

   ```
   helm install hyperpod-dependencies HyperPodHelmChart --namespace kube-system
   ```

In summary, the Helm installation sets up various components for your Amazon EKS cluster, including job scheduling and queueing (Kueue), storage management, MLflow integration, and Kubeflow. Additionally, the charts install the following components for integrating with the SageMaker HyperPod cluster resiliency features, which are required components.
+ **Health monitoring agent** – This installs the health-monitoring agent provided by SageMaker HyperPod. This is required if you want to get your HyperPod cluster be monitored. Health-monitoring agents are provided as Docker images as follows. In the provided `values.yaml` in the Helm charts, the image is preset. The agent support GPU-based instances and Trainium-accelerator-based instances (`trn1`, `trn1n`, `inf2`). It is installed to the `aws-hyperpod` namespace. To find your supported URI, see [Supported Regions and their ECR URIs in the sagemaker-hyperpod-cli repository on GitHub](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/readme.md#6-notes).
+ **Deep health check** – This sets up a `ClusterRole`, a ServiceAccount (`deep-health-check-service-account`) in the `aws-hyperpod` namespace, and a `ClusterRoleBinding` to enable the SageMaker HyperPod deep health check feature. For more information about the Kubernetes RBAC file for deep health check, see the configuration file [https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/deep-health-check/templates/deep-health-check-rbac.yaml](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/deep-health-check/templates/deep-health-check-rbac.yaml) in the SageMaker HyperPod CLI GitHub repository. 
+ **`job-auto-restart`** - This sets up a `ClusterRole`, a ServiceAccount (`job-auto-restart`) in the `aws-hyperpod` namespace, and a `ClusterRoleBinding`, to enable the auto-restart feature for PyTorch training jobs in SageMaker HyperPod. For more information about the Kubernetes RBAC file for `job-auto-restart`, see the configuration file [https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/job-auto-restart/templates/job-auto-restart-rbac.yaml](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/job-auto-restart/templates/job-auto-restart-rbac.yaml) in the SageMaker HyperPod CLI GitHub repository. 
+ **Kubeflow MPI operator** – The [MPI Operator](https://github.com/kubeflow/mpi-operator) is a Kubernetes operator that simplifies running distributed Machine Learning (ML) and High-Performance Computing (HPC) workloads using the Message Passing Interface (MPI) on Kubernetes clusters. It installs MPI Operator v0.5. It is installed to the `mpi-operator` namespace.
+ **`nvidia-device-plugin`** – This is a Kubernetes device plug-in that allows you to automatically expose NVIDIA GPUs for consumption by containers in your Amazon EKS cluster. It allows Kubernetes to allocate and provide access to the requested GPUs for that container. Required when using an instance type with GPU.
+ **`neuron-device-plugin`** – This is a Kubernetes device plug-in that allows you to automatically expose AWS Inferentia chips for consumption by containers in your Amazon EKS cluster. It allows Kubernetes to access and utilize the AWS Inferentia chips on the cluster nodes. Required when using a Neuron instance type.
+ **`aws-efa-k8s-device-plugin`** – This is a Kubernetes device plug-in that enables the use of AWS Elastic Fabric Adapter (EFA) on Amazon EKS clusters. EFA is a network device that provides low-latency and high-throughput communication between instances in a cluster. Required when using an EFA supported instance type.

For more information about the installation procedure using the provided Helm charts, see the [README file in the SageMaker HyperPod CLI repository](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart).

# Setting up Kubernetes role-based access control
<a name="sagemaker-hyperpod-eks-setup-rbac"></a>

Cluster admin users also need to set up [Kubernetes role-based access control (RBAC)](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) for data scientist users to use the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) to run workloads on HyperPod clusters orchestrated with Amazon EKS.

## Option 1: Set up RBAC using Helm chart
<a name="sagemaker-hyperpod-eks-setup-rbac-helm"></a>

The SageMaker HyperPod service team provides a Helm sub-chart for setting up RBAC. To learn more, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

## Option 2: Set up RBAC manually
<a name="sagemaker-hyperpod-eks-setup-rbac-manual"></a>

Create `ClusterRole` and `ClusterRoleBinding` with the minimum privilege, and create `Role` and `RoleBinding` with mutation permissions.

**To create `ClusterRole` & `ClusterRoleBinding` for data scientist IAM role**

Create a cluster-level configuration file `cluster_level_config.yaml` as follows.

```
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: hyperpod-scientist-user-cluster-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hyperpod-scientist-user-cluster-role-binding
subjects:
- kind: Group
  name: hyperpod-scientist-user-cluster-level
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: hyperpod-scientist-user-cluster-role # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io
```

Apply the configuration to the EKS cluster.

```
kubectl apply -f cluster_level_config.yaml
```

**To create Role and RoleBinding in namespace**

This is the namespace training operator that run training jobs and Resiliency will monitor by default. Job auto-resume can only support in `kubeflow` namespace or namespace prefixed `aws-hyperpod`. 

Create a role configuration file `namespace_level_role.yaml` as follows. This example creates a role in the `kubeflow` namespace

```
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: kubeflow
  name: hyperpod-scientist-user-namespace-level-role
###
#  1) add/list/describe/delete pods
#  2) get/list/watch/create/patch/update/delete/describe kubeflow pytroch job
#  3) get pod log
###
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "get"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["get", "create"]
- apiGroups: ["kubeflow.org"]
  resources: ["pytorchjobs", "pytorchjobs/status"]
  verbs: ["get", "list", "create", "delete", "update", "describe"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create", "update", "get", "list", "delete"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create", "get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: kubeflow
  name: hyperpod-scientist-user-namespace-level-role-binding
subjects:
- kind: Group
  name: hyperpod-scientist-user-namespace-level
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: hyperpod-scientist-user-namespace-level-role # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io
```

Apply the configuration to the EKS cluster.

```
kubectl apply -f namespace_level_role.yaml
```

## Create an access entry for Kubernetes groups
<a name="sagemaker-hyperpod-eks-setup-rbac-access-entry"></a>

After you have set up RBAC using one of the two options above, use the following sample command replacing the necessary information.

```
aws eks create-access-entry \
    --cluster-name <eks-cluster-name> \
    --principal-arn arn:aws:iam::<AWS_ACCOUNT_ID_SCIENTIST_USER>:role/ScientistUserRole \
    --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
```

For the `principal-arn` parameter, you need to use the [IAM users for scientists](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-user).

# Custom Amazon Machine Images (AMIs) for SageMaker HyperPod clusters
<a name="hyperpod-custom-ami-support"></a>

Using base Amazon Machine Images (AMIs) provided and made public by Amazon SageMaker HyperPod, you can build custom AMIs. With a custom AMI, you can create specialized environments for AI workloads with pre-configured software stacks, driver customizations, proprietary dependencies, and security agents. This capability eliminates the need for complex post-launch bootstrapping using lifecycle configuration scripts.

With custom AMIs, you can standardize environments across different stages, accelerate startup times, and have full control over your runtime environment while leveraging SageMaker HyperPod's infrastructure capabilities and scaling advantages. This helps you maintain control over your AI infrastructure while still benefiting from SageMaker HyperPod's optimized base runtime.

You can build upon the SageMaker HyperPod performance-tuned base images by adding security agents, compliance tools, and specialized libraries while preserving all the distributed training benefits. This capability removes the previously required choice between infrastructure optimization and organizational security policies.

The custom AMI experience integrates seamlessly with established enterprise security workflows. Security teams build hardened images using SageMaker HyperPod's public AMIs as a base, and AI platform teams can specify these custom AMIs when creating or updating clusters through the SageMaker HyperPod APIs. The APIs validate image compatibility, handle necessary permissions, and maintain backwards compatibility so existing workflows continue functioning. Organizations with stringent security protocols can eliminate the error-prone alternative of installing security agents at runtime through lifecycle scripts. By aligning with enterprise security practices rather than forcing organizations to adapt their protocols to SageMaker HyperPod's limitations, custom AMIs remove a common barrier to adoption for security-conscious organizations running critical AI workloads.

For release notes on updates to the public AMIs, see [Public AMI releases](sagemaker-hyperpod-release-public-ami.md). To learn how to get started with building a custom AMI and using it in your HyperPod clusters, see the following topics.

**Topics**
+ [Build a custom AMI](hyperpod-custom-ami-how-to.md)
+ [Cluster management with custom AMIs](hyperpod-custom-ami-cluster-management.md)

# Build a custom AMI
<a name="hyperpod-custom-ami-how-to"></a>

The following page explains how to build a custom Amazon Machine Image (AMI) using Amazon SageMaker HyperPod base AMIs. You begin by selecting a base AMI, and then you create your own customized AMI using any of the common methods for creating new images, such as the AWS CLI.

## Select a SageMaker HyperPod base AMI
<a name="hyperpod-custom-ami-select-base"></a>

You can select a SageMaker HyperPod base AMI through one of the following methods.

### AWS console selection
<a name="hyperpod-custom-ami-console-selection"></a>

You can select public SageMaker HyperPod AMIs through the AWS console or by using the `DescribeImages` API call. SageMaker HyperPod AMIs are public and visible in every AWS account. You can find them in the Amazon EC2 AMI catalog by applying a filter to search for public AMIs owned by Amazon.

To find SageMaker HyperPod AMIs in the console:

1. Sign in to the Amazon EC2 console.

1. In the left navigation pane, choose **AMIs**.

1. For the **Image type** dropdown, select **Public images**.

1. In the search bar filters, set the **Owner alias** filter to **amazon**.

1. Search for AMIs prefixed as **HyperPod EKS** and select the AMI (preferably latest) that works for your use case. For instance, you can choose an AMI between Kubernetes 1.31 versus Kubernetes 1.30.

### Fetch latest public AMI ID through the AWS CLI
<a name="hyperpod-custom-ami-cli-fetch"></a>

If you want to always use the latest release public AMI, it is more efficient to use the public SageMaker HyperPod SSM parameter that contains the value of the latest AMI ID released by SageMaker HyperPod.

The following example shows how to retrieve the latest AMI ID using the AWS CLI:

```
aws ssm get-parameter \
  --name "/aws/service/sagemaker-hyperpod/ami/x86_64/eks-1.31-amazon-linux-2/latest/ami-id" \
  --region us-east-1 \
  --query "Parameter.Value" \
  --output text
```

**Note**  
Replace the parameter name with the corresponding Kubernetes version as required. For example, if you want to use Kubernetes 1.30, use the following parameter: `/aws/service/hyperpod/ami/x86_64/eks-1.30-amazon-linux-2/latest/ami-id`.

## Build your custom AMI
<a name="hyperpod-custom-ami-build"></a>

After you have selected a SageMaker HyperPod public AMI, use that as the base AMI to build your own custom AMI with one of the following methods. Note that this is not an exhaustive list for building AMIs. You can use any method of your choice for building AMIs. SageMaker HyperPod does not have any specific recommendation.
+ **AWS Management Console**: You can launch an Amazon EC2 instance using the SageMaker HyperPod AMI, make desired customizations, and then create an AMI from that instance.
+ **AWS CLI**: You can also use the `aws ec2 create-image` command to create an AMI from an existing Amazon EC2 instance after performing the customization.
+ **HashiCorp Packer**: Packer is an open-source tool from HashiCorp that enables you to create identical machine images for multiple platforms from a single source configuration. It supports creating AMIs for AWS, as well as images for other cloud providers and virtualization platforms.
+ **Image Builder**: EC2 Image Builder is a fully managed AWS service that makes it easier to automate the creation, maintenance, validation, sharing, and deployment of Linux or Windows Server images. For more information, see the [EC2 Image Builder User Guide](https://docs.aws.amazon.com/imagebuilder/latest/userguide/what-is-image-builder.html).

### Build a custom AMI with customer managed AWS KMS encryption
<a name="hyperpod-custom-ami-build-kms"></a>

The following sections describe how to build a custom AMI with a customer managed AWS KMS key to encrypt your HyperPod cluster volumes. For more information about customer managed keys in HyperPod and granting the required IAM and KMS key policy permissions, see [Customer managed AWS KMS key encryption for SageMaker HyperPod](smcluster-cmk.md). If you plan to use a custom AMI that is encrypted with a customer managed key, ensure that you also encrypt your HyperPod cluster's Amazon EBS root volume with the same key.

#### AWS CLI example: Create a new AMI using EC2 Image Builder and a HyperPod base image
<a name="hyperpod-custom-ami-cli-example"></a>

The following example shows how to create an AMI using Image Builder with AWS KMS encryption:

```
aws imagebuilder create-image-recipe \
    name "hyperpod-custom-recipe" \
    version "1.0.0" \
    parent-image "<hyperpod-base-image-id>" \
    block-device-mappings DeviceName="/dev/xvda",Ebs={VolumeSize=100,VolumeType=gp3,Encrypted=true,KmsKeyId=arn:aws:kms:us-east-1:111122223333:key/key-id,DeleteOnTermination=true}
```

#### Amazon EC2 console: Create a new AMI from an Amazon EC2
<a name="hyperpod-custom-ami-console-example"></a>

To create an AMI from an Amazon EC2 instance using the Amazon EC2 console:

1. Right-click on your customized Amazon EC2 instance and choose **Create Image**.

1. In the **Encryption** section, select **Encrypt snapshots**.

1. Select your KMS key from the dropdown. For example: `arn:aws:kms:us-east-2:111122223333:key/<your-kms-key-id>` or use the key alias: `alias/<your-hyperpod-key>`.

#### AWS CLI example: Create a new AMI from an Amazon EC2 instance
<a name="hyperpod-custom-ami-cli-create-image"></a>

Use the `aws ec2 create-image` command with AWS KMS encryption:

```
aws ec2 create-image \
    instance-id "<instance-id>" \
    name "MyCustomHyperPodAMI" \
    description "Custom HyperPod AMI" \
    block-device-mappings '[
        {
            "DeviceName": "/dev/xvda",
            "Ebs": {
                "Encrypted": true,
                "KmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id",
                "VolumeType": "gp2" 
            }
        }
    ]'
```

# Cluster management with custom AMIs
<a name="hyperpod-custom-ami-cluster-management"></a>

After the custom AMI is built, you can use it for creating or updating an Amazon SageMaker HyperPod cluster. You can also scale up or add instance groups that use the new AMI.

## Permissions required for cluster operations
<a name="hyperpod-custom-ami-permissions"></a>

Add the following permissions to the cluster admin user who operates and configures SageMaker HyperPod clusters. The following policy example includes the minimum set of permissions for cluster administrators to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters with custom AMI.

Note that AMI and AMI EBS snapshot sharing permissions are included through `ModifyImageAttribute` and `ModifySnapshotAttribute` API permissions as part of the following policy. For scoping down the sharing permissions, you can take the following steps:
+ Add tags to control the AMI sharing permissions to AMI and AMI snapshot. For example, you can tag the AMI with `AllowSharing` as `true`.
+ Add the context key in the policy to only allow AMI sharing for AMIs tagged with certain tags.

The following policy is a scoped down policy to ensure only AMIs tagged with `AllowSharing` as `true` are allowed.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::111122223333:role/your-execution-role-name"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:DeleteCluster",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware",
                "sagemaker:BatchDeleteClusterNodes",
                "eks:DescribeCluster",
                "eks:CreateAccessEntry",
                "eks:DescribeAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:AssociateAccessPolicy",
                "iam:CreateServiceLinkedRole",
                "ec2:DescribeImages",
                "ec2:DescribeSnapshots"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:ModifyImageAttribute",
                "ec2:ModifySnapshotAttribute"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "ec2:ResourceTag/AllowSharing": "true"
                }
            }
        }
    ]
}
```

------

**Important**  
If you plan to use an encrypted custom AMI, then make sure that your KMS key meets the permissions described in [Customer managed AWS KMS key encryption for SageMaker HyperPod](smcluster-cmk.md). Additionally, ensure that your custom AMI's KMS key is also used to encrypt your cluster's Amazon EBS root volume.

## Create a cluster
<a name="hyperpod-custom-ami-api-create"></a>

You can specify your custom AMI in the `ImageId` field for the `CreateCluster` operation.

The following examples show how to create a cluster with a custom AMI, both with and without an AWS KMS customer managed key for encrypting the cluster volumes.

------
#### [ Standard example ]

The following example shows how to create a cluster with a custom AMI.

```
aws sagemaker create-cluster \
   --cluster-name <exampleClusterName> \
   --orchestrator 'Eks={ClusterArn='<eks_cluster_arn>'}' \
   --node-provisioning-mode Continuous \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ImageId": "<your_custom_ami>",
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "InstanceStorageConfigs": [
   
        {
            "EbsVolumeConfig": {
                "VolumeSizeInGB": 200
            }
        }
   ]
}' --vpc-config '{
   "SecurityGroupIds": ["<security_group>"],
   "Subnets": ["<subnet>"]
}'
```

------
#### [ Customer managed key example ]

The following example shows how to create a cluster with a custom AMI while specifying your own AWS KMS customer managed key for encrypting the cluster's Amazon EBS volumes. It is possible to specify different customer managed keys for the root volume and the instance storage volume. If you don't use customer managed keys in the `InstanceStorageConfigs` field, then an AWS owned KMS key is used to encrypt the volumes. If you use different keys for the root volume and secondary instance storage volumes, then set the required KMS key policies on both of your keys.

```
aws sagemaker create-cluster \
   --cluster-name <exampleClusterName> \
   --orchestrator 'Eks={ClusterArn='<eks_cluster_arn>'}' \
   --node-provisioning-mode Continuous \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ImageId": "<your_custom_ami>",
   "ExecutionRole": "<arn:aws:iam:us-east-1:444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "InstanceStorageConfigs": [
             # Root volume configuration
            {
                "EbsVolumeConfig": {
                    "RootVolume": True,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            },
            # Instance storage volume configuration
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            }
   ]
}' --vpc-config '{
   "SecurityGroupIds": ["<security_group>"],
   "Subnets": ["<subnet>"]
}'
```

------

## Update the cluster software
<a name="hyperpod-custom-ami-api-update"></a>

If you want to update an existing instance group on your cluster with your custom AMI, you can use the `UpdateClusterSoftware` operation and specify your custom AMI in the `ImageId` field. Note that unless you specify the name of a specific instance group in your request, then the new image is applied to all of the instance groups in your cluster.

The following example shows how to update a cluster's platform software with a custom AMI:

```
aws sagemaker update-cluster-software \
   --cluster-name <exampleClusterName> \
   --instance-groups <instanceGroupToUpdate> \
   --image-id <customAmiId>
```

## Scale up an instance group
<a name="hyperpod-custom-ami-scale-up"></a>

The following examples show how to scale up an instance group for a cluster using a custom AMI, both with and without using an AWS KMS customer managed key for encryption.

------
#### [ Standard example ]

The following example shows how to scale up an instance group with a custom AMI.

```
aws sagemaker update-cluster \
    --cluster-name <exampleClusterName> --instance-groups '[{                  
    "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}]'
```

------
#### [ Customer managed key example ]

The following example shows how to update and scale up your cluster with a custom AMI while specifying your own AWS KMS customer managed key for encrypting the cluster's Amazon EBS volumes. It is possible to specify different customer managed keys for the root volume and the instance storage volume. If you don't use customer managed keys in the `InstanceStorageConfigs` field, then an AWS owned KMS key is used to encrypt the volumes. If you use different keys for the root volume and secondary instance storage volumes, then set the required KMS key policies on both of your keys.

```
aws sagemaker update-cluster \
    --cluster-name <exampleClusterName> --instance-groups '[{                  
    "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>",
   "InstanceStorageConfigs": [
             # Root volume configuration
            {
                "EbsVolumeConfig": {
                    "RootVolume": True,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            },
            # Instance storage volume configuration
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            }
   ]
}]'
```

------

## Add an instance group
<a name="hyperpod-custom-ami-add-instance-group"></a>

The following example shows how to add an instance group to a cluster using a custom AMI:

```
aws sagemaker update-cluster \
   --cluster-name "<exampleClusterName>" \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}' '{
   "InstanceGroupName": "<exampleGroupName2>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 1,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}'
```

# Managing SageMaker HyperPod EKS clusters using the SageMaker console
<a name="sagemaker-hyperpod-eks-operate-console-ui"></a>

The following topics provide guidance on how to manage SageMaker HyperPod in the SageMaker AI console.

**Topics**
+ [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md)
+ [Browsing, viewing, and editing SageMaker HyperPod clusters](sagemaker-hyperpod-eks-operate-console-ui-browse-view-edit.md)
+ [Deleting a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-console-ui-delete-cluster.md)

# Creating a SageMaker HyperPod cluster with Amazon EKS orchestration
<a name="sagemaker-hyperpod-eks-operate-console-ui-create-cluster"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Amazon EKS orchestration through the SageMaker AI console UI.

**Topics**
+ [Create cluster](#smcluster-getting-started-eks-console-create-cluster-page)
+ [Deploy resources](#smcluster-getting-started-eks-console-create-cluster-deploy)

## Create cluster
<a name="smcluster-getting-started-eks-console-create-cluster-page"></a>

To navigate to the **SageMaker HyperPod Clusters** page and choose Amazon EKS orchestration, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Amazon EKS**.

1. On the EKS cluster creation page, you will see two options, choose the option that best fits your needs.

   1. **Quick setup** - To get started immediately with default settings, choose **Quick setup**. With this option, SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

   1. **Custom setup** - To integrate with existing AWS resources or have specific networking, security, or storage requirements, choose **Custom setup**. With this option, you can choose to use the existing resources or create new ones, and you can customize the configuration that best fits your needs.

## Quick setup
<a name="smcluster-getting-started-eks-console-create-cluster-default"></a>

On the **Quick setup** section, follow these steps to create your HyperPod cluster with Amazon EKS orchestration.

### General settings
<a name="smcluster-getting-started-eks-console-create-cluster-default-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

### Instance groups
<a name="smcluster-getting-started-eks-console-create-cluster-default-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. Follow these steps to add an instance group.

1. For **Instance group type**, choose **Standard** or **Restricted Instance Group (RIG)**. Typically, you will choose **Standard**, which provides a general purpose computing environment without additional security restrictions. **Restricted Instance Group (RIG)** is a specialized environment for foundational models customization such as Amazon Nova. For more information about setting up RIG for Amazon Nova model customization, see Amazon Nova customization on SageMaker HyperPod in the [Amazon Nova 1.0 user guide](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) or the [Amazon Nova 2.0 user guide](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html).

1. For **Name**, specify a name for the instance group.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group.
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. For **Instance deep health checks**, choose your option. Deep health checks monitor instance health during creation and after software updates, automatically recovering faulty instances through reboots or replacements when enabled.

1. If your instance type supports GPU partitioning with Multi-Instance GPU (MIG), you can enable GPU partition configuration for the instance group. GPU partitioning allows you to divide GPUs into smaller, isolated partitions for improved resource utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

   1. Toggle **Use GPU partition** to enable GPU partitioning for this instance group.

   1. Select a **GPU partition profile** from the available options for your instance type. Each profile defines the GPU slice configuration and memory allocation.

1. Choose **Add instance group**.

### Quick setup defaults
<a name="smcluster-getting-started-eks-console-create-cluster-default-settings"></a>

This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process. Review the default settings.

## Custom setup
<a name="smcluster-getting-started-eks-console-create-cluster-custom"></a>

On the **Custom setup** section, follow these steps to create your first HyperPod cluster with Amazon EKS orchestration.

### General settings
<a name="smcluster-getting-started-eks-console-create-cluster-custom-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

For **Instance recovery**, choose **Automatic - *recommended*** or **None**. 

### Networking
<a name="smcluster-getting-started-eks-console-create-cluster-custom-network"></a>

Configure network settings within the cluster and in-and-out of the cluster. For orchestration of SageMaker HyperPod cluster with Amazon EKS, the VPC is automatically set to the one configured with the EKS cluster you selected.

1. For **VPC**, choose your own VPC if you already have one that gives SageMaker AI access to your VPC. To create a new VPC, follow the instructions at [Create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc.html) in the *Amazon Virtual Private Cloud User Guide*. You can leave it as **None** to use the default SageMaker AI VPC.

1. For **VPC IPv4 CIDR block**, enter the starting IP of your VPC.

1. For **Availability Zones**, choose the Availability Zones (AZ) where HyperPod will create subnets for your cluster. Choose AZs that match the location of your accelerated compute capacity.

1. For **Security group(s)**, choose security groups that are either attached to the Amazon EKS cluster or whose inbound traffic is permitted by the security group associated with the Amazon EKS cluster. To create new security groups, go to the Amazon VPC console.

### Orchestration
<a name="smcluster-getting-started-eks-console-create-cluster-custom-orchestration"></a>

Follow these steps to create or select an Amazon EKS cluster to use as an orchestrator. 

1. For **EKS cluster**, choose either create a new Amazon EKS cluster or use an existing one. 

   If you need to create a new EKS cluster, you can create it from the EKS cluster section without having to open the Amazon EKS console.
**Note**  
The VPC subnet you choose for HyperPod has to be private.   
After submitting a new EKS cluster creation request, wait until the EKS cluster becomes `Active`.

1. For **Kubernetes version**, choose a version from the drop-down menu. For more information about Kubernetes versions, see [Understand the Kubernetes version lifecycle on EKS](https://docs.aws.amazon.com//eks/latest/userguide/kubernetes-versions.html) from the *Amazon EKS User Guide*.

1. For **Operators**, choose **Use default Helm charts and add-ons** or **Don't install operators**. The option defaults to **Use default Helm charts and add-ons**, which will be used to install operators on the EKS cluster. For more information about the default Helm charts and add-ons, see [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart/HyperPodHelmChart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart/HyperPodHelmChart) from the GitHub repository. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

1. For **Enabled operators**, view the list of enabled operators. To edit the operators, uncheck the box at top and choose operators to enable for the EKS cluster. 
**Note**  
To use HyperPod with EKS, you must install Helm charts and add-ons that enable operators on the EKS cluster. These components configure EKS as the control plane for HyperPod and provide the necessary setup for workload management and orchestration.

### Instance groups
<a name="smcluster-getting-started-eks-console-create-cluster-custom-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. Follow these steps to add an instance group.

1. For **Instance group type**, choose **Standard** or **Restricted Instance Group (RIG)**. Typically, you will choose **Standard**, which provides a general purpose computing environment without additional security restrictions. **Restricted Instance Group (RIG)** is a specialized environment for foundational models customization such as Amazon Nova. For more information about setting up RIG for Amazon Nova model customization, see Amazon Nova customization on SageMaker HyperPod in the [Amazon Nova 1.0 user guide](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) or the [Amazon Nova 2.0 user guide](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html).

1. For **Name**, specify a name for the instance group.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group.
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. For **Instance deep health checks**, choose your option. Deep health checks monitor instance health during creation and after software updates, automatically recovering faulty instances through reboots or replacements when enabled. To learn more, see [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md)

1. For **Use GPU partition - optional**, if your instance type supports GPU partitioning with Multi-Instance GPU (MIG), you can enable this option to configure the GPU partition profile for the instance group. GPU partitioning allows you to divide GPUs into smaller, isolated partitions for improved resource utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

   1. Toggle **Use GPU partition** to enable GPU partitioning for this instance group.

   1. Select a **GPU partition profile** from the available options for your instance type. Each profile defines the GPU slice configuration and memory allocation.

1. Choose **Add instance group**.

### Lifecycle scripts
<a name="smcluster-getting-started-eks-console-create-cluster-custom-lifecycle"></a>

You can choose to use the default lifecycle scripts or the custom lifecycle scripts, which will be stored in your Amazon S3 bucket. You can view the default lifecycle scripts in the [Awesome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts). To learn more about the lifecycle scripts, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

1. For **Lifecycle scripts**, choose to use default or custom lifecycle scripts.

1. For **S3 bucket for lifecycle scripts**, choose to create a new bucket or use an existing bucket to store the lifecycle scripts.

### Permissions
<a name="smcluster-getting-started-eks-console-create-cluster-custom-permissions"></a>

Choose or create an IAM role that allows HyperPod to run and access necessary AWS resources on your behalf. For more information, see [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

### Storage
<a name="smcluster-getting-started-eks-console-create-cluster-custom-storage"></a>

Configure the FSx for Lustre file system to be provisioned on the HyperPod cluster.

1. For **File system**, choose an existing FSx for Lustre file system, to create a new FSx for Lustre file system, or don't provision an FSx for Lustre file system.

1. For **Throughput per unit of storage**, choose the throughput that will be available per TiB of provisioned storage.

1. For **Storage capacity**, enter a capacity value in TB.

1. For **Data compression type**, choose **LZ4** to enable data compression.

1. For **Lustre version**, view the value that's recommended for the new file systems.

### Tags - optional
<a name="smcluster-getting-started-eks-console-create-cluster-tags"></a>

For **Tags - *optional***, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

## Deploy resources
<a name="smcluster-getting-started-eks-console-create-cluster-deploy"></a>

After you complete the cluster configurations using either **Quick setup** or **Custom setup**, choose the following option to start resource provisioning and cluster creation.
+  **Submit** - SageMaker AI will start provisioning the default configuration resources and creating the cluster. 
+ **Download CloudFormation template parameters** - You will download the configuration parameter JSON file and run AWS CLI command to deploy the CloudFormation stack to provision the configuration resources and creating the cluster. You can edit the downloaded parameter JSON file if needed. If you choose this option, see more instructions in [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-eks-console-create-cluster-cfn.md).

# Browsing, viewing, and editing SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-eks-operate-console-ui-browse-view-edit"></a>

Use the following instructions to browse, view, and edit SageMaker HyperPod clusters orchestrated by Amazon EKS in the SageMaker AI console.

**Topics**
+ [To browse your SageMaker HyperPod clusters](#sagemaker-hyperpod-eks-operate-console-ui-browse-clusters)
+ [To view details of each SageMaker HyperPod cluster](#sagemaker-hyperpod-eks-operate-console-ui-view-details-of-clusters)
+ [To edit a SageMaker HyperPod cluster](#sagemaker-hyperpod-eks-operate-console-ui-edit-clusters)

## To browse your SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-eks-operate-console-ui-browse-clusters"></a>

Under **Clusters** on the SageMaker HyperPod page in the SageMaker AI console, all created clusters should be listed under the **Clusters** section, which provides a summary view of clusters, their ARNs, status, and creation time.

## To view details of each SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-view-details-of-clusters"></a>

Under **Clusters** on the SageMaker HyperPod page in the SageMaker AI console, the cluster names are activated as links. Choose the cluster name link to see details of each cluster.

## To edit a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-edit-clusters"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to update.

1. Select your cluster, and choose **Edit**.

1. In the **Edit <your-cluster>** page, you can edit the configurations of existing instance groups, add more instance groups, delete instance groups, and change tags for the cluster. After making changes, choose **Submit**. 

   1. In the **Configure instance groups** section, you can add more instance groups by choosing **Create instance group**.

   1. In the **Configure instance groups** section, you can choose **Edit** to change its configuration or **Delete** to remove the instance group permanently.
**Important**  
When deleting an instance group, consider the following points:  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal.
The removal process cannot be undone.
**Note**  
Deleting an instance group will terminate all compute resources associated with that group.

   1. In the **Tags** section, you can update tags for the cluster.

# Deleting a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-delete-cluster"></a>

Use the following instructions to delete SageMaker HyperPod clusters orchestrated by Amazon EKS in the SageMaker AI console.

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to delete.

1. Select your cluster, and choose **Delete**.

1. In the pop-up window for cluster deletion, review the cluster information carefully to confirm that you chose the right cluster to delete.

1. After you reviewed the cluster information, choose **Yes, delete cluster**.

1. In the text field to confirm this deletion, type **delete**.

1. Choose **Delete** on the lower right corner of the pop-up window to finish sending the cluster deletion request.

**Note**  
When cluster deletion fails due to attached SageMaker HyperPod task governance policies, you will need to [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md).

# Creating SageMaker HyperPod clusters using CloudFormation templates
<a name="smcluster-getting-started-eks-console-create-cluster-cfn"></a>

You can create SageMaker HyperPod clusters using the CloudFormation templates for HyperPod. You must install AWS CLI to proceed.

**Topics**
+ [Configure resources in the console and deploy using CloudFormation](#smcluster-getting-started-eks-console-create-cluster-deploy-console)
+ [Configure and deploy resources using CloudFormation](#smcluster-getting-started-eks-console-create-cluster-deploy-cfn)

## Configure resources in the console and deploy using CloudFormation
<a name="smcluster-getting-started-eks-console-create-cluster-deploy-console"></a>

You can configure resources using the AWS Management Console and deploy using the CloudFormation templates.

Follow these steps.

1. *Instead of choosing **Submit***, choose **Download CloudFormation template parameters** at the end of the tutorial in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md). The tutorial contains important configuration information you will need to create your cluster successfully.
**Important**  
If you choose **Submit**, you will not be able to deploy a cluster with the same name until you delete the cluster.

   After you choose **Download CloudFormation template parameters**, the **Using the configuration file to create the cluster using the AWS CLI** window will appear on the right side of the page.

1. On the **Using the configuration file to create the cluster using the AWS CLI** window, choose **Download configuration parameters file**. The file will be downloaded to your machine. You can edit the configuration JSON file based on your needs or leave it as-is, if no change is required.

1. In the terminal, navigate to the location of the parameter file `file://params.json`.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url https://aws-sagemaker-hyperpod-cluster-setup.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the [CloudFormation console](https://console.aws.amazon.com/cloudformation).

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

## Configure and deploy resources using CloudFormation
<a name="smcluster-getting-started-eks-console-create-cluster-deploy-cfn"></a>

You can configure and deploy resources using the CloudFormation templates for SageMaker HyperPod.

Follow these steps.

1. Download a CloudFormation template for SageMaker HyperPod from the [sagemaker-hyperpod-cluster-setup](https://github.com/aws/sagemaker-hyperpod-cluster-setup) GitHub repository.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url URL_of_the_file_that_contains_the_template_body
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the CloudFormation console.

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes.

# Managing SageMaker HyperPod EKS clusters using the AWS CLI
<a name="sagemaker-hyperpod-eks-operate-cli-command"></a>

The following topics provide guidance on writing SageMaker HyperPod API request files in JSON format and run them using the AWS CLI commands.

**Topics**
+ [Creating a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-cli-command-create-cluster.md)
+ [Retrieving SageMaker HyperPod cluster details](sagemaker-hyperpod-eks-operate-cli-command-cluster-details.md)
+ [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md)
+ [Updating the SageMaker HyperPod platform software](sagemaker-hyperpod-eks-operate-cli-command-update-cluster-software.md)
+ [Accessing SageMaker HyperPod cluster nodes](sagemaker-hyperpod-eks-operate-access-through-terminal.md)
+ [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md)
+ [Deleting a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-cli-command-delete-cluster.md)

# Creating a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-create-cluster"></a>

Learn how to create SageMaker HyperPod clusters orchestrated by Amazon EKS using the AWS CLI.

1. Before creating an SageMaker HyperPod cluster:

   1. Ensure that you have an existing Amazon EKS cluster up and running. For detailed instructions about how to set up an Amazon EKS cluster, see [Create an Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) in the *Amazon EKS User Guide*.

   1. Install the Helm chart as instructed in [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md). If you create a [Amazon Nova SageMaker HyperPod cluster](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-cluster.html), you will need a separate Helm chart.

1. Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as `s3://amzn-s3-demo-bucket/Lifecycle-scripts/base-config/`.

   For a quick start, download the sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create.sh) from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage.
**Important**  
If you create an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attaching only the managed [https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html), your cluster has access to Amazon S3 buckets with the specific prefix `sagemaker-`.

   If you create a restricted instance group, you don't need to download and run the lifecycle script. Instead, you need to run `install_rig_dependencies.sh`. 

   The prerequisites to run the `install_rig_dependencies.sh` script include:
   + AWS Node (CNI) and CoreDNS should both be enabled. These are standard EKS add-ons that are not managed by the standard SageMaker HyperPod Helm, but can be easily enabled in the EKS console under Add-ons.
   +  The standard SageMaker HyperPod Helm chart should be installed before running this script.

   The `install_rig_dependencies.sh` script performs the following actions. 
   + `aws-node` (CNI): New `rig-aws-node` Daemonset created; existing `aws-node` patched to avoid RIG nodes.
   + `coredns`: Converted to Daemonset for RIGs to support multi-RIG use and prevent overloading.
   + training-operators: Updated with RIG Worker taint tolerations and nodeAffinity favoring non-RIG instances.
   + Elastic Fabric Adapter (EFA): Updated to tolerate RIG worker taint and use correct container images for each Region.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API request file in JSON format. For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` from the section [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).
**Note**  
Ensure that your SageMaker HyperPod cluster is deployed within the same Virtual Private Cloud (VPC) as your Amazon EKS cluster. The subnets and security groups specified in the SageMaker HyperPod cluster configuration must allow network connectivity and communication with the Amazon EKS cluster's API server endpoint.

   ```
   // create_cluster.json
   {
       "ClusterName": "string",
       "InstanceGroups": [{
           "InstanceGroupName": "string",
           "InstanceType": "string",
           "InstanceCount": number,
           "LifeCycleConfig": {
               "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
               "OnCreate": "on_create.sh"
           },
           "ExecutionRole": "string",
           "ThreadsPerCore": number,
           "OnStartDeepHealthChecks": [
               "InstanceStress", "InstanceConnectivity"
           ]
       }],
       "RestrictedInstanceGroups": [ 
         { 
            "EnvironmentConfig": { 
               "FSxLustreConfig": { 
                  "PerUnitStorageThroughput": number,
                  "SizeInGiB": number
               }
            },
            "ExecutionRole": "string",
            "InstanceCount": number,
            "InstanceGroupName": "string",
            "InstanceStorageConfigs": [ 
               { ... }
            ],
            "InstanceType": "string",
            "OnStartDeepHealthChecks": [ "string" ],
            "OverrideVpcConfig": { 
               "SecurityGroupIds": [ "string" ],
               "Subnets": [ "string" ]
            },
            "ScheduledUpdateConfig": { 
               "DeploymentConfig": { 
                  "AutoRollbackConfiguration": [ 
                     { 
                        "AlarmName": "string"
                     }
                  ],
                  "RollingUpdatePolicy": { 
                     "MaximumBatchSize": { 
                        "Type": "string",
                        "Value": number
                     },
                     "RollbackMaximumBatchSize": { 
                        "Type": "string",
                        "Value": number
                     }
                  },
                  "WaitIntervalInSeconds": number
               },
               "ScheduleExpression": "string"
            },
            "ThreadsPerCore": number,
            "TrainingPlanArn": "string"
         }
      ],
       "VpcConfig": {
           "SecurityGroupIds": ["string"],
           "Subnets": ["string"]
       },
       "Tags": [{
           "Key": "string",
           "Value": "string"
       }],
       "Orchestrator": {
           "Eks": {
               "ClusterArn": "string",
               "KubernetesConfig": {
                   "Labels": {
                       "nvidia.com/mig.config": "all-3g.40gb"
                   }
               }
           }
       },
       "NodeRecovery": "Automatic"
   }
   ```
**Flexible instance groups**  
Instead of specifying a single `InstanceType`, you can use the `InstanceRequirements` parameter to specify multiple instance types for an instance group. Note the following:  
`InstanceType` and `InstanceRequirements` are mutually exclusive. You must specify one or the other, but not both.
`InstanceRequirements.InstanceTypes` is an ordered list that determines provisioning priority. SageMaker HyperPod attempts to provision the first instance type in the list and falls back to subsequent types if capacity is unavailable. You can specify up to 20 instance types, and the list must not contain duplicates.
Flexible instance groups require continuous node provisioning mode.
The following example shows an instance group using `InstanceRequirements`:  

   ```
   {
       "InstanceGroupName": "flexible-ig",
       "InstanceRequirements": {
           "InstanceTypes": ["ml.p5.48xlarge", "ml.p4d.24xlarge", "ml.g6.48xlarge"]
       },
       "InstanceCount": 10,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster"
   }
   ```

   Note the following when configuring to create a new SageMaker HyperPod cluster associating with an EKS cluster.
   + You can configure up to 20 instance groups under the `InstanceGroups` parameter.
   + For `Orchestator.Eks.ClusterArn`, specify the ARN of the EKS cluster you want to use as the orchestrator.
   + For `OnStartDeepHealthChecks`, add `InstanceStress` and `InstanceConnectivity` to enable [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md).
   + For `NodeRecovery`, specify `Automatic` to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.
   + For the `Tags` parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see [Tagging AWS Resources User Guide](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).
   + For the `VpcConfig` parameter, specify the information of the VPC used in the EKS cluster. The subnets must be private.
   + For `Orchestrator.Eks.KubernetesConfig.Labels`, you can optionally specify Kubernetes labels to apply to the nodes. To enable GPU partitioning with Multi-Instance GPU (MIG), add the `nvidia.com/mig.config` label with the desired MIG profile. For example, `"nvidia.com/mig.config": "all-3g.40gb"` configures all GPUs with the 3g.40gb partition profile. For more information about GPU partitioning and available profiles, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

1. Run the [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) command as follows.
**Important**  
When running the `create-cluster` command with the `--cli-input-json` parameter, you must include the `file://` prefix before the complete path to the JSON file. This prefix is required to ensure that the AWS CLI recognizes the input as a file path. Omitting the `file://` prefix results in a parsing parameter error.

   ```
   aws sagemaker create-cluster \
       --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the new cluster.
**Important**  
You can use the [update-cluster](https://docs.aws.amazon.com//cli/latest/reference/ecs/update-cluster.html) operation to remove a restricted instance group (RIG). When a RIG is scaled down to 0, the FSx for Lustre file system won't be deleted. To completely remove the FSx for Lustre file system, you must remove the RIG entirely.   
Removing a RIG will not delete any artifacts stored in the service-managed Amazon S3 bucket. However, you should ensure all artifacts in the FSx for Lustre file system are fully synchronized to Amazon S3 before removal. We recommend waiting at least 30 minutes after job completion to ensure complete synchronization of all artifacts from the FSx for Lustre file system to the service-managed Amazon S3 bucket.
**Important**  
When using an onboarded On-Demand Capacity Reservation (ODCR), you must map your instance group to the same Availability Zone ID (AZ ID) as the ODCR by setting `OverrideVpcConfig` with a subnet in the matching AZ ID.  
CRITICAL: Verify `OverrideVpcConfig` configuration before deployment to avoid incurring duplicate charges for both ODCR and On-Demand Capacity.

# Retrieving SageMaker HyperPod cluster details
<a name="sagemaker-hyperpod-eks-operate-cli-command-cluster-details"></a>

Learn how to retrieve SageMaker HyperPod cluster details using the AWS CLI.

## Describe a cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-describe-cluster"></a>

Run [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster.html) to check the status of the cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster --cluster-name your-hyperpod-cluster
```

After the status of the cluster turns to **InService**, proceed to the next step. Using this API, you can also retrieve failure messages from running other HyperPod API operations.

## List details of cluster nodes
<a name="sagemaker-hyperpod-eks-operate-cli-command-list-cluster-nodes"></a>

Run [list-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) to check the key information of the cluster nodes.

```
aws sagemaker list-cluster-nodes --cluster-name your-hyperpod-cluster
```

This returns a response, and the `InstanceId` is what you need to use for logging (using `aws ssm`) into them.

## Describe details of a cluster node
<a name="sagemaker-hyperpod-eks-operate-cli-command-describe-cluster-node"></a>

Run [describe-cluster-node](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster-node.html) to retrieve details of a cluster node. You can get the cluster node ID from list-cluster-nodes output. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster-node \
    --cluster-name your-hyperpod-cluster \
    --node-id i-111222333444555aa
```

## List clusters
<a name="sagemaker-hyperpod-eks-operate-cli-command-list-clusters"></a>

Run [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-clusters.html) to list all clusters in your account.

```
aws sagemaker list-clusters
```

You can also add additional flags to filter the list of clusters down. To learn more about what this command runs at low level and additional flags for filtering, see the [ListClusters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusters.html) API reference.

# Updating SageMaker HyperPod cluster configuration
<a name="sagemaker-hyperpod-eks-operate-cli-command-update-cluster"></a>

Run [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) to update the configuration of a cluster.

**Note**  
Important considerations:  
You cannot change the EKS cluster information that your HyperPod cluster is associated after the cluster is created. 
If deep health checks are running on the cluster, this API will not function as expected. You might encounter an error message stating that deep health checks are in progress. To update the cluster, you should wait until the deep health checks finish.

1. Create an [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API request file in JSON format. Make sure that you specify the right cluster name and instance group name to update. For each instance group, you can change the instance type, the number of instances, the lifecycle configuration entrypoint script, and the path to the script.
**Note**  
You can use the `UpdateCluster` to scale down or remove entire instance groups from your SageMaker HyperPod cluster. For additional instructions on how to scale down or delete instance groups, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md).

   1. For `ClusterName`, specify the name of the cluster you want to update.

   1. For `InstanceGroupName`

      1. To update an existing instance group, specify the name of the instance group you want to update.

      1. To add a new instance group, specify a new name not existing in your cluster.

   1. For `InstanceType`

      1. To update an existing instance group, you must match the instance type you initially specified to the group.

      1. To add a new instance group, specify an instance type you want to configure the group with.

      For instance groups that use `InstanceRequirements` instead of `InstanceType`, you can add or remove instance types from the `InstanceTypes` list. However, you cannot remove an instance type that has active nodes running on it. You also cannot switch between `InstanceType` and `InstanceRequirements` when updating an existing instance group. `InstanceType` and `InstanceRequirements` are mutually exclusive.

   1. For `InstanceCount`

      1. To update an existing instance group, specify an integer that corresponds to your desired number of instances. You can provide a higher or lower value (down to 0) to scale the instance group up or down.

      1. To add a new instance group, specify an integer greater or equal to 1. 

   1. For `LifeCycleConfig`, you can change the values for both `SourceS3Uri` and `OnCreate` as you want to update the instance group.

   1. For `ExecutionRole`

      1. For updating an existing instance group, keep using the same IAM role you attached during cluster creation.

      1. For adding a new instance group, specify an IAM role you want to attach.

   1. For `ThreadsPerCore`

      1. For updating an existing instance group, keep using the same value you specified during cluster creation.

      1. For adding a new instance group, you can choose any value from the allowed options per instance type. For more information, search the instance type and see the **Valid threads per core** column in the reference table at [CPU cores and threads per CPU core per instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html) in the *Amazon EC2 User Guide*.

   1. For `OnStartDeepHealthChecks`, add `InstanceStress` and `InstanceConnectivity` to enable [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md).

   1. For `NodeRecovery`, specify `Automatic` to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.

   The following code snippet is a JSON request file template you can use. For more information about the request syntax and parameters of this API, see the [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API reference.

   ```
   // update_cluster.json
   {
       // Required
       "ClusterName": "name-of-cluster-to-update",
       // Required
       "InstanceGroups": [{
           "InstanceGroupName": "string",
           "InstanceType": "string",
           "InstanceCount": number,
           "LifeCycleConfig": {
               "SourceS3Uri": "string",
               "OnCreate": "string"
           },
           "ExecutionRole": "string",
           "ThreadsPerCore": number,
           "OnStartDeepHealthChecks": [
               "InstanceStress", "InstanceConnectivity"
           ]
       }],
       "NodeRecovery": "Automatic"
   }
   ```

1. Run the following `update-cluster` command to submit the request. 

   ```
   aws sagemaker update-cluster \
       --cli-input-json file://complete/path/to/update_cluster.json
   ```

# Updating the SageMaker HyperPod platform software
<a name="sagemaker-hyperpod-eks-operate-cli-command-update-cluster-software"></a>

When you create your SageMaker HyperPod cluster, SageMaker HyperPod selects an Amazon Machine Image (AMI) corresponding to the Kubernetes version of your Amazon EKS cluster.

Run [update-cluster-software](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster-software.html) to update existing clusters with software and security patches provided by the SageMaker HyperPod service. For `--cluster-name`, specify either the name or the ARN of the cluster to update.

**Important**  
When this API is called, SageMaker HyperPod doesn’t drain or redistribute the jobs (Pods) running on the nodes. Make sure to check if there are any jobs running on the nodes before calling this API.
The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre.
All cluster nodes experience downtime (nodes appear as `<NotReady>` in the output of `kubectl get node`) while the patching is in progress. We recommend that you terminate all workloads before patching and resume them after the patch completes.   
If the security patch fails, you can retrieve failure messages by running the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) API as instructed at [Describe a cluster](sagemaker-hyperpod-eks-operate-cli-command-cluster-details.md#sagemaker-hyperpod-eks-operate-cli-command-describe-cluster).

```
aws sagemaker update-cluster-software --cluster-name your-hyperpod-cluster
```

**Rolling upgrades with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, rolling upgrades spread each instance type proportionally across batches. For example, if an instance group has 100 instances (10 P5 and 90 G6) and you configure a 10% batch size, each batch contains 1 P5 instance and 9 G6 instances.

 When calling the `UpdateClusterSoftware` API, SageMaker HyperPod updates the Kubernetes version of the nodes by selecting the latest [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) based on the Kubernetes version of your Amazon EKS cluster. It then runs the lifecycle scripts in the Amazon S3 bucket that you specified during the cluster creation or update. 

You can verify the kubelet version of a node by running the `kubectl describe node` command.

The Kubernetes version of SageMaker HyperPod cluster nodes does not automatically update when you update your Amazon EKS cluster version. After updating the Kubernetes version for your Amazon EKS cluster, you must use the `UpdateClusterSoftware` API to update your SageMaker HyperPod cluster nodes to the same Kubernetes version.

 It is recommended to update your SageMaker HyperPod cluster after updating your Amazon EKS nodes, and avoid having more than one version difference between the Amazon EKS cluster version and the SageMaker HyperPod cluster nodes version.

The SageMaker HyperPod service team regularly rolls out new [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami)s for enhancing security and improving user experiences. We recommend that you always keep updating to the latest SageMaker HyperPod DLAMI. For future SageMaker HyperPod DLAMI updates for security patching, follow up with [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Note**  
You can only run this API programmatically. The patching functionality is not implemented in the SageMaker HyperPod console UI.

# Accessing SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-eks-operate-access-through-terminal"></a>

You can directly access the nodes of a SageMaker HyperPod cluster in service using the AWS CLI commands for AWS Systems Manager (SSM). Run `aws ssm start-session` with the host name of the node in format of `sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]`. You can retrieve the cluster ID, the instance ID, and the instance group name from the [SageMaker HyperPod console](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters) or by running `describe-cluster` and `list-cluster-nodes` from the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes). For example, if your cluster ID is `aa11bbbbb222`, the cluster node name is `controller-group`, and the cluster node ID is `i-111222333444555aa`, the SSM `start-session` command should be the following.

**Note**  
If you haven't set up AWS Systems Manager, follow the instructions provided at [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

```
$ aws ssm start-session \
    --target sagemaker-cluster:aa11bbbbb222_controller-group-i-111222333444555aa \
    --region us-west-2
Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

# Scaling down a SageMaker HyperPod cluster
<a name="smcluster-scale-down"></a>

You can scale down the number of instances running on your Amazon SageMaker HyperPod cluster. You might want to scale down a cluster for various reasons, such as reduced resource utilization or cost optimization.

The following page outlines two main approaches to scaling down:
+ **Scale down at the instance group level:** This approach uses the `UpdateCluster` API, with which you can:
  + Scale down the instance counts for specific instance groups independently. SageMaker AI handles the termination of nodes in a way that reaches the new target instance counts you've set for each group. See [Scale down an instance group](#smcluster-scale-down-updatecluster).
  + Completely delete instance groups from your cluster. See [Delete instance groups](#smcluster-remove-instancegroup).
+ **Scale down at the instance level:** This approach uses the `BatchDeleteClusterNodes` API, with which you can specify the individual nodes you want to terminate. See [Scale down at the instance level](#smcluster-scale-down-batchdelete).

**Note**  
When scaling down at the instance level with `BatchDeleteCusterNodes`, you can only terminate a maximum of 99 instances at a time. `UpdateCluster` supports terminating any number of instances.

## Important considerations
<a name="smcluster-scale-down-considerations"></a>
+ When scaling down a cluster, you should ensure that the remaining resources are sufficient to handle your workload and that any necessary data migration or rebalancing is properly handled to avoid disruptions. 
+ Make sure to back up your data to Amazon S3 or an FSx for Lustre file system before invoking the API on a worker node group. This can help prevent any potential data loss from the instance root volume. For more information about backup, see [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).
+ To invoke this API on an existing cluster, you must first patch the cluster by running the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API. For more information about patching a cluster, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).
+ Metering/billing for on-demand instances will automatically be stopped after scale down. To stop metering for scaled-down reserved instances, you should reach out to your AWS account team for support.
+ You can use the released capacity from the scaled-down reserved instances to scale up another SageMaker HyperPod cluster.

## Scale down at the instance group level
<a name="smcluster-scale-down-or-delete"></a>

The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) operation allows you to make changes to the configuration of your SageMaker HyperPod cluster, such as scaling down the number of instances of an instance group or removing entire instance groups. This can be useful when you want to adjust the resources allocated to your cluster based on changes in your workload, optimize costs, or change the instance type of an instance group.

### Scale down an instance group
<a name="smcluster-scale-down-updatecluster"></a>

Use this approach when you have an instance group that is idle and it's safe to terminate any of the instances for scaling down. When you submit an `UpdateCluster` request to scale down, HyperPod randomly chooses instances for termination and scales down to the specified number of nodes for the instance group.

**Scale-down behavior with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, HyperPod terminates the lowest-priority instance types first during scale-down. The priority is determined by the order of instance types in the `InstanceTypes` list, where the first type has the highest priority. This protects higher-priority instances, which are typically higher-performance, during scale-down operations.

**Note**  
When you scale the number of instances in an instance group down to 0, all the instances within that group will be terminated. However, the instance group itself will still exist as part of the SageMaker HyperPod cluster. You can scale the instance group back up at a later time, using the same instance group configuration.   
Alternatively, you can choose to remove an instance group permanently. For more information, see [Delete instance groups](#smcluster-remove-instancegroup).

**To scale down with `UpdateCluster`**

1. Follow the steps outlined in [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md). When you reach step **1.d** where you specify the **InstanceCount** field, enter a number that is smaller than the current number of instances to scale down the cluster.

1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI command to submit your request.

The following is an example of an `UpdateCluster` JSON object. Consider the case where your instance group currently has 2 running instances. If you set the **InstanceCount** field to 1, as shown in the example, then HyperPod randomly selects one of the instances and terminates it.

```
{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training-instances",
      "InstanceType": "instance-type",
      "InstanceCount": 1,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    }
  ],
  "NodeRecovery": "Automatic"
}
```

### Delete instance groups
<a name="smcluster-remove-instancegroup"></a>

You can use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) operation to remove entire instance groups from your SageMaker HyperPod cluster when they are no longer needed. This goes beyond simple scaling down, allowing you to completely eliminate specific instance groups from your cluster's configuration. 

**Note**  
When removing an instance group:  
All instances within the targeted group are terminated.
The entire group configuration is deleted from the cluster.
Any workloads running on that instance group are stopped.

**To delete instance groups with `UpdateCluster`**

1. When following the steps outlined in [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md):

   1. Set the optional `InstanceGroupsToDelete` parameter in your `UpdateCluster` JSON and pass the comma-separated list of instance group names that you want to delete.

   1.  When you specify the `InstanceGroups` list, ensure that the specifications of the instance groups you are removing are no longer listed in the `InstanceGroups` list.

1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI command to submit your request.

**Important**  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal.
The removal process cannot be undone.

The following is an example of an `UpdateCluster` JSON object. Consider the case where a cluster currently has 3 instance groups, a *training*, a *prototype-training*, and an *inference-serving* group. You want to delete the *prototype-training* group.

```
{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training",
      "InstanceType": "instance-type",
      "InstanceCount": ,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    },
    {
      "InstanceGroupName": "inference-serving",
      "InstanceType": "instance-type",
      "InstanceCount": 2,
      [...]
    },
  ],
  "InstanceGroupsToDelete": [ "prototype-training" ],
  "NodeRecovery": "Automatic"
}
```

## Scale down at the instance level
<a name="smcluster-scale-down-batchdelete"></a>

The `BatchDeleteClusterNodes` operation allows you to scale down a SageMaker HyperPod cluster by specifying the individual nodes you want to terminate. `BatchDeleteClusterNodes` provides more granular control for targeted node removal and cluster optimization. For example, you might use `BatchDeleteClusterNodes` to delete targeted nodes for maintenance, rolling upgrades, or rebalancing resources geographically.

**API request and response**

When you submit a `BatchDeleteClusterNodes` request, SageMaker HyperPod deletes nodes by their instance IDs. The API accepts a request with the cluster name and a list of node IDs to be deleted. 

The response includes two sections: 
+  `Failed`: A list of errors of type `[ BatchDeleteClusterNodesError ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchDeleteClusterNodesError.html)` - one per instance ID.
+  `Successful`: The list of instance IDs successfully terminated. 

**Validation and error handling**

The API performs various validations, such as:
+ Verifying the node ID format (prefix of `i-` and Amazon EC2 instance ID structure). 
+ Checking the node list length, with a limit of 99 or fewer node IDs in a single `BatchDeleteClusterNodes` request.
+ Ensuring a valid SageMaker HyperPod cluster with the input cluster-name is present and that no cluster-level operations (update, system update, patching, or deletion) are in progress.
+ Handling cases where instances are not found, have invalid status, or are in use.

**API Response Codes**
+  The API returns a `200` status code for successful (e.g., all input nodes succeeded validation) or partially successful requests (e.g., some input nodes fail validation). 
+  If all of these validations fail (e.g., all input nodes fail validation), the API will return a `400` Bad Request response with the appropriate error messages and error codes. 

**Example**

The following is an example of **scaling down a cluster at the instance level** using the AWS CLI:

```
aws sagemaker batch-delete-cluster-nodes --cluster-name "cluster-name" --node-ids '["i-111112222233333", "i-111112222233333"]'
```

# Deleting a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-delete-cluster"></a>

Run [delete-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-cluster.html) to delete a cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker delete-cluster --cluster-name your-hyperpod-cluster
```

This API only cleans up the SageMaker HyperPod resources and doesn’t delete any resources of the associated EKS cluster. This includes the Amazon EKS cluster, EKS Pod identities, Amazon FSx volumes, and EKS add-ons. This also includes the initial configuration you added to your EKS cluster. If you want to clean up all resources, make sure that you also clean up the EKS resources separately. 

Make sure that you first delete the SageMaker HyperPod resources, followed by the EKS resources. Performing the deletion in the reverse order may result in lingering resources.

**Important**  
When this API is called, SageMaker HyperPod doesn’t drain or redistribute the jobs (Pods) running on the nodes. Make sure to check if there are any jobs running on the nodes before calling this API.

# HyperPod managed tiered checkpointing
<a name="managed-tier-checkpointing"></a>

This section explains how managed tiered checkpointing works and the benefits it provides for large-scale model training.

Amazon SageMaker HyperPod managed tiered checkpointing helps you train large-scale generative AI models more efficiently. It uses multiple storage tiers, including your cluster’s CPU memory. This approach reduces your time to recovery and minimizes loss in training progress. It also uses underutilized memory resources in your training infrastructure.

Managed tiered checkpointing enables saving checkpoints at a higher frequency to memory. It periodically persists them to durable storage. This maintains both performance and reliability during your training process.

This guide covers how to set up, configure, and use managed tiered checkpointing with PyTorch frameworks on Amazon EKS HyperPod clusters.

## How managed tiered checkpointing works
<a name="managed-tier-checkpointing-works"></a>

Managed tiered checkpointing uses a multi-tier storage approach. CPU memory serves as the primary tier to store model checkpoints. Secondary tiers include persistent storage options like Amazon S3.

When you save a checkpoint, the system stores it in allocated memory space across your cluster nodes. It automatically replicates data across adjacent compute nodes for enhanced reliability. This replication strategy protects against single or multiple node failures while providing fast access for recovery operations.

The system also periodically saves checkpoints to persistent storage according to your configuration. This ensures long-term durability of your training progress.

Key components include:
+ **Memory management system**: A memory management daemon that provides disaggregated memory as a service for checkpoint storage
+ **HyperPod Python library**: Interfaces with the disaggregated storage APIs and provides utilities for saving, loading, and managing checkpoints across tiers
+ **Checkpoint replication**: Automatically replicates checkpoints across multiple nodes for fault tolerance

The system integrates seamlessly with PyTorch training loops through simple API calls. It requires minimal changes to your existing code.

## Benefits
<a name="managed-tier-checkpointing-benefits"></a>

Managed tiered checkpointing delivers several advantages for large-scale model training:
+ **Improved usability**: Manages checkpoint save, replication, persistence, and recovery
+ **Faster checkpoint operations**: Memory-based storage provides faster save and load times compared to disk-based checkpointing, leading to faster recovery
+ **Fault tolerance**: Automatic checkpoint replication across nodes protects against hardware node failures
+ **Minimal code changes**: Simple API integration requires only minor modifications to existing training scripts
+ **Improved training throughput**: Reduced checkpoint overhead means more time spent on actual training

**Topics**
+ [How managed tiered checkpointing works](#managed-tier-checkpointing-works)
+ [Benefits](#managed-tier-checkpointing-benefits)
+ [Set up managed tiered checkpointing](managed-tier-checkpointing-setup.md)
+ [Removing managed tiered checkpointing](managed-tier-checkpointing-remove.md)
+ [Security considerations for managed tiered checkpointing](managed-tier-security-considerations.md)

# Set up managed tiered checkpointing
<a name="managed-tier-checkpointing-setup"></a>

This section contains setup process for managed tiered checkpointing for Amazon SageMaker HyperPod. You’ll learn how to enable the capability on your cluster and implement checkpointing in your training code.

**Topics**
+ [Prerequisites](#managed-tier-checkpointing-setup-prerequisites)
+ [Step 1: Enable managed tiered checkpointing for your cluster](#managed-tier-checkpointing-setup-step-enable-for-cluster)
+ [Step 2: Install the Python library in your training image](#managed-tier-checkpointing-setup-step-install-library)
+ [Step 3: Save checkpoints in your training loop](#managed-tier-checkpointing-setup-step-save-checkpoint-in-loop)
+ [Step 4: Load checkpoints for recovery](#managed-tier-checkpointing-setup-step-load-checkpoint)
+ [Validate your managed tiered checkpointing operations](#managed-tier-checkpointing-setup-validation)

## Prerequisites
<a name="managed-tier-checkpointing-setup-prerequisites"></a>

Before setting up managed tiered checkpointing, ensure you have:
+ An Amazon EKS HyperPod cluster with sufficient CPU memory available for checkpoint allocation
+ PyTorch training workloads and DCP jobs (both are supported)
+ Appropriate IAM permissions for cluster management, including:
  + Amazon CloudWatch and Amazon S3 write permissions for the training pod to read/write checkpoints and push metrics
  + These permissions can be configured via [EKS OIDC setup](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)

## Step 1: Enable managed tiered checkpointing for your cluster
<a name="managed-tier-checkpointing-setup-step-enable-for-cluster"></a>

**Important**  
You must opt in to use managed tiered checkpointing.

Enable managed tiered checkpointing through the HyperPod APIs when creating or updating your cluster. The service automatically installs the memory management system when you specify the `TieredStorageConfig` parameter.

For new clusters, you can use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) AWS CLI.

```
aws sagemaker create-cluster \
    --cluster-name cluster-name \
    --orchestrator "Eks={ClusterArn=eks-cluster-arn}" \
    --instance-groups '{
        "InstanceGroupName": "instance-group-name",
        "InstanceType": "instance-type",
        "InstanceCount": instance-count,
        "LifeCycleConfig": {
            "SourceS3Uri": "s3-path-to-lifecycle-scripts",
            "OnCreate": "lifecycle-script-name"
        },
        "ExecutionRole": "instance-group-iam-role",
        "ThreadsPerCore": threads-per-core,
        "InstanceStorageConfigs": [
            { "EbsVolumeConfig": {"VolumeSizeInGB": volume-size} }
        ]
    }' \
    --vpc-config '{
        "SecurityGroupIds": ["security-group-ids"],
        "Subnets": ["subnets"]
    }' \
    --tiered-storage-config '{
        "Mode": "Enable"
    }'
```

The `InstanceMemoryAllocationPercentage` parameter specifies the `percentage` (int) of cluster memory to allocate for checkpointing. The range is 20-100.

## Step 2: Install the Python library in your training image
<a name="managed-tier-checkpointing-setup-step-install-library"></a>

Install the [Amazon SageMaker checkpointing library](https://pypi.org/project/amzn-sagemaker-checkpointing/) and its dependencies in your training image by adding it to your Dockerfile:

```
# Add this line to your training image Dockerfile
RUN pip install amzn-sagemaker-checkpointing s3torchconnector tenacity torch boto3 s3torchconnector
```

## Step 3: Save checkpoints in your training loop
<a name="managed-tier-checkpointing-setup-step-save-checkpoint-in-loop"></a>

In your training loop, you can asynchronously save checkpoints using PyTorch DCP. The following is an example on how to do so.

```
import torch
import torch.distributed as dist
from torch.distributed.checkpoint import async_save, load
from amzn_sagemaker_checkpointing.checkpointing.filesystem.filesystem import (
    SageMakerTieredStorageWriter,
    SageMakerTieredStorageReader
)

# Initialize distributed training
dist.init_process_group(backend="nccl")

# Configure checkpointing
checkpoint_config = SageMakerCheckpointConfig(
    # Unique ID for your training job 
    # Allowed characters in ID include: alphanumeric, hyphens, and underscores
    namespace=os.environ.get('TRAINING_JOB_NAME', f'job-{int(time.time())}'),

    # Number of distributed processes/available GPUs
    world_size=dist.get_world_size(),

    # S3 storage location, required for SageMakerTieredStorageReader for read fallbacks
    # Required for SageMakerTieredStorageWriter when save_to_s3 is True
    s3_tier_base_path="s3://my-bucket/checkpoints"
)

# Your model and optimizer
model = MyModel()
optimizer = torch.optim.AdamW(model.parameters())

# Training loop
future = None
in_memory_ckpt_freq = 10
s3_ckpt_freq = 50

for training_step in range(1000):
    # ... training code ...
    
    # Save checkpoint
    if (training_step % in_memory_ckpt_freq == 0 or 
        training_step % s3_ckpt_freq == 0):
        # Create state dictionary
        state_dict = {
            "model": model.state_dict(),
            "optimizer": optimizer.state_dict(),
            "step": training_step,
            "epoch": epoch
        }
        
        # Create storage writer for current step
        checkpoint_config.save_to_s3 = training_step % s3_ckpt_freq == 0
        storage_writer = SageMakerTieredStorageWriter(
            checkpoint_config=checkpoint_config,
            step=training_step
        )

        # wait for previous checkpoint to get completed
        if future is not None:
            exc = future.exception()
            if exc:
                print(f"Failure in saving previous checkpoint:{str(exc)}")
                # Handle failures as required
            else:
                result = future.result()
                # Process results from save, if required
        
        # Async save checkpoint using PyTorch DCP
        future = async_save(state_dict=state_dict, storage_writer=storage_writer)
        
        # Continue training while checkpoint saves in background
```

## Step 4: Load checkpoints for recovery
<a name="managed-tier-checkpointing-setup-step-load-checkpoint"></a>

The following is an example on loading a checkpoint.

```
# Create state dictionary template
state_dict = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "step": 0,
    "epoch": 0
}

# Load latest checkpoint
storage_reader = SageMakerTieredStorageReader(checkpoint_config=checkpoint_config)
load(state_dict, storage_reader=storage_reader)

# Load specific checkpoint step
storage_reader = SageMakerTieredStorageReader(
    checkpoint_config=checkpoint_config, 
    step=500 # Or don't pass step if you have to load the latest available step.
)
try:
    load(state_dict, storage_reader=storage_reader)
except BaseException as e:
    print(f"Checkpoint load failed: {str(e)}")
    # Add additional exception handling
```

## Validate your managed tiered checkpointing operations
<a name="managed-tier-checkpointing-setup-validation"></a>

You can validate your managed tiered checkpointing operations with logs.

**Custom logging (optional)**

You can integrate checkpointing logs with other logs by passing a custom logger to the library. For example, you can add a custom logger to your training code so that all logs from the library are also collected in the training logger.

**Enhanced service logging (optional)**

For enhanced debugging and service visibility, you can mount the checkpointing log path `/var/log/sagemaker_checkpointing` from within your pod to a path `/var/logs/sagemaker_checkpointing` on your host. This ensures that only library-specific logs are collected separately. This provides the service team with enhanced visibility for debugging and support.

# Removing managed tiered checkpointing
<a name="managed-tier-checkpointing-remove"></a>

This section explains how to disable managed tiered checkpointing when you no longer need it.

To disable managed tiered checkpointing, use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI to update your cluster configuration:

```
aws sagemaker update-cluster \
    --cluster-name cluster-name \
    --tiered-storage-config '{ "Mode": "Disable" }'
```

This removes the memory management daemon from your cluster. The daemon is implemented as a standard Kubernetes DaemonSet and follows standard Kubernetes lifecycle management.

# Security considerations for managed tiered checkpointing
<a name="managed-tier-security-considerations"></a>

This section covers important security considerations when using managed tiered checkpointing. It includes Python pickle usage, Amazon S3 encryption, and network endpoint security.

**Python pickle usage**

Managed tiered checkpointing uses Python’s pickle module to deserialize checkpoint data stored in Amazon S3. This implementation has important security implications:
+ **Extended trust boundary**: When using managed tiered checkpointing with Amazon S3, the Amazon S3 bucket becomes part of your cluster’s trust boundary.
+ **Code execution risk**: Python’s pickle module can execute arbitrary code during deserialization. If an unauthorized user gains write access to your checkpoint Amazon S3 bucket, they could potentially craft malicious pickle data that executes when loaded by managed tiered checkpointing.

**Best practices for Amazon S3 storage**

When using managed tiered checkpointing with Amazon S3 storage:
+ **Restrict Amazon S3 bucket access**: Ensure that only authorized users and roles associated with your training cluster have access to the Amazon S3 bucket used for checkpointing.
+ **Implement bucket policies**: Configure appropriate bucket policies to prevent unauthorized access or modifications.
+ **Validate access patterns**: Implement logging for validating access patterns to your checkpoint Amazon S3 buckets.
+ **Validate bucket names**: Use caution with bucket name selection to avoid potential bucket hijacking.

**Network endpoints**

Managed tiered checkpointing enables network endpoints on each of your compute nodes on the following ports: 9200/TCP, 9209/UDP, 9210/UDP, 9219/UDP, 9220/UDP, 9229/UDP, 9230/UDP, 9239/UDP, 9240/UDP. These ports are necessary for the checkpointing service to function and maintain data synchronization.

By default, SageMaker’s network configuration restricts access to these endpoints for security purposes. We recommend that you maintain these default restrictions.

When configuring your network settings for your nodes and VPC, follow AWS best practices for VPCs, security groups, and ACLs. For more information, see the following:
+ [Amazon SageMaker HyperPod prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-prerequisites.html#sagemaker-hyperpod-prerequisites-optional-vpcCluster)
+ [VPC security best practices](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-best-practices.html)

# SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance"></a>

SageMaker HyperPod task governance is a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. This provides administrators with the capability to set:
+ Priority levels for various tasks
+ Compute allocation for each team
+ How each team lends and borrows idle compute
+ If a team preempts their own tasks

HyperPod task governance also provides Amazon EKS cluster Observability, offering real-time visibility into cluster capacity. This includes compute availability and usage, team allocation and utilization, and task run and wait time information, setting you up for informed decision-making and proactive resource management. 

The following sections cover how to set up, understand key concepts, and use HyperPod task governance for your Amazon EKS clusters.

**Topics**
+ [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md)
+ [Dashboard](sagemaker-hyperpod-eks-operate-console-ui-governance-metrics.md)
+ [Tasks](sagemaker-hyperpod-eks-operate-console-ui-governance-tasks.md)
+ [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md)
+ [Example HyperPod task governance AWS CLI commands](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md)
+ [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md)
+ [Attribution document for Amazon SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-attributions.md)

# Setup for SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup"></a>

The following section provides information on how to get set up with the Amazon CloudWatch Observability EKS and SageMaker HyperPod task governance add-ons.

Ensure that you have the minimum permission policy for HyperPod cluster administrators with Amazon EKS, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account, performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 

**Topics**
+ [Dashboard setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md)
+ [Task governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md)

# Dashboard setup
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard"></a>

Use the following information to get set up with Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on. This sets you up with a detailed visual dashboard that provides a view into metrics for your EKS cluster hardware, team allocation, and tasks.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [HyperPod Amazon CloudWatch Observability EKS add-on prerequisites](#hp-eks-dashboard-prerequisites)
+ [HyperPod Amazon CloudWatch Observability EKS add-on setup](#hp-eks-dashboard-setup)

## HyperPod Amazon CloudWatch Observability EKS add-on prerequisites
<a name="hp-eks-dashboard-prerequisites"></a>

The following section includes the prerequisites needed before installing the Amazon EKS Observability add-on.
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin).
+ Attach the `CloudWatchAgentServerPolicy` IAM policy to your worker nodes. To do so, enter the following command. Replace `my-worker-node-role` with the IAM role used by your Kubernetes worker nodes.

  ```
  aws iam attach-role-policy \
  --role-name my-worker-node-role \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
  ```

## HyperPod Amazon CloudWatch Observability EKS add-on setup
<a name="hp-eks-dashboard-setup"></a>

Use the following options to set up the Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on.

------
#### [ Setup using the SageMaker AI console ]

The following permissions are required for setup and visualizing the HyperPod task governance dashboard. This section expands upon the permissions listed in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). 

To manage task governance, use the sample policy:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListClusters",
                "sagemaker:DescribeCluster",
                "sagemaker:ListComputeQuotas",
                "sagemaker:CreateComputeQuota",
                "sagemaker:UpdateComputeQuota",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:DeleteComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "sagemaker:CreateClusterSchedulerConfig",
                "sagemaker:UpdateClusterSchedulerConfig",
                "sagemaker:DeleteClusterSchedulerConfig",
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:DescribeAddon",
                "eks:DescribeCluster",
                "eks:DescribeAccessEntry",
                "eks:ListAssociatedAccessPolicies",
                "eks:AssociateAccessPolicy",
                "eks:DisassociateAccessPolicy"
            ],
            "Resource": "*"
        }
    ]
}
```

------

To grant permissions to manage Amazon CloudWatch Observability Amazon EKS and view the HyperPod cluster dashboard through the SageMaker AI console, use the sample policy below:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:ListComputeQuotas",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "eks:DescribeCluster",
                "cloudwatch:GetMetricData",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon CloudWatch Observability EKS. To ensure task governance related metrics are included in the **Dashboard**, enable the Kueue metrics checkbox. Enabling the Kueue metrics enables CloudWatch **Metrics** costs, after free-tier limit is reached. For more information, see **Metrics** in [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/).

------
#### [ Setup using the EKS AWS CLI ]

Use the following EKS AWS CLI command to install the add-on:

```
aws eks create-addon --cluster-name cluster-name 
--addon-name amazon-cloudwatch-observability 
--configuration-values "configuration json"
```

Below is an example of the JSON of the configuration values:

```
{
    "agent": {
        "config": {
            "logs": {
                "metrics_collected": {
                    "kubernetes": {
                        "kueue_container_insights": true,
                        "enhanced_container_insights": true
                    },
                    "application_signals": { }
                }
            },
            "traces": {
                "traces_collected": {
                    "application_signals": { }
                }
            }
        },
    },
}
```

------
#### [ Setup using the EKS Console UI ]

1. Navigate to the [EKS console](https://console.aws.amazon.com/eks/home#/clusters).

1. Choose your cluster.

1. Choose **Add-ons**.

1. Find the **Amazon CloudWatch Observability** add-on and install. Install version >= 2.4.0 for the add-on. 

1. Include the following JSON, Configuration values:

   ```
   {
       "agent": {
           "config": {
               "logs": {
                   "metrics_collected": {
                       "kubernetes": {
                           "kueue_container_insights": true,
                           "enhanced_container_insights": true
                       },
                       "application_signals": { }
                   },
               },
               "traces": {
                   "traces_collected": {
                       "application_signals": { }
                   }
               }
           },
       },
   }
   ```

------

Once the EKS Observability add-on has been successfully installed, you can view your EKS cluster metrics under the HyperPod console **Dashboard** tab.

# Task governance setup
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance"></a>

This section includes information on how to set up the Amazon SageMaker HyperPod task governance EKS add-on. This includes granting permissions that allows you to set task prioritization, compute allocation for teams, how idle compute is shared, and task preemption for teams.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [Kueue Settings](#hp-eks-task-governance-kueue-settings)
+ [HyperPod Task governance prerequisites](#hp-eks-task-governance-prerequisites)
+ [HyperPod task governance setup](#hp-eks-task-governance-setup)

## Kueue Settings
<a name="hp-eks-task-governance-kueue-settings"></a>

HyperPod task governance EKS add-on installs [Kueue](https://github.com/kubernetes-sigs/kueue/tree/main/apis/kueue) for your HyperPod EKS clusters. Kueue is a kubernetes-native system that manages quotas and how jobs consume them. 


| EKS HyperPod task governance add-on version | Version of Kueue that is installed as part of the add-on | 
| --- | --- | 
|  v1.1.3  |  v0.12.0  | 

**Note**  
Kueue v.012.0 and higher don't include kueue-rbac-proxy as part of the installation. Previous versions might have kueue-rbac-proxy installed. For example, if you're using Kueue v0.8.1, you might have kueue-rbac-proxy v0.18.1.

HyperPod task governance leverages Kueue for Kubernetes-native job queueing, scheduling, and quota management, and is installed with the HyperPod task governance EKS add-on. When installed, HyperPod creates and modifies SageMaker AI-managed Kubernetes resources such as `KueueManagerConfig`, `ClusterQueues`, `LocalQueues`, `WorkloadPriorityClasses`, `ResourceFlavors`, and `ValidatingAdmissionPolicies`. While Kubernetes administrators have the flexibility to modify the state of these resources, it is possible that any changes made to a SageMaker AI-managed resource may be updated and overwritten by the service.

The following information outlines the configuration settings utilized by the HyperPod task governance add-on for setting up Kueue.

```
  apiVersion: config.kueue.x-k8s.io/v1beta1
    kind: Configuration
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: :8443
      enableClusterQueueResources: true
    webhook:
      port: 9443
    manageJobsWithoutQueueName: false
    leaderElection:
      leaderElect: true
      resourceName: c1f6bfd2.kueue.x-k8s.io
    controller:
      groupKindConcurrency:
        Job.batch: 5
        Pod: 5
        Workload.kueue.x-k8s.io: 5
        LocalQueue.kueue.x-k8s.io: 1
        ClusterQueue.kueue.x-k8s.io: 1
        ResourceFlavor.kueue.x-k8s.io: 1
    clientConnection:
      qps: 50
      burst: 100
    integrations:
      frameworks:
      - "batch/job"
      - "kubeflow.org/mpijob"
      - "ray.io/rayjob"
      - "ray.io/raycluster"
      - "jobset.x-k8s.io/jobset"
      - "kubeflow.org/mxjob"
      - "kubeflow.org/paddlejob"
      - "kubeflow.org/pytorchjob"
      - "kubeflow.org/tfjob"
      - "kubeflow.org/xgboostjob"
      - "pod"
      - "deployment"
      - "statefulset"
      - "leaderworkerset.x-k8s.io/leaderworkerset"
      podOptions:
        namespaceSelector:
          matchExpressions:
            - key: kubernetes.io/metadata.name
              operator: NotIn
              values: [ kube-system, kueue-system ]
    fairSharing:
      enable: true
      preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
    resources:
      excludeResourcePrefixes: []
```

For more information about each configuration entry, see [Configuration](https://kueue.sigs.k8s.io/docs/reference/kueue-config.v1beta1/#Configuration) in the Kueue documentation.

## HyperPod Task governance prerequisites
<a name="hp-eks-task-governance-prerequisites"></a>
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs, manage SageMaker HyperPod clusters within your AWS account, and performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 
+ You will need to have your Kubernetes version >= 1.30. For instructions, see [Update existing clusters to the new Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).
+ If you already have Kueue installed in their clusters, uninstall Kueue before installing the EKS add-on.
+ A HyperPod node must already exist in the EKS cluster before installing the HyperPod task governance add-on. 

## HyperPod task governance setup
<a name="hp-eks-task-governance-setup"></a>

The following provides information on how to get set up with HyperPod task governance.

------
#### [ Setup using the SageMaker AI console ]

The following provides information on how to get set up with HyperPod task governance using the SageMaker HyperPod console.

You already have all of the following permissions attached if you have already granted permissions to manage Amazon CloudWatch Observability EKS and view the HyperPod cluster dashboard through the SageMaker AI console in the [HyperPod Amazon CloudWatch Observability EKS add-on setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md#hp-eks-dashboard-setup). If you have not set this up, use the sample policy below to grant permissions to manage the HyperPod task governance add-on and view the HyperPod cluster dashboard through the SageMaker AI console.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "eks:DescribeCluster",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon SageMaker HyperPod task governance Add-on. 

------
#### [ Setup using the Amazon EKS AWS CLI ]

Use the example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html) EKS AWS CLI command to set up the HyperPod task governance Amazon EKS API and console UI using the AWS CLI:

```
aws eks create-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

------

You can view the **Policies** tab in the HyperPod SageMaker AI console if the install was successful. You can also use the following example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html) EKS AWS CLI command to check the status. 

```
aws eks describe-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

# Dashboard
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-metrics"></a>

Amazon SageMaker HyperPod task governance provides a comprehensive dashboard view of your Amazon EKS cluster utilization metrics, including hardware, team, and task metrics. The following provides information on your HyperPod EKS cluster dashboard.

The dashboard provides a comprehensive view of cluster utilization metrics, including hardware, team, and task metrics. You will need to install the EKS add-on to view the dashboard. For more information, see [Dashboard setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md).

In the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/), under **HyperPod Clusters**, you can navigate to the HyperPod console and view your list of HyperPod clusters in your Region. Choose your cluster and navigate to the **Dashboard** tab. The dashboard contains the following metrics. You can download the data for a section by choosing the corresponding **Export**.

**Utilization**

Provides health of the EKS cluster point-in-time and trend-based metrics for critical compute resources. By default, **All Instance Groups** are shown. Use the dropdown menu to filter your instance groups. The metrics included in this section are:
+ Number of total, running, and pending recovery instances. The number of pending recovery instances refer to the number of instances that need attention for recovery.
+ GPUs, GPU memory, vCPUs, and vCPUs memory.
+ GPU utilization, GPU memory utilization, vCPU utilization, and vCPU memory utilization.
+ An interactive graph of your GPU and vCPU utilization. 

**Teams**

Provides information into team-specific resource management. This includes:
+ Instance and GPU allocation.
+ GPU utilization rates.
+ Borrowed GPU statistics.
+ Task status (running or pending).
+ A bar chart view of GPU utilization versus compute allocation across teams.
+ Team detailed GPU and vCPU-related information. By default, the information displayed includes **All teams**. You can filter by team and instances by choosing the dropdown menus. In the interactive plot you can filter by time.

**Tasks**

**Note**  
To view your HyperPod EKS cluster tasks in the dashboard:  
Configure Kubernetes Role-Based Access Control (RBAC) for data scientist users in the designated HyperPod namespace to authorize task execution on Amazon EKS-orchestrated clusters. Namespaces follow the format `hyperpod-ns-team-name`. To establish RBAC permissions, refer to the [team role creation instructions](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
Ensure that your job is submitted with the appropriate namespace and priority class labels. For a comprehensive example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).

Provides information on task-related metrics. This includes number of running, pending, and preempted tasks, and run and wait time statistics. By default, the information displayed includes **All teams**. You can filter by team by choosing the dropdown menu. In the interactive plot you can filter by time.

# Tasks
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks"></a>

The following provides information on Amazon SageMaker HyperPod EKS cluster tasks. Tasks are operations or jobs that are sent to the cluster. These can be machine learning operations, like training, running experiments, or inference. The viewable task details list include status, run time, and how much compute is being used per task. 

In the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/), under **HyperPod Clusters**, you can navigate to the HyperPod console and view your list of HyperPod clusters in your Region. Choose your cluster and navigate to the **Tasks** tab.

For the **Tasks** tab to be viewable from anyone besides the administrator, the administrator needs to [add an access entry to the EKS cluster for the IAM role](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html). 

**Note**  
To view your HyperPod EKS cluster tasks in the dashboard:  
Configure Kubernetes Role-Based Access Control (RBAC) for data scientist users in the designated HyperPod namespace to authorize task execution on Amazon EKS-orchestrated clusters. Namespaces follow the format `hyperpod-ns-team-name`. To establish RBAC permissions, refer to the [team role creation instructions](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
Ensure that your job is submitted with the appropriate namespace and priority class labels. For a comprehensive example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).

For EKS clusters, kubeflow (PyTorch, MPI, TensorFlow) tasks are shown. By default, PyTorch tasks are shown. You can filter for PyTorch, MPI, TensorFlow tasks by choosing the dropdown menu or using the search field. The information that is shown for each task includes the task name, status, namespace, priority class, and creation time. 

# Using topology-aware scheduling in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-scheduling"></a>

Topology-aware scheduling in Amazon SageMaker HyperPod task governance optimizes the training efficiency of distributed machine learning workloads by placing pods based on the physical network topology of your Amazon EC2 instances. By considering the hierarchical structure of AWS infrastructure, including Availability Zones, network blocks, and physical racks, topology-aware scheduling ensures that pods requiring frequent communication are scheduled in close proximity to minimize network latency. This intelligent placement is particularly beneficial for large-scale machine learning training jobs that involve intensive pod-to-pod communication, resulting in reduced training times and more efficient resource utilization across your cluster.

**Note**  
To use topology-aware scheduling, make sure that your version of HyperPod task governance is v1.2.2-eksbuild.1 or higher.

Topology-aware scheduling supports the following instance types:
+ ml.p3dn.24xlarge
+ ml.p4d.24xlarge
+ ml.p4de.24xlarge
+ ml.p5.48xlarge
+ ml.p5e.48xlarge
+ ml.p5en.48xlarge
+ ml.p6e-gb200.36xlarge
+ ml.trn1.2xlarge
+ ml.trn1.32xlarge
+ ml.trn1n.32xlarge
+ ml.trn2.48xlarge
+ ml.trn2u.48xlarge

Topology-aware scheduling integrates with your existing HyperPod workflows while providing flexible topology preferences through both kubectl YAML files and the HyperPod CLI. HyperPod task governance automatically configures cluster nodes with topology labels and works with HyperPod task governance policies and resource borrowing mechanisms, ensuring that topology-aware scheduling doesn't disrupt your current operational processes. With built-in support for both preferred and required topology specifications, you can fine-tune workload placement to match your specific performance requirements while maintaining the flexibility to fall back to standard scheduling when topology constraints cannot be satisfied.

By leveraging topology-aware labels in HyperPod, you can enhance their machine learning workloads through intelligent pod placement that considers the physical network infrastructure. HyperPod task governance automatically optimizes pod scheduling based on the hierarchical data center topology, which directly translates to reduced network latency and improved training performance for distributed ML tasks. This topology awareness is particularly valuable for large-scale machine learning workloads, as it minimizes communication overhead by strategically placing related pods closer together in the network hierarchy. The result is optimized communication network latency between pods, more efficient resource utilization, and better overall performance for compute-intensive AI/ML applications, all achieved without you needing to manually manage complex network topology configurations.

The following are labels for the available topology network layers that HyperPod task governance can schedule pods in:
+ topology.k8s.aws/network-node-layer-1
+ topology.k8s.aws/network-node-layer-2
+ topology.k8s.aws/network-node-layer-3
+ topology.k8s.aws/ultraserver-id

To use topology-aware scheduling, include the following labels in your YAML file:
+ kueue.x-k8s.io/podset-required-topology - indicates that this job must have the required pods and that all pods in the nodes must be scheduled within the same topology layer.
+ kueue.x-k8s.io/podset-preferred-topology - indicates that this job must have the pods, but that scheduling pods within the same topology layer is preferred but not required. HyperPod task governance will try to schedule the pods within one layer before trying the next topology layer.

If resources don’t share the same topology label, the job will be suspended. The job will be in the waitlist. Once Kueue sees that there are enough resources, it will admit and run the job.

The following example demonstrates how to use the labels in your YAML files:

```
apiVersion: batch/v1
kind: Job
metadata:
  name: test-tas-job
  namespace: hyperpod-ns-team-name
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
    kueue.x-k8s.io/priority-class: PRIORITY_CLASS-priority
spec:
  parallelism: 10
  completions: 10
  suspend: true
  template:
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
      annotations:
        kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
        or
        kueue.x-k8s.io/podset-preferred-topology: "topology.k8s.aws/network-node-layer-3"
    spec:
      nodeSelector:
        topology.k8s.aws/network-node-layer-3: TOPOLOGY_LABEL_VALUE
      containers:
        - name: dummy-job
          image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
          args: ["3600s"]
          resources:
            requests:
              cpu: "100"
      restartPolicy: Never
```

The following table explains the new parameters you can use in the kubectl YAML file.


| Parameter | Description | 
| --- | --- | 
| kueue.x-k8s.io/queue-name | The name of the queue to use to run the job. The format of this queue-name must be hyperpod-ns-team-name-localqueue. | 
| kueue.x-k8s.io/priority-class | Lets you specify a priority for pod scheduling. This specification is optional. | 
| annotations | Contains the topology annotation that you attach to the job. Available topologies are kueue.x-k8s.io/podset-required-topology and kueue.x-k8s.io/podset-preferred-topology. You can use either an annotation or nodeSelector, but not both at the same time. | 
| nodeSelector | Specifies the network layer that represents the layer of Amazon EC2 instance placement. Use either this field or an annotation, but not both at the same time. In your YAML file, you can also use the nodeSelector parameter to choose the exact layer for your pods. To get the value of your label, use the [ DescribeInstanceTopology](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html) API operation. | 

You can also use the HyperPod CLI to run your job and use topology aware scheduling. For more information about the HyperPod CLI, see [SageMaker HyperPod CLI commands](sagemaker-hyperpod-eks-hyperpod-cli-reference.md).

```
hyp create hyp-pytorch-job \                                            
  --version 1.1 \
  --job-name sample-pytorch-job \
  --image 123456789012.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest \
  --pull-policy "Always" \
  --tasks-per-node 1 \
  --max-retry 1 \
  --priority high-priority \
  --namespace hyperpod-ns-team-name \
  --queue-name hyperpod-ns-team-name-localqueue \
  --preferred-topology-label topology.k8s.aws/network-node-layer-1
```

The following is an example configuration file that you might use to run a PytorchJob with topology labels. The file is largely similar if you want to run MPI and Tensorflow jobs. If you want to run those jobs instead, remember to change the configuration file accordingly, such as using the correct image instead of PyTorchJob. If you’re running a PyTorchJob, you can assign different topologies to the master and worker nodes. PyTorchJob always has one master node, so we recommend that you use topology to support worker pods instead.

```
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  annotations: {}
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
  name: tas-test-pytorch-job
  namespace: hyperpod-ns-team-name
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            imagePullPolicy: Always
            name: pytorch
    Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
        metadata:
          # annotations:
            # kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                cpu: 1
              requests:
                memory: 200Mi
                cpu: 1
          #nodeSelector:
          #  topology.k8s.aws/network-node-layer-3: xxxxxxxxxxx
```

To see the topologies for your cluster, use the [ DescribeInstanceTopology](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html) API operation. By default, the topologies are hidden in the AWS Management Console and Amazon SageMaker Studio. Follow these steps to see them in the interface that you’re using.

**SageMaker Studio**

1. In SageMaker Studio, navigate to your cluster.

1. In the Tasks view, choose the options menu in the Name column, then choose **Manage columns**.

1. Select **Requested topology** and **Topology constraint** to add the columns to see the topology information in the list of Kubernetes pods.

**AWS Management Console**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under **HyperPod clusters**, choose **Cluster management**.

1. Choose the **Tasks** tab, then choose the gear icon.

1. Under instance attributes, toggle **Requested topology** and **Topology constraint**.

1. Choose **Confirm** to see the topology information in the table.

# Using gang scheduling in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling"></a>

In distributed ML training, a job often requires multiple pods running concurrently across nodes with pod-to-pod communication. HyperPod task governance uses Kueue's `waitForPodsReady` feature to implement gang scheduling. When enabled, the workload is monitored by Kueue until all of its pods are ready, meaning scheduled, running, and passing the optional readiness probe. If not all pods of the workload are ready within the configured timeout, the workload is evicted and requeued.

Gang scheduling provides the following benefits:
+ **Prevents resource waste** — Kueue evicts and requeues the workload if all pods do not become ready, ensuring resources are not held indefinitely by partially running workloads.
+ **Avoids deadlocks** — Prevents jobs from holding partial resources and blocking each other indefinitely.
+ **Automatic recovery** — If pods aren't ready within the timeout, the workload is evicted and requeued with configurable exponential backoff, rather than hanging indefinitely.

## Activate gang scheduling
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling-activate"></a>

To activate gang scheduling, you must have a HyperPod Amazon EKS cluster with the task governance Amazon EKS add-on installed. The add-on status must be `Active` or `Degraded`.

**Note**  
Gang scheduling can also be configured directly using `kubectl` by editing the Kueue configuration on the cluster.

**Activate gang scheduling (SageMaker AI console)**

1. Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/) and navigate to your HyperPod cluster.

1. Choose the **Policy management** tab.

1. In the **Task governance** section, open **Actions**, then choose **Configure gang scheduling**.

1. Toggle gang scheduling on and configure the settings.

1. Choose **Save**. The Kueue controller restarts to apply the change.

## Gang scheduling configuration settings
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling-settings"></a>

The following table describes the configuration settings for gang scheduling.


| Setting | Description | Default | 
| --- | --- | --- | 
| timeout | How long Kueue waits for all pods to become ready before evicting and requeuing the workload. | 5m | 
| recoveryTimeout | How long Kueue waits for a pod to recover after a node failure before requeuing the workload. Set to 0s to disable. Defaults to the value of timeout if not set. | 5m | 
| blockAdmission | When enabled, workloads are admitted sequentially. No new workload is admitted until all pods of the current one are ready. Prevents deadlocks on resource-constrained clusters. | Off | 
| requeuingStrategy timestamp | Whether requeue order uses Creation (original submission time, preserves queue position) or Eviction (time of last eviction, effectively deprioritizing repeatedly failing jobs). | Eviction | 
| requeuingStrategy backoffLimitCount | Maximum requeue attempts before Kueue permanently deactivates the workload. Leave empty for unlimited retries. | Unlimited | 
| requeuingStrategy backoffBaseSeconds | The base time in seconds for exponential backoff when requeuing a workload after each consecutive timeout. The exponent is 2. | 60s | 
| requeuingStrategy backoffMaxSeconds | Cap on the exponential backoff delay. Once reached, Kueue continues requeuing at this fixed interval. | 3600s | 

**Note**  
Modifying gang scheduling settings restarts the Kueue controller, which may temporarily delay job admission. This applies whether you are enabling, disabling, or updating any value. Running jobs are not interrupted.

**Note**  
Gang scheduling is cluster-wide. It applies to all Kueue-managed workloads on the cluster, not just specific teams or queues.

# Policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies"></a>

Amazon SageMaker HyperPod task governance simplifies how your Amazon EKS cluster resources are allocated and how tasks are prioritized. The following provides information on HyperPod EKS cluster policies. For information on how to set up task governance, see [Task governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md).

The policies are divided up into **Compute prioritization** and **Compute allocation**. The policy concepts below will be organized in the context of these policies.

**Compute prioritization**, or cluster policy, determines how idle compute is borrowed and how tasks are prioritized by teams.
+ **Idle compute allocation** defines how idle compute is allocated across teams. That is, how unused compute can be borrowed from teams. When choosing an **Idle compute allocation**, you can choose between:
  + **First-come first-serve**: When applied, teams are not prioritized against each other and each incoming task is equally likely to obtain over-quota resources. Tasks are prioritized based on order of submission. This means a user may be able to use 100% of the idle compute if they request it first.
  + **Fair-share**: When applied, teams borrow idle compute based on their assigned **Fair-share weight**. These weights are defined in **Compute allocation**. For more information on how this can be used, see [Sharing idle compute resources examples](#hp-eks-task-governance-policies-examples).
+ **Task prioritization** defines how tasks are queued as compute becomes available. When choosing a **Task prioritization**, you can choose between:
  + **First-come first-serve**: When applied, tasks are queued in the order they are requested.
  + **Task ranking**: When applied, tasks are queued in the order defined by their prioritization. If this option is chosen, you must add priority classes along with the weights at which they should be prioritized. Tasks of the same priority class will be executed on a first-come first-serve basis. When enabled in Compute allocation, tasks are preempted from lower priority tasks by higher priority tasks within the team.

    When data scientists submit jobs to the cluster, they use the priority class name in the YAML file. The priority class is in the format `priority-class-name-priority`. For an example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).
  + **Priority classes**: These classes establish a relative priority for tasks when borrowing capacity. When a task is running using borrowed quota, it may be preempted by another task of higher priority than it, if no more capacity is available for the incoming task. If **Preemption** is enabled in the **Compute allocation**, a higher priority task may also preempt tasks within its own team.
+ **Unallocated resource sharing** enables teams to borrow compute resources that are not allocated to any team through compute quota. When enabled, unallocated cluster capacity becomes available for teams to borrow automatically. For more information, see [How unallocated resource sharing works](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works).

**Compute allocation**, or compute quota, defines a team’s compute allocation and what weight (or priority level) a team is given for fair-share idle compute allocation. 
+ **Team name**: The team name. A corresponding **Namespace** will be created, of type `hyperpod-ns-team-name`. 
+ **Members**: Members of the team namespace. You will need to set up a Kubernetes role-based access control (RBAC) for data scientist users that you want to be part of this team, to run tasks on HyperPod clusters orchestrated with Amazon EKS. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
+ **Fair-share weight**: This is the level of prioritization assigned to the team when **Fair-share** is applied for **Idle compute allocation**. The highest priority has a weight of 100 and the lowest priority has a weight of 0. Higher weight enables a team to access unutilized resources within shared capacity sooner. A zero weight signifies the lowest priority, implying this team will always be at a disadvantage compared to other teams. 

  The fair-share weight provides a comparative edge to this team when vying for available resources against others. Admission prioritizes scheduling tasks from teams with the highest weights and the lowest borrowing. For example, if Team A has a weight of 10 and Team B has a weight of 5, Team A would have priority in accessing unutilized resources as in would have jobs that are scheduled earlier than Team B.
+ **Task preemption**: Compute is taken over from a task based on priority. By default, the team loaning idle compute will preempt tasks from other teams. 
+ **Lending and borrowing**: How idle compute is being lent by the team and if the team can borrow from other teams.
  + **Percentage-based borrow limit**: The limit of idle compute that a team is allowed to borrow, expressed as a percentage of their guaranteed quota. A team can borrow up to 10,000% of allocated compute. The value you provide here is interpreted as a percentage. For example, a value of 500 will be interpreted as 500%. This percentage applies uniformly across all resource types (CPU, GPU, Memory) and instance types in the team's quota.
  + **Absolute borrow limit**: The limit of idle compute that a team is allowed to borrow, defined as absolute resource values per instance type. This provides granular control over borrowing behavior for specific instance types. You need to specify absolute limits using the same schema as **Compute quota**, including instance count, accelerators, vCPU, memory, or accelerator partitions. You can specify absolute limits for one or more instance types in your team's quota.

For information on how these concepts are used, such as priority classes and name spaces, see [Example HyperPod task governance AWS CLI commands](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md).

## Sharing idle compute resources examples
<a name="hp-eks-task-governance-policies-examples"></a>

The total reserved quota should not surpass the cluster's available capacity for that resource, to ensure proper quota management. For example, if a cluster comprises 20 `ml.c5.2xlarge` instances, the cumulative quota assigned to teams should remain under 20. 

If the **Compute allocation** policies for teams allow for **Lend and Borrow** or **Lend**, the idle capacity is shared between these teams. For example, Team A and Team B have **Lend and Borrow** enabled. Team A has a quota of 6 but is using only 2 for its jobs, and Team B has a quota of 5 and is using 4 for its jobs. A job that is submitted to Team B requiring 4 resources. 3 will be borrowed from Team A. 

If any team's **Compute allocation** policy is set to **Don't Lend**, the team would not be able to borrow any additional capacity beyond its own allocations.

## How unallocated resource sharing works
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works"></a>

Unallocated resource sharing automatically manages the pool of resources that are not allocated to any compute quota in your cluster. This means HyperPod continuously monitors your cluster state and automatically updates to the correct configuration over time.

**Initial Setup**
+ When you set `IdleResourceSharing` to `Enabled` in your ClusterSchedulerConfig (by default it is `Disabled`), HyperPod task governance begins monitoring your cluster and calculates available idle resources by subtracting team quotas from total node capacity.
+ Unallocated resource sharing ClusterQueues are created to represent the borrowable resource pool.
+ When you first enable unallocated resource sharing, infrastructure setup takes several mins. You can monitor the progress through policy `Status` and `DetailedStatus` in ClusterSchedulerConfig.

**Ongoing Reconciliation**
+ HyperPod task governance continuously monitors for changes such as node additions or removals and cluster queue quota updates.
+  When changes occur, unallocated resource sharing recalculates quota and updates ClusterQueues. Reconciliation typically completes within seconds. 

**Monitoring**

 You can verify that unallocated resource sharing is fully configured by checking for unallocated resource sharing ClusterQueues: 

```
kubectl get clusterqueue | grep hyperpod-ns-idle-resource-sharing
```

When you see ClusterQueues with names like `hyperpod-ns-idle-resource-sharing-cq-1`, unallocated resource sharing is active. Note that multiple unallocated resource sharing ClusterQueues may exist depending on the number of resource flavors in your cluster. 

## Node eligibility for unallocated resource sharing
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-node-eligibility"></a>

Unllocated Resource Sharing only includes nodes that meet the following requirements:

1. **Node Ready Status**
   + Nodes must be in `Ready` status to contribute to the unallocated resource pool.
   + Nodes in `NotReady` or other non-ready states are excluded from capacity calculations.
   + When a node becomes `Ready`, it is automatically included in the next reconciliation cycle.

1. **Node Schedulable Status**
   + Nodes with `spec.unschedulable: true` are excluded from unallocated resource sharing.
   + When a node becomes schedulable again, it is automatically included in the next reconciliation cycle.

1. **MIG Configuration (GPU nodes only)**
   + For GPU nodes with MIG (Multi-Instance GPU) partitioning, the `nvidia.com/mig.config.state` label must show `success` for the node to contribute MIG profiles to unallocated resource sharing.
   + These nodes will be retried automatically once MIG configuration completes successfully.

1. **Supported Instance Types**
   + The instance must be a supported SageMaker HyperPod instance type.
   + See the list of supported instance types in the SageMaker HyperPod cluster.

**Topics**
+ [Sharing idle compute resources examples](#hp-eks-task-governance-policies-examples)
+ [How unallocated resource sharing works](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works)
+ [Node eligibility for unallocated resource sharing](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-node-eligibility)
+ [Create policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-create.md)
+ [Edit policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit.md)
+ [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md)
+ [Allocating compute quota in Amazon SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation.md)

# Create policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-create"></a>

You can create your **Cluster policy** and **Compute allocation** configurations in the **Policies** tab. The following provides instructions on how to create the following configurations.
+ Create your **Cluster policy** to update how tasks are prioritized and idle compute is allocated.
+ Create **Compute allocation** to create a new compute allocation policy for a team.
**Note**  
When you create a **Compute allocation** you will need to set up a Kubernetes role-based access control (RBAC) for data scientist users in the corresponding namespace to run tasks on HyperPod clusters orchestrated with Amazon EKS. The namespaces have the format `hyperpod-ns-team-name`. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).

For information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Create HyperPod task governance policies**

This procedure assumes that you have already created an Amazon EKS cluster set up with HyperPod. If you have not already done so, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To create your **Cluster policy**:

   1. Choose the corresponding **Edit** to update how tasks are prioritized and idle compute is allocated.

   1. After you have made your changes, choose **Submit**.

1. To create a **Compute allocation**:

1. 

   1. Choose the corresponding **Create**. This takes you to the compute allocation creation page.

   1. After you have made your changes, choose **Submit**.

# Edit policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit"></a>

You can edit your **Cluster policy** and **Compute allocation** configurations in the **Policies** tab. The following provides instructions on how to edit the following configurations.
+ Edit your **Cluster policy** to update how tasks are prioritized and idle compute is allocated.
+ Edit **Compute allocation** to create a new compute allocation policy for a team.
**Note**  
When you create a **Compute allocation** you will need to set up a Kubernetes role-based access control (RBAC) for data scientist users in the corresponding namespace to run tasks on HyperPod clusters orchestrated with Amazon EKS. The namespaces have the format `hyperpod-ns-team-name`. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).

For more information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Edit HyperPod task governance policies**

This procedure assumes that you have already created an Amazon EKS cluster set up with HyperPod. If you have not already done so, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To edit your **Cluster policy**:

   1. Choose the corresponding **Edit** to update how tasks are prioritized and idle compute is allocated.

   1. After you have made your changes, choose **Submit**.

1. To edit your **Compute allocation**:

1. 

   1. Choose the configuration you wish to edit under **Compute allocation**. This takes you to the configuration details page.

   1. If you wish to edit these configurations, choose **Edit**.

   1. After you have made your changes, choose **Submit**.

# Delete policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete"></a>

You can delete your **Cluster policy** and **Compute allocation** configurations using the SageMaker AI console or AWS CLI. The following page provides instructions on how to delete your SageMaker HyperPod task governance policies and configurations.

For more information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Note**  
If you are having issues with listing or deleting task governance policies, you may need to update your cluster administrator minimum set of permissions. See the **Amazon EKS** tab in the [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin) section. For additional information, see [Deleting clusters](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md#hp-eks-troubleshoot-delete-policies).

## Delete HyperPod task governance policies (console)
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-console"></a>

The following uses the SageMaker AI console to delete your HyperPod task governance policies.

**Note**  
You cannot delete your **Cluster policy** (`ClusterSchedulerConfig`) using the SageMaker AI console. To learn how to do so using the AWS CLI, see [Delete HyperPod task governance policies (AWS CLI)](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-cli).

**To delete task governance policies (console)**

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To delete your **Compute allocation** (`ComputeQuota`):

   1. In the **Compute allocation** section, select the configuration you want to delete.

   1. In the **Actions** dropdown menu, choose **Delete**.

   1. Follow the instructions in the UI to complete the task.

## Delete HyperPod task governance policies (AWS CLI)
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-cli"></a>

The following uses the AWS CLI to delete your HyperPod task governance policies.

**Note**  
If you are having issues using the following commands, you may need to update your AWS CLI. For more information, see [Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

**To delete task governance policies (AWS CLI)**

First set your variables for the AWS CLI commands that follow.

```
REGION=aws-region
```

1. Get the *cluster-arn* associated with the policies you wish to delete. You can use the following AWS CLI command to list the clusters in your AWS Region.

   ```
   aws sagemaker list-clusters \
       --region ${REGION}
   ```

1. To delete your compute allocations (`ComputeQuota`):

   1. List all of the compute quotas associated with the HyperPod cluster.

      ```
      aws sagemaker list-compute-quotas \
          --cluster-arn cluster-arn \
          --region ${REGION}
      ```

   1. For each `compute-quota-id` you wish to delete, run the following command to delete the compute quota.

      ```
      aws sagemaker delete-compute-quota \
          --compute-quota-id compute-quota-id \
          --region ${REGION}
      ```

1. To delete your cluster policies (`ClusterSchedulerConfig`):

   1. List all of the cluster policies associated with the HyperPod cluster.

      ```
      aws sagemaker list-cluster-scheduler-configs \
          --cluster-arn cluster-arn \
          --region ${REGION}
      ```

   1. For each `cluster-scheduler-config-id` you wish to delete, run the following command to delete the compute quota.

      ```
      aws sagemaker delete-cluster-scheduler-config 
          --cluster-scheduler-config-id scheduler-config-id \
          --region ${REGION}
      ```

# Allocating compute quota in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation"></a>

Cluster administrators can decide how the organization uses purchased compute. Doing so reduces waste and idle resources. You can allocate compute quota such that teams can borrow unused resources from each other. Compute quota allocation in HyperPod task governance lets administrators allocate resources at the instance level and at a more granular resource level. This capability provides flexible and efficient resource management for teams by allowing granular control over individual compute resources instead of requiring entire instance allocations. Allocating at a granular level eliminates inefficiencies of traditional instance-level allocation. Through this approach, you can optimize resource utilization and reduce idle compute.

Compute quota allocation supports three types of resource allocation: accelerators, vCPU, and memory. Accelerators are components in accelerated computer instances that that perform functions, such as floating point number calculations, graphics processing, or data pattern matching. Accelerators include GPUs, Trainium accelerators, and neuron cores. For multi-team GPU sharing, different teams can receive specific GPU allocations from the same instance type, maximizing utilization of accelerator hardware. For memory-intensive workloads that require additional RAM for data preprocessing or model caching scenarios, you can allocate memory quota beyond the default GPU-to-memory ratio. For CPU-heavy preprocessing tasks that need substantial CPU resources alongside GPU training, you can allocate independent CPU resource allocation.

Once you provide a value, HyperPod task governance calculates the ratio using the formula **allocated resource divided by the total amount of resources available in the instance**. HyperPod task governance then uses this ratio to apply default allocations to other resources, but you can override these defaults and customize them based on your use case. The following are sample scenarios of how HyperPod task governance allocates resources based on your values:
+ **Only accelerator specified** - HyperPod task governance applies the default ratio to vCPU and memory based on the accelerator values.
+ **Only vCPU specified** - HyperPod task governance calculates the ratio and applies it to memory. Accelerators are set to 0.
+ **Only memory specified** - HyperPod task governance calculates the ratio and applies it to vCPU because compute is required to run memory-specified workloads. Accelerators are set to 0.

To programmatically control quota allocation, you can use the [ ComputeQuotaResourceConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ComputeQuotaResourceConfig.html) object and specify your allocations in integers.

```
{
    "ComputeQuotaConfig": {
        "ComputeQuotaResources": [{
            "InstanceType": "ml.g5.24xlarge",
            "Accelerators": "16",
            "vCpu": "200.0",
            "MemoryInGiB": "2.0"
        }]
    }
}
```

To see all of the allocated allocations, including the defaults, use the [ DescribeComputeQuota](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeComputeQuota.html) operation. To update your allocations, use the [ UpdateComputeQuota](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateComputeQuota.html) operation.

You can also use the HyperPod CLI to allocate compute quotas. For more information about the HyperPod CLI, see [Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-run-jobs.md). The following example demonstrates how to set compute quotas using the HyperPod CLI.

```
hyp create hyp-pytorch-job --version 1.1 --job-name sample-job \
--image 123456789012.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest \
--pull-policy "Always" \
--tasks-per-node 1 \
--max-retry 1 \
--priority high-priority \
--namespace hyperpod-ns-team-name \
--queue-name hyperpod-ns-team-name-localqueue \
--instance-type sample-instance-type \
--accelerators 1 \
--vcpu 3 \
--memory 1 \
--accelerators-limit 1 \
--vcpu-limit 4 \
--memory-limit 2
```

To allocate quotas using the AWS console, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under HyperPod clusters, choose **Cluster management**.

1. Under **Compute allocations**, choose **Create**.

1. If you don’t already have instances, choose **Add allocation** to add an instance.

1. Under **Allocations**, choose to allocate by instances or individual resources. If you allocate by individual resources, SageMaker AI automatically assigns allocations to other resources by the ratio that you chose. To override this ratio-based allocation, use the corresponding toggle to override that compute.

1. Repeat steps 4 and 5 to configure additional instances.

After allocating compute quota, you can then submit jobs through the HyperPod CLI or `kubectl`. HyperPod efficiently schedules workloads based on available quota. 

# Allocating GPU partition quota
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions"></a>

You can extend compute quota allocation to support GPU partitioning, enabling fine-grained resource sharing at the GPU partition level. When GPU partitioning is enabled on supported GPUs in the cluster, each physical GPU can be partitioned into multiple isolated GPUs with defined compute, memory, and streaming multiprocessor allocations. For more information about GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md). You can allocate specific GPU partitions to teams, allowing multiple teams to share a single GPU while maintaining hardware-level isolation and predictable performance.

For example, an ml.p5.48xlarge instance with 8 H100 GPUs can be partitioned into GPU partitions, and you can allocate individual partitions to different teams based on their task requirements. When you specify GPU partition allocations, HyperPod task governance calculates proportional vCPU and memory quotas based on the GPU partition, similar to GPU-level allocation. This approach maximizes GPU utilization by eliminating idle capacity and enabling cost-effective resource sharing across multiple concurrent tasks on the same physical GPU.

## Creating Compute Quotas
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions-creating"></a>

```
aws sagemaker create-compute-quota \
  --name "fractional-gpu-quota" \
  --compute-quota-config '{
    "ComputeQuotaResources": [
      {
        "InstanceType": "ml.p4d.24xlarge",
        "AcceleratorPartition": {
            "Count": 4,
            "Type": "mig-1g.5gb"
        }
      }
    ],
    "ResourceSharingConfig": { 
      "Strategy": "LendAndBorrow", 
      "BorrowLimit": 100 
    }
  }'
```

## Verifying Quota Resources
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions-verifying"></a>

```
# Check ClusterQueue
kubectl get clusterqueues
kubectl describe clusterqueue QUEUE_NAME

# Check ResourceFlavors
kubectl get resourceflavor
kubectl describe resourceflavor FLAVOR_NAME
```

# Example HyperPod task governance AWS CLI commands
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-cli"></a>

You can use HyperPod with EKS through Kubectl or through HyperPod custom CLI. You can use these commands through Studio or AWS CLI. The following provides SageMaker HyperPod task governance examples, on how to view cluster details using the HyperPod AWS CLI commands. For more information, including how to install, see the [HyperPod CLI Github repository](https://github.com/aws/sagemaker-hyperpod-cli).

**Topics**
+ [Get cluster accelerator device quota information](#hp-eks-cli-get-clusters)
+ [Submit a job to SageMaker AI-managed queue and namespace](#hp-eks-cli-start-job)
+ [List jobs](#hp-eks-cli-list-jobs)
+ [Get job detailed information](#hp-eks-cli-get-job)
+ [Suspend and unsuspend jobs](#hp-eks-cli-patch-job)
+ [Debugging jobs](#hp-eks-cli-other)

## Get cluster accelerator device quota information
<a name="hp-eks-cli-get-clusters"></a>

The following example command gets the information on the cluster accelerator device quota.

```
hyperpod get-clusters -n hyperpod-ns-test-team
```

The namespace in this example, `hyperpod-ns-test-team`, is created in Kubernetes based on the team name provided, `test-team`, when the compute allocation is created. For more information, see [Edit policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit.md).

Example response:

```
[
    {
        "Cluster": "hyperpod-eks-test-cluster-id",
        "InstanceType": "ml.g5.xlarge",
        "TotalNodes": 2,
        "AcceleratorDevicesAvailable": 1,
        "NodeHealthStatus=Schedulable": 2,
        "DeepHealthCheckStatus=Passed": "N/A",
        "Namespaces": {
            "hyperpod-ns-test-team": {
                "TotalAcceleratorDevices": 1,
                "AvailableAcceleratorDevices": 1
            }
        }
    }
]
```

## Submit a job to SageMaker AI-managed queue and namespace
<a name="hp-eks-cli-start-job"></a>

The following example command submits a job to your HyperPod cluster. If you have access to only one team, the HyperPod AWS CLI will automatically assign the queue for you in this case. Otherwise if multiple queues are discovered, we will display all viable options for you to select.

```
hyperpod start-job --job-name hyperpod-cli-test --job-kind kubeflow/PyTorchJob --image docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd --entry-script /opt/pytorch-mnist/mnist.py --pull-policy IfNotPresent --instance-type ml.g5.xlarge --node-count 1 --tasks-per-node 1 --results-dir ./result --priority training-priority
```

The priority classes are defined in the **Cluster policy**, which defines how tasks are prioritized and idle compute is allocated. When a data scientist submits a job, they use one of the priority class names with the format `priority-class-name-priority`. In this example, `training-priority` refers to the priority class named “training”. For more information on policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

If a priority class is not specified, the job is treated as a low priority job, with a task ranking value of 0. 

If a priority class is specified, but does not correspond to one of the priority classes defined in the **Cluster policy**, the submission fails and an error message provides the defined set of priority classes.

You can also submit the job using a YAML configuration file using the following command: 

```
hyperpod start-job --config-file ./yaml-configuration-file-name.yaml
```

The following is an example YAML configuration file that is equivalent to submitting a job as discussed above.

```
defaults:
  - override hydra/job_logging: stdout
hydra:
  run:
    dir: .
  output_subdir: null
training_cfg:
  entry_script: /opt/pytorch-mnist/mnist.py
  script_args: []
  run:
    name: hyperpod-cli-test
    nodes: 1
    ntasks_per_node: 1
cluster:
  cluster_type: k8s
  instance_type: ml.g5.xlarge
  custom_labels:
    kueue.x-k8s.io/priority-class: training-priority
  cluster_config:
    label_selector:
      required:
        sagemaker.amazonaws.com/node-health-status:
          - Schedulable
      preferred:
        sagemaker.amazonaws.com/deep-health-check-status:
          - Passed
      weights:
        - 100
    pullPolicy: IfNotPresent
base_results_dir: ./result
container: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd
env_vars:
  NCCL_DEBUG: INFO
```

Alternatively, you can submit a job using `kubectl` to ensure the task appears in the **Dashboard** tab. The following is an example kubectl command.

```
kubectl apply -f ./yaml-configuration-file-name.yaml
```

When submitting the job, include your queue name and priority class labels. For example, with the queue name `hyperpod-ns-team-name-localqueue` and priority class `priority-class-name-priority`, you must include the following labels:
+ `kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue` 
+ `kueue.x-k8s.io/priority-class: priority-class-name-priority`

The following YAML configuration snippet demonstrates how to add labels to your original configuration file to ensure your task appears in the **Dashboard** tab:

```
metadata:
    name: job-name
    namespace: hyperpod-ns-team-name
    labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        kueue.x-k8s.io/priority-class: priority-class-name-priority
```

## List jobs
<a name="hp-eks-cli-list-jobs"></a>

The following command lists the jobs and their details.

```
hyperpod list-jobs
```

Example response:

```
{
    "jobs": [
        {
            "Name": "hyperpod-cli-test",
            "Namespace": "hyperpod-ns-test-team",
            "CreationTime": "2024-11-18T21:21:15Z",
            "Priority": "training",
            "State": "Succeeded"
        }
    ]
}
```

## Get job detailed information
<a name="hp-eks-cli-get-job"></a>

The following command provides a job’s details. If no namespace is specified, HyperPod AWS CLI will fetch a namespace managed by SageMaker AI that you have access to.

```
hyperpod get-job --job-name hyperpod-cli-test
```

Example response:

```
{
    "Name": "hyperpod-cli-test",
    "Namespace": "hyperpod-ns-test-team",
    "Label": {
        "app": "hyperpod-cli-test",
        "app.kubernetes.io/managed-by": "Helm",
        "kueue.x-k8s.io/priority-class": "training"
    },
    "CreationTimestamp": "2024-11-18T21:21:15Z",
    "Status": {
        "completionTime": "2024-11-18T21:25:24Z",
        "conditions": [
            {
                "lastTransitionTime": "2024-11-18T21:21:15Z",
                "lastUpdateTime": "2024-11-18T21:21:15Z",
                "message": "PyTorchJob hyperpod-cli-test is created.",
                "reason": "PyTorchJobCreated",
                "status": "True",
                "type": "Created"
            },
            {
                "lastTransitionTime": "2024-11-18T21:21:17Z",
                "lastUpdateTime": "2024-11-18T21:21:17Z",
                "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test is running.",
                "reason": "PyTorchJobRunning",
                "status": "False",
                "type": "Running"
            },
            {
                "lastTransitionTime": "2024-11-18T21:25:24Z",
                "lastUpdateTime": "2024-11-18T21:25:24Z",
                "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test successfully completed.",
                "reason": "PyTorchJobSucceeded",
                "status": "True",
                "type": "Succeeded"
            }
        ],
            "replicaStatuses": {
                "Worker": {
                    "selector": "training.kubeflow.org/job-name=hyperpod-cli-test,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker",
                    "succeeded": 1
                }
            },
        "startTime": "2024-11-18T21:21:15Z"
    },
    "ConsoleURL": "https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/cluster-management/hyperpod-eks-test-cluster-id“
}
```

## Suspend and unsuspend jobs
<a name="hp-eks-cli-patch-job"></a>

If you want to remove some submitted job from the scheduler, HyperPod AWS CLI provides `suspend` command to temporarily remove the job from orchestration. The suspended job will no longer be scheduled unless the job is manually unsuspended by the `unsuspend` command

To temporarily suspend a job:

```
hyperpod patch-job suspend --job-name hyperpod-cli-test
```

To add a job back to the queue:

```
hyperpod patch-job unsuspend --job-name hyperpod-cli-test
```

## Debugging jobs
<a name="hp-eks-cli-other"></a>

The HyperPod AWS CLI also provides other commands for you to debug job submission issues. For example `list-pods` and `get-logs` in the HyperPod AWS CLI Github repository.

# Troubleshoot
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot"></a>

The following page contains known solutions for troubleshooting your HyperPod EKS clusters.

**Topics**
+ [Dashboard tab](#hp-eks-troubleshoot-dashboard)
+ [Tasks tab](#hp-eks-troubleshoot-tasks)
+ [Policies](#hp-eks-troubleshoot-policies)
+ [Deleting clusters](#hp-eks-troubleshoot-delete-policies)
+ [Unallocated resource sharing](#hp-eks-troubleshoot-unallocated-resource-sharing)

## Dashboard tab
<a name="hp-eks-troubleshoot-dashboard"></a>

**The EKS add-on fails to install**

For the EKS add-on installation to succeed, you will need to have a Kubernets version >= 1.30. To update, see [Update Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).

For the EKS add-on installation to succeed, all of the nodes need to be in **Ready** status and all of the pods need to be in **Running** status. 

To check the status of your nodes, use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) AWS CLI command or navigate to your EKS cluster in the [EKS console](https://console.aws.amazon.com/eks/home#/clusters) and view the status of your nodes. Resolve the issue for each node or reach out to your administrator. If the node status is **Unknown**, delete the node. Once all nodes statuses are **Ready**, retry installing the EKS add-on in HyperPod from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

To check the status of your pods, use the [Kubernetes CLI](https://kubernetes.io/docs/reference/kubectl/) command `kubectl get pods -n cloudwatch-agent` or navigate to your EKS cluster in the [EKS console](https://console.aws.amazon.com/eks/home#/clusters) and view the status of your pods with the namespace `cloudwatch-agent`. Resolve the issue for the pods or reach out to your administrator to resolve the issues. Once all pod statuses are **Running**, retry installing the EKS add-on in HyperPod from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

For more troubleshooting, see [Troubleshooting the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html#Container-Insights-setup-EKS-addon-troubleshoot).

## Tasks tab
<a name="hp-eks-troubleshoot-tasks"></a>

If you see the error message about how the **Custom Resource Definition (CRD) is not configured on the cluster**, grant `EKSAdminViewPolicy` and `ClusterAccessRole` policies to your domain execution role. 
+ For information on how to get your execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).
+ To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

## Policies
<a name="hp-eks-troubleshoot-policies"></a>

The following lists solutions to errors relating to policies using the HyperPod APIs or console.
+ If the policy is in `CreateFailed` or `CreateRollbackFailed` status, you need to delete the failed policy and create a new one.
+ If the policy is in `UpdateFailed` status, retry the update with the same policy ARN.
+ If the policy is in `UpdateRollbackFailed` status, you need to delete the failed policy and then create a new one.
+ If the policy is in `DeleteFailed` or `DeleteRollbackFailed` status, retry the delete with the same policy ARN.
  + If you ran into an error while trying to delete the **Compute prioritization**, or cluster policy, using the HyperPod console, try to delete the `cluster-scheduler-config` using the API. To check the status of the resource, go to the details page of a compute allocation.

To see more details into the failure, use the describe API.

## Deleting clusters
<a name="hp-eks-troubleshoot-delete-policies"></a>

The following lists known solutions to errors relating to deleting clusters.
+ When cluster deletion fails due to attached SageMaker HyperPod task governance policies, you will need to [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md).
+ When cluster deletion fails due to the missing the following permissions, you will need to update your cluster administrator minimum set of permissions. See the **Amazon EKS** tab in the [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin) section.
  + `sagemaker:ListComputeQuotas`
  + `sagemaker:ListClusterSchedulerConfig`
  + `sagemaker:DeleteComputeQuota`
  + `sagemaker:DeleteClusterSchedulerConfig`

## Unallocated resource sharing
<a name="hp-eks-troubleshoot-unallocated-resource-sharing"></a>

If your unallocated resource pool capacity is less than expected:

1. **Check node ready status**

   ```
   kubectl get nodes
   ```

   Verify all nodes show `Ready` status in the STATUS column.

1. **Check node schedulable status**

   ```
   kubectl get nodes -o custom-columns=NAME:.metadata.name,UNSCHEDULABLE:.spec.unschedulable
   ```

   Verify nodes show `<none>` or `false` (not `true`).

1. **List unallocated resource sharing ClusterQueues:**

   ```
   kubectl get clusterqueue | grep hyperpod-ns-idle-resource-sharing
   ```

   This shows all unallocated resource sharing ClusterQueues. If the ClusterQueues are not showing up, check the `FailureReason` under ClusterSchedulerConfig policy to see if there are any failure messages to continue the debugging.

1. **Verify unallocated resource sharing quota:**

   ```
   kubectl describe clusterqueue hyperpod-ns-idle-resource-sharing-<index>
   ```

   Check the `spec.resourceGroups[].flavors[].resources` section to see the quota allocated for each resource flavor.

   Multiple unallocated resource sharing ClusterQueues may exist depending on the number of resource flavors in your cluster. 

1. **Check MIG configuration status (GPU nodes):**

   ```
   kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}{end}'
   ```

   Verify MIG-enabled nodes show `success` state.

# Attribution document for Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-attributions"></a>

In the following you can learn about attributions and third-party licenses for material used in Amazon SageMaker HyperPod task governance.

**Topics**
+ [[base-files](https://packages.debian.org/bookworm/base-files)](#hp-eks-task-governance-attributions-base-files)
+ [[netbase](https://packages.debian.org/source/stable/netbase)](#hp-eks-task-governance-attributions-netbase)
+ [[golang-lru](https://github.com/hashicorp/golang-lru)](#hp-eks-task-governance-attributions-golang-lru)

## [base-files](https://packages.debian.org/bookworm/base-files)
<a name="hp-eks-task-governance-attributions-base-files"></a>

```
This is the Debian prepackaged version of the Debian Base System
Miscellaneous files. These files were written by Ian Murdock
<imurdock@debian.org> and Bruce Perens <bruce@pixar.com>.

This package was first put together by Bruce Perens <Bruce@Pixar.com>,
from his own sources.

The GNU Public Licenses in /usr/share/common-licenses were taken from
ftp.gnu.org and are copyrighted by the Free Software Foundation, Inc.

The Artistic License in /usr/share/common-licenses is the one coming
from Perl and its SPDX name is "Artistic License 1.0 (Perl)".


Copyright © 1995-2011 Software in the Public Interest.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

On Debian systems, the complete text of the GNU General
Public License can be found in `/usr/share/common-licenses/GPL'.
```

## [netbase](https://packages.debian.org/source/stable/netbase)
<a name="hp-eks-task-governance-attributions-netbase"></a>

```
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Comment:
 This package was created by Peter Tobias tobias@et-inf.fho-emden.de on
 Wed, 24 Aug 1994 21:33:28 +0200 and maintained by Anthony Towns
 <ajt@debian.org> until 2001.
 It is currently maintained by Marco d'Itri <md@linux.it>.

Files: *
Copyright:
 Copyright © 1994-1998 Peter Tobias
 Copyright © 1998-2001 Anthony Towns
 Copyright © 2002-2022 Marco d'Itri
License: GPL-2
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License, version 2, as
 published by the Free Software Foundation.
 .
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 .
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation,
 Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 .
 On Debian systems, the complete text of the GNU General Public License
 version 2 can be found in '/usr/share/common-licenses/GPL-2'.
```

## [golang-lru](https://github.com/hashicorp/golang-lru)
<a name="hp-eks-task-governance-attributions-golang-lru"></a>

```
Copyright © 2014 HashiCorp, Inc.

Mozilla Public License, version 2.0

1. Definitions

1.1. "Contributor"

     means each individual or legal entity that creates, contributes to the
     creation of, or owns Covered Software.

1.2. "Contributor Version"

     means the combination of the Contributions of others (if any) used by a
     Contributor and that particular Contributor's Contribution.

1.3. "Contribution"

     means Covered Software of a particular Contributor.

1.4. "Covered Software"

     means Source Code Form to which the initial Contributor has attached the
     notice in Exhibit A, the Executable Form of such Source Code Form, and
     Modifications of such Source Code Form, in each case including portions
     thereof.

1.5. "Incompatible With Secondary Licenses"
     means

     a. that the initial Contributor has attached the notice described in
        Exhibit B to the Covered Software; or

     b. that the Covered Software was made available under the terms of
        version 1.1 or earlier of the License, but not also under the terms of
        a Secondary License.

1.6. "Executable Form"

     means any form of the work other than Source Code Form.

1.7. "Larger Work"

     means a work that combines Covered Software with other material, in a
     separate file or files, that is not Covered Software.

1.8. "License"

     means this document.

1.9. "Licensable"

     means having the right to grant, to the maximum extent possible, whether
     at the time of the initial grant or subsequently, any and all of the
     rights conveyed by this License.

1.10. "Modifications"

     means any of the following:

     a. any file in Source Code Form that results from an addition to,
        deletion from, or modification of the contents of Covered Software; or

     b. any new file in Source Code Form that contains any Covered Software.

1.11. "Patent Claims" of a Contributor

      means any patent claim(s), including without limitation, method,
      process, and apparatus claims, in any patent Licensable by such
      Contributor that would be infringed, but for the grant of the License,
      by the making, using, selling, offering for sale, having made, import,
      or transfer of either its Contributions or its Contributor Version.

1.12. "Secondary License"

      means either the GNU General Public License, Version 2.0, the GNU Lesser
      General Public License, Version 2.1, the GNU Affero General Public
      License, Version 3.0, or any later versions of those licenses.

1.13. "Source Code Form"

      means the form of the work preferred for making modifications.

1.14. "You" (or "Your")

      means an individual or a legal entity exercising rights under this
      License. For legal entities, "You" includes any entity that controls, is
      controlled by, or is under common control with You. For purposes of this
      definition, "control" means (a) the power, direct or indirect, to cause
      the direction or management of such entity, whether by contract or
      otherwise, or (b) ownership of more than fifty percent (50%) of the
      outstanding shares or beneficial ownership of such entity.


2. License Grants and Conditions

2.1. Grants

     Each Contributor hereby grants You a world-wide, royalty-free,
     non-exclusive license:

     a. under intellectual property rights (other than patent or trademark)
        Licensable by such Contributor to use, reproduce, make available,
        modify, display, perform, distribute, and otherwise exploit its
        Contributions, either on an unmodified basis, with Modifications, or
        as part of a Larger Work; and

     b. under Patent Claims of such Contributor to make, use, sell, offer for
        sale, have made, import, and otherwise transfer either its
        Contributions or its Contributor Version.

2.2. Effective Date

     The licenses granted in Section 2.1 with respect to any Contribution
     become effective for each Contribution on the date the Contributor first
     distributes such Contribution.

2.3. Limitations on Grant Scope

     The licenses granted in this Section 2 are the only rights granted under
     this License. No additional rights or licenses will be implied from the
     distribution or licensing of Covered Software under this License.
     Notwithstanding Section 2.1(b) above, no patent license is granted by a
     Contributor:

     a. for any code that a Contributor has removed from Covered Software; or

     b. for infringements caused by: (i) Your and any other third party's
        modifications of Covered Software, or (ii) the combination of its
        Contributions with other software (except as part of its Contributor
        Version); or

     c. under Patent Claims infringed by Covered Software in the absence of
        its Contributions.

     This License does not grant any rights in the trademarks, service marks,
     or logos of any Contributor (except as may be necessary to comply with
     the notice requirements in Section 3.4).

2.4. Subsequent Licenses

     No Contributor makes additional grants as a result of Your choice to
     distribute the Covered Software under a subsequent version of this
     License (see Section 10.2) or under the terms of a Secondary License (if
     permitted under the terms of Section 3.3).

2.5. Representation

     Each Contributor represents that the Contributor believes its
     Contributions are its original creation(s) or it has sufficient rights to
     grant the rights to its Contributions conveyed by this License.

2.6. Fair Use

     This License is not intended to limit any rights You have under
     applicable copyright doctrines of fair use, fair dealing, or other
     equivalents.

2.7. Conditions

     Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted in
     Section 2.1.


3. Responsibilities

3.1. Distribution of Source Form

     All distribution of Covered Software in Source Code Form, including any
     Modifications that You create or to which You contribute, must be under
     the terms of this License. You must inform recipients that the Source
     Code Form of the Covered Software is governed by the terms of this
     License, and how they can obtain a copy of this License. You may not
     attempt to alter or restrict the recipients' rights in the Source Code
     Form.

3.2. Distribution of Executable Form

     If You distribute Covered Software in Executable Form then:

     a. such Covered Software must also be made available in Source Code Form,
        as described in Section 3.1, and You must inform recipients of the
        Executable Form how they can obtain a copy of such Source Code Form by
        reasonable means in a timely manner, at a charge no more than the cost
        of distribution to the recipient; and

     b. You may distribute such Executable Form under the terms of this
        License, or sublicense it under different terms, provided that the
        license for the Executable Form does not attempt to limit or alter the
        recipients' rights in the Source Code Form under this License.

3.3. Distribution of a Larger Work

     You may create and distribute a Larger Work under terms of Your choice,
     provided that You also comply with the requirements of this License for
     the Covered Software. If the Larger Work is a combination of Covered
     Software with a work governed by one or more Secondary Licenses, and the
     Covered Software is not Incompatible With Secondary Licenses, this
     License permits You to additionally distribute such Covered Software
     under the terms of such Secondary License(s), so that the recipient of
     the Larger Work may, at their option, further distribute the Covered
     Software under the terms of either this License or such Secondary
     License(s).

3.4. Notices

     You may not remove or alter the substance of any license notices
     (including copyright notices, patent notices, disclaimers of warranty, or
     limitations of liability) contained within the Source Code Form of the
     Covered Software, except that You may alter any license notices to the
     extent required to remedy known factual inaccuracies.

3.5. Application of Additional Terms

     You may choose to offer, and to charge a fee for, warranty, support,
     indemnity or liability obligations to one or more recipients of Covered
     Software. However, You may do so only on Your own behalf, and not on
     behalf of any Contributor. You must make it absolutely clear that any
     such warranty, support, indemnity, or liability obligation is offered by
     You alone, and You hereby agree to indemnify every Contributor for any
     liability incurred by such Contributor as a result of warranty, support,
     indemnity or liability terms You offer. You may include additional
     disclaimers of warranty and limitations of liability specific to any
     jurisdiction.

4. Inability to Comply Due to Statute or Regulation

   If it is impossible for You to comply with any of the terms of this License
   with respect to some or all of the Covered Software due to statute,
   judicial order, or regulation then You must: (a) comply with the terms of
   this License to the maximum extent possible; and (b) describe the
   limitations and the code they affect. Such description must be placed in a
   text file included with all distributions of the Covered Software under
   this License. Except to the extent prohibited by statute or regulation,
   such description must be sufficiently detailed for a recipient of ordinary
   skill to be able to understand it.

5. Termination

5.1. The rights granted under this License will terminate automatically if You
     fail to comply with any of its terms. However, if You become compliant,
     then the rights granted under this License from a particular Contributor
     are reinstated (a) provisionally, unless and until such Contributor
     explicitly and finally terminates Your grants, and (b) on an ongoing
     basis, if such Contributor fails to notify You of the non-compliance by
     some reasonable means prior to 60 days after You have come back into
     compliance. Moreover, Your grants from a particular Contributor are
     reinstated on an ongoing basis if such Contributor notifies You of the
     non-compliance by some reasonable means, this is the first time You have
     received notice of non-compliance with this License from such
     Contributor, and You become compliant prior to 30 days after Your receipt
     of the notice.

5.2. If You initiate litigation against any entity by asserting a patent
     infringement claim (excluding declaratory judgment actions,
     counter-claims, and cross-claims) alleging that a Contributor Version
     directly or indirectly infringes any patent, then the rights granted to
     You by any and all Contributors for the Covered Software under Section
     2.1 of this License shall terminate.

5.3. In the event of termination under Sections 5.1 or 5.2 above, all end user
     license agreements (excluding distributors and resellers) which have been
     validly granted by You or Your distributors under this License prior to
     termination shall survive termination.

6. Disclaimer of Warranty

   Covered Software is provided under this License on an "as is" basis,
   without warranty of any kind, either expressed, implied, or statutory,
   including, without limitation, warranties that the Covered Software is free
   of defects, merchantable, fit for a particular purpose or non-infringing.
   The entire risk as to the quality and performance of the Covered Software
   is with You. Should any Covered Software prove defective in any respect,
   You (not any Contributor) assume the cost of any necessary servicing,
   repair, or correction. This disclaimer of warranty constitutes an essential
   part of this License. No use of  any Covered Software is authorized under
   this License except under this disclaimer.

7. Limitation of Liability

   Under no circumstances and under no legal theory, whether tort (including
   negligence), contract, or otherwise, shall any Contributor, or anyone who
   distributes Covered Software as permitted above, be liable to You for any
   direct, indirect, special, incidental, or consequential damages of any
   character including, without limitation, damages for lost profits, loss of
   goodwill, work stoppage, computer failure or malfunction, or any and all
   other commercial damages or losses, even if such party shall have been
   informed of the possibility of such damages. This limitation of liability
   shall not apply to liability for death or personal injury resulting from
   such party's negligence to the extent applicable law prohibits such
   limitation. Some jurisdictions do not allow the exclusion or limitation of
   incidental or consequential damages, so this exclusion and limitation may
   not apply to You.

8. Litigation

   Any litigation relating to this License may be brought only in the courts
   of a jurisdiction where the defendant maintains its principal place of
   business and such litigation shall be governed by laws of that
   jurisdiction, without reference to its conflict-of-law provisions. Nothing
   in this Section shall prevent a party's ability to bring cross-claims or
   counter-claims.

9. Miscellaneous

   This License represents the complete agreement concerning the subject
   matter hereof. If any provision of this License is held to be
   unenforceable, such provision shall be reformed only to the extent
   necessary to make it enforceable. Any law or regulation which provides that
   the language of a contract shall be construed against the drafter shall not
   be used to construe this License against a Contributor.


10. Versions of the License

10.1. New Versions

      Mozilla Foundation is the license steward. Except as provided in Section
      10.3, no one other than the license steward has the right to modify or
      publish new versions of this License. Each version will be given a
      distinguishing version number.

10.2. Effect of New Versions

      You may distribute the Covered Software under the terms of the version
      of the License under which You originally received the Covered Software,
      or under the terms of any subsequent version published by the license
      steward.

10.3. Modified Versions

      If you create software not governed by this License, and you want to
      create a new license for such software, you may create and use a
      modified version of this License if you rename the license and remove
      any references to the name of the license steward (except to note that
      such modified license differs from this License).

10.4. Distributing Source Code Form that is Incompatible With Secondary
      Licenses If You choose to distribute Source Code Form that is
      Incompatible With Secondary Licenses under the terms of this version of
      the License, the notice described in Exhibit B of this License must be
      attached.

Exhibit A - Source Code Form License Notice

      This Source Code Form is subject to the
      terms of the Mozilla Public License, v.
      2.0. If a copy of the MPL was not
      distributed with this file, You can
      obtain one at
      http://mozilla.org/MPL/2.0/.

If it is not possible or desirable to put the notice in a particular file,
then You may include the notice in a location (such as a LICENSE file in a
relevant directory) where a recipient would be likely to look for such a
notice.

You may add additional accurate notices of copyright ownership.

Exhibit B - "Incompatible With Secondary Licenses" Notice

      This Source Code Form is "Incompatible
      With Secondary Licenses", as defined by
      the Mozilla Public License, v. 2.0.
```

# Usage reporting for cost attribution in SageMaker HyperPod
<a name="sagemaker-hyperpod-usage-reporting"></a>

Usage reporting in SageMaker HyperPod EKS-orchestrated clusters provides granular visibility into compute resource consumption. The capability allows organizations to implement transparent cost attribution, allocating cluster costs to teams, projects, or departments based on their actual usage. By tracking metrics such as GPU/CPU hours, and Neuron Core utilization - captured in *both team-level aggregates and task-specific breakdowns* - usage reporting complements HyperPod's [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) functionality, ensuring fair cost distribution in shared multi-tenant clusters by:
+ Eliminating guesswork in cost allocation
+ Directly linking expenses to measurable resource consumption
+ Enforcing usage-based accountability in shared infrastructure environments

## Prerequisites
<a name="sagemaker-hyperpod-usage-reporting-prerequisites"></a>

To use this capability:
+ You need:
  + An active **SageMaker HyperPod environment** with a running EKS-orchestrated cluster.
  + (Strongly recommended) **Task Governance configured** with compute quotas and priority rules. For setup instructions, see [Task Governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md).
+ Familiarize yourself with these core concepts:
  + **Allocated compute quota:** Resources reserved for a team based on predefined quotas in their Task Governance policies. This is *guaranteed capacity* for their workloads.
  + **Borrowed compute:** Idle resources from the shared cluster pool that teams can temporarily use *beyond their allocated quota*. Borrowed compute is assigned dynamically based on priority rules in the Task Governance policies and availability of unused resources.
  + **Compute usage:** The measurement of resources (GPU, CPU, Neuron Core hours) consumed by a team, tracked as:
    + **Allocated utilization**: Usage within the team's quota.
    + **Borrowed utilization**: Usage beyond the quota, drawn from the shared pool.
  + **Cost attribution:** The process of allocating cluster costs to teams based on their *actual compute usage*, including both resources consumed within their predefined quota and resources temporarily used from the shared cluster pool beyond their quota.

## Reports types
<a name="sagemaker-hyperpod-usage-reporting-report-types"></a>

HyperPod's usage reports provide varying operational granularity:
+ **Summary reports** provide organization-wide visibility into compute usage, aggregating total GPU/CPU/Neuron Core hours per team (namespace) while distinguishing between *regular usage* (resources from a team's allocated quota) and *borrowed compute* (overflow capacity from shared pools).
+ **Detailed reports** offer task-level breakdowns by team, tracking exact compute hours spent running specific tasks – including preempted tasks, hourly utilization patterns, and namespace-specific allocations.

**Important**  
HyperPod usage reporting tracks compute utilization across *all Kubernetes namespaces* in a cluster—including those managed by Task Governance, default namespaces, and namespaces created **outside of Task Governance** (e.g., via direct Kubernetes API calls or external tools). This infrastructure-level monitoring ensures comprehensive usage-based accountability, preventing gaps in cost attribution for shared clusters regardless of how namespaces are managed.

## Reports formats and time range
<a name="sagemaker-hyperpod-usage-reporting-formats"></a>

Using the Python script provided in [Generate reports](sagemaker-hyperpod-usage-reporting-generate.md), administrators can generate usage reports on demand in CSV or PDF formats, selecting time ranges from daily snapshots to 180-day (6-month) historical windows.

**Note**  
You can configure the historical window to extend beyond the default 180-day maximum when setting up the reporting infrastructure. For more information on configuring the data retention period, see [Install Usage Report Infrastructure using CloudFormation](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#install-usage-report-infrastructure-using-cloudformation). 

## Illustrative use cases
<a name="sagemaker-hyperpod-usage-reporting-use-cases"></a>

This capability addresses critical scenarios in multi-tenant AI/ML environments such as:

1. **Cost allocation for shared clusters**: An administrator manages a HyperPod cluster shared by 20 teams training generative AI models. Using a *summary usage report*, they analyze daily GPU utilization over 180 days and discover Team A consumed 200 GPU hours of a specific instance type—170 from their allocated quota and 30 from borrowed compute. The administrator invoices Team A based on this reported usage.

1. **Auditing and dispute resolution**: A finance team questions cost attribution accuracy, citing inconsistencies. The administrator can export a *detailed task-level report* to audit discrepancies. By cross-referencing timestamps, instance types, and preempted jobs within the team's namespace, the report transparently reconcile disputed usage data.

# Reports details and data breakdown
<a name="sagemaker-hyperpod-usage-reporting-content"></a>

SageMaker HyperPod's usage reports provide two distinct lenses for analyzing compute resource consumption: **summary reports** for cost allocation and **detailed reports** for granular auditing. Summary reports aggregate cluster-wide usage by team or namespace, highlighting trends in allocated versus borrowed compute across GPU, CPU, and Neuron Core resources. Detailed reports drill into individual tasks, exposing metrics such as execution windows, task status, and priority-class utilization. In this section, we break down the structure of these reports, understand their key metrics, and demonstrate how administrators and finance teams can cross-reference summary trends with task-level data to validate cost attribution accuracy, resolve discrepancies, and optimize shared infrastructure.

## Common report headers
<a name="sagemaker-hyperpod-usage-reporting-content-headers"></a>

Both summary and detailed reports include the following metadata to contextualize the usage data:
+ **ClusterName:** The EKS-orchestrated Hyperpod cluster name where resources were consumed.
+ **Type:** The report category (`Summary Utilization Report` or `Detailed Utilization Report`).
+ **Date Generated:** When the report was created (e.g., `2025-04-18`).
+ **Date Range (UTC):** The timeframe covered (e.g., `2025-04-16 to 2025-04-18`).
+ **Missing data periods:** Gaps in data collection due to cluster downtime or monitoring issues (e.g., `2025-04-16 00:00:00 to 2025-04-19 00:00:00`).

## Summary reports
<a name="sagemaker-hyperpod-usage-reporting-content-summary"></a>

Summary reports provide a per-day high-level overview of compute resource consumption across teams/namespaces, and instance types distinguishing between allocated (reserved quota) and borrowed (lended pool) utilization. These reports are ideal for invoice generation, cost attribution statements, or capacity forecasting.

*Example: A summary report might show that Team A used 200 GPU hours—170 from their allocated quota and 30 borrowed.*

Here's a structured breakdown of the key columns in a summary report:
+ **Date:** The date of the reported usage (e.g., `2025-04-18`).
+ **Namespace:** The Kubernetes namespace associated with the team (e.g., `hyperpod-ns-ml-team`).
+ **Team:** The Owning team/department (e.g., `ml-team`).
+ **Instance Type:** The compute instance used (e.g., ml.g5.4xlarge).
+ **Total/Allocated/Borrowed Utilization (Hours):** The breakdown of GPU, CPU, or Neuron Core usage by category.

  Where:
  + **Total utilization = Allocated utilization \$1 Borrowed utilization**
  + **Allocated utilization** is the actual GPU CPU, or Neuron Core hours a team has used, capped at 100% of their allocated quota.
  + **Borrowed utilization** is the actual GPU, CPU, or Neuron Core hours a team has used *beyond their allocated quota*, drawn from the shared cluster pool based on Task Governance priority rules and resource availability.

Example: 72 GPU hours total (48 allocated, 24 borrowed).

**Note**  
Only total utilization is displayed for namespaces not managed by Task Governance.

## Detailed reports
<a name="sagemaker-hyperpod-usage-reporting-content-detailed"></a>

Detailed reports provide forensic-level visibility into compute usage, breaking down resource consumption by task, exposing granular metrics like task execution windows, status (e.g., Succeeded, Failed), and priority-class usage. These reports are ideal for billing discrepancies validation, or ensuring compliance with governance policies.

Here's a structured breakdown of the key columns in a detailed report:
+ **Date:** The date of the reported usage (e.g., `2025-04-18`).
+ **Period Start/End:** Exact execution window (UTC) for the task. (e.g., `19:54:34`)
+ **Namespace:** The Kubernetes namespace associated with the team (e.g., `hyperpod-ns-ml-team`).
+ **Team:** The Owning team/department (e.g., `ml-team`).
+ **Task:** The identifier for the job/pod (e.g., `pytorchjob-ml-pytorch-job-2p5zt-db686`).
+ **Instance:** The compute instance used (e.g., `ml.g5.4xlarge`).
+ **Status:** Task outcome (Succeeded, Failed, Preempted).
+ **Total Utilization:** Total consumption (hours and instance count) of GPU, CPU, or Neuron Core resources.
+ **Priority Class:** The priority tier assigned (e.g., training-priority).

# Generate reports
<a name="sagemaker-hyperpod-usage-reporting-generate"></a>

This guide provides step-by-step instructions to configure and manage usage reporting for your SageMaker HyperPod clusters. Follow these procedures to deploy infrastructure, generate custom reports, and remove resources when no longer needed.

## Set up usage reporting
<a name="sagemaker-hyperpod-usage-reporting-install"></a>

**Note**  
Before configuring the SageMaker HyperPod usage report infrastructure in your SageMaker HyperPod cluster, ensure you have met all prerequisites detailed in this [https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#prerequisites](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#prerequisites).

Usage reporting in HyperPod requires:
+ Deploying SageMaker HyperPod usage report AWS resources using an CloudFormation stack
+ Installing the SageMaker HyperPod usage report Kubernetes operator via a Helm chart

You can find comprehensive installation instructions in the [SageMaker HyperPod usage report GitHub repository](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md). Specifically, follow the steps in the [Set up](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#set-up-usage-reporting) section.

## Generate usage reports on demand
<a name="sagemaker-hyperpod-usage-reporting-use"></a>

Once the usage reporting infrastructure and Kubernetes operator are installed, job data for your SageMaker HyperPod cluster is automatically collected and stored in the S3 bucket you configured during setup. The operator continuously captures detailed usage metrics in the background, creating raw data files in the `raw` directory of your designated S3 bucket.

To generate an on-demand usage report, you can use the `run.py` script provided in the [SageMaker HyperPod usage report GitHub repository](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md) to extract and export usage metrics. Specifically, you can find the script and comprehensive instructions for generating a report in the [Generate Reports](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#generate-reports) section.

The script allows you to:
+ Specify custom date ranges for report generation
+ Choose between detailed and summary report types
+ Export reports in CSV or PDF formats
+ Direct reports to a specific S3 location

## Clean up usage reporting resources
<a name="sagemaker-hyperpod-usage-reporting-cleanup"></a>

When you no longer need your SageMaker HyperPod usage reporting infrastructure, follow the steps in [Clean Up Resources](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#clean-up-resources) to clean up the Kubernetes operator and AWS resources (in that order). Proper resource deletion helps prevent unnecessary costs.

# Configuring storage for SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-setup-storage"></a>

Cluster admin needs to configure storage for data scientist users to manage input and output data and storing checkpoints during training on SageMaker HyperPod clusters.

**Handling large datasets (input/output data)**
+ **Data access and management**: Data scientists often work with large datasets that are required for training machine learning models. Specifying storage parameters in the job submission allows them to define where these datasets are located (e.g., Amazon S3 buckets, persistent volumes in Kubernetes) and how they are accessed during the job execution.
+ **Performance optimization**: The efficiency of accessing input data can significantly impact the performance of the training job. By optimizing storage parameters, data scientists can ensure that data is read and written efficiently, reducing I/O bottlenecks.

**Storing checkpoints**
+ **Checkpointing in training**: During long-running training jobs, it’s common practice to save checkpoints—intermediate states of the model. This allows data scientists to resume training from a specific point in case of a failure, rather than starting from scratch.
+ **Data recovery and experimentation**: By specifying the storage location for checkpoints, data scientists can ensure that these checkpoints are securely stored, potentially in a distributed storage system that offers redundancy and high availability. This is crucial for recovering from interruptions and for experimenting with different training strategies.

**Tip**  
For a hands-on experience and guidance on how to set up storage for SageMaker HyperPod cluster orchestrated with Amazon EKS, see the following sections in the [Amazon EKS Support in SageMaker HyperPod workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e).  
[Set up Amazon FSx for Lustre on SageMaker HyperPod](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/01-cluster/06-fsx-for-lustre)
[Set up a mountpoint for Amazon S3](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/01-cluster/09-s3-mountpoint) using [Mountpoint for Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mountpoint.html)

# Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters
<a name="sagemaker-hyperpod-eks-ebs"></a>

SageMaker HyperPod supports the Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver, which manages the lifecycle of Amazon EBS volumes as storage for the Kubernetes volumes that you create. With the Amazon EBS CSI driver, you can create, attach, and manage your Amazon EBS volumes for your machine learning workloads running on SageMaker HyperPod clusters with Amazon EKS orchestration.

**Topics**
+ [Key storage capabilities](#sagemaker-hyperpod-eks-ebs-features)
+ [Use cases](#sagemaker-hyperpod-eks-ebs-use)
+ [Setting up the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters](#sagemaker-hyperpod-eks-ebs-setup)
+ [Using the APIs](#sagemaker-hyperpod-eks-ebs-setup-apis)

## Key storage capabilities
<a name="sagemaker-hyperpod-eks-ebs-features"></a>

The Amazon EBS CSI driver on SageMaker HyperPod supports the following storage capabilities.
+ Static provisioning: Associates pre-created Amazon EBS volumes with Kubernetes [persistent volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) for use in your pods.
+ Dynamic provisioning: Automatically creates Amazon EBS volumes and associated persistent volumes from [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims). Parameters can be passed via [https://kubernetes.io/docs/concepts/storage/storage-classes/](https://kubernetes.io/docs/concepts/storage/storage-classes/) for fine-grained control over volume creation.
+ Volume resizing: Expands existing volumes by updating the [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) size specification without disrupting running workloads. This can be essential for handling growing model repositories or adapting to larger nodes without service disruption.
+ Volume snapshots: Creates point-in-time snapshots of volumes for backup, recovery, and data versioning.
+ Block volumes: Provides raw block device access for high-performance applications requiring direct storage access.
+ Volume modification: Changes volume properties such as type, input or output operations per second (IOPS), or throughput using [volume attributes classes](https://kubernetes.io/docs/concepts/storage/volume-attributes-classes/).

For more information about the Amazon EBS CSI driver, see [Use Kubernetes volume storage with Amazon EBS](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) from the *Amazon EKS User Guide*.

For more information about storage to pods in your cluster, see [Storage](https://kubernetes.io/docs/concepts/storage/) from the *Kubernetes Documentation*.

## Use cases
<a name="sagemaker-hyperpod-eks-ebs-use"></a>

The Amazon EBS CSI driver integration enables several key use cases for both training and inference workloads on SageMaker HyperPod EKS clusters.

**Training workloads**
+ Dataset storage: Provision volumes for training datasets that persist across pod restarts
+ Checkpoint storage: Save model checkpoints and intermediate training results
+ Shared artifacts: Access common datasets and model artifacts across multiple training jobs

**Inference workloads**
+ Model storage: Dynamically provision appropriately sized volumes based on model requirements
+ Container caching: Create ephemeral storage for improved inference performance
+ Event logging: Store inference results and logs with persistent storage

## Setting up the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters
<a name="sagemaker-hyperpod-eks-ebs-setup"></a>

The Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver allows you to dynamically provision and manage Amazon EBS volumes for your containerized workloads running on SageMaker HyperPod clusters with EKS orchestration. This section walks you through installing and configuring the Amazon EBS CSI driver to enable persistent storage for your machine learning workloads.

### Prerequisites
<a name="sagemaker-hyperpod-eks-ebs-setup-prerequisite"></a>

Before you begin, do the following:
+ [Install and configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html)
+ [Create a SageMaker HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html)
+ Install the Amazon EBS CSI driver with the version of [v1.47.0](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md#v1470)

### Additional permissions
<a name="sagemaker-hyperpod-eks-ebs-setup-permissions"></a>

To set up the Amazon EBS CSI driver add-on, follow the instructions in [Use Kubernetes volume storage with Amazon EBS](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) from the *Amazon EKS User Guide*. You should also add the following additional permissions to the IAM role used to run the driver add-on. Note that this is the IAM role specified in your service account configuration for the driver add-on, not the HyperPod cluster execution role.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:AttachClusterNodeVolume",
                "sagemaker:DetachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        }
    ]
}
```

------

## Using the APIs
<a name="sagemaker-hyperpod-eks-ebs-setup-apis"></a>

As an alternative, you can use the [AttachClusterNodeVolume](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AttachClusterNodeVolume.html) and [DetachClusterNodeVolume](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DetachClusterNodeVolume.html) API operations to attach and detach your Amazon EBS volumes to SageMaker HyperPod EKS cluster instances.

**Key requirements for using these APIs include the following.**
+ Both the Amazon EBS volume and SageMaker HyperPod EKS cluster must be owned by the same AWS account.
+ The calling principal needs specific minimum permissions to successfully perform the attach or detach operation. For more information about the minimum permissions, see the following sections.
+ After attaching a volume to your HyperPod node, follow the instructions in [Accessing SageMaker HyperPod cluster nodes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-access-through-terminal.html) to access the cluster node, and [Make a volume available for use](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-using-volumes.html) to mount the attached volume.

### Required permissions for `sagemaker:AttachClusterNodeVolume`
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-attach"></a>

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:AttachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "ec2:AttachVolume",
                "ec2:DescribeVolumes"
            ],
            "Resource": "arn:aws:ec2:us-east-1:111122223333:volume/*"
        }
    ]
}
```

------

### Required permissions for `sagemaker:DetachClusterNodeVolume`
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-detach"></a>

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:DetachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "ec2:DetachVolume",
                "ec2:DescribeVolumes"
            ],
            "Resource": "arn:aws:ec2:us-east-1:111122223333:volume/*"
        }
    ]
}
```

------

### Required permissions for AWS KMS keys
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-kms"></a>

Add the following AWS KMS permissions only if you're using customer managed KMS keys to encrypt your Amazon EBS volumes attached to HyperPod cluster nodes. These permissions are not required if you're using AWS-managed KMS keys (the default encryption option).

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Id": "key-default-1",
    "Statement":
    [
        {
            "Effect": "Allow",
            "Principal":
            {
                "AWS": "arn:aws:iam::111122223333:role/caller-role"
            },
            "Action": "kms:DescribeKey",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Principal":
            {
                "AWS": "arn:aws:iam::111122223333:role/caller-role"
            },
            "Action": "kms:CreateGrant",
            "Resource": "*",
            "Condition":
            {
                "StringEquals":
                {
                    "kms:CallerAccount": "111122223333",
                    "kms:ViaService": "ec2.us-east-1.amazonaws.com"
                },
                "ForAnyValue:StringEquals":
                {
                    "kms:EncryptionContextKeys": "aws:ebs:id"
                },
                "Bool":
                {
                    "kms:GrantIsForAWSResource": true
                },
                "ForAllValues:StringEquals":
                {
                    "kms:GrantOperations":
                    [
                        "Decrypt"
                    ]
                }
            }
        }
    ]
}
```

------

**Note**  
These AWS KMS permissions are not required for `sagemaker:DetachClusterNodeVolume` when detaching a Cluster Auto Volume Attachment (CAVA) volume encrypted with customer managed KMS keys.

# Configuring custom Kubernetes labels and taints in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints"></a>

Amazon SageMaker HyperPod clusters with Amazon Elastic Kubernetes Service (Amazon EKS) orchestrator support custom Kubernetes labels and taints for nodes within instance groups. Labels and taints are fundamental scheduling and organization mechanisms in Kubernetes that give you fine-grained control over pod placement and resource utilization.

Labels are key-value pairs that can be attached to Kubernetes objects, allowing you to organize and select resources based on attributes. Taints, working in conjunction with tolerations, are node-specific properties that influence pod scheduling by repelling pods that don't have matching tolerations. Together, these mechanisms enable you to isolate workloads, assign them according to hardware specifications, and ensure optimal resource utilization.

## Common use cases
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-use-cases"></a>

The following are common scenarios where custom labels and taints are beneficial:
+ **Preventing system pods on expensive instances** - Apply taints to GPU instances to prevent system pods and other non-critical workloads from consuming expensive compute resources
+ **Integration with existing tooling** - Apply labels that match your organization's established infrastructure patterns and node affinity configurations

## Configuring labels and taints
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-configure"></a>

You can configure custom Kubernetes labels and taints at the instance group level using the `KubernetesConfig` parameter in your cluster configuration. Labels and taints are applied to all nodes in the instance group and persist throughout the cluster's lifecycle.

The `KubernetesConfig` parameter is declarative, meaning you specify the complete desired state of labels and taints for an instance group. SageMaker HyperPod then reconciles the actual state of the nodes to match this desired state.
+ **Adding labels or taints** - Include the new labels or taints in the `KubernetesConfig` along with any existing ones you want to keep
+ **Updating labels or taints** - Modify the values in the `KubernetesConfig` for the labels or taints you want to change, and include all others you want to keep
+ **Removing labels or taints** - Omit the labels or taints you want to remove from the `KubernetesConfig`, keeping only those you want to retain

### Creating a cluster with labels and taints
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-create"></a>

When creating a new SageMaker HyperPod cluster, include the `KubernetesConfig` parameter in your instance group configuration. The following example shows how to create a cluster with custom labels and taints:

```
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "InstanceCount": 4,
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://my-bucket/lifecycle-config.sh",
            "OnCreate": "on-create.sh"
        },
        "ExecutionRole": "arn:aws:iam::123456789012:role/HyperPodExecutionRole",
        "ThreadsPerCore": 1,
        "KubernetesConfig": { 
            "Labels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "Taints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            },
            {
                "key": "dedicated",
                "value": "ml-workloads",
                "effect": "NoExecute"
            }]
        }
    }],
    "VpcConfig": {
        "SecurityGroupIds": ["sg-0123456789abcdef0"],
        "Subnets": ["subnet-0123456789abcdef0", "subnet-0123456789abcdef1"]
    },
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "arn:aws:eks:us-west-2:123456789012:cluster/my-eks-cluster"
        }
    }
}
```

In this example:
+ **Labels** - Three custom labels are applied: `env=prod`, `team=ml-training`, and `gpu-type=a100`
+ **Taints** - Two taints are configured to prevent unwanted pod scheduling

### Updating labels and taints on an existing cluster
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-update"></a>

You can modify labels and taints on an existing cluster using the `UpdateCluster` API. The following example shows how to update the `KubernetesConfig` for an instance group:

```
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "KubernetesConfig": { 
            "Labels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100",
                "cost-center": "ml-ops"
            },
            "Taints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }]
}
```

When you update labels and taints, SageMaker HyperPod applies the changes to all nodes in the instance group. The service manages the transition from current to desired state, which you can monitor using the `DescribeCluster` API.

## Monitoring label and taint application
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-monitor"></a>

SageMaker HyperPod provides APIs to monitor the status of labels and taints as they are applied to your cluster nodes.

### Checking cluster-level status
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-describe-cluster"></a>

Use the `DescribeCluster` API to view the current and desired states of labels and taints at the instance group level. The following example shows the response structure:

```
{
    "ClusterName": "my-cluster",
    "ClusterStatus": "InService",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "CurrentInstanceCount": 4,
        "TargetInstanceCount": 4,
        "KubernetesConfig": {
            "CurrentLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "DesiredLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "CurrentTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }],
            "DesiredTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }]
}
```

When the `CurrentLabels` match `DesiredLabels` and `CurrentTaints` match `DesiredTaints`, all nodes in the instance group have the specified configuration applied. If they differ, the cluster is still in the process of applying the changes.

### Checking individual node status
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-describe-node"></a>

For node-level details, use the `DescribeClusterNode` API to check the label and taint configuration of individual nodes. The following example shows the response structure:

```
{
    "NodeDetails": { 
        "InstanceId": "i-0123456789abcdef0",
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "InstanceStatus": {
            "Status": "Running",
            "Message": "Node is healthy"
        },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://my-bucket/lifecycle-config.sh",
            "OnCreate": "on-create.sh"
        },
        "LaunchTime": 1699564800.0,
        "KubernetesConfig": {
            "CurrentLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "DesiredLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "CurrentTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }],
            "DesiredTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }
}
```

Node-level monitoring is useful for troubleshooting when labels or taints are not applying correctly to specific nodes, or when you need to verify the configuration of a particular instance.

## Reserved prefixes
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-reserved-prefixes"></a>

Certain prefixes are reserved for system use and should not be used for custom labels or taints. The following prefixes are reserved:
+ `kubernetes.io/` - Reserved for Kubernetes core components
+ `k8s.io/` - Reserved for Kubernetes core components
+ `sagemaker.amazonaws.com/` - Reserved for SageMaker HyperPod
+ `eks.amazonaws.com/` - Reserved for Amazon EKS
+ `k8s.aws/` - Reserved for Amazon EKS
+ `karpenter.sh/` - Reserved for Karpenter autoscaling

Labels and taints with these prefixes are managed by system components and should not be overwritten with custom values.

# Checkpointless training in Amazon SageMaker HyperPod
<a name="sagemaker-eks-checkpointless"></a>

Checkpointless training on Amazon SageMaker HyperPod enables faster recovery from training infrastructure faults. The following documentation helps you get started with checkpointless training and fine-tuning for NeMo-supported models.

Checkpointless training has the following pre-requisites:
+ [Getting started with Amazon EKS support in SageMaker HyperPod](sagemaker-hyperpod-eks-prerequisites.md)
+ [Installing the training operator](sagemaker-eks-operator-install.md). You must install v1.2.0 or above.

 Checkpointless training on SageMaker HyperPod is built on top of the [ NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/exp_manager.html#experiment-manager). You can run checkpointless training with pre-created SageMaker HyperPod recipes. If you're familiar with NeMo, the process of using the checkpointless training recipes is similar. With minor changes, you can start training a model using checkpointless training features that enable you to recover quickly from training faults.

The following HyperPod recipes are pre-configured with checkpointless training optimizations. You can specify your data paths as part of the recipe and use the associated launch script to run training (see the quick start guide below):


| Model | Method | Size | Nodes | Instance | Accelerator | Recipe | Script | Tutorial | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| GPT OSS | Full finetune example | 120b | 16 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_full_fine_tuning.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html) | 
| GPT OSS | LoRA-example | 120b | 2 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_lora.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft.html) | 
| Llama3 | Pretrain example | 70b | 16 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/training/llama/checkpointless_llama3_70b_pretrain.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-pretraining-llama3.html) | 
| Llama3 | LoRA-example | 70b | 2 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/llama/checkpointless_llama3_70b_lora.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft-llama.html) | 

The following quick-start guide provides tutorials for using checkpointless training recipes:

**Getting started examples**
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b](sagemaker-eks-checkpointless-recipes-finetune.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b](sagemaker-eks-checkpointless-recipes-peft.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b](sagemaker-eks-checkpointless-recipes-pretraining-llama3.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b](sagemaker-eks-checkpointless-recipes-peft-llama.md)

If you’d like to pre-train or fine-tune custom models, see [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models](sagemaker-eks-checkpointless-recipes-custom.md).

To learn more about incorporating specific checkpointless training components, [HyperPod checkpointless training features](sagemaker-eks-checkpointless-features.md).

# Amazon SageMaker HyperPod checkpointless training tutorials
<a name="sagemaker-eks-checkpointless-recipes"></a>

[ HyperPod checkpointless training recipes](https://github.com/aws/sagemaker-hyperpod-checkpointless-training) are predefined job configurations with checkpointless training features enabled. Using these recipes, makes it easier to get started with checkpointless training on HyperPod.

**Topics**
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b](sagemaker-eks-checkpointless-recipes-finetune.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b](sagemaker-eks-checkpointless-recipes-peft.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b](sagemaker-eks-checkpointless-recipes-pretraining-llama3.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b](sagemaker-eks-checkpointless-recipes-peft-llama.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models](sagemaker-eks-checkpointless-recipes-custom.md)

# Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b
<a name="sagemaker-eks-checkpointless-recipes-finetune"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-finetune-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-finetune-recipes-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure your version of Python is greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Launch training jobs with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-finetune-launcher"></a>

You can use the Amazon SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh`

   your\$1container: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_full_fine_tuning \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The `STATUS` will turn to `COMPLETED` when you run `kubectl get pods`.

## Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-finetune-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. update the examples/gpt\$1oss/launch/full\$1finetune\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml
   + image: A Deep Learning container. To find the most recent release of the checkpointless training container, see [checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + resume.restore\$1config.path=<path\$1to\$1pretrained\$1weights>: The path to downloaded pretrained model weigths in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + dataset.dataset\$1path=<path\$1to\$1dataset>: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with full\$1finetune\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml

   ```
   kubectl apply -f examples/gpt_oss/launch/full_finetune_gpt_oss_120b_checkpointless_p5.yaml
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b
<a name="sagemaker-eks-checkpointless-recipes-peft"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-peft-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-peft-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and < 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:
   + SageMaker HyperPod recipes method:

     ```
     # install SageMaker HyperPod Recipes.
     git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
     cd sagemaker-hyperpod-recipes
     pip3 install -r requirements.txt
     ```
   + kubectl with pre-defined job yaml method

     ```
     # install SageMaker HyperPod checkpointless training.
     git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
     cd sagemaker-hyperpod-checkpointless-training
     ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-peft-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh`

   your\$1contrainer: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_lora \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

## Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-peft-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. update the examples/gpt\$1oss/launch/peft\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml
   + image: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + resume.restore\$1config.path=<path\$1to\$1pretrained\$1weights>: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft.html#sagemaker-eks-checkpointless-recipes-peft-prereqs) step.
   + dataset.dataset\$1path=<path\$1to\$1dataset>: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with peft\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml

   ```
   kubectl apply -f examples/gpt_oss/launch/peft_gpt_oss_120b_checkpointless_p5.yaml
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                                             READY   STATUS             RESTARTS        AGE
gpt-120b-lora-checkpointless-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [ source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Method 1: Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh`

   A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=training/llama/checkpointless_llama3_70b_pretrain \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.data.global_batch_size=16 \
       recipes.data.micro_batch_size=4 \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                             READY   STATUS             RESTARTS        AGE
   llama-3-70b-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

## Method 2: Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. Update the `examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml`
   + `image`: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + `resume.restore_config.path=<path_to_pretrained_weights>`: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + `dataset.dataset_path=<path_to_dataset>`: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with `pretrain_llama3_70b_checkpointless_p5.yaml`

   ```
   kubectl apply -f examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                                             READY   STATUS             RESTARTS        AGE
   llama3-pretrain-checkpointless-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b
<a name="sagemaker-eks-checkpointless-recipes-peft-llama"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [ source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Method 1: Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh`

   A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/llama/checkpointless_llama3_70b_lora \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                             READY   STATUS             RESTARTS        AGE
   llama-3-70b-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

## Method 2: Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. Update the `examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml`
   + `image`: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + `resume.restore_config.path=<path_to_pretrained_weights>`: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + `dataset.dataset_path=<path_to_dataset>`: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with `peft_llama3_70b_checkpointless_p5.yaml`

   ```
   kubectl apply -f examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                                             READY   STATUS             RESTARTS        AGE
   llama3-70b-lora-checkpointless-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models
<a name="sagemaker-eks-checkpointless-recipes-custom"></a>

The following sequence of steps is required to run checkpointless training with your custom model on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-custom-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-custom-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies

   ```
   # install SageMaker HyperPod checkpointless training.
   git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
   cd sagemaker-hyperpod-checkpointless-training
   ```

## Checkpointless training modification instructions
<a name="sagemaker-eks-checkpointless-recipes-custom-modification-instructions"></a>

To incrementally adopt checkpointless training for custom models, follow the integration guide (here we use Llama 3 70b pretraining as an example), which involves:
+ Fast communicator creation
+ Memory-mapped dataloader (MMAP)
+ In-process & Checkpointless recovery

### Component 1: Fast communicator creation
<a name="sagemaker-eks-checkpointless-recipes-custom-component1"></a>

This is to optimize time to establish connections between the workers. There is no code changes needed and only requires setting env variables

```
  # Enable Rootless features
  export HPCT_USE_ROOTLESS=1 && \
  sysctl -w net.ipv4.ip_local_port_range="20000 65535" && \

  hyperpodrun --nproc_per_node=8 \
              ...
              --inprocess-restart \
              ...
```

The full change can be found in the [ llama3 70 pretrain launch job config](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml).

### Component 2: Memory-mapped dataloader (MMAP)
<a name="sagemaker-eks-checkpointless-recipes-custom-component2"></a>

MMAP caches to store pre-fetched data samples & enable immediate training start without needing to wait for data preprocessing. It requires minimal code changes to adopt by wrapping existing dataloader.

```
data_module = MMAPDataModule(
  data_module=base_data_module,
  mmap_config=CacheResumeMMAPConfig(cache_dir=…)
)
```

### Components 3 and 4: In-process and checkpointless recovery
<a name="sagemaker-eks-checkpointless-recipes-custom-components3-4"></a>

This enables failure recovery without restart training processes or loading from checkpoints. Additional code changes needed (strategy & training config update, wrap existing main)

```
@HPWrapper(
  health_check=CudaHealthCheck(),
  hp_api_factory=HPAgentK8sAPIFactory(),
  abort_timeout=60.0,
...)
def run_main(
  cfg,
  caller: Optional[HPCallWrapper] = None):
...


CheckpointlessMegatronStrategy(
  **self.cfg.strategy,
  ddp=self.ddp,
)
```

The full change can be found in the [llama3 70 pretrain entry script](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/llama3_70b_pretrain_checkpointless.py) and the corresponding training config change can be found in the [ llama3 70b training config](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/config/llama3_70b_peft_checkpointless.yaml).

### Launch training
<a name="sagemaker-eks-checkpointless-recipes-custom-launch"></a>

You can now launch the checkpointless training using kubectl.

```
kubectl apply -f your_job_config.yaml
```

# HyperPod checkpointless training features
<a name="sagemaker-eks-checkpointless-features"></a>

See the following pages to learn about the training features in checkpointless training.

**Topics**
+ [Amazon SageMaker HyperPod checkpointless training repositories](#sagemaker-eks-checkpointless-repositories)
+ [Collective communication initialization improvements](sagemaker-eks-checkpointless-features-communication.md)
+ [Memory mapped dataloader](sagemaker-eks-checkpointless-features-mmap.md)
+ [In-process recovery and checkpointless training](sagemaker-eks-checkpointless-in-process-recovery.md)

## Amazon SageMaker HyperPod checkpointless training repositories
<a name="sagemaker-eks-checkpointless-repositories"></a>

[ HyperPod checkpointless training](https://github.com/aws/sagemaker-hyperpod-checkpointless-training#) accelerates recovery from cluster faults in large-scale distributed training environments through framework-level optimizations. These optimizations are delivered via a base container image that includes enhanced NCCL initialization improvements, data loading optimizations, and in-process and checkpointless recovery components. The HyperPod checkpointless training package is built on this foundation.

Checkpointless training is enabled via three optimization tracks that run in concert:
+ **Communication initilization improvements (NCCL and Gloo)** - Eliminate communication bottlenecks by decentralizing rank peer and ring information (red box below).
+ **Data loading optimizations** - Reduce the time required to serve the first batch of data during restart operations (orange boxes below).
+ **Program restart overhead reduction** - Minimize restart costs and enable checkpointless replenishment through process recovery on healthy nodes (blue and green boxes below).

![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-optimization-tracks.png)


# Collective communication initialization improvements
<a name="sagemaker-eks-checkpointless-features-communication"></a>

NCCL and Gloo are fundamental communication libraries that enable collective operations (such as all-reduce and broadcast) across distributed training processes. However, traditional NCCL and Gloo initialization can create bottlenecks during fault recovery.

The standard recovery process requires all processes to connect to a centralized TCPStore and coordinate through a root process, introducing an expensive overhead that becomes particularly problematic during restarts. This centralized design creates three critical issues: coordination overhead from mandatory TCPStore connections, recovery delays as each restart must repeat the full initialization sequence, and a single point of failure in the root process itself. This imposes an expensive, centralized coordination steps every time training initializes or restarts.

HyperPod checkpointless training eliminates these coordination bottlenecks, enabling the faster recovery from faults by making initialization "rootless" and "TCPStoreless."

## Rootless configurations
<a name="sagemaker-eks-checkpointless-features-communication-rootless-config"></a>

To enable Rootless, one can simply expose the following environment variables.

```
export HPCT_USE_ROOTLESS=1 && \
sysctl -w net.ipv4.ip_local_port_range="20000 65535" && \
```

HPCT\$1USE\$1ROOTLESS: 0 or 1. Use to turn on and off rootless

sysctl -w net.ipv4.ip\$1local\$1port\$1range="20000 65535": Set the system port range

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml#L111-L113) for enabling Rootless.

## Rootless
<a name="sagemaker-eks-checkpointless-features-communication-rootless"></a>

HyperPod checkpointless training offers novel initialization methods, Rootless and TCPStoreless, for NCCL and Gloo process groups.

The implementation of these optimizations involves modifying NCCL, Gloo, and PyTorch:
+ Extending third-party library APIs to enable Rootless and Storeless NCCL and Gloo optimizations while maintaining backward compatibility
+ Updating process group backends to conditionally use optimized paths and handle in-process recovery issues
+ Bypassing expensive TCPStore creation at the PyTorch distributed layer while maintaining symmetric address patterns through global group counters

The following graph shows the architecture of the distributed training libraries and the changes made in checkpointless training.

![\[The following graph shows the architecture of the distributed training libraries and the changes made in checkpointless training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-training-libraries.png)


### NCCL and Gloo
<a name="sagemaker-eks-checkpointless-features-communication-nccl-gloo"></a>

These are independent packages that perform the core functionality of collective communications. They provide key APIs, such as ncclCommInitRank, to initialize communication networks, manage the underlying resources, and perform collective communications. After making custom changes in NCCL and Gloo, the Rootless and Storeless optimizes (e.g., skip connecting to the TCPStore) initialization of the communication network. You can switch between using the the original code paths or optimized code paths flexibly.

### PyTorch process group backend
<a name="sagemaker-eks-checkpointless-features-communication-pytorch"></a>

The process group backends, specifically ProcessGroupNCCL and ProcessGroupGloo, implement the ProcessGroup APIs by invoking the APIs of their corresponding underlying libraries. Since we extend the third-party libraries' APIs, we have to invoke them properly and make code path switches based on customers' configurations.

In addition to optimization code paths, we also change the process group backend to support in-process recovery.

# Memory mapped dataloader
<a name="sagemaker-eks-checkpointless-features-mmap"></a>

Another restart overhead stems from data loading: the training cluster remains idle while the dataloader initializes, downloads data from remote file systems, and processes it into batches.

To address this, we introduce the Memory Mapped DataLoader(MMAP) Dataloader, which caches prefetched batches in persistent memory, ensuring they remain available even after a fault-induced restart. This approach eliminates dataloader setup time and enables training to resume immediately using cached batches, while the dataloader concurrently reinitializes and fetches subsequent data in the background. The data cache resides on each rank that requires training data and maintains two types of batches: recently consumed batches that have been used for training, and prefetched batches ready for immediate use.

![\[This image illustrates the MMAP Dataloader, caches, and consumed batches.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-mmap-dataloader.png)


MMAP dataloader offers two following features:
+ **Data Prefetching** - Proactively fetches and caches data generated by the dataloader
+ **Persistent Caching** - Stores both consumed and prefetched batches in a temporary filesystem that survives process restarts

Using the cache, the training job will benefit from:
+ **Reduced Memory Footprint** - Leverages memory-mapped I/O to maintain a single shared copy of data in host CPU memory, eliminating redundant copies across GPU processes (e.g., reduces from 8 copies to 1 on a p5 instance with 8 GPUs)
+ **Faster Recovery** - Reduces Mean Time to Restart (MTTR) by enabling training to resume immediately from cached batches, eliminating the wait for dataloader reinitialization and first-batch generation

## MMAP configurations
<a name="sagemaker-eks-checkpointless-features-communication-mmap-config"></a>

To use MMAP, simply pass in the your original data module into `MMAPDataModule`

```
data_module=MMAPDataModule(
    data_module=MY_DATA_MODULE(...),
    mmap_config=CacheResumeMMAPConfig(
        cache_dir=self.cfg.mmap.cache_dir,
        checkpoint_frequency=self.cfg.mmap.checkpoint_frequency),
)
```

`CacheResumeMMAPConfig`: MMAP Dataloader parameters control cache directory location, size limits, and data fetching delegation. By default, only TP rank 0 per node fetches data from the source, while other ranks in the same data replication group read from the shared cache, eliminating redundant transfers.

`MMAPDataModule`: It wraps the original data module and returns the mmap dataloader for both train and validation.

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/gpt_oss/gpt_oss_120b_full_finetune_checkpointless.py#L101-L109) for enabling MMAP.

## API reference
<a name="sagemaker-eks-checkpointless-mmap-reference"></a>

### CacheResumeMMAPConfig
<a name="sagemaker-eks-checkpointless-mmap-reference-cacheresume"></a>

```
class hyperpod_checkpointless_training.dataloader.config.CacheResumeMMAPConfig(
  cache_dir='/dev/shm/pdl_cache',
  prefetch_length=10,
  val_prefetch_length=10,
  lookback_length=2,
  checkpoint_frequency=None,
  model_parallel_group=None,
  enable_batch_encryption=False)
```

Configuration class for cache-resume memory-mapped (MMAP) dataloader functionality in HyperPod checkpointless training.

This configuration enables efficient data loading with caching and prefetching capabilities, allowing training to resume quickly after failures by maintaining cached data batches in memory-mapped files.

**Parameters**
+ **cache\$1dir** (str, optional) – Directory path for storing cached data batches. Default: "/dev/shm/pdl\$1cache"
+ **prefetch\$1length** (int, optional) – Number of batches to prefetch ahead during training. Default: 10
+ **val\$1prefetch\$1length** (int, optional) – Number of batches to prefetch ahead during validation. Default: 10
+ **lookback\$1length** (int, optional) – Number of previously used batches to keep in cache for potential reuse. Default: 2
+ **checkpoint\$1frequency** (int, optional) – Frequency of model checkpointing steps. Used for cache performance optimization. Default: None
+ **model\$1parallel\$1group** (object, optional) – Process group for model parallelism. If None, will be created automatically. Default: None
+ **enable\$1batch\$1encryption** (bool, optional) – Whether to enable encryption for cached batch data. Default: False

**Methods**

```
create(dataloader_init_callable,
    parallel_state_util,
   step,
    is_data_loading_rank,
   create_model_parallel_group_callable,
    name='Train',
   is_val=False,
   cached_len=0)
```

Creates and returns a configured MMAP dataloader instance.

**Parameters**
+ **dataloader\$1init\$1callable** (Callable) – Function to initialize the underlying dataloader
+ **parallel\$1state\$1util** (object) – Utility for managing parallel state across processes
+ **step** (int) – The data step to resume from during training
+ **is\$1data\$1loading\$1rank** (Callable) – Function that returns True if current rank should load data
+ **create\$1model\$1parallel\$1group\$1callable** (Callable) – Function to create model parallel process group
+ **name** (str, optional) – Name identifier for the dataloader. Default: "Train"
+ **is\$1val** (bool, optional) – Whether this is a validation dataloader. Default: False
+ **cached\$1len** (int, optional) – Length of cached data if resuming from existing cache. Default: 0

Returns `CacheResumePrefetchedDataLoader` or `CacheResumeReadDataLoader` – Configured MMAP dataloader instance

Raises `ValueError` if the step parameter is `None`.

**Example**

```
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig

# Create configuration
config = CacheResumeMMAPConfig(
    cache_dir="/tmp/training_cache",
    prefetch_length=20,
    checkpoint_frequency=100,
    enable_batch_encryption=False
)

# Create dataloader
dataloader = config.create(
    dataloader_init_callable=my_dataloader_init,
    parallel_state_util=parallel_util,
    step=current_step,
    is_data_loading_rank=lambda: rank == 0,
    create_model_parallel_group_callable=create_mp_group,
    name="TrainingData"
)
```

**Notes**
+ The cache directory should have sufficient space and fast I/O performance (e.g., /dev/shm for in-memory storage).
+ Setting `checkpoint_frequency` improves cache performance by aligning cache management with model checkpointing
+ For validation dataloaders (`is_val=True`), the step is reset to 0 and cold start is forced
+ Different dataloader implementations are used based on whether the current rank is responsible for data loading

### MMAPDataModule
<a name="sagemaker-eks-checkpointless-mmap-reference-mmapdatamodule"></a>

```
class hyperpod_checkpointless_training.dataloader.mmap_data_module.MMAPDataModule(  
    data_module,  
    mmap_config,  
    parallel_state_util=MegatronParallelStateUtil(),  
    is_data_loading_rank=None)
```

A PyTorch Lightning DataModule wrapper that applies memory-mapped (MMAP) data loading capabilities to existing DataModules for checkpointless training.

This class wraps an existing PyTorch Lightning DataModule and enhances it with MMAP functionality, enabling efficient data caching and fast recovery during training failures. It maintains compatibility with the original DataModule interface while adding checkpointless training capabilities.

Parameters

data\$1module (pl.LightningDataModule)  
The underlying DataModule to wrap (e.g., LLMDataModule)

mmap\$1config (MMAPConfig)  
The MMAP configuration object that defines caching behavior and parameters

`parallel_state_util` (MegatronParallelStateUtil, optional)  
Utility for managing parallel state across distributed processes. Default: MegatronParallelStateUtil()

`is_data_loading_rank` (Callable, optional)  
Function that returns True if the current rank should load data. If None, defaults to parallel\$1state\$1util.is\$1tp\$10. Default: None

**Attributes**

`global_step` (int)  
Current global training step, used for resuming from checkpoints

`cached_train_dl_len` (int)  
Cached length of the training dataloader

`cached_val_dl_len` (int)  
Cached length of the validation dataloader

**Methods**

```
setup(stage=None)
```

Setup the underlying data module for the specified training stage.

`stage` (str, optional)  
Stage of training ('fit', 'validate', 'test', or 'predict'). Default: None

```
train_dataloader()
```

Create the training DataLoader with MMAP wrapping.

*Returns:* DataLoader – MMAP-wrapped training DataLoader with caching and prefetching capabilities

```
val_dataloader()
```

Create the validation DataLoader with MMAP wrapping.

*Returns:* DataLoader – MMAP-wrapped validation DataLoader with caching capabilities

```
test_dataloader()
```

Create the test DataLoader if the underlying data module supports it.

*Returns:* DataLoader or None – Test DataLoader from the underlying data module, or None if not supported

```
predict_dataloader()
```

Create the predict DataLoader if the underlying data module supports it.

*Returns:* DataLoader or None – Predict DataLoader from the underlying data module, or None if not supported

```
load_checkpoint(checkpoint)
```

Load checkpoint information to resume training from a specific step.

checkpoint (dict)  
Checkpoint dictionary containing 'global\$1step' key

```
get_underlying_data_module()
```

Get the underlying wrapped data module.

*Returns:* pl.LightningDataModule – The original data module that was wrapped

```
state_dict()
```

Get the state dictionary of the MMAP DataModule for checkpointing.

*Returns:* dict – Dictionary containing cached dataloader lengths

```
load_state_dict(state_dict)
```

Load the state dictionary to restore MMAP DataModule state.

`state_dict` (dict)  
State dictionary to load

**Properties**

```
data_sampler
```

Expose the underlying data module's data sampler to NeMo framework.

*Returns:* object or None – The data sampler from the underlying data module

**Example**

```
from hyperpod_checkpointless_training.dataloader.mmap_data_module import MMAPDataModule  
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig  
from my_project import MyLLMDataModule  

# Create MMAP configuration  
mmap_config = CacheResumeMMAPConfig(  
    cache_dir="/tmp/training_cache",  
    prefetch_length=20,  
    checkpoint_frequency=100  
)  

# Create original data module  
original_data_module = MyLLMDataModule(  
    data_path="/path/to/data",  
    batch_size=32  
)  

# Wrap with MMAP capabilities  
mmap_data_module = MMAPDataModule(  
    data_module=original_data_module,  
    mmap_config=mmap_config  
)  

# Use in PyTorch Lightning Trainer  
trainer = pl.Trainer()  
trainer.fit(model, data=mmap_data_module)  

# Resume from checkpoint  
checkpoint = {"global_step": 1000}  
mmap_data_module.load_checkpoint(checkpoint)
```

**Notes**
+ The wrapper delegates most attribute access to the underlying data module using \$1\$1getattr\$1\$1
+ Only data loading ranks actually initialize and use the underlying data module; other ranks use fake dataloaders
+ Cached dataloader lengths are maintained to optimize performance during training resumption

# In-process recovery and checkpointless training
<a name="sagemaker-eks-checkpointless-in-process-recovery"></a>

HyperPod checkpointless training uses model redundancy to enable fault-tolerant training. The core principle is that model and optimizer states are fully replicated across multiple node groups, with weight updates and optimizer state changes synchronously replicated within each group. When a failure occurs, healthy replicas complete their optimizer steps and transmit the updated model/optimizer states to recovering replicas.

This model redundancy-based approach enables several fault handling mechanisms:
+ **In-process recovery:** processes remain active despite faults, keeping all model and optimizer states in GPU memory with the latest values
+ **Graceful abort handling:** controlled aborts and resource cleanup for affected operations
+ **Code block re-execution:** re-running only the affected code segments within a Re-executable Code Block (RCB)
+ **Checkpointless recovery with no lost training progress:** since processes persist and states remain in memory, no training progress is lost; when a fault occurs training resumes from the previous step, as opposed to resuming from the last saved checkpoint

**Checkpointless Configurations**

Here is the core snippet of checkpointless training.

```
from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank
    wait_rank()
      
def main():
    @HPWrapper(
        health_check=CudaHealthCheck(),
        hp_api_factory=HPAgentK8sAPIFactory(),
        abort_timeout=60.0,
        checkpoint_manager=PEFTCheckpointManager(enable_offload=True),
        abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),
        finalize=CheckpointlessFinalizeCleanup(),
    )
    def run_main(cfg, caller: Optional[HPCallWrapper] = None):
        ...
        trainer = Trainer(
            strategy=CheckpointlessMegatronStrategy(...,
                num_distributed_optimizer_instances=2),
            callbacks=[..., CheckpointlessCallback(...)],
            )
        trainer.fresume = resume
        trainer._checkpoint_connector = CheckpointlessCompatibleConnector(trainer)
        trainer.wrapper = caller
```
+ `wait_rank`: All ranks will wait for the rank information from the HyperpodTrainingOperator infrastructure.
+ `HPWrapper`: Python function wrapper that enables restart capabilities for a Re-executable Code Block (RCB). The implementation uses a context manager rather than a Python decorator because decorators cannot determine the number of RCBs to monitor at runtime.
+ `CudaHealthCheck`: Ensures the CUDA context for the current process is in a healthy state by synchronizing with the GPU. Uses the device specified by the LOCAL\$1RANK environment variable, or defaults to the main thread's CUDA device if LOCAL\$1RANK is not set.
+ `HPAgentK8sAPIFactory`: This API enables checkpointless training to query the training status of other pods in the Kubernetes training cluster. It also provides an infrastructure-level barrier that ensures all ranks successfully complete abort and restart operations before proceeding.
+ `CheckpointManager`: Manages in-memory checkpoints and peer-to-peer recovery for checkpointless fault tolerance. It has the following core responsibilities:
  + **In-Memory Checkpoint Management**: Saves and manages NeMo model checkpoints in memory for fast recovery without disk I/O during checkpointless recovery scenarios.
  + **Recovery Feasibility Validation**: Determines if checkpointless recovery is possible by validating global step consistency, rank health, and model state integrity.
  + **Peer-to-Peer Recovery Orchestration**: Coordinates checkpoint transfer between healthy and failed ranks using distributed communication for fast recovery.
  + **RNG State Management**: Preserves and restores random number generator states across Python, NumPy, PyTorch, and Megatron for deterministic recovery.
  + **[Optional] Checkpoint Offload**: Offload in memory checkpoint to CPU if GPU does not have enough memory capacity.
+ `PEFTCheckpointManager`: It extends `CheckpointManager` by keeping the base model weights for PEFT finetuning.
+ `CheckpointlessAbortManager`: Manages abort operations in a background thread when an error is encountered. By default, it aborts TransformerEngine, Checkpointing, TorchDistributed, and DataLoader. Users can register custom abort handlers as needed. After the abort completes, all communication must cease and all processes and threads must terminate to prevent resource leaks.
+ `CheckpointlessFinalizeCleanup`: Handles final cleanup operations in the main thread for components that cannot be safely aborted or cleaned up in the background thread.
+ `CheckpointlessMegatronStrategy`: This inherits from the `MegatronStrategy` from in Nemo. Note that checkpointless training requires `num_distributed_optimizer_instances` to be least 2 so that there will be optimizer replication. The strategy also takes care of essential attribute registration and process group initialization, e.g., rootless.
+ `CheckpointlessCallback`: Lightning callback that integrates NeMo training with checkpointless training's fault tolerance system. It has the following core responsibilities:
  + **Training Step Lifecycle Management**: Tracks training progress and coordinates with ParameterUpdateLock to enable/disable checkpointless recovery based on training state (first step vs subsequent steps).
  + **Checkpoint State Coordination**: Manages in-memory PEFT base model checkpoint saving/restoring.
+ `CheckpointlessCompatibleConnector`: A PTL `CheckpointConnector` that attempts to pre-load the checkpoint file to memory, with the source path determined in this priority:
  + try checkpointless recovery
  + if checkpointless return None, fallback to parent.resume\$1start()

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/gpt_oss/gpt_oss_120b_full_finetune.py) to add checkpointless training features to codes.

**Concepts**

This section introduces checkpointless training concepts. Checkpointless training on Amazon SageMaker HyperPod supports in-process recovery. This API interface follows a similar format as the NVRx APIs.

**Concept - Re-Executable Code Block (RCB)**

When a failure occurs, healthy processes remain alive, but a portion of the code must be re-executed to recover the training states and python stacks. A Re-executable Code Block (RCB) is a specific code segment that re-runs during failure recovery. In the following example, the RCB encompasses the entire training script (i.e., everything under main()), meaning that each failure recovery restarts the training script while preserving the in-memory model and optimizer states.

**Concept - Faults control**

A fault controller module receives notifications when failures occur during checkpointless training. This fault controller includes the following components:
+ **Fault detection module:** Receives infrastructure fault notifications
+ **RCB definition APIs:** Enables users to define the re-executable code block (RCB) in their code
+ **Restart module:** Terminates the RCB, cleans up resources, and restarts the RCB

![\[This image illustrates how a fault controller module receives notifications when failure occurs during checkpointless training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-fault-controller-module.png)


**Concept - Model redundancy**

Large model training usually requires a large enough data parallel size to train models efficiently. In traditional data parallelism like PyTorch DDP and Horovod, the model is fully replicated. More advanced sharded data parallelism techniques like DeepSpeed ZeRO optimizer and FSDP also support hybrid sharding mode, which allows sharding the model/optimizer states within the sharding group and fully replicating across replication groups. NeMo also has this hybrid sharding feature through an argument num\$1distributed\$1optimizer\$1instances, which allows redundancy.

However, adding redundancy indicates that the model will not be fully sharded across the entire cluster, resulting in higher device memory usage. The amount of redundant memory will vary depending on the specific model sharding techniques implemented by the user. The low-precision model weights, gradients, and activation memory will not be affected, since they are sharded through model parallelism. The high-precision master model weights/gradients and optimizer states will be affected. Adding one redundant model replica increases device memory usage by roughly the equivalent of one DCP checkpoint size.

Hybrid sharding breaks the collectives across the entire DP groups into relatively smaller collectives. Previously there was a reduce-scatter and an all-gather across the entire DP group. After the hybrid sharding, the reduce-scatter is only running inside each model replica, and there will be an all-reduce across model replica groups. The all-gather is also running inside each model replica. As a result, the entire communication volume remains roughly unchanged, but collectives are running with smaller groups, so we expect better latency.

**Concept - Failure and Restart Types**

The following table records different failure types and associated recovery mechanisms. Checkpointless training first attempts failure recovery via an in-process recovery, followed by a process-level restart. It falls back to a job-level restart only in the event of a catastrophic failure (e.g., multiple nodes fail at the same time).


| Failure Type | Cause | Recovery Type | Recovery Mechanism | 
| --- | --- | --- | --- | 
| In-process failure | Code-level errors, exceptions | In-Process Recovery (IPR) | Rerun RCB within existing process; healthy processes remain active | 
| Process restart failure | Corrupted CUDA context, terminated process | Process Level Restart (PLR) | SageMaker HyperPod training operator restarts processes; skips K8s pod restart | 
| Node replacement failure | Permanent node/GPU hardware failure | Job Level Restart (JLR) | Replace failed node; restart entire training job | 

**Concept - Atomic lock protection for optimizer step**

Model execution is divided into three phases: forward propagation, backward propagation, and optimizer step. Recovery behavior varies based on the failure timing:
+ **Forward/backward propagation:** Roll back to the beginning of the current training step and broadcast model states to replacement node(s)
+ **Optimizer step:** Allow healthy replicas to complete the step under lock protection, then broadcast the updated model states to replacement node(s)

This strategy ensures completed optimizer updates are never discarded, helping reduce fault recovery time.

![\[This image illustrates how failure is handled depending on if it occurs before or after failure.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-optimizer.png)


## Checkpointless Training Flow Diagram
<a name="sagemaker-eks-checkpointless-training-flow"></a>

![\[This diagram illustrates the checkpointless training flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-training-flow.png)


The following steps outline the failure detection and checkpointless recovery process:

1. Training loop starts

1. Fault occurs

1. Evaluate checkpointless resume feasibility

1. Check if it is feasible to do checkpointless resume
   + If feasible, Attempt checkpointless reusme
     + If resumes fails, fallback to checkpoint loading from storage
     + If resume succeeds, training continues from recovered state
   + If not feasible, fall back to checkpoint loading from storage

1. Clean up resources - abort all process groups and backends and free resources in preparation for restart.

1. Resume training loop - a new training loop begins, and the process returns to step 1.

## API reference
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference"></a>

### wait\$1rank
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-wait_rank"></a>

```
hyperpod_checkpointless_training.inprocess.train_utils.wait_rank()
```

Waits for and retrieves rank information from HyperPod, then updates the current process environment with distributed training variables.

This function obtains the correct rank assignment and environment variables for distributed training. It ensures that each process gets the appropriate configuration for its role in the distributed training job.

**Parameters**

None

**Returns**

**None**

**Behavior**
+ **Process Check**: Skips execution if called from a subprocess (only runs in MainProcess)
+ **Environment Retrieval**: Gets current `RANK` and `WORLD_SIZE` from environment variables
+ **HyperPod Communication**: Calls `hyperpod_wait_rank_info()` to retrieve rank information from HyperPod
+ **Environment Update**: Updates the current process environment with worker-specific environment variables received from HyperPod

**Environment Variables**

The function reads the following environment variables:
+ **RANK** (*int*) – Current process rank (default: -1 if not set)
+ **WORLD\$1SIZE** (*int*) – Total number of processes in the distributed job (default: 0 if not set)

**Raises**
+ **AssertionError** – If the response from HyperPod is not in the expected format or if required fields are missing

**Example**

```
from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank  

# Call before initializing distributed training  
wait_rank()  

# Now environment variables are properly set for this rank  
import torch.distributed as dist  
dist.init_process_group(backend='nccl')
```

**Notes**
+ Only executes in the main process; subprocess calls are automatically skipped
+ The function blocks until HyperPod provides the rank information

### HPWrapper
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPWrapper"></a>

```
class hyperpod_checkpointless_training.inprocess.wrap.HPWrapper(  
    *,  
    abort=Compose(HPAbortTorchDistributed()),  
    finalize=None,  
    health_check=None,  
    hp_api_factory=None,  
    abort_timeout=None,  
    enabled=True,  
    trace_file_path=None,  
    async_raise_before_abort=True,  
    early_abort_communicator=False,  
    checkpoint_manager=None,  
    check_memory_status=True)
```

*Python function wrapper that enables restart capabilities for a Re-executable Code Block (RCB) in HyperPod checkpointless training.*

*This wrapper provides fault tolerance and automatic recovery capabilities by monitoring training execution and coordinating restarts across distributed processes when failures occur. It uses a context manager approach rather than a decorator to maintain global resources throughout the training lifecycle.*

**Parameters**
+ **abort** (*Abort*, *optional*) – Asynchronously aborts execution when failures are detected. Default: `Compose(HPAbortTorchDistributed())`
+ **finalize** (*Finalize*, *optional*) – Rank-local finalize handler executed during restart. Default: `None`
+ **health\$1check** (*HealthCheck*, *optional*) – Rank-local health check executed during restart. Default: `None`
+ **hp\$1api\$1factory** (*Callable*, *optional*) – Factory function for creating a HyperPod API to interact with HyperPod. Default: `None`
+ **abort\$1timeout** (*float*, *optional*) – Timeout for abort call in fault controlling thread. Default: `None`
+ **enabled** (*bool*, *optional*) – Enables the wrapper functionality. When `False`, the wrapper becomes a pass-through. Default: `True`
+ **trace\$1file\$1path** (*str*, *optional*) – Path to the trace file for VizTracer profiling. Default: `None`
+ **async\$1raise\$1before\$1abort** (*bool*, *optional*) – Enable raise before abort in fault controlling thread. Default: `True`
+ **early\$1abort\$1communicator** (*bool*, *optional*) – Abort communicator (NCCL/Gloo) before aborting dataloader. Default: `False`
+ **checkpoint\$1manager** (*Any*, *optional*) – Manager for handling checkpoints during recovery. Default: `None`
+ **check\$1memory\$1status** (*bool*, *optional*) – Enable memory status checking and logging. Default: `True`

**Methods**

```
def __call__(self, fn)
```

*Wraps a function to enable restart capabilities.*

**Parameters:**
+ **fn** (*Callable*) – The function to wrap with restart capabilities

**Returns:**
+ **Callable** – Wrapped function with restart capabilities, or original function if disabled

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
from hyperpod_checkpointless_training.nemo_plugins.patches import patch_megatron_optimizer  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_connector import CheckpointlessCompatibleConnector  
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory  
from hyperpod_checkpointless_training.inprocess.abort import CheckpointlessFinalizeCleanup, CheckpointlessAbortManager   
      
@HPWrapper(  
    health_check=CudaHealthCheck(),  
    hp_api_factory=HPAgentK8sAPIFactory(),  
    abort_timeout=60.0,  
    checkpoint_manager=CheckpointManager(enable_offload=False),  
    abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),  
    finalize=CheckpointlessFinalizeCleanup(),  
)def training_function():  
    # Your training code here  
    pass
```

**Notes**
+ The wrapper requires `torch.distributed` to be available
+ When `enabled=False`, the wrapper becomes a pass-through and returns the original function unchanged
+ The wrapper maintains global resources like monitoring threads throughout the training lifecycle
+ Supports VizTracer profiling when `trace_file_path` is provided
+ Integrates with HyperPod for coordinated fault handling across distributed training

### HPCallWrapper
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPCallWrapper"></a>

```
class hyperpod_checkpointless_training.inprocess.wrap.HPCallWrapper(wrapper)
```

Monitors and manages the state of a Restart Code Block (RCB) during execution.

This class handles the lifecycle of RCB execution, including failure detection, coordination with other ranks for restarts, and cleanup operations. It manages distributed synchronization and ensures consistent recovery across all training processes.

**Parameters**
+ **wrapper** (*HPWrapper*) – The parent wrapper containing global in-process recovery settings

**Attributes**
+ **step\$1upon\$1restart** (*int*) – Counter that tracks steps since the last restart, used for determining restart strategy

**Methods**

```
def initialize_barrier()
```

Wait for HyperPod barrier synchronization after encountering an exception from RCB.

```
def start_hp_fault_handling_thread()
```

Start the fault handling thread for monitoring and coordinating failures.

```
def handle_fn_exception(call_ex)
```

Process exceptions from the execution function or RCB.

**Parameters:**
+ **call\$1ex** (*Exception*) – Exception from the monitoring function

```
def restart(term_ex)
```

Execute restart handler including finalization, garbage collection, and health checks.

**Parameters:**
+ **term\$1ex** (*RankShouldRestart*) – Termination exception triggering the restart

```
def launch(fn, *a, **kw)
```

*Execute the RCB with proper exception handling.*

**Parameters:**
+ **fn** (*Callable*) – Function to be executed
+ **a** – Function arguments
+ **kw** – Function keyword arguments

```
def run(fn, a, kw)
```

Main execution loop that handles restarts and barrier synchronization.

**Parameters:**
+ **fn** (*Callable*) – Function to be executed
+ **a** – Function arguments
+ **kw** – Function keyword arguments

```
def shutdown()
```

Shutdown fault handling and monitoring threads.

**Notes**
+ Automatically handles `RankShouldRestart` exceptions for coordinated recovery
+ Manages memory tracking and aborts, garbage collection during restarts
+ Supports both in-process recovery and PLR (Process-Level Restart) strategies based on failure timing

### CudaHealthCheck
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-cudahealthcheck"></a>

```
class hyperpod_checkpointless_training.inprocess.health_check.CudaHealthCheck(timeout=datetime.timedelta(seconds=30))
```

Ensures that the CUDA context for the current process is in a healthy state during checkpointless training recovery.

This health check synchronizes with the GPU to verify that the CUDA context is not corrupted after a training failure. It performs GPU synchronization operations to detect any issues that might prevent successful training resumption. The health check is executed after distributed groups are destroyed and finalization is complete.

**Parameters**
+ **timeout** (*datetime.timedelta*, *optional*) – Timeout duration for GPU synchronization operations. Default: `datetime.timedelta(seconds=30)`

**Methods**

```
__call__(state, train_ex=None)
```

Execute the CUDA health check to verify GPU context integrity.

**Parameters:**
+ **state** (*HPState*) – Current HyperPod state containing rank and distributed information
+ **train\$1ex** (*Exception*, *optional*) – The original training exception that triggered the restart. Default: `None`

**Returns:**
+ **tuple** – A tuple containing `(state, train_ex)` unchanged if health check passes

**Raises:**
+ **TimeoutError** – If GPU synchronization times out, indicating a potentially corrupted CUDA context

**State Preservation**: Returns the original state and exception unchanged if all checks pass

**Example**

```
import datetime  
from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# Create CUDA health check with custom timeout  
cuda_health_check = CudaHealthCheck(  
    timeout=datetime.timedelta(seconds=60)  
)  
  
# Use with HPWrapper for fault-tolerant training  
@HPWrapper(  
    health_check=cuda_health_check,  
    enabled=True  
)  
def training_function():  
    # Your training code here  
    pass
```

**Notes**
+ Uses threading to implement timeout protection for GPU synchronization
+ Designed to detect corrupted CUDA contexts that could prevent successful training resumption
+ Should be used as part of the fault tolerance pipeline in distributed training scenarios

### HPAgentK8sAPIFactory
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPAgentK8sAPIFactory"></a>

```
class hyperpod_checkpointless_training.inprocess.train_utils.HPAgentK8sAPIFactory()
```

Factory class for creating HPAgentK8sAPI instances that communicate with HyperPod infrastructure for distributed training coordination.

This factory provides a standardized way to create and configure HPAgentK8sAPI objects that handle communication between training processes and the HyperPod control plane. It encapsulates the creation of the underlying socket client and API instance, ensuring consistent configuration across different parts of the training system.

**Methods**

```
__call__()
```

Create and return an HPAgentK8sAPI instance configured for HyperPod communication.

**Returns:**
+ **HPAgentK8sAPI** – Configured API instance for communicating with HyperPod infrastructure

**Example**

```
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck  
  
# Create the factory  
hp_api_factory = HPAgentK8sAPIFactory()  
  
# Use with HPWrapper for fault-tolerant training  
hp_wrapper = HPWrapper(  
    hp_api_factory=hp_api_factory,  
    health_check=CudaHealthCheck(),  
    abort_timeout=60.0,  
    enabled=True  
)  
  
@hp_wrapper  
def training_function():  
    # Your distributed training code here  
    pass
```

**Notes**
+ Designed to work seamlessly with HyperPod's Kubernetes-based infrastructure. It is essential for coordinated fault handling and recovery in distributed training scenarios

### CheckpointManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointManager"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager.CheckpointManager(  
    enable_checksum=False,  
    enable_offload=False)
```

Manages in-memory checkpoints and peer-to-peer recovery for checkpointless fault tolerance in distributed training.

This class provides the core functionality for HyperPod checkpointless training by managing NeMo model checkpoints in memory, validating recovery feasibility, and orchestrating peer-to-peer checkpoint transfer between healthy and failed ranks. It eliminates the need for disk I/O during recovery, significantly reducing mean time to recovery (MTTR).

**Parameters**
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation for integrity checks during recovery. Default: `False`
+ **enable\$1offload** (*bool*, *optional*) – Enable checkpoint offloading from GPU to CPU memory to reduce GPU memory usage. Default: `False`

**Attributes**
+ **global\$1step** (*int* or *None*) – Current training step associated with the saved checkpoint
+ **rng\$1states** (*list* or *None*) – Stored random number generator states for deterministic recovery
+ **checksum\$1manager** (*MemoryChecksumManager*) – Manager for model state checksum validation
+ **parameter\$1update\$1lock** (*ParameterUpdateLock*) – Lock for coordinating parameter updates during recovery

**Methods**

```
save_checkpoint(trainer)
```

Save NeMo model checkpoint in memory for potential checkpointless recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Called by CheckpointlessCallback at batch end or during exception handling
+ Creates recovery points without disk I/O overhead
+ Stores complete model, optimizer, and scheduler states

```
delete_checkpoint()
```

Delete the in-memory checkpoint and perform cleanup operations.

**Notes:**
+ Clears checkpoint data, RNG states, and cached tensors
+ Performs garbage collection and CUDA cache cleanup
+ Called after successful recovery or when checkpoint is no longer needed

```
try_checkpointless_load(trainer)
```

Attempt checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored checkpoint if successful, None if fallback to disk needed

**Notes:**
+ Main entry point for checkpointless recovery
+ Validates recovery feasibility before attempting P2P transfer
+ Always cleans up in-memory checkpoints after recovery attempt

```
checkpointless_recovery_feasible(trainer, include_checksum_verification=True)
```

Determine if checkpointless recovery is possible for the current failure scenario.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **include\$1checksum\$1verification** (*bool*, *optional*) – Whether to include checksum validation. Default: `True`

**Returns:**
+ **bool** – True if checkpointless recovery is feasible, False otherwise

**Validation Criteria:**
+ Global step consistency across healthy ranks
+ Sufficient healthy replicas available for recovery
+ Model state checksum integrity (if enabled)

```
store_rng_states()
```

Store all random number generator states for deterministic recovery.

**Notes:**
+ Captures Python, NumPy, PyTorch CPU/GPU, and Megatron RNG states
+ Essential for maintaining training determinism after recovery

```
load_rng_states()
```

Restore all RNG states for deterministic recovery continuation.

**Notes:**
+ Restores all previously stored RNG states
+ Ensures training continues with identical random sequences

```
maybe_offload_checkpoint()
```

Offload checkpoint from GPU to CPU memory if offload is enabled.

**Notes:**
+ Reduces GPU memory usage for large models
+ Only executes if `enable_offload=True`
+ Maintains checkpoint accessibility for recovery

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
# Use with HPWrapper for complete fault tolerance  
@HPWrapper(  
    checkpoint_manager=CheckpointManager(),  
    enabled=True  
)  
def training_function():  
    # Training code with automatic checkpointless recovery  
    pass
```

**Validation**: Verifies checkpoint integrity using checksums (if enabled)

**Notes**
+ Uses distributed communication primitives for efficient P2P transfer
+ Automatically handles tensor dtype conversions and device placement
+ **MemoryChecksumManager** – Handles model state integrity validation

### PEFTCheckpointManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-PEFTCheckpointManager"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager.PEFTCheckpointManager(  
    *args,  
    **kwargs)
```

Manages checkpoints for PEFT (Parameter-Efficient Fine-Tuning) with separate base and adapter handling for optimized checkpointless recovery.

This specialized checkpoint manager extends CheckpointManager to optimize PEFT workflows by separating base model weights from adapter parameters.

**Parameters**

Inherits all parameters from **CheckpointManager**:
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation. Default: `False`
+ **enable\$1offload** (*bool*, *optional*) – Enable checkpoint offloading to CPU memory. Default: `False`

**Additional Attributes**
+ **params\$1to\$1save** (*set*) – Set of parameter names that should be saved as adapter parameters
+ **base\$1model\$1weights** (*dict* or *None*) – Cached base model weights, saved once and reused
+ **base\$1model\$1keys\$1to\$1extract** (*list* or *None*) – Keys for extracting base model tensors during P2P transfer

**Methods**

```
maybe_save_base_model(trainer)
```

Save base model weights once, filtering out adapter parameters.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Only saves base model weights on first call; subsequent calls are no-ops
+ Filters out adapter parameters to store only frozen base model weights
+ Base model weights are preserved across multiple training sessions

```
save_checkpoint(trainer)
```

Save NeMo PEFT adapter model checkpoint in memory for potential checkpointless recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Automatically calls `maybe_save_base_model()` if base model not yet saved
+ Filters checkpoint to include only adapter parameters and training state
+ Significantly reduces checkpoint size compared to full model checkpoints

```
try_base_model_checkpointless_load(trainer)
```

Attempt PEFT base model weights checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored base model checkpoint if successful, None if fallback needed

**Notes:**
+ Used during model initialization to recover base model weights
+ Does not clean up base model weights after recovery (preserves for reuse)
+ Optimized for model-weights-only recovery scenarios

```
try_checkpointless_load(trainer)
```

Attempt PEFT adapter weights checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored adapter checkpoint if successful, None if fallback needed

**Notes:**
+ Recovers only adapter parameters, optimizer states, and schedulers
+ Automatically loads optimizer and scheduler states after successful recovery
+ Cleans up adapter checkpoints after recovery attempt

```
is_adapter_key(key)
```

Check if state dict key belongs to adapter parameters.

**Parameters:**
+ **key** (*str* or *tuple*) – State dict key to check

**Returns:**
+ **bool** – True if key is adapter parameter, False if base model parameter

**Detection Logic:**
+ Checks if key is in `params_to_save` set
+ Identifies keys containing ".adapter." substring
+ Identifies keys ending with ".adapters"
+ For tuple keys, checks if parameter requires gradients

```
maybe_offload_checkpoint()
```

Offload base model weights from GPU to CPU memory.

**Notes:**
+ Extends parent method to handle base model weight offloading
+ Adapter weights are typically small and don't require offloading
+ Sets internal flag to track offload state

**Notes**
+ Designed specifically for Parameter-Efficient Fine-Tuning scenarios (LoRA, Adapters, etc.)
+ Automatically handles separation of base model and adapter parameters

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import PEFTCheckpointManager  
# Use with HPWrapper for complete fault tolerance  
@HPWrapper(  
    checkpoint_manager=PEFTCheckpointManager(),  
    enabled=True  
)  
def training_function():  
    # Training code with automatic checkpointless recovery  
    pass
```

### CheckpointlessAbortManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessAbortManager"></a>

```
class hyperpod_checkpointless_training.inprocess.abort.CheckpointlessAbortManager()
```

Factory class for creating and managing abort component compositions for checkpointless fault tolerance.

This utility class provides static methods to create, customize, and manage abort component compositions used during fault handling in HyperPod checkpointless training. It simplifies the configuration of abort sequences that handle cleanup of distributed training components, data loaders, and framework-specific resources during failure recovery.

**Parameters**

None (all methods are static)

**Static Methods**

```
get_default_checkpointless_abort()
```

Get the default abort compose instance containing all standard abort components.

**Returns:**
+ **Compose** – Default composed abort instance with all abort components

**Default Components:**
+ **AbortTransformerEngine()** – Cleans up TransformerEngine resources
+ **HPCheckpointingAbort()** – Handles checkpointing system cleanup
+ **HPAbortTorchDistributed()** – Aborts PyTorch distributed operations
+ **HPDataLoaderAbort()** – Stops and cleans up data loaders

```
create_custom_abort(abort_instances)
```

*Create a custom abort compose with only the specified abort instances.*

**Parameters:**
+ **abort\$1instances** (*Abort*) – Variable number of abort instances to include in the compose

**Returns:**
+ **Compose** – New composed abort instance containing only the specified components

**Raises:**
+ **ValueError** – If no abort instances are provided

```
override_abort(abort_compose, abort_type, new_abort)
```

Replace a specific abort component in a Compose instance with a new component.

**Parameters:**
+ **abort\$1compose** (*Compose*) – The original Compose instance to modify
+ **abort\$1type** (*type*) – The type of abort component to replace (e.g., `HPCheckpointingAbort`)
+ **new\$1abort** (*Abort*) – The new abort instance to use as replacement

**Returns:**
+ **Compose** – New Compose instance with the specified component replaced

**Raises:**
+ **ValueError** – If abort\$1compose doesn't have 'instances' attribute

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.inprocess.abort import CheckpointlessFinalizeCleanup, CheckpointlessAbortManager  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),  
    health_check=CudaHealthCheck(),  
    finalize=CheckpointlessFinalizeCleanup(),  
    enabled=True  
)  
def training_function():  
    trainer.fit(...)
```

**Notes**
+ Custom configurations allow fine-tuned control over cleanup behavior
+ Abort operations are critical for proper resource cleanup during fault recovery

### CheckpointlessFinalizeCleanup
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessFinalizeCleanup"></a>

```
class hyperpod_checkpointless_training.inprocess.abort.CheckpointlessFinalizeCleanup()
```

Performs comprehensive cleanup after fault detection to prepare for in-process recovery during checkpointless training.

This finalize handler executes framework-specific cleanup operations including Megatron/TransformerEngine abort, DDP cleanup, module reloading, and memory cleanup by destroying training component references. It ensures that the training environment is properly reset for successful in-process recovery without requiring full process termination.

**Parameters**

None

**Attributes**
+ **trainer** (*pytorch\$1lightning.Trainer* or *None*) – Reference to the PyTorch Lightning trainer instance

**Methods**

```
__call__(*a, **kw)
```

**Execute comprehensive cleanup operations for in-process recovery preparation.**

*Parameters:*
+ **a** – Variable positional arguments (inherited from Finalize interface)
+ **kw** – Variable keyword arguments (inherited from Finalize interface)

**Cleanup Operations:**
+ **Megatron Framework Cleanup** – Calls `abort_megatron()` to clean up Megatron-specific resources
+ **TransformerEngine Cleanup** – Calls `abort_te()` to clean up TransformerEngine resources
+ **RoPE Cleanup** – Calls `cleanup_rope()` to clean up rotary position embedding resources
+ **DDP Cleanup** – Calls `cleanup_ddp()` to clean up DistributedDataParallel resources
+ **Module Reloading** – Calls `reload_megatron_and_te()` to reload framework modules
+ **Lightning Module Cleanup** – Optionally clears Lightning module to reduce GPU memory
+ **Memory Cleanup** – Destroys training component references to free memory

```
register_attributes(trainer)
```

*Register the trainer instance for use during cleanup operations.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance to register

**Integration with CheckpointlessCallback**

```
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    ...  
    finalize=CheckpointlessFinalizeCleanup(),   
)  
def training_function():  
    trainer.fit(...)
```

**Notes**
+ Cleanup operations are executed in a specific order to avoid dependency issues
+ Memory cleanup uses garbage collection introspection to find target objects
+ All cleanup operations are designed to be idempotent and safe to retry

### CheckpointlessMegatronStrategy
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessMegatronStrategy"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.megatron_strategy.CheckpointlessMegatronStrategy(*args, **kwargs)
```

NeMo Megatron strategy with integrated checkpointless recovery capabilities for fault-tolerant distributed training.

Note that checkpointless training requires `num_distributed_optimizer_instances` to be least 2 so that there will be optimizer replication. The strategy also takes care of essential attribute registration and process group initialization.

**Parameters**

Inherits all parameters from **MegatronStrategy**:
+ Standard NeMo MegatronStrategy initialization parameters
+ Distributed training configuration options
+ Model parallelism settings

**Attributes**
+ **base\$1store** (*torch.distributed.TCPStore* or *None*) – Distributed store for process group coordination

**Methods**

```
setup(trainer)
```

Initialize the strategy and register fault tolerance components with the trainer.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Setup Operations:**
+ **Parent Setup** – Calls parent MegatronStrategy setup
+ **Fault Injection Registration** – Registers HPFaultInjectionCallback hooks if present
+ **Finalize Registration** – Registers trainer with finalize cleanup handlers
+ **Abort Registration** – Registers trainer with abort handlers that support it

```
setup_distributed()
```

Initialize process group using either TCPStore with prefix or rootless connection.

```
load_model_state_dict(checkpoint, strict=True)
```

Load model state dict with checkpointless recovery compatibility.

**Parameters:**
+ **checkpoint** (*Mapping[str, Any]*) – Checkpoint dictionary containing model state
+ **strict** (*bool*, *optional*) – Whether to strictly enforce state dict key matching. Default: `True`

```
get_wrapper()
```

Get the HPCallWrapper instance for fault tolerance coordination.

**Returns:**
+ **HPCallWrapper** – The wrapper instance attached to the trainer for fault tolerance

```
is_peft()
```

Check if PEFT (Parameter-Efficient Fine-Tuning) is enabled in the training configuration by checking for PEFT callbacks

**Returns:**
+ **bool** – True if PEFT callback is present, False otherwise

```
teardown()
```

Override PyTorch Lightning native teardown to delegate cleanup to abort handlers.

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    checkpoint_manager=checkpoint_manager,  
    enabled=True  
)  
def training_function():  
    trainer = pl.Trainer(strategy=CheckpointlessMegatronStrategy())  
    trainer.fit(model, datamodule)
```

### CheckpointlessCallback
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessCallback"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.callbacks.CheckpointlessCallback(  
    enable_inprocess=False,  
    enable_checkpointless=False,  
    enable_checksum=False,  
    clean_tensor_hook=False,  
    clean_lightning_module=False)
```

Lightning callback that integrates NeMo training with checkpointless training's fault tolerance system.

This callback manages step tracking, checkpoint saving, and parameter update coordination for in-process recovery capabilities. It serves as the primary integration point between PyTorch Lightning training loops and HyperPod checkpointless training mechanisms, coordinating fault tolerance operations throughout the training lifecycle.

**Parameters**
+ **enable\$1inprocess** (*bool*, *optional*) – Enable in-process recovery capabilities. Default: `False`
+ **enable\$1checkpointless** (*bool*, *optional*) – Enable checkpointless recovery (requires `enable_inprocess=True`). Default: `False`
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation (requires `enable_checkpointless=True`). Default: `False`
+ **clean\$1tensor\$1hook** (*bool*, *optional*) – Clear tensor hooks from all GPU tensors during cleanup (expensive operation). Default: `False`
+ **clean\$1lightning\$1module** (*bool*, *optional*) – Enable Lightning module cleanup to free GPU memory after each restart. Default: `False`

**Attributes**
+ **tried\$1adapter\$1checkpointless** (*bool*) – Flag to track if adapter checkpointless restore has been attempted

**Methods**

```
get_wrapper_from_trainer(trainer)
```

Get the HPCallWrapper instance from the trainer for fault tolerance coordination.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **HPCallWrapper** – The wrapper instance for fault tolerance operations

```
on_train_batch_start(trainer, pl_module, batch, batch_idx, *args, **kwargs)
```

Called at the start of each training batch to manage step tracking and recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **pl\$1module** (*pytorch\$1lightning.LightningModule*) – Lightning module being trained
+ **batch** – Current training batch data
+ **batch\$1idx** (*int*) – Index of the current batch
+ **args** – Additional positional arguments
+ **kwargs** – Additional keyword arguments

```
on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)
```

*Release parameter update lock at the end of each training batch.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **pl\$1module** (*pytorch\$1lightning.LightningModule*) – Lightning module being trained
+ **outputs** (*STEP\$1OUTPUT*) – Training step outputs
+ **batch** (*Any*) – Current training batch data
+ **batch\$1idx** (*int*) – Index of the current batch

**Notes:**
+ Lock release timing ensures checkpointless recovery can proceed after parameter updates complete
+ Only executes when both `enable_inprocess` and `enable_checkpointless` are True

```
get_peft_callback(trainer)
```

*Retrieve the PEFT callback from the trainer's callback list.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **PEFT** or **None** – PEFT callback instance if found, None otherwise

```
_try_adapter_checkpointless_restore(trainer, params_to_save)
```

*Attempt checkpointless restore for PEFT adapter parameters.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **params\$1to\$1save** (*set*) – Set of parameter names to save as adapter parameters

**Notes:**
+ Only executes once per training session (controlled by `tried_adapter_checkpointless` flag)
+ Configures checkpoint manager with adapter parameter information

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
import pytorch_lightning as pl  
  
# Create checkpoint manager  
checkpoint_manager = CheckpointManager(  
    enable_checksum=True,  
    enable_offload=True  
)  
  
# Create checkpointless callback with full fault tolerance  
checkpointless_callback = CheckpointlessCallback(  
    enable_inprocess=True,  
    enable_checkpointless=True,  
    enable_checksum=True,  
    clean_tensor_hook=True,  
    clean_lightning_module=True  
)  
  
# Use with PyTorch Lightning trainer  
trainer = pl.Trainer(  
    callbacks=[checkpointless_callback],  
    strategy=CheckpointlessMegatronStrategy()  
)  
  
# Training with fault tolerance  
trainer.fit(model, datamodule=data_module)
```

**Memory Management**
+ **clean\$1tensor\$1hook**: Removes tensor hooks during cleanup (expensive but thorough)
+ **clean\$1lightning\$1module**: Frees Lightning module GPU memory during restarts
+ Both options help reduce memory footprint during fault recovery
+ Coordinates with ParameterUpdateLock for thread-safe parameter update tracking

### CheckpointlessCompatibleConnector
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessCompatibleConnector"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_connector.CheckpointlessCompatibleConnector()
```

PyTorch Lightning checkpoint connector that integrates checkpointless recovery with traditional disk-based checkpoint loading.

This connector extends PyTorch Lightning's `_CheckpointConnector` to provide seamless integration between checkpointless recovery and standard checkpoint restoration. It attempts checkpointless recovery first, then falls back to disk-based checkpoint loading if checkpointless recovery is not feasible or fails.

**Parameters**

Inherits all parameters from **\$1CheckpointConnector**

**Methods**

```
resume_start(checkpoint_path=None)
```

Attempt to pre-load checkpoint with checkpointless recovery priority.

**Parameters:**
+ **checkpoint\$1path** (*str* or *None*, *optional*) – Path to disk checkpoint for fallback. Default: `None`

```
resume_end()
```

Complete the checkpoint loading process and perform post-load operations.

**Notes**
+ Extends PyTorch Lightning's internal `_CheckpointConnector` class with checkpointless recovery support
+ Maintains full compatibility with standard PyTorch Lightning checkpoint workflows

### CheckpointlessAutoResume
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessAutoResume"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.resume.CheckpointlessAutoResume()
```

Extends NeMo's AutoResume with delayed setup to enable checkpointless recovery validation before checkpoint path resolution.

This class implements a two-phase initialization strategy that allows checkpointless recovery validation to occur before falling back to traditional disk-based checkpoint loading. It conditionally delays AutoResume setup to prevent premature checkpoint path resolution, enabling the CheckpointManager to first validate whether checkpointless peer-to-peer recovery is feasible.

**Parameters**

Inherits all parameters from **AutoResume**

**Methods**

```
setup(trainer, model=None, force_setup=False)
```

Conditionally delay AutoResume setup to enable checkpointless recovery validation.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer* or *lightning.fabric.Fabric*) – PyTorch Lightning trainer or Fabric instance
+ **model** (*optional*) – Model instance for setup. Default: `None`
+ **force\$1setup** (*bool*, *optional*) – If True, bypass delay and execute AutoResume setup immediately. Default: `False`

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.resume import CheckpointlessAutoResume  
from hyperpod_checkpointless_training.nemo_plugins.megatron_strategy import CheckpointlessMegatronStrategy  
import pytorch_lightning as pl  
  
# Create trainer with checkpointless auto-resume  
trainer = pl.Trainer(  
    strategy=CheckpointlessMegatronStrategy(),  
    resume=CheckpointlessAutoResume()  
)
```

**Notes**
+ Extends NeMo's AutoResume class with delay mechanism for enabling checkpointless recovery
+ Works in conjunction with `CheckpointlessCompatibleConnector` for complete recovery workflow

# Special considerations
<a name="sagemaker-eks-checkpointless-considerations"></a>

We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to job operations, resource management, and essential service functionality. 

HyperPod managed tiered checkpointing and elastic training: note that HyperPod checkpointless training is currently incompatible with HyperPod managed tiered checkpointing and elastic training.

Checkpointless training recipes for GPT OSS 120B and Llama models are provided to simplify getting started. These recipes have been verified on ml.p5 instances. Using other instance types may require additional modifications to the underlying recipes. These recipes can be adapted to full finetuning workflows as well. For custom models, we recommend reviewing the [getting started examples](https://docs.aws.amazon.com/sagemaker-eks-checkpointless-recipes-custom).

# Appendix
<a name="sagemaker-eks-checkpointless-appendix"></a>

**Monitor training results via HyperPod recipes**

SageMaker HyperPod recipes offer Tensorboard integration to analyze training behavior. These recipes also incorporate VizTracer, which is a low-overhead tool for tracing and visualizing Python code execution. For more information, see [ VizTracer](https://github.com/gaogaotiantian/viztracer).

The tensorboard logs are generated and stored within the `log_dir`. To access and analyze these logs locally, use the following procedure:

1. Download the Tensorboard experiment folder from your training environment to your local machine.

1. Open a terminal or command prompt on your local machine.

1. Navigate to the directory containing the downloaded experiment folder.

1. Launch Tensorboard by running the command:

   ```
   tensorboard --port=<port> --bind_all --logdir experiment.
   ```

1. Open your web browser and visit `http://localhost:8008`.

You can now see the status and visualizations of your training jobs within the Tensorboard interface. Seeing the status and visualizations helps you monitor and analyze the training process. Monitoring and analyzing the training process helps you gain insights into the behavior and performance of your models. For more information about how you monitor and analyze the training with Tensorboard, see the [ NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/exp_manager.html#experiment-manager).

**VizTracer**

To enable VizTracer, you can modify your recipe by setting the environment variable `ENABLE_VIZTRACER` to `1`. After the training has completed, your VizTracer profile is in the experiment folder `log_dir/viztracer_xxx.json`. To analyze your profile, you can download it and open it using the **vizviewer** tool:

```
vizviewer --port <port> viztracer_xxx.json
```

This command launches the vizviewer on port 9001. You can view your VizTracer by going to http://localhost:<port> in your browser. After you open VizTracer, you begin analyzing the training. For more information about using VizTracer, see [ VizTracer documentation](https://viztracer.readthedocs.io/en/latest/installation.html).

# Release notes
<a name="sagemaker-eks-checkpointless-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker HyperPod checkpointless training.

**The SageMaker HyperPod checkpointless training v1.0.1**

Date: April 10, 2026

**Bug Fixes**
+ Fixed incorrect CUDA device binding in the fault handling thread. The fault handling thread now correctly sets the CUDA device context by using `LOCAL_RANK`. This fix prevents device mismatch errors during in-process fault recovery.

**The SageMaker HyperPod checkpointless training v1.0.0**

Date: December 03, 2025

**SageMaker HyperPod checkpointless training Features**
+ **Collective Communication Initialization Improvements**: Offers novel initialization methods, Rootless and TCPStoreless for NCCL and Gloo.
+ **Memory-mapped (MMAP)** Dataloader: Caches (persist) prefetched batches so that they are available even when a fault causes a restart of the training job.
+ **Checkpointless**: Enables faster recovery from cluster training faults in large-scale distributed training environments by making framework-level optimizations
+ **Built on Nvidia Nemo and PyTorch Lightning**: Leverages these powerful frameworks for efficient and flexible model training
  + [Nividia NeMo](https://github.com/NVIDIA-NeMo/NeMo)
  + [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/)

**SageMaker HyperPod Checkpointless training Docker container**

Checkpointless training on HyperPod is built on top of the [ NVIDIA NeMo framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html). HyperPod checkpointless training aims to recover faster from cluster training faults in large-scale distributed training environments by making framework-level optimizations that will be delivered on a base container containing the base image with NCCL and PyTorch optimizations.

**Availability**

Currently images are only available in:

```
eu-north-1
ap-south-1
us-east-2
eu-west-1
eu-central-1
sa-east-1
us-east-1
eu-west-2
ap-northeast-1
us-west-2
us-west-1
ap-southeast-1
ap-southeast-2
```

but not available in the following 3 opt-in Regions:

```
ap-southeast-3
ap-southeast-4
eu-south-2
```

**Container details**

Checkpointless training Docker container for PyTorch v2.6.0 with CUDA v12.9

```
963403601044.dkr.ecr.eu-north-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
423350936952.dkr.ecr.ap-south-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
556809692997.dkr.ecr.us-east-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
942446708630.dkr.ecr.eu-west-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
391061375763.dkr.ecr.eu-central-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
311136344257.dkr.ecr.sa-east-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
016839105697.dkr.ecr.eu-west-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
356859066553.dkr.ecr.ap-northeast-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
920498770698.dkr.ecr.us-west-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
827510180725.dkr.ecr.us-west-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
885852567298.dkr.ecr.ap-southeast-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
304708117039.dkr.ecr.ap-southeast-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
```

**Pre-installed packages**

```
PyTorch: v2.6.0
CUDA: v12.9
NCCL: v2.27.5
EFA: v1.43.0
AWS-OFI-NCCL v1.16.0
Libfabric version 2.1
Megatron v0.15.0
Nemo v2.6.0rc0
```

# Using GPU partitions in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-gpu-partitioning"></a>

Cluster administrators can choose how to maximize GPU utilization across their organization. You can enable GPU partitioning with NVIDIA Multi-Instance GPU (MIG) technology to partition GPU resources into smaller, isolated instances for better resource utilization. This capability provides the ability to run multiple smaller sized tasks concurrently on a single GPU instead of dedicating the entire hardware to a single, often underutilized task. This eliminates wasted compute power and memory.

GPU partitioning with MIG technology supports GPUs and allows you to partition a single supported GPU into up to seven separate GPU partitions. Each GPU partition has dedicated memory, cache, and compute resources, providing predictable isolation.

## Benefits
<a name="sagemaker-hyperpod-eks-gpu-partitioning-benefits"></a>
+ **Improved GPU utilization** - Maximize compute efficiency by partitioning GPUs based on compute and memory requirements
+ **Task isolation** - Each GPU partition operates independently with dedicated memory, cache, and compute resources
+ **Task flexibility** - Support a mix of tasks on a single physical GPU, all running in parallel
+ **Flexible setup management** - Support both Do-it-yourself (DIY) Kubernetes configurations using Kubernetes command-line client `kubectl`, and a managed solution with custom labels to easily configure and apply your labels associated with GPU partitions

**Important**  
GPU partitioning with MIG is not supported with flexible instance groups (instance groups that use `InstanceRequirements`). To use MIG, create an instance group with a single `InstanceType`.

## Supported Instance Types
<a name="sagemaker-hyperpod-eks-gpu-partitioning-instance-types"></a>

GPU partitioning with MIG technology is supported on the following HyperPod instance types:

**A100 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p4/](https://aws.amazon.com/ec2/instance-types/p4/)
+ **ml.p4d.24xlarge** - 8 NVIDIA A100 GPUs (80GB HBM2e per GPU)
+ **ml.p4de.24xlarge** - 8 NVIDIA A100 GPUs (80GB HBM2e per GPU)

**H100 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p5/](https://aws.amazon.com/ec2/instance-types/p5/)
+ **ml.p5.48xlarge** - 8 NVIDIA H100 GPUs (80GB HBM3 per GPU)

**H200 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p5/](https://aws.amazon.com/ec2/instance-types/p5/)
+ **ml.p5e.48xlarge** - 8 NVIDIA H200 GPUs (141GB HBM3e per GPU)
+ **ml.p5en.48xlarge** - 8 NVIDIA H200 GPUs (141GB HBM3e per GPU)

**B200 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p6/](https://aws.amazon.com/ec2/instance-types/p6/)
+ **ml.p6b.48xlarge** - 8 NVIDIA B200 GPUs

## GPU partitions
<a name="sagemaker-hyperpod-eks-gpu-partitioning-profiles"></a>

NVIDIA MIG profiles define how GPUs are partitioned. Each profile specifies the compute and memory allocation per MIG instance. The following are the MIG profiles associated with each GPU type:

**A100 GPU (ml.p4d.24xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p4d.24xlarge | 
| --- | --- | --- | --- | 
| `1g.5gb` | 5 | 7 | 56 | 
| `2g.10gb` | 10 | 3 | 24 | 
| `3g.20gb` | 20 | 2 | 16 | 
| `4g.20gb` | 20 | 1 | 8 | 
| `7g.40gb` | 40 | 1 | 8 | 

**H100 GPU (ml.p5.48xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p5.48xlarge | 
| --- | --- | --- | --- | 
| `1g.10gb` | 10 | 7 | 56 | 
| `1g.20gb` | 20 | 4 | 32 | 
| `2g.20gb` | 20 | 3 | 24 | 
| `3g.40gb` | 40 | 2 | 16 | 
| `4g.40gb` | 40 | 1 | 8 | 
| `7g.80gb` | 80 | 1 | 8 | 

**H200 GPU (ml.p5e.48xlarge and ml.p5en.48xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p5en.48xlarge | 
| --- | --- | --- | --- | 
| `1g.18gb` | 18 | 7 | 56 | 
| `1g.35gb` | 35 | 4 | 32 | 
| `2g.35gb` | 35 | 3 | 24 | 
| `3g.71gb` | 71 | 2 | 16 | 
| `4g.71gb` | 71 | 1 | 8 | 
| `7g.141gb` | 141 | 1 | 8 | 

**Topics**
+ [Benefits](#sagemaker-hyperpod-eks-gpu-partitioning-benefits)
+ [Supported Instance Types](#sagemaker-hyperpod-eks-gpu-partitioning-instance-types)
+ [GPU partitions](#sagemaker-hyperpod-eks-gpu-partitioning-profiles)
+ [Setting up GPU partitions on Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning-setup.md)
+ [Node Lifecycle and Labels](sagemaker-hyperpod-eks-gpu-partitioning-labels.md)
+ [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md)

# Setting up GPU partitions on Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup"></a>

**Topics**
+ [Prerequisites](#sagemaker-hyperpod-eks-gpu-partitioning-setup-prerequisites)
+ [Creating a Cluster with MIG Configuration](#sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster)
+ [Adding GPU operator to an existing cluster](#sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator)
+ [Updating MIG Configuration](#sagemaker-hyperpod-eks-gpu-partitioning-setup-update)
+ [Verifying MIG Configuration](#sagemaker-hyperpod-eks-gpu-partitioning-setup-verify)
+ [Common Commands for Debugging MIG Configuration](#sagemaker-hyperpod-eks-gpu-partitioning-setup-debug-commands)
+ [Using SageMaker AI Console](#sagemaker-hyperpod-eks-gpu-partitioning-setup-console)

## Prerequisites
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-prerequisites"></a>
+ HyperPod Amazon EKS cluster with supported GPU instances
+ NVIDIA GPU Operator installed
+ Appropriate IAM permissions for cluster management

## Creating a Cluster with MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster"></a>

### Using AWS CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster-cli"></a>

```
aws sagemaker create-cluster \
  --cluster-name my-mig-cluster \
  --orchestrator 'Eks={ClusterArn=arn:aws:eks:region:account:cluster/cluster-name}' \
  --instance-groups '{
    "InstanceGroupName": "gpu-group",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 1,
    "LifeCycleConfig": {
       "SourceS3Uri": "s3://my-bucket",
       "OnCreate": "on_create_script.sh"
    },
    "KubernetesConfig": {
       "Labels": {
          "nvidia.com/mig.config": "all-1g.5gb"
       }
    },
    "ExecutionRole": "arn:aws:iam::account:role/execution-role",
    "ThreadsPerCore": 1
  }' \
  --vpc-config '{
     "SecurityGroupIds": ["sg-12345"],
     "Subnets": ["subnet-12345"]
  }' \
  --node-provisioning-mode Continuous
```

### Using CloudFormation
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster-cfn"></a>

```
{
  "ClusterName": "my-mig-cluster",
  "InstanceGroups": [
    {
      "InstanceGroupName": "gpu-group",
      "InstanceType": "ml.p4d.24xlarge",
      "InstanceCount": 1,
      "KubernetesConfig": {
        "Labels": {
          "nvidia.com/mig.config": "all-2g.10gb"
        }
      },
      "ExecutionRole": "arn:aws:iam::account:role/execution-role"
    }
  ],
  "Orchestrator": {
    "Eks": {
      "ClusterArn": "arn:aws:eks:region:account:cluster/cluster-name"
    }
  },
  "NodeProvisioningMode": "Continuous"
}
```

## Adding GPU operator to an existing cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator"></a>

### Install GPU Operator
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-install"></a>

Replace `{$AWS_REGION}` with your cluster region (e.g., us-east-1, us-west-2).

```
helm install gpuo helm_chart/HyperPodHelmChart/charts/gpu-operator \
-f helm_chart/HyperPodHelmChart/charts/gpu-operator/regional-values/values-{$AWS_REGION}.yaml \
-n kube-system
```

### Verify Installation (Wait 2-3 minutes)
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-verify"></a>

Check all GPU operator pods are running:

```
kubectl get pods -n kube-system | grep -E "(gpu-operator|nvidia-)"
```

**Expected pods:**
+ gpu-operator-\$1 - 1 instance (cluster controller)
+ nvidia-device-plugin-daemonset-\$1 - 1 per GPU node (all GPU instances)
+ nvidia-mig-manager-\$1 - 1 per MIG-capable node (A100/H100)

### Remove Old Device Plugin
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-remove"></a>

Disable the existing nvidia-device-plugin:

```
helm upgrade dependencies helm_chart/HyperPodHelmChart \
--set nvidia-device-plugin.devicePlugin.enabled=false \
-n kube-system
```

### Verify GPU Resources
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-verify-gpu"></a>

Confirm nodes show GPU capacity. It should display: nvidia.com/gpu: 8 (or your actual GPU count).

```
kubectl describe nodes | grep "nvidia.com/gpu"
```

## Updating MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update"></a>

**Preparing Nodes Before MIG Updates**  
Before updating MIG configurations on your instance group, you must prepare the nodes to prevent workload disruption. Follow these steps to safely drain workloads from the nodes that will be reconfigured.

### Step 1: Identify Nodes in the Instance Group
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-identify"></a>

First, identify all nodes that belong to the instance group you want to update:

```
# List all nodes in the instance group
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=INSTANCE_GROUP_NAME

# Example:
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=p4d-group
```

This command returns a list of all nodes in the specified instance group. Make note of each node name for the following steps.

### Step 2: Cordon and Drain Each Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain"></a>

For each node identified in Step 1, perform the following actions:

#### Cordon the Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain-cordon"></a>

Cordoning prevents new pods from being scheduled on the node:

```
# Cordon a single node
kubectl cordon NODE_NAME

# Example:
kubectl cordon hyperpod-i-014a41a7001adca60
```

#### Drain Workload Pods from the Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain-evict"></a>

Drain the node to evict all workload pods while preserving system pods:

```
# Drain the node (ignore DaemonSets and evict pods)
kubectl drain NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300

# Example:
kubectl drain hyperpod-i-014a41a7001adca60 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300
```

**Command Options Explained:**
+ `--ignore-daemonsets` - Allows the drain operation to proceed even if DaemonSet pods are present
+ `--delete-emptydir-data` - Deletes pods using emptyDir volumes (required for draining to succeed)
+ `--force` - Forces deletion of pods not managed by a controller (use with caution)
+ `--grace-period=300` - Gives pods 5 minutes to terminate gracefully

**Important**  
The drain operation may take several minutes depending on the number of pods and their termination grace periods
System pods in the following namespaces will remain running: `kube-system`, `cert-manager`, `kubeflow`, `hyperpod-inference-system`, `kube-public`, `mpi-operator`, `gpu-operator`, `aws-hyperpod`, `jupyter-k8s-system`, `hyperpod-observability`, `kueue-system`, and `keda`
DaemonSet pods will remain on the node (they are ignored by design)

### Step 3: Verify No Workload Pods are Running
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-verify"></a>

After draining, verify that no workload pods remain on the nodes (excluding system namespaces):

```
# Check for any remaining pods outside system namespaces
kubectl get pods --all-namespaces --field-selector spec.nodeName=NODE_NAME \
  | grep -v "kube-system" \
  | grep -v "cert-manager" \
  | grep -v "kubeflow" \
  | grep -v "hyperpod-inference-system" \
  | grep -v "kube-public" \
  | grep -v "mpi-operator" \
  | grep -v "gpu-operator" \
  | grep -v "aws-hyperpod" \
  | grep -v "jupyter-k8s-system" \
  | grep -v "hyperpod-observability" \
  | grep -v "kueue-system" \
  | grep -v "keda"

# Example:
kubectl get pods --all-namespaces --field-selector spec.nodeName=hyperpod-i-014a41a7001adca60 \
  | grep -v "kube-system" \
  | grep -v "cert-manager" \
  | grep -v "kubeflow" \
  | grep -v "hyperpod-inference-system" \
  | grep -v "kube-public" \
  | grep -v "mpi-operator" \
  | grep -v "gpu-operator" \
  | grep -v "aws-hyperpod" \
  | grep -v "jupyter-k8s-system" \
  | grep -v "hyperpod-observability" \
  | grep -v "kueue-system" \
  | grep -v "keda"
```

**Expected Output:** If the node is properly drained, this command should return no results (or only show the header row). If any pods are still running, investigate why they weren't evicted and manually delete them if necessary.

### Step 4: Verify Node Readiness Status
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-readiness"></a>

Before proceeding with the MIG update, confirm that all nodes are cordoned:

```
# Check node status - should show "SchedulingDisabled"
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=INSTANCE_GROUP_NAME
```

Nodes should show `SchedulingDisabled` in the STATUS column, indicating they are cordoned and ready for the MIG update.

### Update MIG Profile on Existing Cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-change"></a>

You can change MIG profiles on existing clusters:

```
aws sagemaker update-cluster \
  --cluster-name my-mig-cluster \
  --instance-groups '{
    "InstanceGroupName": "gpu-group",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 1,
    "KubernetesConfig": {
       "Labels": {
          "nvidia.com/mig.config": "all-3g.20gb"
       }
    },
    "ExecutionRole": "arn:aws:iam::account:role/execution-role"
  }'
```

**Note**  
If jobs are already running on a node, the MIG partitioning will fail. User will get error message to drain the nodes before re-attempting the MIG partitioning.

## Verifying MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-verify"></a>

After cluster creation or update, verify the MIG configuration:

```
# Update kubeconfig
aws eks update-kubeconfig --name your-eks-cluster --region us-east-2

# Check MIG labels
kubectl get node NODE_NAME -o=jsonpath='{.metadata.labels}' | grep mig

# Check available MIG resources
kubectl describe node NODE_NAME | grep -A 10 "Allocatable:"
```

## Common Commands for Debugging MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-debug-commands"></a>

Use the following commands to troubleshoot and validate MIG configuration in your cluster:

```
# Check GPU Operator status
kubectl get pods -n gpu-operator-resources

# View MIG configuration
kubectl exec -n gpu-operator-resources nvidia-driver-XXXXX -- nvidia-smi mig -lgi

# Check device plugin configuration
kubectl logs -n gpu-operator-resources nvidia-device-plugin-XXXXX

# Monitor node events
kubectl get events --field-selector involvedObject.name=NODE_NAME
```

**Note**  
Replace `nvidia-driver-XXXXX` and `nvidia-device-plugin-XXXXX` with the actual pod names from your cluster, and `NODE_NAME` with your node's name.

## Using SageMaker AI Console
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console"></a>

### Creating a New Cluster with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console-create"></a>

1. Navigate to **Amazon SageMaker AI** > **HyperPod Clusters** > **Cluster Management** > **Create HyperPod cluster**

1. Select **Orchestrated by EKS**

1. Choose **Custom setup** and verify **GPU Operator** is enabled by default

1. Under **Instance groups** section, click **Add group**

1. Configure the instance group and navigate to **Advanced Configuration** to enable **Use GPU partition** toggle and choose your desired **MIG configuration** from the dropdown

1. Click **Add Instance group** and complete the remaining cluster configuration

1. Click **Submit** to create the cluster

### Updating MIG Configuration on Existing Cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console-update"></a>

1. Navigate to **Amazon SageMaker AI** > **HyperPod Clusters** > **Cluster Management**

1. Select your existing cluster and click **Edit** on the instance group you want to modify

1. In **Advanced configuration**, toggle **Use GPU partition** if not already enabled and select a different **MIG configuration** from the dropdown

1. Click **Save changes**

# Node Lifecycle and Labels
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels"></a>

Amazon SageMaker HyperPod performs deep health checks on cluster instances during the creation and update of HyperPod clusters before GPU partitioning begins. HyperPod health-monitoring agent continuously monitors the health status of GPU partitioned instances.

## MIG Configuration States
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-states"></a>

Nodes with GPU partition configuration go through several states:
+ **Pending** - Node is being configured with a MIG profile
+ **Configuring** - GPU Operator is applying MIG partitioning
+ **Success** - GPU partitioning completed successfully
+ **Failed** - GPU partitioning encountered an error

## Monitoring Node States
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-monitoring"></a>

```
# Check node health status
kubectl get nodes -l sagemaker.amazonaws.com/node-health-status=Schedulable

# Monitor MIG configuration progress
kubectl get node NODE_NAME -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'

# Check for configuration errors
kubectl describe node NODE_NAME | grep -A 5 "Conditions:"
```

## Custom Labels and Taints
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-custom"></a>

You can manage MIG configuration with custom labels and taints to label your GPU partitions and apply them across instances:

```
{
  "KubernetesConfig": {
    "Labels": {
      "nvidia.com/mig.config": "all-2g.10gb",
      "task-type": "inference",
      "environment": "production"
    },
    "Taints": [
      {
        "Key": "gpu-task",
        "Value": "mig-enabled",
        "Effect": "NoSchedule"
      }
    ]
  }
}
```

# Task Submission with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission"></a>

**Topics**
+ [Using Kubernetes YAML](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-kubectl)
+ [Using HyperPod CLI](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-cli)
+ [Model Deployment with MIG](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-deployment)
+ [Using HyperPod CLI](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli)

## Using Kubernetes YAML
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-kubectl"></a>

```
apiVersion: batch/v1
kind: Job
metadata:
  name: mig-job
  namespace: default
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: pytorch/pytorch:latest
        resources:
          requests:
            nvidia.com/mig-1g.5gb: 1
            cpu: "100m"
            memory: "128Mi"
          limits:
            nvidia.com/mig-1g.5gb: 1
      restartPolicy: Never
```

## Using HyperPod CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-cli"></a>

Use the HyperPod CLI to deploy JumpStart models with MIG support. The following example demonstrates the new CLI parameters for GPU partitioning:

```
# Deploy JumpStart model with MIG
hyp create hyp-jumpstart-endpoint \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.p5.48xlarge \
  --accelerator-partition-type mig-2g.10gb \
  --accelerator-partition-validation True \
  --endpoint-name my-endpoint \
  --tls-certificate-output-s3-uri s3://certificate-bucket/ \
  --namespace default
```

## Model Deployment with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-deployment"></a>

HyperPod Inference allows deploying the models on MIG profiles via Studio Classic, `kubectl` and HyperPod CLI. To deploy JumpStart Models on `kubectl`, CRDs have fields called `spec.server.acceleratorPartitionType` to deploy the model to the desired MIG profile. We run validations to ensure models can be deployed on the MIG profile selected in the CRD. In case you want to disable the MIG validation checks, use `spec.server.validations.acceleratorPartitionValidation` to `False`.

### JumpStart Models
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-jumpstart"></a>

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name: deepseek-model
  namespace: default
spec:
  sageMakerEndpoint:
    name: deepseek-endpoint
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
  server:
    acceleratorPartitionType: mig-7g.40gb
    instanceType: ml.p4d.24xlarge
```

### Deploy model from Amazon S3 using InferenceEndpointConfig
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-s3"></a>

InferenceEndpointConfig allows you to deploy custom model from Amazon S3. To deploy a model on MIG, in `spec.worker.resources` mention MIG profile in `requests` and `limits`. Refer to a simple deployment below:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: custom-model
  namespace: default
spec:
  replicas: 1
  modelName: my-model
  endpointName: my-endpoint
  instanceType: ml.p4d.24xlarge
  modelSourceConfig:
    modelSourceType: s3
    s3Storage:
      bucketName: my-model-bucket
      region: us-east-2
    modelLocation: model-path
  worker:
    resources:
      requests:
        nvidia.com/mig-3g.20gb: 1
        cpu: "5600m"
        memory: "10Gi"
      limits:
        nvidia.com/mig-3g.20gb: 1
```

### Deploy model from FSx for Lustre using InferenceEndpointConfig
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-fsx"></a>

InferenceEndpointConfig allows you to deploy custom model from FSx for Lustre. To deploy a model on MIG, in `spec.worker.resources` mention MIG profile in `requests` and `limits`. Refer to a simple deployment below:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: custom-model
  namespace: default
spec:
  replicas: 1
  modelName: my-model
  endpointName: my-endpoint
  instanceType: ml.p4d.24xlarge
  modelSourceConfig:
    modelSourceType: fsx
    fsxStorage:
      fileSystemId: fs-xxxxx
    modelLocation: location-on-fsx
  worker:
    resources:
      requests:
        nvidia.com/mig-3g.20gb: 1
        cpu: "5600m"
        memory: "10Gi"
      limits:
        nvidia.com/mig-3g.20gb: 1
```

### Using Studio Classic UI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio"></a>

#### Deploying JumpStart Models with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-deploy"></a>

1. Open **Studio Classic** and navigate to **JumpStart**

1. Browse or search for your desired model (e.g., "DeepSeek", "Llama", etc.)

1. Click on the model card and select **Deploy**

1. In the deployment configuration:
   + Choose **HyperPod** as the deployment target
   + Select your MIG-enabled cluster from the dropdown
   + Under **Instance configuration**:
     + Select instance type (e.g., `ml.p4d.24xlarge`)
     + Choose **GPU Partition Type** from available options
     + Configure **Instance count** and **Auto-scaling** settings

1. Review and click **Deploy**

1. Monitor deployment progress in the **Endpoints** section

#### Model Configuration Options
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-config"></a>

**Endpoint Settings:**
+ **Endpoint name** - Unique identifier for your deployment
+ **Variant name** - Configuration variant (default: AllTraffic)
+ **Instance type** - Must support GPU partition (p series)
+ **MIG profile** - GPU partition
+ **Initial instance count** - Number of instances to deploy
+ **Auto-scaling** - Enable for dynamic scaling based on traffic

**Advanced Configuration:**
+ **Model data location** - Amazon S3 path for custom models
+ **Container image** - Custom inference container (optional)
+ **Environment variables** - Model-specific configurations
+ **Amazon VPC configuration** - Network isolation settings

#### Monitoring Deployed Models
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-monitor"></a>

1. Navigate to **Studio Classic** > **Deployments** > **Endpoints**

1. Select your MIG-enabled endpoint

1. View metrics including:
   + **MIG utilization** - Per GPU partition usage
   + **Memory consumption** - Per GPU partition
   + **Inference latency** - Request processing time
   + **Throughput** - Requests per second

1. Set up **Amazon CloudWatch alarms** for automated monitoring

1. Configure **auto-scaling policies** based on MIG utilization

## Using HyperPod CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli"></a>

### JumpStart Deployment
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli-jumpstart"></a>

The HyperPod CLI JumpStart command includes two new fields for MIG support:
+ `--accelerator-partition-type` - Specifies the MIG configuration (e.g., mig-4g.20gb)
+ `--accelerator-partition-validation` - Validates compatibility between models and MIG profile (default: true)

```
hyp create hyp-jumpstart-endpoint \
  --version 1.1 \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.p4d.24xlarge \
  --endpoint-name js-test \
  --accelerator-partition-type "mig-4g.20gb" \
  --accelerator-partition-validation true \
  --tls-certificate-output-s3-uri s3://my-bucket/certs/
```

### Custom Endpoint Deployment
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli-custom"></a>

For deploying via custom endpoint, use the existing fields `--resources-requests` and `--resources-limits` to enable MIG profile functionality:

```
hyp create hyp-custom-endpoint \
  --namespace default \
  --metadata-name deepseek15b-mig-10-14-v2 \
  --endpoint-name deepseek15b-mig-endpoint \
  --instance-type ml.p4d.24xlarge \
  --model-name deepseek15b-mig \
  --model-source-type s3 \
  --model-location deep-seek-15b \
  --prefetch-enabled true \
  --tls-certificate-output-s3-uri s3://sagemaker-bucket \
  --image-uri lmcache/vllm-openai:v0.3.7 \
  --container-port 8080 \
  --model-volume-mount-path /opt/ml/model \
  --model-volume-mount-name model-weights \
  --s3-bucket-name model-storage-123456789 \
  --s3-region us-east-2 \
  --invocation-endpoint invocations \
  --resources-requests '{"cpu":"5600m","memory":"10Gi","nvidia.com/mig-3g.20gb":"1"}' \
  --resources-limits '{"nvidia.com/mig-3g.20gb":"1"}' \
  --env '{
    "OPTION_ROLLING_BATCH":"vllm",
    "SERVING_CHUNKED_READ_TIMEOUT":"480",
    "DJL_OFFLINE":"true",
    "NUM_SHARD":"1",
    "SAGEMAKER_PROGRAM":"inference.py",
    "SAGEMAKER_SUBMIT_DIRECTORY":"/opt/ml/model/code",
    "MODEL_CACHE_ROOT":"/opt/ml/model",
    "SAGEMAKER_MODEL_SERVER_WORKERS":"1",
    "SAGEMAKER_MODEL_SERVER_TIMEOUT":"3600",
    "OPTION_TRUST_REMOTE_CODE":"true",
    "OPTION_ENABLE_REASONING":"true",
    "OPTION_REASONING_PARSER":"deepseek_r1",
    "SAGEMAKER_CONTAINER_LOG_LEVEL":"20",
    "SAGEMAKER_ENV":"1"
  }'
```

# Cluster resiliency features for SageMaker HyperPod cluster orchestration with Amazon EKS
<a name="sagemaker-hyperpod-eks-resiliency"></a>

SageMaker HyperPod provides the following cluster resiliency features. 

**Topics**
+ [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md)
+ [Basic health checks](sagemaker-hyperpod-eks-resiliency-basic-health-check.md)
+ [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md)
+ [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md)
+ [Resilience-related Kubernetes labels by SageMaker HyperPod](sagemaker-hyperpod-eks-resiliency-node-labels.md)
+ [Manually quarantine, replace, or reboot a node](sagemaker-hyperpod-eks-resiliency-manual.md)
+ [Suggested resilience configurations](sagemaker-hyperpod-eks-resiliency-config-tips.md)

# Health Monitoring System
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent"></a>

SageMaker HyperPod health monitoring system includes two components 

1. Monitoring agents installed in your nodes, which include the Health Monitoring Agent (HMA) that serves as an on-host health monitor and a set of out-of-node health monitors.

1. Node Recovery System managed by SageMaker HyperPod. The health monitoring system will monitor the node health status continuously via monitoring agents and then take actions automatically when fault is detected using the Node Recovery System. 

![\[This image illustrates how health monitoring system integrated with HyperPod Cluster.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-resilience-event.png)


## Health checks done by the SageMaker HyperPod health-monitoring agent
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent-list-of-checks"></a>

The SageMaker HyperPod health-monitoring agent checks the following.

**NVIDIA GPUs**
+ [DCGM policy violation notifications](https://docs.nvidia.com/datacenter/dcgm/3.0/user-guide/feature-overview.html#notifications)
+ Errors in the `nvidia-smi` output
+ Various errors in the logs generated by the Amazon Elastic Compute Cloud (EC2) platform
+ GPU Count validation — if there’s a mismatch between the expected number of GPUs in a particular instance type (for example: 8 GPUs in ml.p5.48xlarge instance type) and the count returned by `nvidia-smi`, then HMA reboots the node 

**AWS Trainium**
+ Errors in the output from the [AWS Neuron monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html)
+ Outputs generated by the Neuron node problem detector (For more information about the AWS Neuron node problem detector, see [Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters](https://aws.amazon.com/blogs/machine-learning/node-problem-detection-and-recovery-for-aws-neuron-nodes-within-amazon-eks-clusters/).)
+ Various errors in the logs generated by the Amazon EC2 platform
+ Neuron Device Count validation — if there’s a mismatch between the actual number of neuron device count in a particular instance type and the count returned by `neuron-ls`, then HMA reboots the node

 The above checks are passive, background health checks HyperPod runs continuously on your nodes. In addition to these checks, HyperPod also runs deep (or active) health checks during the creation and update of HyperPod clusters. Learn more about [Deep health checks](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html).

## Fault Detection
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-fault-detection"></a>

When SageMaker HyperPod detects a fault, it implements a four-part response:

1. **Node Labels**

   1. Health Status: `sagemaker.amazonaws.com/node-health-status`

   1. Fault Type: `sagemaker.amazonaws.com/fault-types` label for high-level categorization

   1. Fault Reason: `sagemaker.amazonaws.com/fault-reasons` label for detailed fault information

1. **Node Taint**

   1. `sagemaker.amazonaws.com/node-health-status=Unschedulable:NoSchedule`

1. **Node Annotation**

   1. Fault details: `sagemaker.amazonaws.com/fault-details`

   1. Records up to 20 faults with timestamps that occurred on the node

1. **Node Conditions**([Kubernetes Node Condition](https://kubernetes.io/docs/reference/node/node-status/#condition))

   1. Reflects current health status in node conditions:
      + Type: Same as fault type
      + Status: `True`
      + Reason: Same as fault reason
      + LastTransitionTime: Fault occurrence time

![\[This image illustrates how the health monitoring system works when detected a fault.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-resilience-workflow.png)


## Logs generated by the SageMaker HyperPod health-monitoring agent
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent-health-check-results"></a>

The SageMaker HyperPod health-monitoring agent is an out-of-the-box health check feature and continuously runs on all HyperPod clusters. The health monitoring agent publishes detected health events on GPU or Trn instances to CloudWatch under the Cluster log group `/aws/sagemaker/Clusters/`.

The detection logs from the HyperPod health monitoring agent are created as separate log streams named `SagemakerHealthMonitoringAgent` for each node. You can query the detection logs using CloudWatch log insights as follows.

```
fields @timestamp, @message
| filter @message like /HealthMonitoringAgentDetectionEvent/
```

This should return an output similar to the following.

```
2024-08-21T11:35:35.532-07:00
    {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
2024-08-21T11:35:35.532-07:00
    {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
```

# Basic health checks
<a name="sagemaker-hyperpod-eks-resiliency-basic-health-check"></a>

SageMaker HyperPod performs a set of *basic health checks* on cluster instances during the creation and update of HyperPod clusters. These basic health checks are orchestrator-agnostic, so these checks are applicable regardless of the underlying orchestration platforms supported by SageMaker HyperPod (Amazon EKS or Slurm).

The basic health checks monitor cluster instances for issues related to devices such as accelerators (GPU and Trainium cores) and network devices (Elastic Fabric Adapter, or EFA). To find the list of basic cluster health checks, see [Cluster health checks](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html#sagemaker-hyperpod-resiliency-slurm-cluster-health-check).

# Deep health checks
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks"></a>

SageMaker HyperPod performs *deep health checks* on cluster instances during the creation and update of HyperPod clusters. You can also request deep health checks on-demand for a SageMaker HyperPod cluster using [StartClusterHealthCheck](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StartClusterHealthCheck.html) API. The deep health checks ensure the reliability and stability of the SageMaker HyperPod clusters by testing the underlying hardware and infrastructure components. This proactive approach helps identify and mitigate potential issues early in the cluster lifecycle.

## List of deep health checks done by SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks-list"></a>

SageMaker HyperPod runs the following deep health checks.

**Instance-level deep health checks**


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | GPU/NVLink count | GPU | Verifies GPU/NVLink counts. | 
| Accelerator | [DCGM diagnostics](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html) level 4 | GPU | Assesses the health and functionality of NVIDIA GPUs by running DCGM (NVIDIA Data Center GPU Manager) diagnostics at level 4, including additional memory tests. | 
| Accelerator | Neuron sysfs | Trainium | For Trainium-powered instances, the health of the Neuron devices is determined by reading counters from [Neuron sysfs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html) propagated directly by the Neuron driver. | 
| Accelerator | Neuron hardware check | Trainium | Runs a training workload and verifies the results to test the hardware. | 
| Accelerator | NCCOM local test | Trainium | Evaluates the performance of collective communication operations on single Trainium nodes | 
| Network | EFA | GPU and Trainium | Runs latency and bandwidth benchmarking on the attached EFA device. | 

**Cluster-level deep health checks**


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | NCCL test | GPU | Verifies the performance of collective communication operations on multiple NVIDIA GPUs | 
| Accelerator | NCCOM cluster test | Trainium | Verifies the performance of collective communication operations on multiple Trainium nodes | 

**Deep health checks with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, deep health checks behave as follows:  
Instance-level deep health checks run only on eligible GPU instance types. CPU instance types within a flexible instance group are skipped.
Cluster-level connectivity tests (such as NCCL AllReduce) run only between instances of the same type within the instance group. This ensures accurate test results that reflect the networking capabilities of each instance type.
If deep health checks are enabled, at least one instance type in the flexible instance group must support deep health checks.

## Logs from the deep health checks
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks-log"></a>

The following are example logs from the SageMaker HyperPod deep health checks.

**Cluster-level logs** 

The cluster-level deep health check logs are stored in your CloudWatch log group at `/aws/sagemaker/Clusters/<cluster_name>/<cluster_id>`

The log streams are logged at `DeepHealthCheckResults/<log_stream_id>`.

As an example shown below, the deep health check output logs show the instance ID that failed the checks with the cause of the failure.

```
{
    "level": "error",
    "ts": "2024-06-18T21:15:22Z",
    "msg": "Encountered FaultyInstance. Replace the Instance. Region: us-west-2, InstanceType: p4d.24xlarge. ERROR:Bandwidth has less than threshold: Expected minimum threshold :80,NCCL Test output Bw: 30"
}
```

**Instance-level logs** 

The instance-level deep health check logs are stored at `/var/log/aws/clusters/sagemaker-deep-health-check.log` on each node. SSH into the node and open the log file by running the following command.

```
cat /var/log/aws/clusters/sagemaker-deep-health-check.log
```

The following is an example output of the hardware stress, [NVIDIA DCGM](https://developer.nvidia.com/dcgm) stress, and EFA connectivity test.

```
# Hardware Stress Test output

2024-08-20T21:53:58Z info Executing Hardware stress check with command: stress-ng, and args: [--cpu 32 --vm 2 --hdd 1 --fork 8 --switch 4 --timeout 60 --metrics]

2024-08-20T21:54:58Z info stress-ng success

2024-08-20T21:54:58Z    info    GpuPci Count check success

# DCGM Stress Test

2024-08-20T22:25:02Z    info    DCGM diagnostic health summary: dcgmCheckLevel: 0 dcgmVersion: 3.3.7 gpuDriverVersion: 535.183.01, gpuDeviceIds: [2237] replacementRequired: false rebootRequired:false

# EFA Loopback Test

2024-08-20T22:26:28Z    info    EFA Loopback check passed for device: rdmap0s29 . Output summary is MaxBw: 58.590000, AvgBw: 32.420000, MaxTypicalLat: 30.870000, MinTypicalLat: 20.080000, AvgLat: 21.630000
```

The following is an example output of the NCCL connectivity test.

```
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

           8             2     float     sum      -1    353.9    0.00    0.00      0    304.2    0.00    0.00      0
          16             4     float     sum      -1    352.8    0.00    0.00      0    422.9    0.00    0.00      0
          32             8     float     sum      -1    520.0    0.00    0.00      0    480.3    0.00    0.00      0
          64            16     float     sum      -1    563.0    0.00    0.00      0    416.1    0.00    0.00      0
         128            32     float     sum      -1    245.1    0.00    0.00      0    308.4    0.00    0.00      0
         256            64     float     sum      -1    310.8    0.00    0.00      0    304.9    0.00    0.00      0
         512           128     float     sum      -1    304.9    0.00    0.00      0    300.8    0.00    0.00      0
        1024           256     float     sum      -1    509.3    0.00    0.00      0    495.4    0.00    0.00      0
        2048           512     float     sum      -1    530.3    0.00    0.00      0    420.0    0.00    0.00      0
        4096          1024     float     sum      -1    391.2    0.01    0.01      0    384.5    0.01    0.01      0
        8192          2048     float     sum      -1    328.5    0.02    0.02      0    253.2    0.03    0.03      0
       16384          4096     float     sum      -1    497.6    0.03    0.03      0    490.9    0.03    0.03      0
       32768          8192     float     sum      -1    496.7    0.07    0.07      0    425.0    0.08    0.08      0
       65536         16384     float     sum      -1    448.0    0.15    0.15      0    501.0    0.13    0.13      0
      131072         32768     float     sum      -1    577.4    0.23    0.23      0    593.4    0.22    0.22      0
      262144         65536     float     sum      -1    757.8    0.35    0.35      0    721.6    0.36    0.36      0
      524288        131072     float     sum      -1   1057.1    0.50    0.50      0   1019.1    0.51    0.51      0
     1048576        262144     float     sum      -1   1460.5    0.72    0.72      0   1435.6    0.73    0.73      0
     2097152        524288     float     sum      -1   2450.6    0.86    0.86      0   2583.1    0.81    0.81      0
     4194304       1048576     float     sum      -1   4344.5    0.97    0.97      0   4419.3    0.95    0.95      0
     8388608       2097152     float     sum      -1   8176.5    1.03    1.03      0   8197.8    1.02    1.02      0
    16777216       4194304     float     sum      -1    15312    1.10    1.10      0    15426    1.09    1.09      0
    33554432       8388608     float     sum      -1    30149    1.11    1.11      0    29941    1.12    1.12      0
    67108864      16777216     float     sum      -1    57819    1.16    1.16      0    58635    1.14    1.14      0
   134217728      33554432     float     sum      -1   115699    1.16    1.16      0   115331    1.16    1.16      0
   268435456      67108864     float     sum      -1   227507    1.18    1.18      0   228047    1.18    1.18      0
   536870912     134217728     float     sum      -1   453751    1.18    1.18      0   456595    1.18    1.18      0
  1073741824     268435456     float     sum      -1   911719    1.18    1.18      0   911808    1.18    1.18      0
  2147483648     536870912     float     sum      -1  1804971    1.19    1.19      0  1806895    1.19    1.19      0

2024-08-20T16:22:43.831-07:00

# Out of bounds values : 0 OK

2024-08-20T16:22:43.831-07:00

# Avg bus bandwidth    : 0.488398 

2024-08-20T23:22:43Z    info    Nccl test successful. Summary: NcclMaxAlgoBw: 1.190000, NcclAvgAlgoBw: 0.488398, NcclThresholdAlgoBw: 1.180000, NcclOutOfBoundError: OK, NcclOperations: all_reduce_perf, NcclTotalDevices: 2, NcclNodes: 2, NcclClusterMessage:
```

# Automatic node recovery
<a name="sagemaker-hyperpod-eks-resiliency-node-recovery"></a>

During cluster creation or update, cluster admin users can select the node (instance) recovery option between `Automatic` (Recommended) and `None` at the cluster level. If set to `Automatic`, SageMaker HyperPod reboots or replaces faulty nodes automatically. 

**Important**  
We recommend setting the `Automatic` option.

Automatic node recovery runs when issues are found from health-monitoring agent, basic health checks, and deep health checks. If set to `None`, the health monitoring agent will label the instances when a fault is detected, but it will not automatically initiate any repair or recovery actions on the affected nodes. This option is not recommended.

# Resilience-related Kubernetes labels by SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-resiliency-node-labels"></a>

*Labels* are key-value pairs that are attached to [Kubernetes objects](https://kubernetes.io/docs/concepts/overview/working-with-objects/#kubernetes-objects). SageMaker HyperPod introduces the following labels for the health checks it provides.

## Node health status labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-health-status"></a>

The `node-health-status` labels represent the status of the node health and to be used as part of node selector filter in healthy nodes.


| Label | Description | 
| --- | --- | 
| sagemaker.amazonaws.com/node-health-status: Schedulable | The node has passed basic health checks and is available for running workloads. This health check is the same as the [currently available SageMaker HyperPod resiliency features for Slurm clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html). | 
| sagemaker.amazonaws.com/node-health-status: Unschedulable | The node is running deep health checks and is not available for running workloads. | 
| sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement | The node has failed deep health checks or health-monitoring agent checks and requires a replacement. If automatic node recovery is enabled, the node will be automatically replaced by SageMaker HyperPod. | 
| sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot | The node has failed deep health checks or health-monitoring agent checks and requires a reboot. If automatic node recovery is enabled, the node will be automatically rebooted by SageMaker HyperPod. | 

## Deep health check labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-deep-health-check"></a>

The `deep-health-check-status` labels represent the progress of deep health check on a specific node. Helpful for Kubernetes users to quickly filter for progress of overall deep health checks.


| Label | Description | 
| --- | --- | 
| sagemaker.amazonaws.com/deep-health-check-status: InProgress | The node is running deep health checks and is not available for running workloads. | 
| sagemaker.amazonaws.com/deep-health-check-status: Passed | The node has successfully completed deep health checks and health-monitoring agent checks, and is available for running workloads. | 
| sagemaker.amazonaws.com/deep-health-check-status: Failed | The node has failed deep health checks or health-monitoring agent checks and requires a reboot or replacement. If automatic node recovery is enabled, the node will be automatically rebooted or replaced by SageMaker HyperPod. | 

## Fault type and reason labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-fault-type-and-reason"></a>

The following describes the `fault-type` and `fault-reason` labels.
+ `fault-type` labels represent high-level fault categories when health checks fail. These are populated for failures identified during both deep health and health-monitoring agent checks.
+ `fault-reason` labels represent the detailed fault reason associated with a `fault-type`.

## How SageMaker HyperPod labels
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels"></a>

The following topics cover how labeling is done depending on various cases.

**Topics**
+ [When a node is added to a SageMaker HyperPod cluster with deep health check config disabled](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-off)
+ [When a node is added to a SageMaker HyperPod cluster with deep health check config enabled](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-on)
+ [When there are any compute failures on nodes](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-node-fails)

### When a node is added to a SageMaker HyperPod cluster with deep health check config disabled
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-off"></a>

When a new node is added into a cluster, and if deep health check is not enabled for the instance group, SageMaker HyperPod runs the same health checks as the [currently available SageMaker HyperPod health checks for Slurm clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html). 

If the health check passes, the nodes will be marked with the following label.

```
sagemaker.amazonaws.com/node-health-status: Schedulable
```

If the health check doesn't pass, the nodes will be terminated and replaced. This behavior is the same as the way SageMaker HyperPod health check works for Slurm clusters. 

### When a node is added to a SageMaker HyperPod cluster with deep health check config enabled
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-on"></a>

When a new node is added into a SageMaker HyperPod cluster, and if the deep health check test is enabled for the instance group, HyperPod first taints the node and starts the \$12-hour deep health check/stress test on the node. There are 3 possible outputs of the node labels after the deep health check. 

1. When the deep health check test passes

   ```
   sagemaker.amazonaws.com/node-health-status: Schedulable
   ```

1. When the deep health check test fails, and the instance needs to be replaced

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement
   ```

1. When the deep health check test fails, and the instance needs to be rebooted to rerun the deep health check

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot
   ```

If an instance fails the deep health check test, the instance will always be replaced. If the deep health check tests succeeds, the taint on the node will be removed.

### When there are any compute failures on nodes
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-node-fails"></a>

The SageMaker HyperPod health monitor agent also continuously monitors the health status of each node. When it detects any failures (such as GPU failure and driver crash), the agent marks the node with one of the following labels.

1. When the node is unhealthy and needs to be replaced

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement
   ```

1. When the node is unhealthy and needs to be rebooted

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot
   ```

 The health monitor agent also taints the node when it detects any node health issues.

# Manually quarantine, replace, or reboot a node
<a name="sagemaker-hyperpod-eks-resiliency-manual"></a>

Learn how to manually quarantine, replace, and reboot a faulty node in SageMaker HyperPod clusters orchestrated with Amazon EKS.

**To quarantine a node and force delete a training pod**

```
kubectl cordon <node-name>
```

After quarantine, force ejecting the Pod. This is useful when you see a pod is stuck in termination for more than 30min or `kubectl describe pod` shows ‘Node is not ready’ in Events

```
kubectl delete pods <pod-name> --grace-period=0 --force
```

SageMaker HyperPod offers two methods for manual node recovery. The preferred approach is using the SageMaker HyperPod Reboot and Replace APIs, which provides a faster and more transparent recovery process that works across all orchestrators. Alternatively, you can use kubectl commands to label nodes for reboot and replace operations. Both methods activate the same SageMaker HyperPod recovery processes.

**To reboot a node using the Reboot API**

To reboot a node you can use the BatchRebootClusterNodes API.

 Here is an example of running the reboot operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-reboot-cluster-nodes \
        --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
        --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

**To replace a node using the Replace API**

To replace a node you can use the BatchReplaceClusterNodes API as follows

 Here is an example of running the replace operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-replace-cluster-nodes \
        --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
        --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

**Karpenter-managed clusters**  
For SageMaker HyperPod clusters using Karpenter for node provisioning, the `BatchReplaceClusterNodes` API does not guarantee that a replacement node will be created. The specified node *will* be terminated, but replacement depends on Karpenter's pod-demand-based provisioning model. Karpenter only creates new nodes when there are pods in a `Pending` state that cannot be scheduled on existing nodes.  
If the workload from the deleted node can be rescheduled onto remaining nodes in the cluster (for example, if those nodes have sufficient capacity), Karpenter does not provision a replacement. To ensure a replacement node is created, verify that your workload configuration (such as pod anti-affinity rules or resource requests) requires a new node for the displaced pods.  
We are aware of this limitation and are actively working on a solution to enforce node replacement when requested through the API.

**To replace a node using kubectl**

Label the node to replace with `sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement`, which triggers the SageMaker HyperPod [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md). Note that you also need to activate automatic node recovery during cluster creation or update.

```
kubectl label nodes <node-name> \
   sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement
```

**To reboot a node using kubectl**

Label the node to reboot with `sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot`, which triggers the SageMaker HyperPod [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md). Note that you also need to activate automatic node recovery during cluster creation or update.

```
kubectl label nodes <node-name> \
   sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot
```

After the labels `UnschedulablePendingReplacement` or `UnschedulablePendingReboot` are applied, you should be able to see the node is terminated or rebooted in a few minutes. 

# Suggested resilience configurations
<a name="sagemaker-hyperpod-eks-resiliency-config-tips"></a>

When the deep health checks are enabled, whenever a new instance is added to the HyperPod cluster (either during create-cluster or through automatic node replacement), the new instance goes through the deep health check process (instance level stress tests) for about a couple of hours. The following are suggested resilience config combinations depending on possible cases.

1. **Case**: When you have additional spare nodes within a cluster as back-up resources (not using the full capacity), or if you can wait for about 2 hours for the deep health check process to get the less error-prone instances.

   **Recommendation**: Enable the deep health check config throughout the cluster lifecycle. Node auto-recovery config is enabled by default.

1. **Case**: When you don't have additional backup nodes (capacity is fully used for some training load). You want to get the replacement nodes as soon as possible to resume the training job. 

   **Recommendation**: Enable the deep health check during cluster creation, then turn-off the deep health check config after the cluster is created. Node auto recovery config is enabled by default.

1. **Case**: When you don't have additional backup nodes, and you don't want to wait for the \$12 hour deep health check process (small clusters).

   **Recommendation**: disable the deep health check config throughout the cluster life cycle. Node auto recovery config is enabled by default.

If you want to resume the training job from a failure immediately, make sure that you have additional spare nodes as backup resources in the cluster.

# Spot instances in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-spot"></a>

Amazon SageMaker HyperPod supports Amazon EC2 Spot Instances, enabling significant cost savings for fault-tolerant and stateless AI/ML workloads. Use cases include batch inference and training jobs, hyperparameter tuning, and experimental workloads. You can also use Spot Instances to automatically scale your compute capacity when this low-cost capacity is available and scale back to On-Demand capacity when the added Spot capacity is reclaimed.

By default, Spot Instances on HyperPod work with HyperPod’s [continuous provisioning feature](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-scaling-eks.html), which enables SageMaker HyperPod to automatically provision remaining capacity in the background while workloads start immediately on available instances. When node provisioning encounters failures due to capacity constraints or other issues, SageMaker HyperPod automatically retries in the background until clusters reach their desired scale, so your autoscaling operations remain resilient and non-blocking. You can also use Spot Instances with [Karpenter-based](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html) autoscaling.

**Key Capabilities & Concepts to consider**
+ Capture up to 90% cost savings compared to On-Demand instances
+ Use Spot Instances for jobs that can handle interruptions and where job start and completion times are flexible
+ When using Karpenter for automatic scaling, you can configure HyperPod to automatically fallback to On-Demand when Spot capacity is interrupted or unavailable
+ Access a wide range of CPU, GPU, and accelerator instance types supported by HyperPod
+ Capacity availability depends on supply from EC2 and varies by region and instance type
+ You can perform various actions such as identifying the likelihood of obtaining desired instances or getting interrupted, using various tools such as [Spot Instance Advisor](https://aws.amazon.com/ec2/spot/instance-advisor/) provided by EC2

## Getting started
<a name="sagemaker-hyperpod-spot-instance-getstart"></a>

### Prerequisites
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq"></a>

Before you begin, ensure you have:

#### AWS CLI installed and configured
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-cli"></a>

Set up your AWS credentials and region:

```
aws configure
```

Refer to the [AWS credentials documentation](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html) for detailed instructions.

#### IAM Role for SageMaker HyperPod execution
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-iam"></a>

To update the cluster, you must first create [AWS Identity and Access Management](https://aws.amazon.com/iam/) (IAM) permissions for Karpenter. For instructions, see [Create an IAM role for HyperPod autoscaling with Karpenter](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling-iam.html).

#### VPC and EKS Cluster Setup
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-cluster"></a>

**2.1 Create VPC and EKS Cluster**

Follow the [HyperPod EKS setup guide](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-install-packages-using-helm-chart.html) to:

1. Create a VPC with subnets in multiple Availability Zones

1. Create an EKS cluster

1. Install [required dependencies](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-install-packages-using-helm-chart.html) using Helm charts

**2.2 Set Environment Variables**

```
export EKS_CLUSTER_ARN="arn:aws:eks:REGION:ACCOUNT_ID:cluster/CLUSTER_NAME"
export EXECUTION_ROLE="arn:aws:iam::ACCOUNT_ID:role/SageMakerExecutionRole"
export BUCKET_NAME="your-s3-bucket-name"
export SECURITY_GROUP="sg-xxxxx"
export SUBNET="subnet-xxxxx"
export SUBNET1="subnet-xxxxx"
export SUBNET2="subnet-xxxxx"
export SUBNET3="subnet-xxxxx"
```

#### Service quotas for the Spot instances
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-quota"></a>

Verify you have the required quotas for the instances you will create in the SageMaker HyperPod cluster. To review your quotas, on the Service Quotas console, choose AWS services in the navigation pane, then choose SageMaker. For example, the following screenshot shows the available quota for c5 instances.

![\[An image containing cost region information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cluster-quota.png)


#### Check Spot Availability
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-availability"></a>

Before creating Spot instance groups, check availability in different Availability Zones:

```
aws ec2 get-spot-placement-scores \
  --region us-west-2 \
  --instance-types c5.2xlarge \
  --target-capacity 10 \
  --single-availability-zone \
  --region-names us-west-2
```

**Tip**: Target Availability Zones with higher placement scores for better availability. You can also check Spot Instance Advisor and EC2 Spot pricing for availability. Select required Availability Zone with better availability score and configure Instance group with associated subnet to launch instance in that AZ.

### Creating a Instance Group (No Autoscaling)
<a name="sagemaker-hyperpod-spot-instance-getstart-create"></a>

**CreateCluster (Spot)**

```
aws sagemaker create-cluster \
    --cluster-name clusterNameHere \
    --orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
    --node-provisioning-mode "Continuous" \
    --cluster-role 'arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole' \
    --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]' 
    --vpc-config '{
        "SecurityGroupIds": ["'$SECURITY_GROUP'"],
        "Subnets": ["'$SUBNET'"] 
    }'
```

**Update Cluster (Spot \$1 On-Demand)**

```
aws sagemaker update-cluster \
   --cluster-name "my-cluster" \
   --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-x-az3",
        "InstanceType": "ml.c5.xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://'$BUCKET_NAME'",
            "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET3'"]
        }
    },
    {
        "InstanceGroupName": "auto-spot-c5-2x-az2",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET2'"]
         }
    },
    {   
        "InstanceGroupName": "auto-ondemand-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]'
```

`CapacityRequirements` cannot be modified once an Instance Group is created.

**Describe Cluster**

```
aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region us-west-2
```

```
## Sample Response
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [
        {
            "InstanceGroupName": "ml.c5.2xlarge",
            "InstanceType": "ml.c5.xlarge",
            "InstanceCount": 5,
            "CurrentCount": 3,
            "CapacityRequirements: { "Spot": {} },
            "ExecutionRole": "arn:aws:iam::account:role/SageMakerExecutionRole",
            "InstanceStorageConfigs": [...],
            "OverrideVpcConfig": {...}
        }
        // Other IGs
    ]
}
```

**DescribeClusterNode**

```
aws sagemaker describe-cluster-node --cluster-name $HP_CLUSTER_NAME --region us-west-2
```

```
## Sample Response
{
  "NodeDetails": {
    "InstanceId": "i-1234567890abcdef1",
    "InstanceGroupName": "ml.c5.2xlarge",
    "CapacityType": "Spot",
    "InstanceStatus": {...}
  }
}
```

### Using Console
<a name="sagemaker-hyperpod-spot-instance-getstart-console"></a>

#### Create and configure a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-spot-instance-getstart-console-create"></a>

To begin, launch and configure your SageMaker HyperPod EKS cluster and verify that continuous provisioning mode is enabled on cluster creation. Complete the following steps:

1. On the SageMaker AI console, choose HyperPod clusters in the navigation pane.

1. Choose Create HyperPod cluster and Orchestrated on Amazon EKS.

1. For Setup options, select Custom setup.

1. For Name, enter a name.

1. For Instance recovery, select Automatic.

1. For Instance provisioning mode, select Use continuous provisioning.

1. CapacityType : Select Spot 

1. Choose Submit.

Screen shot of Console : 

![\[An image containing the creation cluster flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-create-cluster.png)


This setup creates the necessary configuration such as virtual private cloud (VPC), subnets, security groups, and EKS cluster, and installs operators in the cluster. You can also provide existing resources such as an EKS cluster if you want to use an existing cluster instead of creating a new one. This setup will take around 20 minutes.

#### Adding new Spot Instance Group to the same cluster
<a name="sagemaker-hyperpod-spot-instance-getstart-console-add"></a>

To add an Spot IG to your existing HyperPod EKS cluster. Complete the following steps:

1. On the SageMaker AI console, choose HyperPod clusters in the navigation pane.

1. Select an existing HyperPod cluster with Amazon EKS Orchestration (Ensure continuous provisioning is enabled).

1. Click Edit.

1. On the Edit Cluster page, click Create instance group.

1. Select capacity type: Spot instance in the instance group configuration.

1. Click Create instance group. 

1. Click Submit.

**Screen shot of Console : **

![\[An image containing the instance group creation flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-instance-group.png)


### Using CloudFormation
<a name="sagemaker-hyperpod-spot-instance-getstart-cfn"></a>

```
Resources:
  TestCluster:
    Type: AWS::SageMaker::Cluster
    Properties:
      ClusterName: "SampleCluster"
      InstanceGroups:
        - InstanceGroupName: group1
          InstanceType: ml.c5.2xlarge
          InstanceCount: 1
          LifeCycleConfig:
            SourceS3Uri: "s3://'$BUCKET_NAME'"
            OnCreate: "on_create_noop.sh"
          ExecutionRole: "'$EXECUTION_ROLE'",
          ThreadsPerCore: 1
          CapacityRequirements:
            Spot: {}
      VpcConfig:
        Subnets:
          - "'$SUBNET1'"
        SecurityGroupIds:
          - "'$SECURITY_GROUP'"
      Orchestrator:
        Eks:
          ClusterArn:
            '$EKS_CLUSTER_ARN'
      NodeProvisioningMode: "Continuous"
      NodeRecovery: "Automatic"
```

Please see [https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-eks-console-create-cluster-cfn.html](https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-eks-console-create-cluster-cfn.html) for details.

### Karpenter based Autoscaling
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter"></a>

#### Create cluster role
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-role"></a>

**Step 1: Navigate to IAM Console**

1. Go to the **AWS Management Console** → **IAM** service

1. Click **Roles** in the left sidebar

1. Click **Create role**

**Step 2: Set up Trust Policy**

1. Select Custom trust policy (instead of AWS service)

1. Replace the default JSON with this trust policy:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "hyperpod.sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

click Next

**Step 3: Create Custom Permissions Policy**

Since these are specific SageMaker permissions, you'll need to create a custom policy:

1. Click **Create policy** (opens new tab)

1. Click the **JSON** tab

1. Enter this policy:

   ```
   {
     "Version": "2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "sagemaker:BatchAddClusterNodes",
           "sagemaker:BatchDeleteClusterNodes"
         ],
         "Resource": "*"
       }
     ]
   }
   ```

1. Click **Next**

1. Give it a name like `SageMakerHyperPodRolePolicy`

1. Click **Create policy**

**Step 4: Attach the Policy to Role**

1. Go back to your role creation tab

1. Refresh the policies list

1. Search for and select your newly created policy

1. Click **Next**

**Step 5: Name and Create Role**

1. Enter a role name (e.g., `SageMakerHyperPodRole`)

1. Add a description if desired

1. Review the trust policy and permissions

1. Click **Create role**

#### Verification
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-verify"></a>

After creation, you can verify by:
+ Checking the Trust relationships tab shows the hyperpod service
+ Checking the Permissions tab shows your custom policy
+ The role ARN will be available for use with HyperPod

The role ARN format will be:

```
 arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole
```

#### Create Cluster with AutoScaling:
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-cluster"></a>

For better availability, create IGs in multiple AZs by configuring Subnets. You can also include onDemand IGs for fallback.

```
aws sagemaker create-cluster \
    --cluster-name clusterNameHere \
    --orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
    --node-provisioning-mode "Continuous" \
    --cluster-role 'arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole' \
    --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 0, // For Auto scaling keep instance count as 0
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]' 
--vpc-config '{
    "SecurityGroupIds": ["'$SECURITY_GROUP'"],
    "Subnets": ["'$SUBNET'"] 
}'
--auto-scaling ' {
    "Mode": "Enable",
    "AutoScalerType": "Karpenter"
}'
```

#### Update Cluster (Spot \$1 On-Demand)
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-update-cluster"></a>

```
aws sagemaker update-cluster \
   --cluster-name "my-cluster" \
   --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-x-az3",
        "InstanceType": "ml.c5.xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://'$BUCKET_NAME'",
            "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET3'"]
        }
    },
    {
        "InstanceGroupName": "auto-spot-c5-2x-az2",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET2'"]
         }
    },
    {   
        "InstanceGroupName": "auto-ondemand-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]'
```

#### Create HyperpodNodeClass
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-class"></a>

`HyperpodNodeClass` is a custom resource that maps to pre-created instance groups in SageMaker HyperPod, defining constraints around which instance types and Availability Zones are supported for Karpenter’s auto scaling decisions. To use `HyperpodNodeClass`, simply specify the names of the `InstanceGroups` of your SageMaker HyperPod cluster that you want to use as the source for the AWS compute resources to use to scale up your pods in your NodePools. The `HyperpodNodeClass` name that you use here is carried over to the NodePool in the next section where you reference it. This tells the NodePool which `HyperpodNodeClass` to draw resources from. To create a `HyperpodNodeClass`, complete the following steps:

1. Create a YAML file (for example, nodeclass.yaml) similar to the following code. Add `InstanceGroup` names that you used at the time of the SageMaker HyperPod cluster creation. You can also add new instance groups to an existing SageMaker HyperPod EKS cluster.

1. Reference the `HyperPodNodeClass` name in your NodePool configuration.

The following is a sample `HyperpodNodeClass` :

```
apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  name: multiazg6
spec:
  instanceGroups:
    # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created
    # before this step can be completed.
    # MaxItems: 10
    - auto-spot-c5-2x-az1
    - auto-spot-c5-2x-az2
    - auto-spot-c5-x-az3
    - auto-ondemand-c5-2x-az1
```

Karpenter prioritizes Spot instance groups over On-Demand instances, using On-Demand as a fallback when specified in the configuration. Instance selection is sorted by EC2 Spot Placement Scores associated with each subnet's availability zone.

**Apply the configuration to your EKS cluster using `kubectl`:**

```
kubectl apply -f nodeclass.yaml
```

The HyperPod cluster must have AutoScaling enabled and the AutoScaling status must change to `InService` before the `HyperpodNodeClass` can be applied. It also shows Instance Groups capacities as Spot or OnDemand. For more information and key considerations, see [Autoscaling on SageMaker HyperPod EKS](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html).

**For example**

```
apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  creationTimestamp: "2025-11-30T03:25:04Z"
  name: multiazc6
  uid: ef5609be-15dd-4700-89ea-a3370e023690
spec:
  instanceGroups:
  -spot1
status:
  conditions:
  // true when all IGs in the spec are present in SageMaker cluster, false otherwise
  - lastTransitionTime: "2025-11-20T03:25:04Z"
    message: ""
    observedGeneration: 3
    reason: InstanceGroupReady
    status: "True"
    type: InstanceGroupReady
  // true if subnets of IGs are discoverable, false otherwise
  - lastTransitionTime: "2025-11-20T03:25:04Z"
    message: ""
    observedGeneration: 3
    reason: SubnetsReady
    status: "True"
    type: SubnetsReady
  // true when all dependent resources are Ready [InstanceGroup, Subnets]
  - lastTransitionTime: "2025-11-30T05:47:55Z"
    message: ""
    observedGeneration: 3
    reason: Ready
    status: "True"
    type: Ready
  instanceGroups:
  - instanceTypes:
    - ml.c5.2xlarge
    name:auto-spot-c5-2x-az2
    subnets:
    - id: subnet-03ecc649db2ff20d2
      zone: us-west-2a
      zoneId: usw2-az2
  - capacities: {"Spot": {}}
```

#### Create NodePool
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-nodepool"></a>

The NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The NodePool can be set to perform various actions, such as: 
+ Define labels and taints to limit the pods that can run on nodes Karpenter creates
+ Limit node creation to certain zones, instance types, and computer architectures, and so on

For more information about NodePool, refer to [NodePools](https://karpenter.sh/docs/concepts/nodepools/). SageMaker HyperPod managed Karpenter supports a limited set of well-known Kubernetes and Karpenter requirements, which we explain in this post.

To create a NodePool, complete the following steps:

Create a YAML file named `nodepool.yaml` with your desired NodePool configuration. The following code is a sample configuration to create a sample NodePool. We specify the NodePool to include our ml.g6.xlarge SageMaker instance type, and we additionally specify it for one zone. Refer to [NodePools](https://karpenter.sh/docs/concepts/nodepools/) for more customizations.

```
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
 name: gpunodepool
spec:
 template:
   spec:
     nodeClassRef:
      group: karpenter.sagemaker.amazonaws.com
      kind: HyperpodNodeClass
      name: multiazg6
     expireAfter: Never
     requirements:
        - key: node.kubernetes.io/instance-type
          operator: Exists
        - key: "node.kubernetes.io/instance-type" // Optional otherwise Karpenter will decide based on Job config resource requirements
          operator: In
          values: ["ml.c5.2xlarge"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-west-2a"]
```

**Tip**: On EC2 Spot interruption, Hyperpod taints node to trigger pod eviction. Karpenter’s **consolidation** process respects pod disruption budgets and performs normal Kubernetes eviction, but if you set consolidateAfter: 0, then consolidation can happen **immediately**, giving very little time for graceful pod eviction. Set it to non zero upto 2 min to allow graceful pod eviction for any checkpointing needs.

**Apply the NodePool to your cluster:**

```
kubectl apply -f nodepool.yaml
```

**Monitor the NodePool status to ensure the Ready condition in the status is set to True:**

```
kubectl get nodepool gpunodepool -oyaml
```

This example shows how a NodePool can be used to specify the hardware (instance type) and placement (Availability Zone) for pods.

**Launch a simple workload**

The following workload runs a Kubernetes deployment where the pods in deployment are requesting for 1 CPU and 256 MB memory per replica, per pod. The pods have not been spun up yet.

```
kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/inflate.yaml
```

When we apply this, we can see a deployment and a single node launch in our cluster, as shown in the following screenshot.

**To scale this component, use the following command:**

```
kubectl scale deployment inflate --replicas 10
```

See [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html) for more details.

### Managing Node Interruption
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-interrupt"></a>

Spot Instances can be reclaimed at any time. EC2 provides a best-effort 2-minute interruption notice in most cases, but this notice is not guaranteed. In some situations, EC2 may terminate Spot Instances immediately without any advance warning.HyperPod automatically handles both scenarios:
+ With 2-minute notice: Automatically reattempts graceful pod eviction and controlled capacity replacement when Spot capacity becomes available.
+ Without notice (immediate termination): Automatically reattempts node replacement (when Spot capacity becomes available) without graceful eviction 

**How it works**

When EC2 sends a Spot interruption notice, HyperPod automatically:

1. Detects interruption signal 

1. Taints the node: Prevents new pods from being scheduled on the interrupted instance

1. Gracefully evicts pods: Gives running pods time to complete or checkpoint their work (respecting Kubernetes `terminationGracePeriodSeconds`)

1. Replaces capacity: Automatically attempts to provision the replacement instances (Spot or On-Demand based on availability). 

   Capacity replacement works by automatically provisioning replacement instances. When capacity is not immediately available, the system continues checking until resources become accessible. In the case of non-autoscaling instance groups, HyperPod attempts to scale up within the same instance group until the required capacity becomes available. For Karpenter-based instance groups, Karpenter implements a fallback mechanism to other instance groups configured in the Node class when the primary group cannot accommodate the demand. Additionally, you can configure On-Demand as a fallback option, allowing Karpenter to automatically switch to On-Demand instances if it cannot successfully scale up Spot instance groups.

1. Reschedules workloads: Kubernetes automatically reschedules evicted pods on healthy nodes

### Finding your Usage and Bill
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-bill"></a>

To check your usage and billing for Spot Instances on HyperPod you can use the AWS Cost Explorer Console. Go to Billing and Cost Management > Bill

![\[An image containing cost region information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cost-region.png)


**To explore usage and billing on Console, go to Billing and Cost Management > Cost Explorer**

![\[An image containing cost and usage.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cost-usage.png)


# Using UltraServers in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-ultraserver"></a>

SageMaker HyperPod support for Ultraservers provides high-performance GPU computing capabilities for AI and machine learning workloads. Built on NVIDIA GB200 and NVL72 architecture, these Ultraservers provide NVLink connectivity across 18 GB200 instances in a dual-rack configuration, totaling 72 B200 GPUs. This NVLink fabric allows workloads to use GPU communications that increase usable GPU capacity and addressable memory beyond what's possible with discrete instances, supporting more complex and resource-intensive AI models. The NVLink connectivity is enabled by NVIDIA IMEX technology, which handles the low-level configuration for secure GPU fabric connections across instances within the same rack.

HyperPod simplifies the deployment and management of these GPU clusters through intelligent topology awareness and automated configuration. The platform automatically discovers and labels nodes with their physical location and capacity block information, which supports topology-aware scheduling for distributed workloads. HyperPod abstracts the complex IMEX configuration requirements, allowing you to focus on workload deployment rather than low-level GPU fabric setup. You can choose flexible deployment options including both self-managed nodes and EKS managed node groups. Amazon EKS provides optimized AMIs that include pre-configured NVIDIA drivers, Fabric Manager, IMEX drivers, and all necessary system software for seamless operation.

The integration includes pod placement capabilities that ensure distributed workloads are scheduled optimally across NVL72 domains using standard Kubernetes topology labels. Built-in monitoring and automated recovery features provide operational support, where the AMI health agent detects GPU errors from kernel logs and can automatically remediate issues or replace faulty nodes in managed node groups. This combination of GPU scale, intelligent workload placement, and automated operations helps you focus on your AI/ML innovations rather than infrastructure complexity, while achieving maximum performance from your GPU investments.

To get set up using UltraServers with your HyperPod cluster, see the following steps:

1. Create an [ EKS-based HyperPod cluster](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html). When you choose an instance group, make sure you choose an UltraServer. 

1. After your cluster is created, use the following commands install operational plugins:

   NVIDIA device plugin v0.17.2

   ```
   kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.2/deployments/static/nvidia-device-plugin.yml
   ```

   FD DaemonSet v0.17.3

   ```
   kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.17.3"
   ```

   GPU feature discovery

   ```
   kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.2/deployments/static/gpu-feature-discovery-daemonset.yaml
   ```

You can now run jobs. The following example demonstrates how to create a domain, configure an IMEX domain, and enable channel allocation. These steps also let you create a pod to provision a channel for NCCL communication.

1. Create a resource specification file to use with Kubectl.

   ```
   cat <<EOF > imex-channel-injection.yaml
   ---
   apiVersion: resource.nvidia.com/v1beta1
   kind: ComputeDomain
   metadata:
     name: imex-channel-injection
   spec:
     numNodes: 1
     channel:
       resourceClaimTemplate:
         name: imex-channel-0
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: imex-channel-injection
   spec:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: nvidia.com/gpu.clique
               operator: Exists
             - key: topology.k8s.aws/ultraserver-id
               operator: In
               values: 
               - <UltraServer-ID>
     containers:
     - name: ctr
       image: ubuntu:22.04
       command: ["bash", "-c"]
       args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
       resources:
         claims:
         - name: imex-channel-0
     resourceClaims:
     - name: imex-channel-0
       resourceClaimTemplateName: imex-channel-0
   EOF
   ```

1. Apply the configuration that you created.

   ```
   kubectl apply -f imex-channel-injection.yaml
   ```

1. To verify that your pod is created, run the `get pods` commands.

   ```
   kubectl get pods
   kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
   ```

1. You can also check the logs from the pod to see if it allocated a communication channel.

   ```
   kubectl logs imex-channel-injection
   ```

   ```
   total 0
   drwxr-xr-x 2 root root     60 Feb 19 10:43 .
   drwxr-xr-x 6 root root    380 Feb 19 10:43 ..
   crw-rw-rw- 1 root root 507, 0 Feb 19 10:43 channel0
   ```

1. You can also check the logs to verify that the automated IMEX configuration is running with an allocated channel.

   ```
   kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --tail=-1
   /etc/nvidia-imex/nodes_config.cfg:
   ```

   ```
   IMEX Log initializing at: 8/8/2025 14:23:12.081
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX version 570.124.06 is running with the following configuration options
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Logging level = 4
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Logging file name/path = /var/log/nvidia-imex.log
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Append to log file = 0
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Max Log file size = 1024 (MBs)
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Use Syslog file = 0
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX Library communication bind interface =
   
   [JAug 8 2025 14:23:12] [INFO] [tid 39] IMEX library communication bind port = 50000
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Identified this node as ID 0, using bind IP of '10.115.131.8', and network interface of enP5p9s0
   [Aug 8 2025 14:23:120] [INFO] [tid 39] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist.  Assuming no previous importers.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] NvGpu Library version matched with GPU Driver version
   [Aug 8 2025 14:23:12] [INFO] [tid 63] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 64] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 65] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Creating gRPC channels to all peers (nPeers = 1).
   [Aug 8 2025 14:23:12] [INFO] [tid 66] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
   [Aug 8 2025 14:23:12] [INFO] [tid 67] Connection established to node 0 with ip address 10.115.131.8. Number of times connected: 1
   [Aug 8 2025 14:23:12] [INFO] [tid 39] GPU event successfully subscribed
   ```

1. After you've verified everything, delete the workload and remove the configuration.

   ```
   kubectl delete -f imex-channel-injection.yaml
   ```

# IDEs and Notebooks
<a name="sagemaker-hyperpod-eks-cluster-ide"></a>

Amazon SageMaker is introducing a new capability for SageMaker HyperPod EKS clusters, which allows AI developers to run their interactive machine learning workloads directly on the HyperPod EKS cluster. This feature introduces a new add-on called Amazon SageMaker Spaces, that enables AI developers to create and manage self-contained environments for running notebooks.

Administrators can use SageMaker HyperPod Console to install the add-on on their cluster, and define default space configurations such as images, compute resources, local storage for notebook settings (additional storage to be attached to their dev spaces), file systems, and initialization scripts. A one-click installation option will be available with default settings to simplify the admin experience. Admins can use the SageMaker HyperPod Console, kubectl, or HyperPod CLI to install the operator, create default settings, and manage all spaces in a centralized location.

AI developers can use HyperPod CLI to create, update, and delete dev spaces. They have the flexibility to use default configurations provided by admins or customize settings. AI developers can access their spaces on HyperPod using their local VS Code IDEs, and/or their web browser that hosts their JupyterLab or CodeEditor IDE on custom DNS domain configured by their admins. They can also use kubernetes’ port forwarding feature to access spaces in their web browsers.

## Admin
<a name="admin-cx"></a>
+ [Set up permissions](permission-setup.md)
+ [Install SageMaker AI Spaces Add-on](operator-install.md)
+ [Customize add-on](customization.md)
+ [Add users and set up service accounts](add-user.md)
+ [Limits](ds-limits.md)
+ [Task governance for Interactive Spaces on HyperPod](task-governance.md)
+ [Observability](observability.md)

## Data scientist
<a name="data-scientist-cx"></a>
+ [Create and manage spaces](create-manage-spaces.md)
+ [Web browser access](browser-access.md)
+ [Remote access to SageMaker Spaces](vscode-access.md)

## SageMaker Spaces Managed Instance Pricing
<a name="spaces-managed-instance-pricing"></a>

The SageMaker Spaces Add-on/Operator does not incur any additional charge to the customer. However, to support the SSH-over-SSM tunneling required for the *Remote IDE Connection* feature, SageMaker Spaces uses an AWS-managed instance. This instance is registered as an Advanced On-Premises Instance under SSM, and therefore is billed per compute hour.

Please refer to the “On-Premises Instance Management” rate on the AWS Systems Manager pricing page: AWS Systems Manager Pricing: [https://aws.amazon.com/systems-manager/pricing/](https://aws.amazon.com/systems-manager/pricing.com)

# Set up permissions
<a name="permission-setup"></a>

## Roles required for Add-on and its dependencies
<a name="permission-setup-addon"></a>

### IAM Roles Required for SageMaker Spaces on SageMaker HyperPod
<a name="role-hyperpod"></a>

When enabling **SageMaker Spaces (a.k.a****SageMaker IDE / Notebooks)** features on a SageMaker HyperPod (EKS) cluster, several IAM roles must be created and assigned. These roles support secure access, routing, remote IDE sessions, and EBS storage provisioning. The following table summarizes the four roles and when they are required.

### Role Summary Table
<a name="role-table"></a>


| IAM Role | Required? | Purpose | Who Uses It? | Customization allowed by SageMaker Console? | 
| --- | --- | --- | --- | --- | 
|  Spaces Add-on Execution Role  |  Always required  |  Allows the Spaces controller to manage Spaces, generate presigned URLs, manage SSM sessions  |  Add-on controller pod (privileged)  |  ✔ Yes  | 
|  In-Cluster Router Role  |  Required for WebUI access  |  Allows router pod to perform KMS operations for JWT signing (WebUI authentication)  |  In-cluster router pod (privileged)  |  ✔ Yes  | 
|  SSM Managed Instance Role  |  Required for Remote IDE access  |  Used by SSM agent sidecar for SSH-over-SSM remote IDE sessions  |  SSM Agent in Space IDE Pods (not an add-on pod)  |  ✔ Yes  | 
|  IAM Role for EBS CSI Driver Add-on  |  Always required  |  Allows EBS CSI Driver to create/attach/modify volumes for Spaces workloads  |  EBS CSI Driver Add-on  |  Auto created  | 
|  IAM Role for External DNS Add-on  |  Required for WebUI access  |  It ensures that Space endpoints and in-cluster components can be automatically assigned DNS names in the customer’s Route 53 hosted zones.  |  External DNS Add-on  |  Auto created  | 

### 1. Spaces Add-on Execution Role (Required)
<a name="add-n-execution-role"></a>

The Spaces Add-on Execution Role is always required because it is used by the SageMaker Spaces addon-on controller pod, an administrative component installed through the EKS add-on. This role allows the controller to manage Spaces, provision resources, interact with SSM, and generate presigned URLs for both Remote IDE and WebUI access. It also supports KMS access used for request signing for authenticating the WebUI https requests. This role can be automatically created when SageMaker Spaces add-on is installed through the SageMaker Console. For manual creation, AWS provides the `AmazonSageMakerSpacesControllerPolicy` managed policy.

**Reference Trust Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
          "sts:AssumeRole",
          "sts:TagSession"
      ],
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "{{accountId}}",
          "aws:SourceArn": "arn:aws:eks:{{region}}:{{accountId}}:cluster/{{eksClusterName}}"
        }
      }
    }
  ]
}
```

### 2. In-Cluster Router Role (Required for WebUI Authentication)
<a name="in-cluster-role"></a>

The In-Cluster Router Role is used by the** router pod**, a privileged component that authenticates Spaces WebUI sessions. The router uses a KMS key to create and sign JWT tokens that authorize user access to specific Spaces. This role allows the router pod to generate data keys, and decrypt them. Similar to the controller role, it enforces security using tag- and cluster-based scope restrictions. This role can be automatically generated when Spaces add-on is installed via the AWS SageMaker Console, but customers may manually create it.

**Reference Trust Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
          "sts:AssumeRole",
          "sts:TagSession"
      ],
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "{{accountId}}",
          "aws:SourceArn": "arn:aws:eks:{{region}}:{{accountId}}:cluster/{{eksClusterName}}"
        }
      }
    }
  ]
}
```

**Reference Permission Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "KMSDescribeKey",
            "Effect": "Allow",
            "Action": [
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:{{region}}:{{accountId}}:key/{{kmsKeyId}}"
        },
        {
            "Sid": "KMSKeyOperations",
            "Effect": "Allow",
            "Action": [
                "kms:GenerateDataKey",
                "kms:Decrypt"
            ],
            "Resource": "arn:aws:kms:{{region}}:{{accountId}}:key/{{kmsKeyId}}",
            "Condition": {
                "StringEquals": {
                    "kms:EncryptionContext:sagemaker:component": "amazon-sagemaker-spaces",
                    "kms:EncryptionContext:sagemaker:eks-cluster-arn": "${aws:PrincipalTag/eks-cluster-arn}"
                }
            }
        }
    ]
}
```

### 3. SSM Managed Instance Role (Required for Remote IDE Access)
<a name="ssm-role"></a>

The SSM Managed Instance Role is passed when registering the SSM managed instance for enabling the remote IDE access. This role allows the SSM agent to register the pod as an SSM Managed Instance and use the SSM Session Manager channels for Remote IDE (SSH-over-SSM) connectivity. It can be created automatically when using the AWS SageMaker Console. For manual deployments, customers must create this role and provide it to the Spaces add-on. The controller pod itself does not assume this role; it only provides it when calling `ssm:CreateActivation`.

**Reference Trust Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ssm.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{{account}}"
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:ssm:{{region}}:{{account}}:*"
                }
            }
        }
    ]
}
```

**Reference Permissions Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:DescribeAssociation"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:association/*",
        "arn:aws:ssm:{{region}}:{{account}}:document/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetDocument",
        "ssm:DescribeDocument"
      ],
      "Resource": "arn:aws:ssm:{{region}}:{{account}}:document/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameter",
        "ssm:GetParameters"
      ],
      "Resource": "arn:aws:ssm:{{region}}:{{account}}:parameter/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:ListInstanceAssociations"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:PutComplianceItems"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateAssociationStatus"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:document/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateInstanceAssociationStatus"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:association/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateInstanceInformation"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetDeployablePatchSnapshotForInstance",
        "ssm:GetManifest",
        "ssm:ListAssociations",
        "ssm:PutInventory",
        "ssm:PutConfigurePackageResult"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssmmessages:CreateControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:OpenDataChannel"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2messages:AcknowledgeMessage",
        "ec2messages:DeleteMessage",
        "ec2messages:FailMessage",
        "ec2messages:GetEndpoint"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2messages:GetMessages",
        "ec2messages:SendReply"
      ],
      "Resource": "*",
      "Condition": {
        "ArnLike": {
          "ssm:SourceInstanceARN": "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
        }
      }
    }
  ]
}
```

### 4. IAM Role for EBS CSI Driver Add-on
<a name="role-ebs-csi"></a>

The IAM role for the EBS CSI Driver is required because the EBS CSI Driver provisions persistent volumes for Spaces workloads. While the AWS-managed [AmazonEBSCSIDriverPolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonEBSCSIDriverPolicy.html) provides baseline permissions, SageMaker HyperPod clusters require [additional capabilities](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-ebs.html#sagemaker-hyperpod-eks-ebs-setup) such as creating fast snapshot restores, tagging cluster-owned volumes, and attaching/detaching volumes for HyperPod-managed nodes. These permissions also include SageMaker-specific APIs such as `sagemaker:AttachClusterNodeVolume`. If EBS CSI Driver is not installed, this role will now be automatically created by the SageMaker Console during Spaces add-on installation, **requiring no customer action**.

### 5. IAM Role for External DNS Add-on
<a name="role-external-nds"></a>

The External DNS add-on manages DNS records for Services and Ingress resources on the HyperPod cluster. It ensures that Space endpoints and in-cluster components can be automatically assigned DNS names in the customer’s Route 53 hosted zones. Today, customers often install External DNS manually via a 1-click option in the EKS console. As part of improving the SageMaker Spaces experience, this role will now be automatically created by the SageMaker Console during Spaces add-on installation, **requiring no customer action**.

## Permission setup for AWS Toolkit to Access SageMaker Spaces
<a name="permission-for-toolkitl"></a>

To allow the AWS VS Code Toolkit resource explorer side panel to discover and connect to SageMaker Spaces, the following IAM permissions are required. These permissions allow the Toolkit to list available SageMaker HyperPod clusters, retrieve cluster details, and obtain a connection token for the associated Amazon EKS cluster.

**Required IAM Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "SageMakerListClusters",
            "Effect": "Allow",
            "Action": "sagemaker:ListClusters",
            "Resource": "*"
        },
        {
            "Sid": "SageMakerDescribeCluster",
            "Effect": "Allow",
            "Action": "sagemaker:DescribeCluster",
            "Resource": "arn:aws:sagemaker:{{region}}:{{account}}:cluster/cluster-name"
        },
        {
            "Sid": "EksDescribeCluster",
            "Effect": "Allow",
            "Action": "eks:DescribeCluster",
            "Resource": "arn:aws:eks:{{region}}:{{account}}:cluster/cluster-name"
        },
        {
            "Sid": "EksGetToken",
            "Effect": "Allow",
            "Action": "eks:GetToken",
            "Resource": "*"
        }
    ]
}
```

**Scoping Recommendations**
+ Replace cluster-name with the specific SageMaker HyperPod cluster(s) your users need to access.
+ The eks:GetToken action currently does not support resource-level restrictions and must use Resource: "\$1". This is an AWS service limitation. The client side Authentication is performed through [EKS access entries](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html).

# Install SageMaker AI Spaces Add-on
<a name="operator-install"></a>

## Dependencies
<a name="dependencies"></a>

**Amazon EKS Pod Identity Agent add-on**
+ Required for the operator to obtain AWS credentials
+ **Typically pre-installed** on most EKS clusters
+ Installation: Via EKS add-ons

**Cert-manager**
+ Required for TLS certificate management
+ **Pre-installed** if using HyperPod quick cluster create
+ Installation: Via EKS add-ons

**EBS CSI Driver**
+ Required for Space persistent storage (EBS volumes)
+ **Automatically installed** when using SageMaker console to install
+ Requires IAM role with `AmazonEBSCSIDriverPolicy` \$1 HyperPod-specific permissions
+ Installation: Via EKS add-ons. However, make sure follow the guide to install additional permissions needed for HyperPod. 
+ Reference: [Using the Amazon EBS CSI driver on HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-ebs.html)

## Additional dependencies for WebUI Access
<a name="-additional-dependencies"></a>

**AWS Load Balancer Controller**
+ **Pre-installed** if using HyperPod quick cluster create
+ Installation: Via Helm
+ Manual installation guide: [Installing the AWS Load Balancer Controller](https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html)

**External DNS**
+ Required when using custom domain for WebUI access
+ Manages Route53 DNS records automatically
+ Requires IAM role with Route53 permissions
+ Installation: Via EKS add-ons

## Installation
<a name="installation"></a>

Before you begin, ensure that you have:
+ An active SageMaker HyperPod cluster with at least one worker node running Kubernetes version 1.30 or later
+ At least one worker node with minimum instance type (XX vCPU, YY GiB memory)

### Installing the Amazon SageMaker Spaces add-on
<a name="space-add-on"></a>

You can install the SageMaker Spaces add-on using either quick install for default settings or custom install for advanced configuration.

#### Quick install
<a name="quick-install"></a>

1. Open the Amazon SageMaker console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose your cluster from the clusters list.

1. On the IDE and Notebooks tab, locate Amazon SageMaker Spaces, then choose Quick install.

Quick install automatically:
+ Creates the required IAM roles for the add-on
+ Enables remote access mode with required IAM roles for Systems Manager
+ Installs the add-on and configures pod identity association

#### Custom install
<a name="custom-install"></a>

1. Open the Amazon SageMaker console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose your cluster from the clusters list.

1. On the IDE and Notebooks tab, locate Amazon SageMaker Spaces, then choose Custom install.

1. Configure the following options:

   **IAM roles needed by add-on**
   + Choose whether to create new IAM roles with recommended permissions or use existing ones with the required permissions (Refer to Admin Permission Set up section above)

   **Remote access configuration**
   + Enable to allow users to connect to spaces from local Visual Studio Code using AWS Systems Manager
   + For SSM managed instance role:
     + **Create new role** – The add-on creates and manages the role with required Systems Manager permissions
     + **Use existing role** – Select a pre-configured role with necessary Systems Manager permissions
   + Ensure the Spaces Add-on Execution Role has PassRole permissions for the SSM managed instance role
**Note**  
Enabling remote access activates AWS Systems Manager advanced-instances tier for additional per-instance charges. For pricing information, see Systems Manager pricing.

   **Web browser access configuration**
   + Enable to allow users to access spaces through a web browser using Route 53 DNS and SSL certificates
   + **Prerequisites:** Install AWS Load Balancer Controller before enabling browser access
   + **Route 53 hosted zone:** Select an existing hosted zone for a domain or subdomain that you own. The domain or subdomain must be registered and under your control to enable DNS management and SSL certificate validation.

     For more details on domain registration, see [Registering a new domain](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/domain-register.html#domain-register-procedure-section) in the Route 53 Developer Guide.
   + **Subdomain:** Enter subdomain prefix (alphanumeric and hyphens only, maximum 63 characters)
   + **SSL certificate:** Select an existing SSL certificate from AWS Certificate Manager. The certificate must be valid and cover both your subdomain (e.g., subdomain.domain.com) and wildcard subdomains (e.g., \$1.subdomain.domain.com) to support individual space access URLs.
   +  **Token signing key:** Select an AWS KMS asymmetric key for JWT token signing. The key is used to encrypt authentication tokens for secure WebUI access. You can create a new asymmetric key in KMS or select an existing one that your account has access to.
**Note**  
Standard Route 53 charges apply for hosted zones and DNS queries. For pricing information, see Route 53 pricing.

#### EKS Addon Installation - Jupyter K8s with WebUI
<a name="webui-install"></a>

##### Configuration File
<a name="configure-file"></a>

Create `addon-config.yaml`:

```
jupyter-k8s:
  workspacePodWatching:
    enable: true

jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    kmsEncryptionContext:
      enabled: true
    traefik:
      shouldInstall: true
    auth:
      kmsKeyId: "<KMS_KEY_ARN>"
```

**Replace the following placeholders:**
+ <DOMAIN\$1NAME>: Your domain name (e.g., `jupyter.example.com`)
+ <ACM\$1CERTIFICATE\$1ARN>: Your ACM certificate ARN (e.g. `arn:aws:acm:us-west-2:111122223333:certificate/12345678-1234-1234-1234-123456789012`, 
+ <KMS\$1KEY\$1ARN>: Your KMS key ARN (e.g., `arn:aws:kms:us-west-2:111122223333:key/12345678-1234-1234-1234-123456789012`

##### Installation via AWS CLI
<a name="install-via-cli"></a>

```
aws eks create-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

**To update existing addon:**

```
aws eks update-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

##### Installation via AWS Management Console
<a name="install-via-console"></a>

1. Go to **EKS Console** → Select your cluster

1. Click **Add-ons** tab → **Add new**

1. Select **SageMaker Spaces** addon

1. Paste the YAML config above in **Optional configuration settings**

1. Click **Next**, then review the addon settings

1. Click **Create**

##### Verify Installation
<a name="install-verify"></a>

```
# Check addon status
aws eks describe-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --region <AWS_REGION>
```

##### Customizing ALB Attributes
<a name="customize-alb"></a>

By default, the addon creates a public load balancer for use with the web UI. You can customize the load balancer attributes using the EKS addon properties.

To create an internal ALB, set the scheme to `internal`:

```
jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    alb:
      scheme: "internal"  # Default is "internet-facing"
```

You can also use the `alb.annotations` field to customize ALB settings:

```
jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    alb:
      scheme: "internal"
      annotations:
        alb.ingress.kubernetes.io/security-groups: "<SECURITY_GROUP_ID>"
        alb.ingress.kubernetes.io/subnets: "<SUBNET_ID_1>,<SUBNET_ID_2>"
        alb.ingress.kubernetes.io/load-balancer-attributes: "idle_timeout.timeout_seconds=60"
```

**Common ALB annotations:**
+ `alb.ingress.kubernetes.io/security-groups`: Specify security groups for the ALB
+ `alb.ingress.kubernetes.io/subnets`: Specify subnets for the ALB
+ `alb.ingress.kubernetes.io/load-balancer-attributes`: Set ALB attributes (idle timeout, access logs, etc.)

See [AWS Load Balancer Controller documentation](https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/guide/ingress/annotations/) for all available annotations.

### Upgrade / versioning of add-on
<a name="upgrade-add-on"></a>

```
aws eks update-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

# Customize add-on
<a name="customization"></a>

## Template
<a name="customization-template"></a>

Templates are reusable workspace configurations that serve as admin-controlled blueprints for workspace creation. They provide defaults for workspace configuration values, and guardrails to control what data scientists can do. Templates exist at a cluster level, and can be re-used across namespaces. 

SageMaker Spaces creates two system templates as a starting point for data scientists, one for Code Editor and one for JupyterLab. These system templates are managed by the addon and cannot be editied directly. Instead, admins can create new templates and set them as default.

## Task Governance
<a name="customization-governabce"></a>

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-jupyter-template
  namespace: my-namespace
  labels:
    kueue.x-k8s.io/priority-class: <user-input>-priority
spec:
  displayName: "My Custom Jupyter Lab"
  description: "Custom Jupyter Lab with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-jupyterlab"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "jupyterlab"
```

## SMD / Custom images
<a name="customization-image"></a>

Customers can configure image policies through templates by providing a default image and a list of allowed images. Additionally, administrators can choose whether to allow data scientists to bring their own custom images. The system defaults to using the latest SageMaker Distribution, but if you wish to pin to a particular version, you can specify the exact SMD version to use in a template.

Custom image requirements:
+ `curl` if you want to use idle shutdown
+ port 8888
+ remote access

## Remote IDE Requirement
<a name="remote-ide-requirement"></a>

### VS Code version requirement
<a name="remote-ide-requirement-vscode"></a>

VS Code version [v1.90](https://code.visualstudio.com/updates/v1_90) or greater is required. We recommend using the [latest stable version of VS Code](https://code.visualstudio.com/updates).

### Operating system requirements
<a name="remote-ide-requirement-operate"></a>

You need one of the following operating systems to remotely connect to Studio spaces:
+ macOS 13\$1
+ Windows 10
  + [Windows 10 support ends on October 14, 2025](https://support.microsoft.com/en-us/windows/windows-10-support-ends-on-october-14-2025-2ca8b313-1946-43d3-b55c-2b95b107f281)
+ Windows 11
+ Linux
+ Install the official [Microsoft VS Code for Linux](https://code.visualstudio.com/docs/setup/linux)
  + not an open-source version

### Local machine prerequisites
<a name="remote-ide-requirement-machine"></a>

Before connecting your local Visual Studio Code to Studio spaces, ensure your local machine has the required dependencies and network access.

**Note**  
Environments with software installation restrictions may prevent users from installing required dependencies. The AWS Toolkit for Visual Studio Code automatically searches for these dependencies when initiating remote connections and will prompt for installation if any are missing. Coordinate with your IT department to ensure these components are available.

**Required local dependencies**

Your local machine must have the following components installed:
+ **[https://code.visualstudio.com/docs/remote/ssh](https://code.visualstudio.com/docs/remote/ssh)**
+ — Standard VS Code Marketplace extension for remote development
+ **[Session Manager plugin](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html)** — Required for secure session management
+ **SSH Client** — Standard component on most machines ([OpenSSH recommended for Windows](https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse))
+ **[https://code.visualstudio.com/docs/configure/command-line](https://code.visualstudio.com/docs/configure/command-line)**
+  Typically included with VS Code installation

**Platform-specific requirements**
+ **Windows users** — PowerShell 5.1 or later is required for SSH terminal connections

**Network connectivity requirements**

Your local machine must have network access to [Session Manager endpoints](https://docs.aws.amazon.com/general/latest/gr/ssm.html). For example, in US East (N. Virginia) (us-east-1) these can be:
+ `[ssm.us-east-1.amazonaws.com](http://ssm.us-east-1.amazonaws.com)`
+ `ssm.us-east-1.api.aws`
+ `[ssmmessages.us-east-1.amazonaws.com](http://ssmmessages.us-east-1.amazonaws.com)`
+ `[ec2messages.us-east-1.amazonaws.com](http://ec2messages.us-east-1.amazonaws.com)`

### Image requirements
<a name="remote-ide-requirement-image"></a>

**SageMaker Distribution images**

When using SageMaker Distribution with remote access, use [SageMaker Distribution](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-distribution.html) version 2.7 or later.

**Custom images**

When you [Bring your own image (BYOI)](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-byoi.html) with remote access, ensure that you follow the [custom image specifications](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-byoi-specs.html) and ensure the following dependencies are installed:
+ `curl` or `wget` — Required for downloading AWS CLI components
+ `unzip` — Required for extracting AWS CLI installation files
+ `tar` — Required for archive extraction
+ `gzip` — Required for compressed file handling

### Instance requirements
<a name="remote-ide-requirement-instance"></a>
+ **Memory** — 8GB or more
+ Use instances with at least 8GB of memory. The following instance types are *not* supported due to insufficient memory (less than 8GB): `ml.t3.medium`, `ml.c7i.large`, `ml.c6i.large`, `ml.c6id.large`, and `ml.c5.large`. For a more complete list of instance types, see the [Amazon EC2 On-Demand Pricing page](https://aws.amazon.com/ec2/pricing/on-demand/)

## Optimizing Kubernetes Startup Time by Pre-Warming Container Images
<a name="remote-ide-optimize-image"></a>

Container image pulling performance has become a significant bottleneck for many EKS customers, especially as AI/ML workloads rely on increasingly large container images. Pulling and unpacking these large images typically takes several minutes the first time they are used on each EKS node. This delay adds substantial latency when launching SageMaker Spaces and directly impacts user experience—particularly in environments where fast startup is essential, such as notebooks, interactive development jobs. 

Image pre-warming is a technique used to preload specific container images onto every node in the EKS/HyperPod cluster before they are needed. Instead of waiting for a pod to trigger the first pull of a large image, the cluster proactively downloads and caches images across all nodes. This ensures that when workloads launch, the required images are already available locally, eliminating long cold-start delays. Image pre-warming improves SageMaker Spaces startup speed and provides a more predictable and responsive experience for end users.

### Pre-Warming via DaemonSet
<a name="remote-ide-optimize-image-dae"></a>

We recommend using a DaemonSet to preload images. A DaemonSet ensures that one pod runs on every node in the cluster. Each container inside the DaemonSet pod references an image you want to cache. When Kubernetes starts the pod, it automatically pulls the images, warming the cache on each node.

The following example shows how to create a DaemonSet that preloads two GPU images. Each container runs a lightweight `sleep infinity` command to keep the pod active with minimal overhead.

```
cat <<EOF | kubectl apply -n "namespace_1" -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: image-preload-ds
spec:
  selector:
    matchLabels:
      app: image-preloader
  template:
    metadata:
      labels:
        app: image-preloader
    spec:
      containers:
      - name: preloader-3-4-2
        image: public.ecr.aws/sagemaker/sagemaker-distribution:3.4.2-gpu
        command: ["sleep"]
        args: ["infinity"]
        resources:
          requests:
            cpu: 1m
            memory: 16Mi
          limits:
            cpu: 5m
            memory: 32Mi
      - name: preloader-3-3-2
        image: public.ecr.aws/sagemaker/sagemaker-distribution:3.3.2-gpu
        command: ["sleep"]
        args: ["infinity"]
        resources:
          requests:
            cpu: 1m
            memory: 16Mi
          limits:
            cpu: 5m
            memory: 32Mi
EOF
```

### How It Works
<a name="remote-ide-optimize-image-how"></a>
+ Each container references one image.
+ Kubernetes must download each image before starting the container.
+ Once the pod is running on every node, the images are cached locally.
+ Any workload using these images now starts much faster.

## Space default storage (EBS)
<a name="space-storage"></a>

The system uses the EBS CSI driver by default to provision EBS storage volumes for each workspace. SageMaker creates an EBS storage class for use with workspaces, and administrators can customize the default and maximum size of these volumes using template settings. For advanced users working with CLI tools, you can also customize the storage class of the workspace, which allows users to leverage other storage classes including configuring customer-managed KMS keys for their EBS volumes.

Note that EBS volumes are bound to a particular AZ, which means workspaces can only be scheduled on nodes in the same AZ as their storage volume. This can lead to scheduling failures if cluster capacity exists but not in the correct AZ.

## Additional storage
<a name="space-additional-storage"></a>

SageMaker Spaces supports attaching additional storage volumes such as Amazon EFS, FSx for Lustre, or S3 Mountpoint to your development spaces. This allows you to access shared datasets, collaborate on projects, or use high-performance storage for your workloads.

### Prerequisites
<a name="space-additional-storage-prereq"></a>

Before attaching additional storage to spaces, you must:

1. **Install the appropriate CSI driver add-on** via [EKS add-ons](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html) (Amazon EFS CSI Driver, Amazon FSx for Lustre CSI Driver, or Mountpoint for Amazon S3 CSI Driver)

1. **Set up storage resources and PersistentVolumeClaims** following the CSI driver documentation for your specific storage type

1. **Ensure the PVC is available** in the same namespace where you plan to create your space

### Attaching storage to spaces
<a name="space-additional-storage-attach"></a>

Once you have a PersistentVolumeClaim configured, you can attach it to a space using either the HyperPod CLI or kubectl.

**HyperPod CLI**

```
hyp create hyp-space \
    --name my-space \
    --display-name "My Space with FSx" \
    --memory 8Gi \
    --volume name=shared-fsx,mountPath=/shared,persistentVolumeClaimName=my-fsx-pvc
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: "My Space with FSx"
  desiredStatus: Running
  volumes:
  - name: shared-fsx
    mountPath: /shared
    persistentVolumeClaimName: my-fsx-pvc
```

### Multiple volumes
<a name="space-additional-storage-multiple"></a>

You can attach multiple additional storage volumes to a single space by specifying multiple `--volume` flags with the CLI or multiple entries in the `volumes` array with kubectl.

**HyperPod CLI**

```
hyp create hyp-space \
    --name my-space \
    --display-name "My Space with Multiple Storage" \
    --memory 8Gi \
    --volume name=shared-efs,mountPath=/shared,persistentVolumeClaimName=my-efs-pvc \
    --volume name=datasets,mountPath=/datasets,persistentVolumeClaimName=my-s3-pvc
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: "My Space with Multiple Storage"
  desiredStatus: Running
  volumes:
  - name: shared-efs
    mountPath: /shared
    persistentVolumeClaimName: my-efs-pvc
  - name: datasets
    mountPath: /datasets
    persistentVolumeClaimName: my-s3-pvc
```

## Resource configuration
<a name="space-resource-configuration"></a>

SageMaker Spaces allows you to configure compute resources for your development environments, including CPU, memory, and GPU resources to match your workload requirements.

### GPU configuration
<a name="space-gpu-configuration"></a>

SageMaker Spaces supports both whole GPU allocation and GPU partitioning using NVIDIA Multi-Instance GPU (MIG) technology. This allows you to optimize GPU utilization for different types of machine learning workloads.

#### Whole GPU allocation
<a name="space-gpu-whole"></a>

**HyperPod CLI**

```
hyp create hyp-space \
    --name gpu-space \
    --display-name "GPU Development Space" \
    --image public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu \
    --memory 16Gi \
    --gpu 1 \
    --gpu-limit 1
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: gpu-space
spec:
  displayName: "GPU Development Space"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  desiredStatus: Running
  resources:
    requests:
      memory: "16Gi"
      nvidia.com/gpu: "1"
    limits:
      memory: "16Gi"
      nvidia.com/gpu: "1"
```

#### GPU partitioning (MIG)
<a name="space-gpu-mig"></a>

GPU partitioning using NVIDIA Multi-Instance GPU (MIG) technology allows you to partition a single GPU into smaller, isolated instances. Your HyperPod cluster must have GPU nodes that support MIG and have MIG profiles configured. For more information on setting up MIG on your HyperPod cluster, see [GPU partitioning using NVIDIA MIG](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-gpu-partitioning-setup.html).

**HyperPod CLI**

```
hyp create hyp-space \
    --name mig-space \
    --display-name "MIG GPU Space" \
    --image public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu \
    --memory 8Gi \
    --accelerator-partition-type mig-3g.20gb \
    --accelerator-partition-count 1
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: mig-space
spec:
  displayName: "MIG GPU Space"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  desiredStatus: Running
  resources:
    requests:
      memory: "8Gi"
      nvidia.com/mig-3g.20gb: "1"
    limits:
      memory: "8Gi"
      nvidia.com/mig-3g.20gb: "1"
```

## Lifecycle
<a name="space-lifecycle"></a>

Lifecycle configuration provides startup scripts that run when a workspace is created or started. These scripts allow administrators to customize the workspace environment during startup. These are bash scripts with a maximum size of 1 KB. If you need larger setup configuration, we recommend adding a script to the container image and triggering the script from the lifecycle configuration.

We leverage Kubernetes container lifecycle hooks to provide this functionality [https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/). Note that Kubernetes does not provide guarantees of when the startup script will be run in relation to the entrypoint of the container. 

## Idle shutdown
<a name="space-idle-shutdown"></a>

Configure automatic shutdown of idle workspaces to optimize resource usage.

### Idle shutdown
<a name="space-idle-shutdown-spec"></a>

```
idleShutdown:
  enabled: true
  idleShutdownTimeoutMinutes: 30
  detection:
    httpGet:
      path: /api/idle
      port: 8888
      scheme: HTTP
```

### Parameters
<a name="space-idle-shutdown-parameter"></a>

**enabled** (boolean, required) - Enables or disables idle shutdown for the workspace.

**idleShutdownTimeoutMinutes** (integer, required) - Number of minutes of inactivity before the workspace shuts down. Minimum value is 1.

**detection** (object, required) - Defines how to detect workspace idle state.

**detection.httpGet** (object, optional) - HTTP endpoint configuration for idle detection. Uses Kubernetes HTTPGetAction specification.
+ **path** - HTTP path to request
+ **port** - Port number or name
+ **scheme** - HTTP or HTTPS (default: HTTP)

### Configuration Locations
<a name="space-idle-shutdown-configure"></a>

**Workspace Configuration**

Define idle shutdown directly in the workspace specification:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:

      name: my-workspace
spec:
  displayName: "Development Workspace"
  image:
      jupyter/scipy-notebook:latest
  idleShutdown:
    enabled: true

      idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path:
      /api/idle
        port: 8888
```

**Template Configuration**

Define default idle shutdown behavior in a WorkspaceTemplate:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: jupyter-template
spec:
  displayName: "Jupyter Template"
  defaultImage: jupyter/scipy-notebook:latest
  defaultIdleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path: /api/idle
        port: 8888
  idleShutdownOverrides:
    allow: true
    minTimeoutMinutes: 60
    maxTimeoutMinutes: 240
```

### Template Inheritance and Overrides
<a name="space-idle-shutdown-inherit"></a>

Workspaces using a template automatically inherit the template's `defaultIdleShutdown` configuration. Workspaces can override this configuration if the template allows it.

**Override Policy**

Templates control override behavior through `idleShutdownOverrides`:

**allow** (boolean, default: true)- Whether workspaces can override the default idle shutdown configuration.

**minTimeoutMinutes** (integer, optional)- Minimum allowed timeout value for workspace overrides.

**maxTimeoutMinutes** (integer, optional)- Maximum allowed timeout value for workspace overrides.

**Inheritance Example**

Workspace inherits template defaults:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-workspace
spec:
  displayName: "My Workspace"
  templateRef:
    name: jupyter-template
  # Inherits defaultIdleShutdown from template
```

**Override Example**

Workspace overrides template defaults:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-workspace
spec:
  displayName: "My Workspace"
  templateRef:
    name: jupyter-template
  idleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 60  # Must be within template bounds
    detection:
      httpGet:
        path: /api/idle
        port: 8888
```

**Locked Configuration**

Prevent workspace overrides:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: locked-template
spec:
  displayName: "Locked Template"
  defaultImage: jupyter/scipy-notebook:latest
  defaultIdleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path: /api/idle
        port: 8888
  idleShutdownOverrides:
    allow: false  # Workspaces cannot override
```

### Behavior
<a name="space-idle-shutdown-behavior"></a>

When idle shutdown is enabled, the system periodically checks the workspace for activity using the configured HTTP endpoint. If the endpoint indicates the workspace is idle for the specified timeout duration, the workspace automatically stops. You can manually restart the workspace when needed.

## Template updates
<a name="customization-template-updates"></a>

The client tools such as Kubectl or Hyperpod CLI and SDK can be used for managing Spaces within the EKS cluster. Administrators can provision Space Templates for default Space configurations, while Data Scientists can customize their integrated development environments without needing to understand the underlying Kubernetes complexity. For detailed usage instructions, please refer to the CLI and SDK documentation at [https://sagemaker-hyperpod-cli.readthedocs.io/en/latest/index.html](https://sagemaker-hyperpod-cli.readthedocs.io/en/latest/index.html).

Administrators can perform CRUD operations on Space Templates, which serve as the base configurations when creating a Space. Data Scientists can perform CRUD operations on Spaces and override various parameters, including the Multi-Instance GPU profiles for specific compute nodes. They can start, stop, and connect to the Spaces via remote VSCode access and the Web UI. When a Space Template is updated, any subsequently created Space will be configured with the settings in the updated template. Compliance checks will be performed when existing Spaces are updated or started. If any settings are out of bounds or mismatched, the Spaces will fail to update or start.

## Using hyp cli and kubectl
<a name="customization-hyp-cli"></a>

User can perform CRUD on the templates with the Hyperpod CLI

```
### 1. Create a Space Template
hyp create hyp-space-template --file template.yaml

### 2. List Space Templates
hyp list hyp-space-template
hyp list hyp-space-template --output json

### 3. Describe a Space Template
hyp describe hyp-space-template --name my-template
hyp describe hyp-space-template --name my-template --output json

### 4. Update a Space Template
hyp update hyp-space-template --name my-template --file updated-template.yaml

### 5. Delete a Space Template
hyp delete hyp-space-template --name my-template
```

To create custom templates, you can use our system templates as a starting point. This template will work for SMD-like images, however it can be customized based on the images used by admins.

Example custom JupyterLab template:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-jupyter-template
  namespace: my-namespace
spec:
  displayName: "My Custom Jupyter Lab"
  description: "Custom Jupyter Lab with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-jupyterlab"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "jupyterlab"
```

Example custom Code Editor template:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-code-editor-template
  namespace: my-namespace
spec:
  displayName: "My Custom Code Editor"
  description: "Custom Code Editor with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-code-editor"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "code-editor"
```

# Add users and set up service accounts
<a name="add-user"></a>

## Fine grained access control - our recommendation
<a name="add-user-access-control"></a>

Users are differentiated based on their Kubernetes username. The user’s Kubernetes username is defined in their Access Entry. To ensure two human users have distinct usernames, there are two options:

1. Recommended - Multiple human users can use the same role as long as each one has their own distinct session name that will persist between sessions. By default, Kubernetes usernames for IAM roles are in the format `arn:aws:sts::{ACCOUNT_ID}:assumed-role/{ROLE_NAME}/{SESSION_NAME}`. With this default, users will already be differentiated by session name. An Admin has a few ways to enforce unique session names per user.
   + SSO login - Users using SSO login will by default have a session name tied to their AWS username
   + Central credentials vending service - For enterprise customers, they may have some internal credential vending service that users can call to get credentials with their identity. 
   + Role based enforcement - Require IAM users to set their `aws:username` as their role session name when they assume an IAM role in your AWS account. Documentation on how to do this is here: [https://aws.amazon.com/blogs/security/easily-control-naming-individual-iam-role-sessions/](https://aws.amazon.com/blogs/security/easily-control-naming-individual-iam-role-sessions/)

1. If 2 Data Scientists are using different access entries (different IAM role or user), they will always be counted as different users.

**Creating access entry**

Required IAM policy for data scientist role:
+ `eks:DescribeCluster`

Required access entry policies
+ `AmazonSagemakerHyperpodSpacePolicy` - scoped to namespace DS should create spaces in
+ `AmazonSagemakerHyperpodSpaceTemplatePolicy` - scoped to “jupyter-k8s-shared” namespace

## Private and Public spaces
<a name="add-user-spaces"></a>

We support 2 types of sharing patterns: “Public” and “OwnerOnly”. Both the “AccessType” and “OwnershipType” fields use these 2 values.
+ AccessType: Public spaces can be accessed by anyone with permissions in the namespace, while OwnerOnly can only be accessed by the space creator as well as administrator users. Administrator users are defined with the following criteria:
+ OwnershipType: Public spaces can be modified/deleted by anyone with permissions in the namespace, OwnerOnly can be modified/deleted by the creator or the Admin.

Admin users are defined by:

1. Part of the `system:masters` Kubernetes group

1. Part of the Kubernetes group defined in the CLUSTER\$1ADMIN\$1GROUP environment variable in the helm chart.

A user’s groups can be configured using EKS access entries. A space can be defined as “Public” or “OwnerOnly” by configuring the spec in the object:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  labels:
    app.kubernetes.io/name: jupyter-k8s
  name: example-workspace
spec:
  displayName: "Example Workspace"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:3.4.2-cpu"
  desiredStatus: "Running"
  ownershipType: "Public"/"OwnerOnly"
  accessType: "Public"/"OwnerOnly"
  # more fields here
```

# Limits
<a name="ds-limits"></a>

Spaces run as pods on HyperPod EKS nodes with attached EBS volumes. The number of Spaces that can be deployed per node is constrained by AWS infrastructure limits.

**EBS Volume Limits per Node**

Reference: [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume\$1limits.html](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html)

EC2 nodes have a maximum number of EBS volumes that can be attached. Since each Space typically uses one EBS volume, this limits how many Spaces with dedicated EBS storage can run on a single node.

**Maximum Pods per HyperPod Node**

Reference: [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)

Each HyperPod instance type supports a maximum number of pods based on available IP addresses from the VPC CNI plugin. Since each Space runs as a pod, this directly caps the number of Spaces per node.

**Impact**

The effective limit for Spaces per node is whichever constraint is reached first. 

# Task governance for Interactive Spaces on HyperPod
<a name="task-governance"></a>

This section covers how to optimize your shared Amazon SageMaker HyperPod EKS clusters for Interactive Spaces workloads. You'll learn to configure Kueue's task governance features—including quota management, priority scheduling, and resource sharing policies—to ensure your development workloads run without interruption while maintaining fair allocation across your teams' training, evaluation, and batch processing activities.

## How Interactive Space management works
<a name="task-governance-how"></a>

To effectively manage Interactive Spaces in shared HyperPod EKS clusters, implement the following task governance strategies using Kueue's existing capabilities.

**Priority class configuration**

Define dedicated priority classes for Interactive Spaces with high weights (such as 100) to ensure development pods are admitted and scheduled before other task types. This configuration enables Interactive Spaces to preempt lower-priority jobs during cluster load, which is critical for maintaining uninterrupted development workflows.

**Quota sizing and allocation**

Reserve sufficient compute resources in your team's ClusterQueue to handle expected development workloads. During periods when development resources are idle, unused quota resources can be temporarily allocated to other teams' tasks. When development demand increases, these borrowed resources can be reclaimed to prioritize pending Interactive Space pods.

**Resource Sharing Strategies**

Choose between two quota sharing approaches based on your requirements:

*Strict Resource Control*: Disable quota lending and borrowing to guarantee reserved compute capacity is always available for your Interactive Spaces. This approach requires sizing quotas large enough to independently handle peak development demand and may result in idle nodes during low-usage periods.

*Flexible Resource Sharing*: Enable quota lending to allow other teams to utilize idle development resources when needed. However, disable borrowing to ensure Interactive Spaces never run on borrowed, reclaimable resources that could lead to unexpected evictions.

**Intra-Team Preemption**

Enable intra-team preemption when running mixed workloads (training, evaluation, and Interactive Spaces) under the same quota. This allows Kueue to preempt lower-priority jobs within your team to accommodate high-priority Interactive Space pods, ensuring development work can proceed without depending on external quota borrowing.

## Sample Interactive Space setup
<a name="task-governance-space-setup"></a>

The following example shows how Kueue manages compute resources for Interactive Spaces in a shared Amazon SageMaker HyperPod cluster.

**Cluster configuration and policy setup**

Your cluster has the following configuration:
+ *Team Alpha (Dev Team)*: 8 CPU quota for Interactive Spaces
+ *Team Beta (ML Team)*: 16 CPU quota for training and evaluation
+ *Team Gamma (Research)*: 6 CPU quota for experimentation
+ *Static provisioning*: No autoscaling
+ *Total capacity*: 30 CPUs

The shared CPU pool uses this priority policy:
+ *Interactive Spaces*: Priority 100
+ *Training*: Priority 75
+ *Evaluation*: Priority 50
+ *Batch Processing*: Priority 25

Kueue enforces team quotas and priority classes, with preemption enabled and borrowing disabled for the dev team.

**Initial state: Normal cluster utilization**

In normal operations:
+ *Team Alpha*: Runs 6 Interactive Spaces using 6 CPUs, 2 CPUs idle
+ *Team Beta*: Runs training jobs (12 CPUs) and evaluation (4 CPUs) within its 16 CPU quota
+ *Team Gamma*: Runs research workloads on all 6 CPUs
+ *Resource sharing*: Team Beta borrows Team Alpha's 2 idle CPUs for additional training

**Development spike: Team Alpha requires additional resources**

When Team Alpha's developers need to scale up development work, additional Interactive Space pods require 4 more CPUs. Kueue detects that the new pods are:
+ Within Team Alpha's namespace
+ Priority 100 (Interactive Spaces)
+ Pending admission due to quota constraints

**Kueue's response process**

Kueue follows a three-step process to allocate resources:

1. **Quota check**

   Question: Does Team Alpha have unused quota?
   + *Current usage*: 6 CPUs used, 2 CPUs available
   + *New requirement*: 4 CPUs needed
   + *Result*: Insufficient quota → Proceed to Step 2

1. **Self-preemption within Team Alpha**

   Question: Can lower-priority Team Alpha jobs be preempted?
   + *Available targets*: No lower-priority jobs in Team Alpha
   + *Result*: No preemption possible → Proceed to Step 3

1. **Reclaim borrowed resources**

   Question: Are Team Alpha resources being borrowed by other teams?
   + *Borrowed resources*: Team Beta using 2 CPUs from Team Alpha
   + *Action*: Kueue evicts Team Beta's borrowed training pods, freeing 2 CPUs
   + *Remaining need*: Still need 2 more CPUs → Interactive Spaces remain in NotAdmitted state until resources become available

This approach prioritizes Interactive Spaces while maintaining team quota boundaries and preventing development work from running on unstable borrowed resources.

# Observability
<a name="observability"></a>

## Standard Kubernetes Monitoring
<a name="observability-monitor"></a>

You can monitor Spaces using standard Kubernetes tools like `kubectl` describe and `kubectl` logs.

**Monitoring Space Status**

```
# List all Spaces with status
kubectl get workspace -A

# Get detailed information about a specific Space
kubectl describe workspace <workspace-name>
```

**Viewing Space Logs**

```
# View workspace container logs
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c workspace

# View SSM agent sidecar logs (for remote IDE connectivity)
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c ssm-agent-sidecar

# Follow logs in real-time
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c workspace -f
```

**Understanding Space Conditions**

Spaces report four condition types in their status:
+ **Available**: `True` when the Space is ready for use. All required resources (pods, services, storage) are running and healthy.
+ **Progressing**: `True` when the Space is being created, updated, or reconciled. Transitions to `False` once stable.
+ **Degraded**: `True` when errors are detected with the Space resources. Check the condition message for details.
+ **Stopped**: `True` when the Space desired status is set to `Stopped`. The pods are terminated but storage and configuration are preserved.

## CloudWatch Logs Integration
<a name="observability-cw"></a>

You can install the CloudWatch logging add-on to send Space logs to Amazon CloudWatch Logs for centralized log management and retention. This enables log aggregation across multiple clusters and integration with CloudWatch Insights for querying and analysis. All of the above available `kubectl` logs are queryable in CloudWatch with this plugin.

**Reference: **[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.html).

## HyperPod Observability Add-on
<a name="observability-addon"></a>

The SageMaker HyperPod observability add-on provides comprehensive dashboards for monitoring Space resource utilization. After installing the add-on, you can view Space memory and CPU usage in the **Tasks** tab of the HyperPod console, which displays metrics in Amazon Managed Grafana dashboards.

**Reference: **[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html)

**Key metrics available:**
+ CPU and memory utilization per Space
+ GPU metrics (if applicable)

# Create and manage spaces
<a name="create-manage-spaces"></a>

Data scientists can list to view all the spaces they have access to, create a space using one of the templates, update space to update the image, file system, and other attributes of space configuration, and delete a space. As a prerequisite, customers must install HyperPod CLI or use kubectl to create and manage spaces. For more details on HyperPod CLI, please see [this](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/README.md#space). To use kubectl commands, please refer to [this guide](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) to install kubectl.

## Create space
<a name="create-manage-spaces-create"></a>

**HyperPod CLI**

Create a Jupyter space

```
hyp create hyp-space \ 
    --name myspace \ 
    --display-name "My Space" \ 
    --memory 8Gi \ 
    --template-ref name=sagemaker-jupyter-template,namespace=jupyter-k8s-system
```

Create a Code Editor space

```
hyp create hyp-space \ 
    --name myspace \ 
    --display-name "My Space" \ 
    --memory 8Gi \ 
    --template-ref name=sagemaker-code-editor-template,namespace=jupyter-k8s-system
```

**kubectl**

```
kubectl apply -f - <<EOF
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: my-space
  desiredStatus: Running
EOF
```

or you can simply apply the yaml file

```
kubectl apply -f my-workspace.yaml
```

## List spaces
<a name="create-manage-spaces-list"></a>

**HyperPod CLI**

```
hyp list hyp-space
```

**kubectl**

```
kubectl get workspaces -n <workspace-namespace> 
```

## Describe a space
<a name="create-manage-spaces-describe"></a>

**HyperPod CLI**

```
hyp describe hyp-space --name myspace
```

**kubectl**

```
# Basic Status reporting
kubectl get workspace my-workspace -n <workspace-namespace>

# Enhanced Workspace Information Retrieval 
kubectl get workspace my-workspace -n <workspace-namespace> -o wide

# Complete Workspace Information Retrieval
kubectl get workspace my-workspace -n <workspace-namespace> -o json
kubectl get workspace my-workspace -n <workspace-namespace> -o yaml
```

## Update a space
<a name="create-manage-spaces-update"></a>

**HyperPod CLI**

```
hyp update hyp-space \
    --name myspace \
    --display-name "Updated My Space"
```

**kubectl**

Update the original workspace YAML file as needed, then re-apply it. Be sure that the metadata name is not modified. You can also use these kubectl command to modify fields without reapplying the entire workspace yaml: 

```
# Open a Terminal IDE and modify the Workspace
kubectl edit workspace -n <workspace-namespace>

# Patch a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"<field name>":"<desired value>"}}' -n <workspace-namespace>
```

## Start/Stop a space
<a name="create-manage-spaces-stop"></a>

**HyperPod CLI**

```
hyp start hyp-space --name myspace
hyp stop hyp-space --name myspace
```

**kubectl**

You can update the desired status field in the Workspace to start/stop a space.

```
# Start a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"desiredStatus":"Running"}}' -n <workspace-namespace>
    
# Stop a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"desiredStatus":"Stopped"}}' -n <workspace-namespace>
```

## Get Logs
<a name="create-manage-spaces-log"></a>

**HyperPod CLI**

```
hyp get-logs hyp-space --name myspace
```

**kubectl**

```
# Check Pod Logs
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-metadata-name>

# Check Pod Events
kubectl describe pod -l workspace.jupyter.org/workspace-name=<workspace-metadata-name>

# Check Operator Logs
kubectl logs -n jupyter-k8s-system deployment/jupyter-k8s-controller-manager
```

## Delete a space
<a name="create-manage-spaces-delete"></a>

**HyperPod CLI**

```
hyp delete hyp-space --name myspace
```

**kubectl**

```
# Delete a Workspace
kubectl delete workspace <workspace-name> -n <namespace>
```

# Web browser access
<a name="browser-access"></a>

Web UI access allows you to connect directly to development spaces running on your SageMaker HyperPod cluster through a secure web browser interface. This provides immediate access to Jupyter Lab and other web-based development environments without requiring local software installation.

## Prerequisites
<a name="browser-access-prereq"></a>

Before setting up web UI access, ensure you have completed the following:
+ *SageMaker Spaces add-on installation*: Follow the [SageMaker Spaces add-on installation ](https://docs.aws.amazon.com/sagemaker/latest/dg/operator-install.html) and enable web UI access during installation
+ *User access to EKS cluster*: Users need EKS Access Entry configured with appropriate permissions. See [Add users and set up service accounts for EKS Access Entry setup details](https://docs.aws.amazon.com/sagemaker/latest/dg/add-user.html)
+ *Development spaces*: Create and start development spaces on your HyperPod cluster
+ *kubectl access*: Ensure kubectl is configured to access your EKS cluster

## Generate Web UI Access URL
<a name="browser-access-url"></a>

**Using HyperPod CLI**

If you have the HyperPod CLI installed, you can use this simplified command:

```
hyp create hyp-space-access --name <space-name> --connection-type web-ui
```

**Using kubectl**

You can also use the `kubectl` command line to create a connection request.

```
kubectl create -f - -o yaml <<EOF
apiVersion: connection.workspace.jupyter.org/v1alpha1
kind: WorkspaceConnection
metadata:
  namespace: <space-namespace>
spec:
  workspaceName: <space-name>
  workspaceConnectionType: web-ui
EOF
```

The URL is present in the `status.workspaceConnectionUrl` of the output of this command.

## Accessing Your Development Space
<a name="browser-access-develop"></a>

1. *Generate the web UI URL* using one of the methods above

1. *Copy the URL* from the response

1. *Open the URL* in your web browser

1. *Access your development environment* through the web interface

## Supported Development Environments
<a name="browser-access-develop-env"></a>

The web UI provides access to:
+ *Jupyter Lab*
+ *Code Editor*

## Troubleshooting
<a name="browser-access-troubleshooting"></a>

**Cannot generate access URLs**

Check the following:
+ SageMaker Spaces add-on is running: kubectl get pods -n sagemaker-spaces-system
+ Development space is running and healthy
+ User has appropriate EKS Access Entry permissions

# Remote access to SageMaker Spaces
<a name="vscode-access"></a>

Remote access allows you to connect your local Visual Studio Code directly to development spaces running on your SageMaker HyperPod cluster. Remote connections use SSM to establish secure, encrypted tunnels between your local machine and the development spaces.

## Prerequisites
<a name="vscode-access-prereq"></a>

Before setting up remote access, ensure you have completed the following:
+ *SageMaker Spaces add-on installation*: Follow [SageMaker Spaces add-on installation](https://docs.aws.amazon.com/sagemaker/latest/dg/operator-install.html) and enable remote access during installation (either Quick install or Custom install with remote access configuration enabled).
+ *User access to EKS cluster*: Users need EKS Access Entry configured with appropriate permissions. See [Add users and set up service accounts for EKS Access Entry setup details](https://docs.aws.amazon.com/sagemaker/latest/dg/add-user.html)
+ *Development spaces*: Create and start development spaces on your HyperPod cluster
+ *kubectl access*: Ensure kubectl is configured to access your EKS cluster

## Generate VS Code remote connection
<a name="vscode-access-remote"></a>

### Using HyperPod CLI
<a name="vscode-access-remote-cli"></a>

If you have the HyperPod CLI installed, you can use this simplified command:

```
hyp create hyp-space-access --name <space-name> --connection-type vscode-remote
```

### Using kubectl
<a name="vscode-access-remote-kubectl"></a>

You can also use the `kubectl` command line to create a connection request.

```
kubectl create -f - -o yaml <<EOF
apiVersion: connection.workspace.jupyter.org/v1alpha1
kind: WorkspaceConnection
metadata:
  namespace: <space-namespace>
spec:
  workspaceName: <space-name>
  workspaceConnectionType: vscode-remote
EOF
```

The URL is present in the `status.workspaceConnectionUrl` of the output of this command.

## Connecting with VS Code
<a name="vscode-access-remote-vscode"></a>

1. Generate the VS Code connection URL using one of the methods above

1. Copy the VS Code URL from the response

1. Click the URL or paste it into your browser

1. VS Code will prompt to open the remote connection

1. Confirm the connection to establish the remote development environment

## Supported Development Environments
<a name="vscode-access-remote-dev-env"></a>

The web UI provides access to:
+ *Jupyter Lab*
+ *Code Editor*

## Troubleshooting
<a name="troubleshooting"></a>

**Cannot generate connection URLs**

*Check the following:*
+ SageMaker Spaces add-on is running: kubectl get pods -n sagemaker-spaces-system
+ Development space is running and healthy
+ Remote access was enabled during add-on installation
+ User has appropriate EKS Access Entry permissions

# Train and deploy models with HyperPod CLI and SDK
<a name="getting-started-hyperpod-training-deploying-models"></a>

Amazon SageMaker HyperPod helps you train and deploy machine learning models at scale. The AWS HyperPod CLI is a unified command-line interface that simplifies machine learning (ML) workflows on AWS. It abstracts infrastructure complexities and provides a streamlined experience for submitting, monitoring, and managing ML training jobs. The CLI is designed specifically for data scientists and ML engineers who want to focus on model development rather than infrastructure management. This topic walks you through three key scenarios: training a PyTorch model, deploying a custom model using trained artifacts, and deploying a JumpStart model. Designed for first-time users, this concise tutorial ensures you can set up, train, and deploy models effortlessly using either the HyperPod CLI or the SDK. The handshake process between training and inference helps you manage model artifacts effectively. 

## Prerequisites
<a name="prerequisites"></a>

Before you begin using Amazon SageMaker HyperPod, make sure you have:
+ An AWS account with access to Amazon SageMaker HyperPod
+ Python 3.9, 3.10, or 3.11 installed
+ AWS CLI configured with appropriate credentials. 

## Install the HyperPod CLI and SDK
<a name="install-cli-sdk"></a>

Install the required package to access the CLI and SDK:

```
pip install sagemaker-hyperpod
```

This command sets up the tools needed to interact with HyperPod clusters.

## Configure your cluster context
<a name="configure-cluster"></a>

HyperPod operates on clusters optimized for machine learning. Start by listing available clusters to select one for your tasks.

1. List all available clusters:

   ```
   hyp list-cluster
   ```

1. Choose and set your active cluster:

   ```
   hyp set-cluster-context your-eks-cluster-name
   ```

1. Verify the configuration:

   ```
   hyp get-cluster-context
   ```

**Note**  
All subsequent commands target the cluster you've set as your context.

## Choose your scenario
<a name="choose-scenario"></a>

For detailed instructions on each scenario, click on the topics below:

**Topics**
+ [Prerequisites](#prerequisites)
+ [Install the HyperPod CLI and SDK](#install-cli-sdk)
+ [Configure your cluster context](#configure-cluster)
+ [Choose your scenario](#choose-scenario)
+ [Train a PyTorch model](train-models-with-hyperpod.md)
+ [Deploy a custom model](deploy-trained-model.md)
+ [Deploy a JumpStart model](deploy-jumpstart-model.md)

# Train a PyTorch model
<a name="train-models-with-hyperpod"></a>

This topic walks you through the process of training a PyTorch model using HyperPod.

In this scenario, let's train a PyTorch model using the `hyp-pytorch-job` template, which simplifies job creation by exposing commonly used parameters. The model artifacts will be stored in an S3 bucket for later use in inference. However, this is optional, and you can choose your preferred storage location.

## Create a training job
<a name="create-training-job"></a>

You can train the model using either the CLI or Python SDK.

### Using the CLI
<a name="using-cli"></a>

Create a training job with the following command:

```
hyp create hyp-pytorch-job \
    --version 1.0 \
    --job-name test-pytorch-job \
    --image pytorch/pytorch:latest \
    --command '["python", "train.py"]' \
    --args '["--epochs", "10", "--batch-size", "32"]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.p4d.24xlarge \
    --tasks-per-node 8 \
    --label-selector '{"accelerator": "nvidia", "network": "efa"}' \
    --deep-health-check-passed-nodes-only true \
    --scheduler-type "kueue" \
    --queue-name "training-queue" \
    --priority "high" \
    --max-retry 3 \
    --volumes '["data-vol", "model-vol", "checkpoint-vol"]' \
    --persistent-volume-claims '["shared-data-pvc", "model-registry-pvc"]' \
    --output-s3-uri s3://my-bucket/model-artifacts
```

**Key required parameters explained**:
+ `--job-name`: Unique identifier for your training job
+ `--image`: Docker image containing your training environment

This command starts a training job named `test-pytorch-job`. The `--output-s3-uri` specifies where the trained model artifacts will be stored, for example, `s3://my-bucket/model-artifacts`. Note this location, as you’ll need it for deploying the custom model.

### Using the Python SDK
<a name="using-python-sdk"></a>

For programmatic control, use the SDK. Create a Python script to launch the same training job.

```
from sagemaker.hyperpod import HyperPodPytorchJob
from sagemaker.hyperpod.job 
import ReplicaSpec, Template, Spec, Container, Resources, RunPolicy, Metadata

# Define job specifications
nproc_per_node = "1"  # Number of processes per node
replica_specs = 
[
    ReplicaSpec
    (
        name = "pod",  # Replica name
        template = Template
        (
            spec = Spec
            (
                containers =
                [
                    Container
                    (
                        # Container name
                        name="container-name",  
                        
                        # Training image
                        image="448049793756.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist",  
                        
                        # Always pull image
                        image_pull_policy="Always",  
                        resources=Resources\
                        (
                            # No GPUs requested
                            requests={"nvidia.com/gpu": "0"},  
                            # No GPU limit
                            limits={"nvidia.com/gpu": "0"},   
                        ),
                        # Command to run
                        command=["python", "train.py"],  
                        # Script arguments
                        args=["--epochs", "10", "--batch-size", "32"],  
                    )
                ]
            )
        ),
    )
]
# Keep pods after completion
run_policy = RunPolicy(clean_pod_policy="None")  

# Create and start the PyTorch job
pytorch_job = HyperPodPytorchJob
(
    # Job name
    metadata = Metadata(name="demo"),  
    # Processes per node
    nproc_per_node = nproc_per_node,   
    # Replica specifications
    replica_specs = replica_specs,     
    # Run policy
    run_policy = run_policy,           
    # S3 location for artifacts
    output_s3_uri="s3://my-bucket/model-artifacts"  
)
# Launch the job
pytorch_job.create()
```

## Monitor your training job
<a name="monitor-training-job"></a>

Monitor your job's progress with these commands:

### Using the CLI
<a name="monitor-cli"></a>

```
# Check job status
hyp list hyp-pytorch-job

# Get detailed information
hyp describe hyp-pytorch-job --job-name test-pytorch-job

# View logs
hyp get-logs hyp-pytorch-job \
    --pod-name test-pytorch-job-pod-0 \
    --job-name test-pytorch-job
```

**Note**: Training time varies based on model complexity and instance type. Monitor the logs to track progress.

These commands help you verify the job’s status and troubleshoot issues. Once the job completes successfully, the model artifacts are saved to `s3://my-bucket/model-artifacts`.

### Using the Python SDK
<a name="monitor-python-sdk"></a>

Add the following code to your Python script:

```
print("List all pods created for this job:")
print(pytorch_job.list_pods())

print("Check the logs from pod0:")
print(pytorch_job.get_logs_from_pod(pod_name="demo-pod-0"))

print("List all HyperPodPytorchJobs:")
print(HyperPodPytorchJob.list())

print("Describe job:")
print(HyperPodPytorchJob.get(name="demo").model_dump())

pytorch_job.refresh()
print(pytorch_job.status.model_dump())
```

## Next steps
<a name="next-steps"></a>

After training, the model artifacts are stored in the S3 bucket you specified (`s3://my-bucket/model-artifacts`). You can use these artifacts to deploy a model. Currently, you must manually manage the transition from training to inference. This involves:
+ **Locating artifacts**: Check the S3 bucket (`s3://my-bucket/model-artifacts`) to confirm the trained model files are present.
+ **Recording the path**: Note the exact S3 path (e.g., `s3://my-bucket/model-artifacts/test-pytorch-job/model.tar.gz`) for use in the inference setup.
+ **Referencing in deployment**: Provide this S3 path when configuring the custom endpoint to ensure the correct model is loaded.

# Deploy a custom model
<a name="deploy-trained-model"></a>

After training completes, deploy your model for inference. You can deploy a custom model using either the CLI or the SDK.

## Locate your model artifacts
<a name="locate-model-artifacts"></a>
+ **Check your S3 bucket**: Verify that model artifacts are saved at `s3://my-bucket/model-artifacts/`
+ **Note the exact path**: You'll need the full path (for example, `s3://my-bucket/model-artifacts/test-pytorch-job/model.tar.gz`)

## Deploy using the CLI
<a name="deploy-using-cli"></a>

Run the following command to deploy your custom model:

```
hyp create hyp-custom-endpoint \
    --version 1.0 \
    --env '{"HF_MODEL_ID":"/opt/ml/model", "SAGEMAKER_PROGRAM":"inference.py", }' \
    --model-source-type s3 \
    --model-location test-pytorch-job \
    --s3-bucket-name my-bucket \
    --s3-region us-east-2 \
    --prefetch-enabled true \ 
    --image-uri 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:latest \
    --model-volume-mount-name model-weights \
    --container-port 8080 \
    --resources-requests '{"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"}' \
    --resources-limits '{"nvidia.com/gpu": 1}' \
    --tls-output-s3-uri s3://<bucket_name> \
    --instance-type ml.g5.8xlarge \
    --endpoint-name endpoint-custom-pytorch \
    --model-name pytorch-custom-model
```

This command deploys the trained model as an endpoint named `endpoint-custom-pytorch`. The `--model-location` references the artifact path from the training job.

## Deploy using the Python SDK
<a name="deploy-using-sdk"></a>

Create a Python script with the following content:

```
from sagemaker.hyperpod.inference.config.hp_custom_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig, EnvironmentVariables
from sagemaker.hyperpod.inference.hp_custom_endpoint import HPCustomEndpoint

model = Model(
    model_source_type="s3",
    model_location="test-pytorch-job",
    s3_bucket_name="my-bucket",
    s3_region="us-east-2",
    prefetch_enabled=True
)

server = Server(
    instance_type="ml.g5.8xlarge",
    image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0",
    container_port=8080,
    model_volume_mount_name="model-weights"
)

resources = {
    "requests": {"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
    "limits": {"nvidia.com/gpu": 1}
}

env = EnvironmentVariables(
    HF_MODEL_ID="/opt/ml/model",
    SAGEMAKER_PROGRAM="inference.py",
    SAGEMAKER_SUBMIT_DIRECTORY="/opt/ml/model/code",
    MODEL_CACHE_ROOT="/opt/ml/model",
    SAGEMAKER_ENV="1"
)

endpoint_name = SageMakerEndpoint(name="endpoint-custom-pytorch")

tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://<bucket_name>")

custom_endpoint = HPCustomEndpoint(
    model=model,
    server=server,
    resources=resources,
    environment=env,
    sage_maker_endpoint=endpoint_name,
    tls_config=tls_config
)

custom_endpoint.create()
```

## Invoke the endpoint
<a name="invoke-endpoint"></a>

### Using the CLI
<a name="invoke-using-cli"></a>

Test the endpoint with a sample input:

```
hyp invoke hyp-custom-endpoint \
    --endpoint-name endpoint-custom-pytorch \
    --body '{"inputs":"What is the capital of USA?"}'
```

This returns the model’s response, such as “The capital of the USA is Washington, D.C.”

### Using the SDK
<a name="invoke-using-sdk"></a>

Add the following code to your Python script:

```
data = '{"inputs":"What is the capital of USA?"}'
response = custom_endpoint.invoke(body=data).body.read()
print(response)
```

## Manage the endpoint
<a name="manage-endpoint"></a>

### Using the CLI
<a name="manage-using-cli"></a>

List and inspect the endpoint:

```
hyp list hyp-custom-endpoint
hyp get hyp-custom-endpoint --name endpoint-custom-pytorch
```

### Using the SDK
<a name="manage-using-sdk"></a>

Add the following code to your Python script:

```
logs = custom_endpoint.get_logs()
print(logs)
```

## Clean up resources
<a name="cleanup-resources"></a>

When you're done, delete the endpoint to avoid unnecessary costs.

### Using the CLI
<a name="cleanup-using-cli"></a>

```
hyp delete hyp-custom-endpoint --name endpoint-custom-pytorch
```

### Using the SDK
<a name="cleanup-using-sdk"></a>

```
custom_endpoint.delete()
```

## Next steps
<a name="next-steps"></a>

You've successfully deployed and tested a custom model using SageMaker HyperPod. You can now use this endpoint for inference in your applications.

# Deploy a JumpStart model
<a name="deploy-jumpstart-model"></a>

You can deploy a pre-trained JumpStart model for inference using either the CLI or the SDK.

## Using the CLI
<a name="deploy-jumpstart-cli"></a>

Run the following command to deploy a JumpStart model:

```
hyp create hyp-jumpstart-endpoint \
  --version 1.0 \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.g5.8xlarge \
  --endpoint-name endpoint-test-jscli
```

## Using the SDK
<a name="deploy-jumpstart-sdk"></a>

Create a Python script with the following content:

```
from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint

model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b'
)

server=Server(
    instance_type='ml.g5.8xlarge',
)

endpoint_name=SageMakerEndpoint(name='<endpoint-name>')

# create spec
js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name
)
```

## Invoke the endpoint
<a name="invoke-jumpstart-endpoint"></a>

### Using the CLI
<a name="invoke-jumpstart-cli"></a>

Test the endpoint with a sample input:

```
hyp invoke hyp-jumpstart-endpoint \
    --endpoint-name endpoint-jumpstart \
    --body '{"inputs":"What is the capital of USA?"}'
```

### Using the SDK
<a name="invoke-jumpstart-sdk"></a>

Add the following code to your Python script:

```
data = '{"inputs":"What is the capital of USA?"}'
response = js_endpoint.invoke(body=data).body.read()
print(response)
```

## Manage the endpoint
<a name="manage-jumpstart-endpoint"></a>

### Using the CLI
<a name="manage-jumpstart-cli"></a>

List and inspect the endpoint:

```
hyp list hyp-jumpstart-endpoint
hyp get hyp-jumpstart-endpoint --name endpoint-jumpstart
```

### Using the SDK
<a name="manage-jumpstart-sdk"></a>

Add the following code to your Python script:

```
endpoint_iterator = HPJumpStartEndpoint.list()
for endpoint in endpoint_iterator:
    print(endpoint.name, endpoint.status)

logs = js_endpoint.get_logs()
print(logs)
```

## Clean up resources
<a name="cleanup-jumpstart-resources"></a>

When you're done, delete the endpoint to avoid unnecessary costs.

### Using the CLI
<a name="cleanup-jumpstart-cli"></a>

```
hyp delete hyp-jumpstart-endpoint --name endpoint-jumpstart
```

### Using the SDK
<a name="cleanup-jumpstart-sdk"></a>

```
js_endpoint.delete()
```

## Next steps
<a name="jumpstart-next-steps"></a>

Now that you've trained a PyTorch model, deployed it as a custom endpoint, and deployed a JumpStart model using HyperPod's CLI and SDK, explore advanced features:
+ **Multi-node training**: Scale training across multiple instances
+ **Custom containers**: Build specialized training environments
+ **Integration with SageMaker Pipelines**: Automate your ML workflows
+ **Advanced monitoring**: Set up custom metrics and alerts

For more examples and advanced configurations, visit the [SageMaker HyperPod GitHub repository](https://github.com/aws/amazon-sagemaker-examples).

# Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-run-jobs"></a>

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters orchestrated with Amazon EKS. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters.

**Note**  
When running jobs via the SageMaker HyperPod CLI or kubectl, HyperPod can track compute utilization (GPU/CPU hours) across namespaces (teams). These metrics power usage reports, which provide:  
Visibility into allocated vs. borrowed resource consumption
Teams resource utilization for auditing (up to 180 days)
Cost attribution aligned with Task Governance policies
To use usage reports, you must install the usage report infrastructure. We strongly recommend configuring [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) to enforce compute quotas and enable granular cost attribution.  
For more information about setting up and generating usage reports, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html).

**Tip**  
For a hands-on experience and guidance on how to set up and use a SageMaker HyperPod cluster orchestrated with Amazon EKS, we recommend taking this [Amazon EKS Support in SageMaker HyperPod](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e) workshop.

Data scientist users can train foundational models using the EKS cluster set as the orchestrator for the SageMaker HyperPod cluster. Scientists leverage the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) and the native `kubectl` commands to find available SageMaker HyperPod clusters, submit training jobs (Pods), and manage their workloads. The SageMaker HyperPod CLI enables job submission using a training job schema file, and provides capabilities for job listing, description, cancellation, and execution. Scientists can use [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/training/overview/) according to compute quotas managed by HyperPod, and [SageMaker AI-managed MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) to manage ML experiments and training runs. 

**Topics**
+ [Installing the SageMaker HyperPod CLI](sagemaker-hyperpod-eks-run-jobs-access-nodes.md)
+ [SageMaker HyperPod CLI commands](sagemaker-hyperpod-eks-hyperpod-cli-reference.md)
+ [Running jobs using the SageMaker HyperPod CLI](sagemaker-hyperpod-eks-run-jobs-hyperpod-cli.md)
+ [Running jobs using `kubectl`](sagemaker-hyperpod-eks-run-jobs-kubectl.md)

# Installing the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-eks-run-jobs-access-nodes"></a>

SageMaker HyperPod provides the [SageMaker HyperPod command line interface](https://github.com/aws/sagemaker-hyperpod-cli) (CLI) package. 

1. Check if the version of Python on your local machine is between 3.8 and 3.11.

1. Check the prerequisites in the `README` markdown file in the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) package.

1. Clone the SageMaker HyperPod CLI package from GitHub.

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli.git
   ```

1. Install the SageMaker HyperPod CLI.

   ```
   cd sagemaker-hyperpod-cli && pip install .
   ```

1. Test if the SageMaker HyperPod CLI is successfully installed by running the following command. 

   ```
   hyperpod --help
   ```

**Note**  
If you are a data scientist and want to use the SageMaker HyperPod CLI, make sure that your IAM role is set up properly by your cluster admins following the instructions at [IAM users for scientists](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-user) and [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md).

# SageMaker HyperPod CLI commands
<a name="sagemaker-hyperpod-eks-hyperpod-cli-reference"></a>

The following table summarizes the SageMaker HyperPod CLI commands.

**Note**  
For a complete CLI reference, see [README](https://github.com/aws/sagemaker-hyperpod-cli?tab=readme-ov-file#sagemaker-hyperpod-command-line-interface) in the [SageMaker HyperPod CLI GitHub repository](https://github.com/aws/sagemaker-hyperpod-cli).


| SageMaker HyperPod CLI command | Entity  | Description | 
| --- | --- | --- | 
| hyperpod get-clusters | cluster/access | Lists all clusters to which the user has been enabled with IAM permissions to submit training workloadsGives the current snapshot of whole available instances which are not running any workloads or jobs along with maximum capacity, grouping by health check statuses (ex: BurnInPassed) | 
| hyperpod connect-cluster | cluster/access | Configures kubectl to operate on the specified HyperPod cluster and namespace | 
| hyperpod start-job  | job | Submits the job to targeted cluster-Job name will be unique at namespace level-Users will be able to override yaml spec by passing them as CLI arguments | 
| hyperpod get-job | job | Display metadata of the submitted job | 
| hyperpod list-jobs | job | Lists all jobs in the connected cluster/namespace to which the user has been added with IAM permissions to submit training workloads | 
| hyperpod cancel-job | job | Stops and deletes the job and gives up underlying compute resources. This job cannot be resumed again. A new job needs to be started, if needed. | 
| hyperpod list-pods | pod | Lists all the pods in the given job in a namespace | 
| hyperpod get-log | pod | Retrieves the logs of a particulat pod in a specified job | 
| hyperpod exec | pod | Run the bash command in the shell of the specified pod(s) and publishes the output | 
| hyperpod --help | utility | lists all supported commands | 

# Running jobs using the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli"></a>

To run jobs, make sure that you installed Kubeflow Training Operator in the EKS clusters. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

Run the `hyperpod get-cluster` command to get the list of available HyperPod clusters.

```
hyperpod get-clusters
```

Run the `hyperpod connect-cluster` to configure the SageMaker HyperPod CLI with the EKS cluster orchestrating the HyperPod cluster.

```
hyperpod connect-cluster --cluster-name <hyperpod-cluster-name>
```

Use the `hyperpod start-job` command to run a job. The following command shows the command with required options. 

```
hyperpod start-job \
    --job-name <job-name>
    --image <docker-image-uri>
    --entry-script <entrypoint-script>
    --instance-type <ml.instance.type>
    --node-count <integer>
```

The `hyperpod start-job` command also comes with various options such as job auto-resume and job scheduling.

## Enabling job auto-resume
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-enable-auto-resume"></a>

The `hyperpod start-job` command also has the following options to specify job auto-resume. For enabling job auto-resume to work with the SageMaker HyperPod node resiliency features, you must set the value for the `restart-policy` option to `OnFailure`. The job must be running under the `kubeflow` namespace or a namespace prefixed with `hyperpod`.
+ [--auto-resume <bool>] \$1Optional, enable job auto resume after fails, default is false
+ [--max-retry <int>] \$1Optional, if auto-resume is true, max-retry default value is 1 if not specified
+ [--restart-policy <enum>] \$1Optional, PyTorchJob restart policy. Available values are `Always`, `OnFailure`, `Never` or `ExitCode`. The default value is `OnFailure`. 

```
hyperpod start-job \
    ... // required options \
    --auto-resume true \
    --max-retry 3 \
    --restart-policy OnFailure
```

## Running jobs with scheduling options
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-scheduling"></a>

The `hyperpod start-job` command has the following options to set up the job with queuing mechanisms. 

**Note**  
You need [Kueue](https://kueue.sigs.k8s.io/docs/overview/) installed in the EKS cluster. If you haven't installed, follow the instructions in [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md).
+ [--scheduler-type <enum>] \$1Optional, Specify the scheduler type. The default is `Kueue`.
+ [--queue-name <string>] \$1Optional, Specify the name of the [Local Queue](https://kueue.sigs.k8s.io/docs/concepts/local_queue/) or [Cluster Queue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/) you want to submit with the job. The queue should be created by cluster admins using `CreateComputeQuota`.
+ [--priority <string>] \$1Optional, Specify the name of the [Workload Priority Class](https://kueue.sigs.k8s.io/docs/concepts/workload_priority_class/), which should be created by cluster admins.

```
hyperpod start-job \
    ... // required options
    --scheduler-type Kueue \
    --queue-name high-priority-queue \
    --priority high
```

## Running jobs from a configuration file
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-from-config"></a>

As an alternative, you can create a job configuration file containing all the parameters required by the job and then pass this config file to the `hyperpod start-job` command using the --config-file option. In this case:

1. Create your job configuration file with the required parameters. Refer to the job configuration file in the SageMaker HyperPod CLI GitHub repository for a [baseline configuration file](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-run-jobs-hyperpod-cli.html#sagemaker-hyperpod-eks-hyperpod-cli-from-config).

1. Start the job using the configuration file as follows.

   ```
   hyperpod start-job --config-file /path/to/test_job.yaml
   ```

**Tip**  
For a complete list of parameters of the `hyperpod start-job` command, see the [Submitting a Job](https://github.com/aws/sagemaker-hyperpod-cli?tab=readme-ov-file#submitting-a-job) section in the `README.md` of the SageMaker HyperPod CLI GitHub repository.

# Running jobs using `kubectl`
<a name="sagemaker-hyperpod-eks-run-jobs-kubectl"></a>

**Note**  
Training job auto resume requires Kubeflow Training Operator release version `1.7.0`, `1.8.0`, or `1.8.1`.

Note that you should install Kubeflow Training Operator in the clusters using a Helm chart. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md). Verify if the Kubeflow Training Operator control plane is properly set up by running the following command.

```
kubectl get pods -n kubeflow
```

This should return an output similar to the following.

```
NAME                                             READY   STATUS    RESTARTS   AGE
training-operator-658c68d697-46zmn               1/1     Running   0          90s
```

**To submit a training job**

To run a training jobs, prepare the job configuration file and run the [https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply) command as follows.

```
kubectl apply -f /path/to/training_job.yaml
```

**To describe a training job**

To retrieve the details of the job submitted to the EKS cluster, use the following command. It returns job information such as the job submission time, completion time, job status, configuration details.

```
kubectl get -o yaml training-job -n kubeflow
```

**To stop a training job and delete EKS resources**

To stop a training job, use kubectl delete. The following is an example of stopping the training job created from the configuration file `pytorch_job_simple.yaml`.

```
kubectl delete -f /path/to/training_job.yaml 
```

This should return the following output.

```
pytorchjob.kubeflow.org "training-job" deleted
```

**To enable job auto-resume**

SageMaker HyperPod supports job auto-resume functionality for Kubernetes jobs, integrating with the Kubeflow Training Operator control plane.

Ensure that there are sufficient nodes in the cluster that have passed the SageMaker HyperPod health check. The nodes should have the taint `sagemaker.amazonaws.com/node-health-status` set to `Schedulable`. It is recommended to include a node selector in the job YAML file to select nodes with the appropriate configuration as follows.

```
sagemaker.amazonaws.com/node-health-status: Schedulable
```

The following code snippet is an example of how to modify a Kubeflow PyTorch job YAML configuration to enable the job auto-resume functionality. You need to add two annotations and set `restartPolicy` to `OnFailure` as follows.

```
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob 
metadata:
    name: pytorch-simple
    namespace: kubeflow
    annotations: { // config for job auto resume
      sagemaker.amazonaws.com/enable-job-auto-resume: "true"
      sagemaker.amazonaws.com/job-max-retry-count: "2"
    }
spec:
  pytorchReplicaSpecs:
  ......
  Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
          spec:
              nodeSelector:
                  sagemaker.amazonaws.com/node-health-status: Schedulable
```

**To check the job auto-resume status**

Run the following command to check the status of job auto-resume.

```
kubectl describe pytorchjob -n kubeflow <job-name>
```

Depending on the failure patterns, you might see two patterns of Kubeflow training job restart as follows.

**Pattern 1**:

```
Start Time:    2024-07-11T05:53:10Z
Events:
  Type     Reason                   Age                    From                   Message
  ----     ------                   ----                   ----                   -------
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-0
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-1
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m59s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-0
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-1
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m58s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.
```

**Pattern 2**: 

```
Events:
  Type    Reason                   Age    From                   Message
  ----    ------                   ----   ----                   -------
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-master-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-master-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-master-0
```

# Using the HyperPod training operator
<a name="sagemaker-eks-operator"></a>

 The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. It introduces intelligent fault recovery, hang job detection, and process-level management capabilities that minimize training disruptions and reduce costs. Unlike traditional training infrastructure that requires complete job restarts when failures occur, this operator implements surgical process recovery to keep your training jobs running smoothly. 

 The operator also works with HyperPod's health monitoring and observability functions, providing real-time visibility into training execution and automatic monitoring of critical metrics like loss spikes and throughput degradation. You can define recovery policies through simple YAML configurations without code changes, allowing you to quickly respond to and recover from unrecoverable training states. These monitoring and recovery capabilities work together to maintain optimal training performance while minimizing operational overhead.

 While Kueue is not required for this training operator, your cluster administrator can install and configure it for enhanced job scheduling capabilities. For more information, see the [official documentation for Kueue](https://kueue.sigs.k8s.io/docs/overview/).

**Note**  
To use the training operator, you must use the latest [ HyperPod AMI release](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html). To upgrade, use the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API operation. If you use [ HyperPod task governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html), it must also be the latest version.

## Supported versions
<a name="sagemaker-eks-operator-supported-versions"></a>

 The HyperPod training operator works only work with specific versions of Kubernetes, Kueue, and HyperPod. See the list below for the complete list of compatible versions. 
+ Supported Kubernetes versions – 1.28, 1.29, 1.30, 1.31, 1.32, and 1.33
+ Suggested Kueue versions – [ v.0.12.2](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.2) and [v.0.12.3](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.3)
+ The latest HyperPod AMI release. To upgrade to the latest AMI release, use the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API.
+ [PyTorch 2.4.0 – 2.7.1](https://github.com/pytorch/pytorch/releases)

**Note**  
We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to a job operations, resource management, and essential service functionality.

# Installing the training operator
<a name="sagemaker-eks-operator-install"></a>

See the following sections to learn about how to install the training operator.

## Prerequisites
<a name="sagemaker-eks-operator-prerequisites"></a>

 Before you use the HyperPod training operator, you must have completed the following prerequisites: 
+  [ Created a HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html). 
+ Installed the latest AMI on your HyperPod cluster. For more information, see [SageMaker HyperPod AMI releases for Amazon EKS](sagemaker-hyperpod-release-ami-eks.md).
+ [Installed cert-manager](https://cert-manager.io/docs/installation/).
+  [ Set up the EKS Pod Identity Agent using the console](https://docs.aws.amazon.com/eks/latest/userguide/pod-id-agent-setup.html). If you want to use the AWS CLI, use the following command: 

  ```
  aws eks create-addon \ 
   --cluster-name my-eks-cluster \
   --addon-name eks-pod-identity-agent \
   --region AWS Region
  ```
+ (Optional) If you run your HyperPod cluster nodes in a private VPC, you must set up PrivateLinks VPC endpoints for the Amazon SageMaker AI API (`com.amazonaws.aws-region.sagemaker.api`) and Amazon EKS Auth services (com.amazonaws.*aws-region*.eks-auth). You must also make sure that your cluster nodes are running with subnets that are in a security group that allows the traffic to route through the VPC endpoints to communicate with SageMaker AI and Amazon EKS. If these aren't properly set up, the add-on installation can fail. To learn more about setting up VPC endpoints, see [Create a VPC endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html#create-interface-endpoint-aws).

## Installing the training operator
<a name="sagemaker-eks-operator-install-operator"></a>

 You can now install the HyperPod training operator through the SageMaker AI console, the Amazon EKS console, or with the AWS CLI The console methods offer simplified experiences that help you install the operator. The AWS CLI offers a programmatic approach that lets you customize more of your installation.

Between the two console experiences, SageMaker AI provides a one-click installation creates the IAM execution role, creates the pod identity association, and installs the operator. The Amazon EKS console installation is similar, but this method doesn't automatically create the IAM execution role. During this process, you can choose to create a new IAM execution role with information that the console pre-populates. By default, these created roles only have access to the current cluster that you're installing the operator in. Unless you edit the role's permissions to include other clusters, if you remove and reinstall the operator, you must create a new role. 

------
#### [ SageMaker AI console (recommended) ]

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **Amazon SageMaker HyperPod training operator**, and choose **install**. During the installation process, SageMaker AI creates an IAM execution role with permissions similar to the [ AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and creates a pod identity association between your Amazon EKS cluster and your new execution role.

------
#### [ Amazon EKS console ]

**Note**  
If you install the add-on through the Amazon EKS cluster, first make sure that you've tagged your HyperPod cluster with the key-value pair `SageMaker:true`. Otherwise, the installation will fail.

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Go to your EKS cluster, choose **Add-ons**, then choose ** Get more Add-ons**.

1. Choose Amazon SageMaker HyperPod training operator, then choose **Next**.

1. Under **Version**, the console defaults to the latest version, which we recommend that you use.

1. Under **Add-on access**, choose a pod identity IAM role to use with the training operator add-on. If you don't already have a role, choose **Create recommended role** to create one.

1. During this role creation process, the IAM console pre-populates all of the necessary information, such as the use case, the [ AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and other required permissions, the role name, and the description. As you go through the steps, review the information, and choose **Create role**.

1. In the EKS console, review your add-on's settings, and then choose **Create**.

------
#### [ CLI ]

1. Make sure that the IAM execution role for your HyperPod cluster has a trust relationship that allows EKS Pod Identity to assume the role or or [create a new IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html) with the following trust policy. Alternatively, you could use the Amazon EKS console to install the add-on, which creates a recommended role.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
         "Effect": "Allow",
         "Principal": {
           "Service": "pods.eks.amazonaws.com"
         },
         "Action": [
           "sts:AssumeRole",
           "sts:TagSession",
           "eks-auth:AssumeRoleForPodIdentity"
         ]
       }
     ]
   }
   ```

------

1.  Attach the [ AmazonSageMakerHyperPodTrainingOperatorAccess managed policy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) to your created role. 

1.  [ Then create a pod identity association between your EKS cluster, your IAM role, and your new IAM role](https://docs.aws.amazon.com/eks/latest/userguide/pod-identities.html).

   ```
   aws eks create-pod-identity-association \
   --cluster-name my-eks-cluster \
   --role-arn ARN of your execution role \
   --namespace aws-hyperpod \
   --service-account hp-training-operator-controller-manager \
   --region AWS Region
   ```

1.  After you finish the process, you can use the ListPodIdentityAssociations operation to see the association you created. The following is a sample response of what it might look like. 

   ```
   aws eks list-pod-identity-associations --cluster-name my-eks-cluster
   {
       "associations": [{
           "clusterName": "my-eks-cluster",
           "namespace": "aws-hyperpod",
           "serviceAccount": "hp-training-operator-controller-manager",
           "associationArn": "arn:aws:eks:us-east-2:123456789012:podidentityassociation/my-hyperpod-cluster/a-1a2b3c4d5e6f7g8h9",
           "associationId": "a-1a2b3c4d5e6f7g8h9"
       }]
   }
   ```

1. To install the training operator, use the `create-addon` operation. The `--addon-version` parameter is optional. If you don’t provide one, the default is the latest version. To get the possible versions, use the [ DescribeAddonVersions](https://docs.aws.amazon.com/eks/latest/APIReference/API_DescribeAddonVersions.html) operation.

   ```
   aws eks create-addon \
     --cluster-name my-eks-cluster \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --resolve-conflicts OVERWRITE
   ```

------

If you already have the training operator installed on your HyperPod cluster, you can update the EKS add-on to the version that you want. If you want to use [ checkpointless training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless.html) or [ elastic training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-elastic-training.html), consider the following:
+ Both checkpointless training and elastic training require the EKS add-on to be on version 1.2.0 or above.
+ The Amazon SageMaker HyperPod training operator maintains backwards compatibility for any EKS add-on version, so you can upgrade from any add-on version to 1.2.0 or above.
+ If you downgrade from versions 1.2.0 or above to a lower version, you must first delete the existing jobs before the downgrade and resubmit the jobs after the downgrade is complete.

------
#### [ Amazon EKS Console ]

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Go to your EKS cluster, and choose **Add-ons**. Then, choose the Amazon SageMaker HyperPod training operator add-on and choose **Edit**.

1. In the **Version** menu, choose the version of the add-on that you want, then choose **Save changes**.

------
#### [ CLI ]

1. First get the list of the supported versions of the add-on for your cluster.

   ```
   aws eks describe-addon-versions \
     --kubernetes-version $(aws eks describe-cluster --name my-eks-cluster --query 'cluster.version' --output text) \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --query 'addons[0].addonVersions[].addonVersion' \
     --output table
   ```

1. Then update the add-on to the version that you want.

   ```
   aws eks update-addon \
     --cluster-name my-eks-cluster \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --addon-version target-version
     --resolve-conflicts OVERWRITE
   ```

------

 The training operator comes with a number of options with default values that might fit your use case. We recommend that you try out the training operator with default values before changing them. The table below describes all parameters and examples of when you might want to configure each parameter.


| Parameter | Description | Default | 
| --- | --- | --- | 
| hpTrainingControllerManager.manager.resources.requests.cpu | How many processors to allocate for the controller | 1 | 
| hpTrainingControllerManager.manager.resources.requests.memory | How much memory to allocate to the controller | 2Gi | 
| hpTrainingControllerManager.manager.resources.limits.cpu | The CPU limit for the controller | 2 | 
| hpTrainingControllerManager.manager.resources.limits.memory | The memory limit for the controller | 4Gi | 
| hpTrainingControllerManager.nodeSelector | Node selector for the controller pods | Default behavior is to select nodes with the label sagemaker.amazonaws.com/compute-type: "HyperPod" | 

## HyperPod elastic agent
<a name="sagemaker-eks-operator-elastic-agent"></a>

The HyperPod elastic agent is an extension of [PyTorch’s ElasticAgent](https://docs.pytorch.org/docs/stable/elastic/agent.html). It orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. To use the HyperPod training operator, you must first install the HyperPod elastic agent into your training image before you can submit and run jobs using the operator. The following is a docker file that installs elastic agent and uses `hyperpodrun` to create the job launcher.

**Note**  
Both [ checkpointless training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless.html) and [ elastic training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-elastic-training.html) require that you use HyperPod elastic agent version 1.1.0 or above.

```
RUN pip install hyperpod-elastic-agent

ENTRYPOINT ["entrypoint.sh"]
# entrypoint.sh
...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
            --rdzv-backend hyperpod \ # Optional
            --inprocess-restart \ # Optional (in-process fault recovery with checkpointless training)
            ... # Other torchrun args
            # pre-traing arg_group
            --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
            # post-train arg_group
            --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
            training.py --script-args
```

You can now submit jobs with `kubectl`.

### HyperPod elastic agent arguments
<a name="sagemaker-eks-operator-elastic-agent-args"></a>

 The HyperPod elastic agent supports all of the original arguments and adds some additional arguments. The following is all of the arguments available in the HyperPod elastic agent. For more information about PyTorch's Elastic Agent, see their [official documentation](https://docs.pytorch.org/docs/stable/elastic/agent.html). 


| Argument | Description | Default Value | 
| --- | --- | --- | 
| --shutdown-signal | Signal to send to workers for shutdown (SIGTERM or SIGKILL) | "SIGKILL" | 
| --shutdown-timeout | Timeout in seconds between shutdown-signal and SIGKILL signals | 15 | 
| --server-host | Agent server address | "0.0.0.0" | 
| --server-port | Agent server port | 8080 | 
| --server-log-level | Agent server log level | "info" | 
| --server-shutdown-timeout | Server shutdown timeout in seconds | 300 | 
| --pre-train-script | Path to pre-training script | None | 
| --pre-train-args | Arguments for pre-training script | None | 
| --post-train-script | Path to post-training script | None | 
| --post-train-args | Arguments for post-training script | None | 
| --inprocess-restart | Flag specifying whether to use the inprocess\$1restart feature | FALSE | 
| --inprocess-timeout | Time in seconds that the agent waits for workers to reach a synchronization barrier before triggering a process-level restart. | None | 

## Task governance (optional)
<a name="sagemaker-eks-operator-task-governance"></a>

The training operator is integrated with [ HyperPod task governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance), a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. To set up HyperPod task governance, see [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md). 

**Note**  
When installing the HyperPod task governance add-on, you must use version v1.3.0-eksbuild.1 or higher.

When submitting a job, make sure you include your queue name and priority class labels of `hyperpod-ns-team-name-localqueue` and `priority-class-name-priority`. For example, if you're using Kueue, your labels become the following:
+ kueue.x-k8s.io/queue-name: hyperpod-ns-*team-name*-localqueue
+ kueue.x-k8s.io/priority-class: *priority-class*-name-priority

The following is an example of what your configuration file might look like:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPytorchJob
metadata:
  name: hp-task-governance-sample
  namespace: hyperpod-ns-team-name
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
    kueue.x-k8s.io/priority-class: priority-class-priority
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 4
      spares: 2
      template:
        spec:
          containers:
            - name: ptjob
              image: XXXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  cpu: "2"
```

Then use the following kubectl command to apply the YAML file.

```
kubectl apply -f task-governance-job.yaml
```

## Kueue (optional)
<a name="sagemaker-eks-operator-kueue"></a>

While you can run jobs directly, your organization can also integrate the training operator with Kueue to allocate resources and schedule jobs. Follow the steps below to install Kueue into your HyperPod cluster.

1. Follow the installation guide in the [ official Kueue documentation](https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version). When you reach the step of configuring `controller_manager_config.yaml`, add the following configuration:

   ```
   externalFrameworks:
   - "HyperPodPytorchJob.v1.sagemaker.amazonaws.com"
   ```

1. Follow the rest of the steps in the official installation guide. After you finish installing Kueue, you can create some sample queues with the `kubectl apply -f sample-queues.yaml` command. Use the following YAML file.

   ```
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: ClusterQueue
   metadata:
     name: cluster-queue
   spec:
     namespaceSelector: {}
     preemption:
       withinClusterQueue: LowerPriority
     resourceGroups:
     - coveredResources:
       - cpu
       - nvidia.com/gpu
       - pods
       flavors:
       - name: default-flavor
         resources:
         - name: cpu
           nominalQuota: 16
         - name: nvidia.com/gpu
           nominalQuota: 16
         - name: pods
           nominalQuota: 16
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: LocalQueue
   metadata:
     name: user-queue
     namespace: default
   spec:
     clusterQueue: cluster-queue
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: ResourceFlavor
   metadata:
     name: default-flavor
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   description: High priority
   kind: WorkloadPriorityClass
   metadata:
     name: high-priority-class
   value: 1000
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   description: Low Priority
   kind: WorkloadPriorityClass
   metadata:
     name: low-priority-class
   value: 500
   ```

# Using the training operator to run jobs
<a name="sagemaker-eks-operator-usage"></a>

 To use kubectl to run the job, you must create a job.yaml to specify the job specifications and run `kubectl apply -f job.yaml` to submit the job. In this YAML file, you can specify custom configurations in the `logMonitoringConfiguration` argument to define automated monitoring rules that analyze log outputs from the distributed training job to detect problems and recover. 

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job
```

If you want to use the log monitoring options, make sure that you’re emitting the training log to `sys.stdout`. HyperPod elastic agent monitors training logs in sys.stdout, which is saved at `/tmp/hyperpod/`. You can use the following command to emit training logs.

```
logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)
```

 The following table describes all of the possible log monitoring configurations: 


| Parameter | Usage | 
| --- | --- | 
| jobMaxRetryCount | Maximum number of restarts at the process level. | 
| restartPolicy: numRestartBeforeFullJobRestart | Maximum number of restarts at the process level before the operator restarts at the job level. | 
| restartPolicy: evalPeriodSeconds | The period of evaluating the restart limit in seconds | 
| restartPolicy: maxFullJobRestarts | Maximum number of full job restarts before the job fails. | 
| cleanPodPolicy | Specifies the pods that the operator should clean. Accepted values are All, OnlyComplete, and None. | 
| logMonitoringConfiguration | The log monitoring rules for slow and hanging job detection | 
| expectedRecurringFrequencyInSeconds | Time interval between two consecutive LogPattern matches after which the rule evaluates to HANGING. If not specified, no time constraint exists between consecutive LogPattern matches. | 
| expectedStartCutOffInSeconds | Time to first LogPattern match after which the rule evaluates to HANGING. If not specified, no time constraint exists for the first LogPattern match. | 
| logPattern | Regular expression that identifies log lines that the rule applies to when the rule is active | 
| metricEvaluationDataPoints | Number of consecutive times a rule must evaluate to SLOW before marking a job as SLOW. If not specified, the default is 1. | 
| metricThreshold | Threshold for value extracted by LogPattern with a capturing group. If not specified, metric evaluation is not performed. | 
| operator | The inequality to apply to the monitoring configuration. Accepted values are gt, gteq, lt, lteq, and eq. | 
| stopPattern | Regular expresion to identify the log line at which to deactivate the rule. If not specified, the rule will always be active. | 
| faultOnMatch | Indicates whether a match of LogPattern should immediately trigger a job fault. When true, the job will be marked as faulted as soon as the LogPattern is matched, regardless of other rule parameters. When false or not specified, the rule will evaluate to SLOW or HANGING based on other parameters. | 

 For more training resiliency, specify spare node configuration details. If your job fails, the operator works with Kueue to use nodes reserved in advance to continue running the job. Spare node configurations require Kueue, so if you try to submit a job with spare nodes but don’t have Kueue installed, the job will fail. The following example is a sample `job.yaml` file that contains spare node configurations.

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"
```

## Monitoring
<a name="sagemaker-eks-operator-usage-monitoring"></a>

The Amazon SageMaker HyperPod is integrated with [ observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html), so you can set up monitoring to collect and feed metrics into these observability tools.

Alternatively, you can scrape metrics through Amazon Managed Service for Prometheus without managed observability. To do so, include the metrics that you want to monitor into your `job.yaml` file when you run jobs with `kubectl`.

```
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s
```

The following are events that the training operator emits that you can feed into Amazon Managed Service for Prometheus to monitor your training jobs.


| Event | Description | 
| --- | --- | 
| hyperpod\$1training\$1operator\$1jobs\$1created\$1total | Total number of jobs that the training operator has run | 
| hyperpod\$1training\$1operator\$1jobs\$1restart\$1latency | Current job restart latency | 
| hyperpod\$1training\$1operator\$1jobs\$1fault\$1detection\$1latency | Fault detection latency | 
| hyperpod\$1training\$1operator\$1jobs\$1deleted\$1total | Total number of deleted jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1successful\$1total | Total number of completed jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1failed\$1total | Total number of failed jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1restarted\$1total | Total number of auto-restarted jobs | 

## Sample docker configuration
<a name="sagemaker-eks-operator-usage-docker"></a>

The following is a sample docker file that you can run with the `hyperpod run` command.

```
export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}
```

## Sample log monitoring configurations
<a name="sagemaker-eks-operator-usage-log-monitoring"></a>

**Job hang detection**

To detect hang jobs, use the following configurations. It uses the following parameters:
+ expectedStartCutOffInSeconds – how long the monitor should wait before expecting the first logs
+ expectedRecurringFrequencyInSeconds – the time interval to wait for the next batch of logs

With these settings, the log monitor expects to see a log line matching the regex pattern `.*Train Epoch.*` within 60 seconds after the training job starts. After the first appearance, the monitor expects to see matching log lines every 10 seconds. If the first logs don't appear within 60 seconds or subsequent logs don't appear every 10 seconds, the HyperPod elastic agent treats the container as stuck and coordinates with the training operator to restart the job.

```
runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds
```

**Training loss spike**

The following monitoring configuration emits training logs with the pattern `xxx training_loss_step xx`. It uses the parameter `metricEvaluationDataPoints`, which lets you specify a threshold of data points before the operator restarts the job. If the training loss value is more than 2.0, the operator restarts the job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job
```

**Low TFLOPs detection**

The following monitoring configuration emits training logs with the pattern `xx TFLOPs xx` every five seconds. If TFLOPs is less than 100 for 5 data points, the operator restarts the training job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5
```

**Training script error log detection**

The following monitoring configuration detects if the pattern specified in `logPattern` is present in the training logs. As soon as the training operator encounters the error pattern, the training operator treats it as a fault and restarts the job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "GPU Error"
      logPattern: ".*RuntimeError.*out of memory.*"
      faultOnMatch: true
```

# Troubleshooting
<a name="sagemaker-eks-operator-troubleshooting"></a>

See the following sections to learn how to troubleshoot error when using the training operator.

## I can't install the training operator
<a name="sagemaker-eks-operator-troubleshooting-installation-error"></a>

If you can't install the training operator, make sure that you're using the [ supported versions of components](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html#sagemaker-eks-operator-supported-versions). For example, if you get an error that your HyperPod AMI release is incompatible with the training operator, [ update to the latest version](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html).

## Incompatible HyperPod task governance version
<a name="sagemaker-eks-operator-troubleshooting-task-governance-version"></a>

During installation, you might get an error message that the version of HyperPod task governance is incompatible. The training operator works only with version v1.3.0-eksbuild.1 or higher. Update your HyperPod task governance add-on and try again. 

## Missing permissions
<a name="sagemaker-eks-operator-troubleshooting-task-missing-permissions"></a>

 While you're setting up the training operator or running jobs, you might receive errors that you're not authorized to run certain operations, such as `DescribeClusterNode`. To resolve these errors, make sure you correctly set up IAM permissions while you're [setting up the Amazon EKS Pod Identity Agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html#sagemaker-eks-operator-install-pod-identity).

# Using elastic training in Amazon SageMaker HyperPod
<a name="sagemaker-eks-elastic-training"></a>

 Elastic training is a new Amazon SageMaker HyperPod capability that automatically scales training jobs based on compute resource availability and workload priority. Elastic training jobs can start with minimum compute resources required for model training and dynamically scale up or down through automatic checkpointing and resumption across different node configurations (world size). Scaling is achieved by automatically adjusting the number of data-parallel replicas. During high cluster utilization periods, elastic training jobs can be configured to automatically scale down in response to resource requests from higher-priority jobs, freeing up compute for critical workloads. When resources free up during off-peak periods, elastic training jobs automatically scale back up to accelerate training, then scale back down when higher-priority workloads need resources again. 

Elastic training is built on top of the HyperPod training operator and integrates the following components:
+ [Amazon EKS for Kubernetes orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html)
+ [ Amazon SageMaker HyperPod Task Governance ](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html)for job queuing, prioritization, and scheduling
+ [PyTorch Distributed Checkpoint (DCP)](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) for scalable state and checkpoint management, such as DCP

**Supported frameworks**
+ PyTorch with Distributed Data Parallel(DDP) and Fully Sharded Data Parallel(FSDP)
+ PyTorch Distributed Checkpoint (DCP)

## Prerequisites
<a name="sagemaker-eks-elastic-prereqs"></a>

### SageMaker HyperPod EKS Cluster
<a name="sagemaker-eks-elastic-hyperpod-cluster"></a>

You must have a running SageMaker HyperPod cluster with Amazon EKS orchestration. For information on creating a HyperPod EKS cluster, see:
+ [Getting started with Amazon EKS in SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html)

### SageMaker HyperPod Training Operator
<a name="sagemaker-eks-elastic-training-operator"></a>

Elastic Training is supported in training operator v. 1.2 and above.

To install the training operator as EKS add-on, see: [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html)

### (Recommended) Install and configure Task Governance and Kueue
<a name="sagemaker-eks-elastic-task-governance"></a>

We recommend installing and configuring Kueue via [HyperPod Task Governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html) to specify workload priorities with elastic training. Kueue provides stronger workload management with queuing, prioritization, gang scheduling, resource tracking and graceful preemption which are essential for operating in multi-tenant training environments.
+ Gang scheduling ensures that all required pods of a training job start together. This prevents situations where some pods start while others remain pending, which could cause wasted resources.
+ Gentle preemption allows lower-priority elastic jobs to yield resources to higher-priority workloads. Elastic jobs can scale down gracefully without being forcibly evicted, improving overall cluster stability.

We recommend configuring the following Kueue components:
+ PriorityClasses to define relative job importance
+ ClusterQueues to manage global resource sharing and quotas across teams or workloads
+ LocalQueues to route jobs from individual namespaces into the appropriate ClusterQueue

For more advanced setups, you can also incorporate:
+ Fair-share policies to balance resource usage across multiple teams
+ Custom preemption rules to enforce organizational SLAs or cost controls

Please refer to:
+ [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html)
+ [Kueue Documentation](https://kueue.sigs.k8s.io/)

### (Recommended) Setup user namespaces and resource quotas
<a name="sagemaker-eks-elastic-namespaces-quotas"></a>

When deploying this feature on Amazon EKS, we recommend applying a set of foundational cluster-level configurations to ensure isolation, resource fairness, and operational consistency across teams.

#### Namespace and Access Configuration
<a name="sagemaker-eks-elastic-namespace-access"></a>

Organize your workloads using separate namespaces for each team or project. This allows you to apply fine-grained isolation and governance. We also recommend configuring AWS IAM to Kubernetes RBAC mapping to associate individual IAM users or roles with their corresponding namespaces.

Key practices include:
+ Map IAM roles to Kubernetes service accounts using IAM Roles for Service Accounts (IRSA) when workloads need AWS permissions. [https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html)
+ Apply RBAC policies to restrict users to only their designated namespaces (e.g., `Role`/`RoleBinding` rather than cluster-wide permissions).

#### Resource and Compute Constraints
<a name="sagemaker-eks-elastic-resource-constraints"></a>

To prevent resource contention and ensure fair scheduling across teams, apply quotas and limits at the namespace level:
+ ResourceQuotas to cap aggregate CPU, memory, storage, and object counts (pods, PVCs, services, etc.).
+ LimitRanges to enforce default and maximum per-pod or per-container CPU and memory limits.
+ PodDisruptionBudgets (PDBs) as needed to define resiliency expectations.
+ Optional: Namespace-level queueing constraints (e.g., via Task Governance or Kueue) to prevent users from over-submitting jobs.

These constraints help maintain cluster stability and support predictable scheduling for distributed training workloads.

#### Auto-scaling
<a name="sagemaker-eks-elastic-autoscaling"></a>

SageMaker HyperPod on EKS supports cluster autoscaling through Karpenter. When Karpenter or similar resource provisioner is used together with elastic training, the cluster as well as the elastic training job may scale up automatically after an elastic training job is once submitted. This is because elastic training operator takes greedy approach, always asks more than the available compute resources until it reaches maximum limit set by the job. This occurs because the elastic training operator continuously requests additional resources as part of elastic job execution, which can trigger node provisioning. Continuous resource provisioners like Karpenter will serve the requests by scaling up the compute cluster.

To keep these scale-ups predictable and under control, we recommend configuring namespace-level ResourceQuotas in the namespaces where elastic training jobs are created. ResourceQuotas help limit the maximum resources that jobs can request, preventing unbounded cluster growth while still allowing elastic behavior within defined limits.

For example, a ResourceQuota for 8 ml.p5.48xlarge instances will have the following form:

```
apiVersion: v1
kind: ResourceQuota
metadata:
  name: <quota-name>
  namespace: <namespace-name>
spec:
  hard:
    nvidia.com/gpu: "64"
    vpc.amazonaws.com/efa: "256"
    requests.cpu: "1536"
    requests.memory: "5120Gi"
    limits.cpu: "1536"
    limits.memory: "5120Gi"
```

## Build Training Container
<a name="sagemaker-eks-elastic-build-container"></a>

HyperPod training operator works with a custom PyTorch launcher provided via HyperPod Elastic Agent python package ([https://www.piwheels.org/project/hyperpod-elastic-agent/](https://www.piwheels.org/project/hyperpod-elastic-agent/)). Customers must install the elastic agent and replace the `torchrun` command with `hyperpodrun` to launch training. For more details, please see:

[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html\$1sagemaker-eks-operator-elastic-agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html#sagemaker-eks-operator-elastic-agent)

An example training container:

```
FROM ...

...

RUN pip install hyperpod-elastic-agent
ENTRYPOINT ["entrypoint.sh"]

# entrypoint.sh ...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
  --rdzv-backend hyperpod \
 # Optional ...
 # Other torchrun args
 # pre-traing arg_group
 --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
 # post-train arg_group
 --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
 training.py --script-args
```

## Training code modification
<a name="sagemaker-eks-elastic-training-code"></a>

SageMaker HyperPod provides a set of recipes that already configured to run with Elastic Policy.

To enable elastic training for custom PyTorch training scripts, you will need to make minor modifications to your training loop. This guide walks you through the necessary modifications needed to ensure your training job responds to elastic scaling events that occur when compute resource availability changes. During all elastic events (e.g., nodes are available, or nodes get preempted), the training job receives an elastic event signal that is used to coordinate a graceful shutdown by saving a checkpoint, and resuming training by restarting from that saved checkpoint with a new world configuration. To enable elastic training with custom training scripts, you need to:

### Detect Elastic Scaling Events
<a name="sagemaker-eks-elastic-detect-events"></a>

In your training loop, check for elastic events during each iteration:

```
from hyperpod_elastic_agent.elastic_event_handler import elastic_event_detected

def train_epoch(model, dataloader, optimizer, args):
    for batch_idx, batch_data in enumerate(dataloader):
        # Forward and backward pass
        loss = model(batch_data).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # Handle checkpointing and elastic scaling
        should_checkpoint = (batch_idx + 1) % args.checkpoint_freq == 0
        elastic_event = elastic_event_detected()
        
        # Save checkpoint if scaling-up or scaling down job
        if should_checkpoint or elastic_event:
            save_checkpoint(model, optimizer, scheduler, 
                            checkpoint_dir=args.checkpoint_dir, 
                            step=global_step)
              
            if elastic_event:
                print("Elastic scaling event detected. Checkpoint saved.")
                return
```

### Implement Checkpoint Saving and Checkpoint Loading
<a name="sagemaker-eks-elastic-checkpoint-implementation"></a>

Note: We recommend using PyTorch Distributed Checkpoint (DCP) for saving model and optimizer states, as DCP supports resuming from a checkpoint with different world sizes. Other checkpointing formats may not support checkpoint loading on different world sizes, in which case you'll need to implement custom logic to handle dynamic world size changes.

```
import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict

def save_checkpoint(model, optimizer, lr_scheduler, user_content, checkpoint_path):
    """Save checkpoint using DCP for elastic training."""
    state_dict = {
        "model": model,
        "optimizer": optimizer,
        "lr_scheduler": lr_scheduler,
        **user_content
    }
      
    dcp.save(
        state_dict=state_dict,
        storage_writer=dcp.FileSystemWriter(checkpoint_path)
    )

def load_checkpoint(model, optimizer, lr_scheduler, checkpoint_path):
    """Load checkpoint using DCP with automatic resharding."""
    state_dict = {
        "model": model,
        "optimizer": optimizer,
        "lr_scheduler": lr_scheduler
    }
      
    dcp.load(
        state_dict=state_dict,
        storage_reader=dcp.FileSystemReader(checkpoint_path)
    )
      
    return model, optimizer, lr_scheduler
```

### (Optional) Use stateful dataloaders
<a name="sagemaker-eks-elastic-stateful-dataloaders"></a>

If you're only training for a single-epoch (i.e., one single pass through the entire dataset), the model must see each data sample exactly once. If the training job stops mid-epoch and resumes with a different world size, previously processed data samples will be repeated if the dataloader state is not persisted. A stateful dataloader prevents this by saving and restoring the dataloader's position, ensuring that resumed runs continue from the elastic scaling event without reprocessing any samples. We recommend using [StatefulDataLoader](https://meta-pytorch.org/data/main/torchdata.stateful_dataloader.html), which is a drop‑in replacement for `torch.utils.data.DataLoader` that adds `state_dict()` and `load_state_dict()` methods, enabling mid‑epoch checkpointing of the data loading process.

## Submitting elastic training jobs
<a name="sagemaker-eks-elastic-submit-job"></a>

[HyperPod training operator](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-usage.html) defines a new resource type - `hyperpodpytorchjob`. Elastic training extends this resource type and add the highlighted fields below:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  name: elastic-training-job
spec:
  elasticPolicy:
    minReplicas: 1
    maxReplicas: 4
    # Increment amount of pods in fixed-size groups
    # Amount of pods will be equal to minReplicas + N * replicaIncrementStep
    replicaIncrementStep: 1           
    # ... or Provide an exact amount of pods that required for training
    replicaDiscreteValues: [2,4,8]     

    # How long traing operator wait job to save checkpoint and exit during
    # scaling events. Job will be force-stopped after this period of time
    gracefulShutdownTimeoutInSeconds: 600

    # When scaling event is detected:   
    # how long job controller waits before initiate scale-up.
    # Some delay can prevent from frequent scale-ups and scale-downs
    scalingTimeoutInSeconds: 60

    # In case of faults, specify how long elastic training should wait for
    # recovery, before triggering a scale-down
    faultyScaleDownTimeoutInSeconds: 30
  ...
  replicaSpecs:
    - name: pods
      replicas: 4           # Initial replica count
      maxReplicas: 8        # Max for this replica spec (should match elasticPolicy.maxReplicas)
      ...
```

### Using kubectl
<a name="sagemaker-eks-elastic-kubectl-apply"></a>

You can subsequently launch elastic training with the following command.

```
kubectl apply -f elastic-training-job.yaml
```

### Using SageMaker Recipes
<a name="sagemaker-eks-elastic-sagemaker-recipes"></a>

Elastic training jobs can be launched through [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes).

**Note**  
We have included **46** elastic recipes for **SFO** and **DPO** jobs on Hyperpod Recipe. Users can launch those jobs with one line change on top of existing static launcher script:  
`++recipes.elastic_policy.is_elastic=true`

In addition to static recipes, elastic recipes add the following fields to define the elastic behaviors:

#### Elastic Policy
<a name="sagemaker-eks-elastic-policy"></a>

The `elastic_policy` field defines the job level configuration for the elastic training job, it has the following configurations:
+ `is_elastic` : `bool` - if this job is an elastic job
+ `min_nodes` : `int` - the minimum number of nodes used for elastic training
+ `max_nodes`: `int` - the maximum number of nodes used for elastic training
+ `replica_increment_step` : `int` - increment amount of pods in fixed-size groups, this field is mutually exclusive to the `scale_config` we define later.
+ `use_graceful_shutdown` : `bool` - if use graceful shutdown during scaling events, default to `true`.
+ `scaling_timeout`: `int` - the waiting time in second during scaling event before timeout
+ `graceful_shutdown_timeout`: `int` - the waiting time for graceful shutdown

The following is a sample definition of this field, you can also find in on Hyperpod Recipe repo in recipe: `recipes_collection/recipes/fine-tuning/llama/llmft_llama3_1_8b_instruct_seq4k_gpu_sft_lora.yaml`

```
<static recipe>
...
elastic_policy:
  is_elastic: true
  min_nodes: 1
  max_nodes: 16
  use_graceful_shutdown: true
  scaling_timeout: 600
  graceful_shutdown_timeout: 600
```

#### Scale Config
<a name="sagemaker-eks-elastic-scale-config"></a>

The `scale_config` field defines overriding configurations at each specific scale. It's a key-value dictionary, where key is an integer representing the target scale and value is a subset of base recipe. At `<key>` scale, we use the `<value>` to update the specific configurations in the base/static recipe. The following show an example of this field:

```
scale_config:   
...
  2:
    trainer:
      num_nodes: 2
    training_config:
      training_args:
        train_batch_size: 128
        micro_train_batch_size: 8
        learning_rate: 0.0004
  3:
    trainer:
      num_nodes: 3
    training_config:
      training_args:
        train_batch_size: 128
        learning_rate: 0.0004
        uneven_batch:
          use_uneven_batch: true
          num_dp_groups_with_small_batch_size: 16
          small_local_batch_size: 5
          large_local_batch_size: 6
 ...
```

The above configuration defines the training configuration at scale 2 and 3. In both cases, we use learning rate `4e-4`, batch size of `128`. But at scale 2, we use a `micro_train_batch_size` of 8, while scale 3, we use an uneven batch size as the train batch size cannot be evenly divided across 3 nodes.

**Uneven Batch Size**

This is a field to define the batch distributing behavior when the global batch size cannot be evenly divided by number of ranks. It's not specific to elastic training, but it's an enabler for finer scaling granularity.
+ `use_uneven_batch` : `bool` - if use uneven batch distribution
+ `num_dp_groups_with_small_batch_size` : `int` - in uneven batch distribution, some ranks use smaller local batch size, where the others use larger batch size. The global batch size should equal to `small_local_batch_size * num_dp_groups_with_small_batch_size + (world_size-num_dp_groups_with_small_batch_size) * large_local_batch_size`
+ `small_local_batch_size` : `int` - this value is the smaller local batch size
+ `large_local_batch_size` : `int` - this value is the larger local batch size

**Monitor training on MLFlow**

Hyperpod recipe jobs support observability through MLFlow. Users can specify MLFlow configurations in recipe:

```
training_config:
  mlflow:
    tracking_uri: "<local_file_path or MLflow server URL>"
    run_id: "<MLflow run ID>"
    experiment_name: "<MLflow experiment name, e.g. llama_exps>"
    run_name: "<run name, e.g. llama3.1_8b>"
```

These configurations are mapped to corresponding [MLFlow setup](https://mlflow.org/docs/latest/ml/tracking/tracking-api/#setup--configuration). The following is a sample MLflow dashboard for an elastic training job.

![\[The following is a sample MLflow dashboard for an elastic training job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-sample-dashboard.png)


After defining the elastic recipes, we can use the launcher scripts, such as `launcher_scripts/llama/run_llmft_llama3_1_8b_instruct_seq4k_gpu_sft_lora.sh` to launch an elastic training job. This is similar to launching a static job using Hyperpod recipe.

**Note**  
Elastic training job from recipe support automatically resume from latest checkpoints, however, by default, every restart creates a new training directory. To enable resuming from last checkpoint correctly, we need to make sure the same training directory is reused. This can be done by setting  
`recipes.training_config.training_args.override_training_dir=true`

## Use-case examples and limitations
<a name="sagemaker-eks-elastic-use-cases"></a>

### Scale-up when more resources are available
<a name="sagemaker-eks-elastic-scale-up"></a>

When more resources become available on the cluster (e.g., other workloads complete). During this event, the training controller will automatically scale up the training job. This behavior is explained below.

To simulate a situation when more resources become available we can submit a high-priority job, and then release resources back by deleting the high-priority job.

```
# Submit a high-priority job on your cluster. As a result of this command
# resources will not be available for elastic training
kubectl apply -f high_prioriy_job.yaml

# Submit an elastic job with normal priority
kubectl apply -f hyperpod_job_with_elasticity.yaml

# Wait for training to start....

# Delete high priority job. This command will make additional resources available for
# elastic training
kubectl delete -f high_prioriy_job.yaml

# Observe the scale-up of elastic job
```

Expected behavior:
+ The training operator creates a Kueue Workload When an elastic training job requests a change in world size, the training operator generates an additional Kueue Workload object representing the new resource requirements.
+ Kueue admits the Workload Kueue evaluates the request based on available resources, priorities, and queue policies. Once approved, the Workload is admitted.
+ The training operator creates the additional Pods Upon admission, the operator launches the additional pods required to reach the new world size.
+ When the new pods become ready, the training operator sends a special elastic event signal to training script.
+ The training job performs checkpointing, to prepare for a graceful shutdown The training process periodically checks for the elastic event signal by calling **elastic\$1event\$1detected()** function. Once detected, it initiates a checkpoint. After the checkpoint is successfully completed, the training process exits cleanly.
+ The training operator restarts the job with the new world size The operator waits for all processes to exit, then restarts the training job using the updated world size and the latest checkpoint.

**Note:** When Kueue is not used, the training operator skips the first two steps. It immediately attempts to create the additional pods required for the new world size. If sufficient resources are not available in the cluster, these pods will remain in a **Pending** state until capacity becomes available.

![\[The diagram illustrates the resizing and resource timeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-resize-timeline.png)


### Preemption by high priority job
<a name="sagemaker-eks-elastic-preemption"></a>

Elastic jobs can be scaled down automatically when a high-priority job needs resources. To simulate this behavior you can submit an elastic training job, which uses the maximum number of available resources from the start of training, than submit a high priority job, and observe preemption behavior.

```
# Submit an elastic job with normal priority
kubectl apply -f hyperpod_job_with_elasticity.yaml

# Submit a high-priority job on your cluster. As a result of this command
# some amount of resources will be   
kubectl apply -f high_prioriy_job.yaml

# Observe scale-down behaviour
```

When a high-priority job needs resources, Kueue can preempt lower-priority Elastic Training workloads (there could be more than 1 Workload object associated with Elastic Training job). The preemption process follows this sequence:

1. A high-priority job is submitted The job creates a new Kueue Workload, but the Workload cannot be admitted due to insufficient cluster resources.

1. Kueue preempts one of the Elastic Training job's Workloads Elastic jobs may have multiple active Workloads (one per world-size configuration). Kueue selects one to preempt based on priority and queue policies.

1. The training operator sends an elastic event signal. Once preemption is triggered, the training operator notifies the running training process to stop gracefully.

1. The training process performs checkpointing. The training job periodically checks for elastic event signals. When detected, it begins a coordinated checkpoint to preserve progress before shutting down.

1. training operator cleans up pods and workloads. The operator waits for checkpoint completion, then deletes the training pods that were part of the preempted Workload. It also removes the corresponding Workload object from Kueue.

1. The high-priority workload is admitted. With resources freed, Kueue admits the high-priority job, allowing it to start execution.  
![\[Preemption timeline for elastic training worklaods.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-preemption-timeline.png)

Preemption can cause the entire training job to pause, which may not be desirable for all workflows. To avoid full-job suspension while still allowing elastic scaling, customers can configure two different priority levels within the same training job by defining two `replicaSpec` sections:
+ A primary (fixed) replicaSpec with normal or high priority
  + Contains the minimum required number of replicas needed to keep the training job running.
  + Uses a higher PriorityClass, ensuring these replicas are *never* preempted.
  + Maintains baseline progress even when the cluster is under resource pressure.
+ An elastic (scalable) replicaSpec with lower priority
  + Contains the additional optional replicas that provide extra compute during elastic scaling.
  + Uses a lower PriorityClass, allowing Kueue to preempt these replicas when higher-priority jobs need resources.
  + Ensures only the elastic portion is reclaimed, while the core training continues uninterrupted.

This configuration enables partial preemption, where only the elastic capacity is reclaimed—maintaining training continuity while still supporting fair resource sharing in multi-tenant environments. Example:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  name: elastic-training-job
spec:
  elasticPolicy:
    minReplicas: 2
    maxReplicas: 8
    replicaIncrementStep: 2
  ...
  replicaSpecs:
    - name: base
      replicas: 2
      template:
        spec:
          priorityClassName: high-priority # set high-priority to avoid evictions
           ...
    - name: elastic
      replicas: 0
      maxReplicas: 6
      template:
        spec:
          priorityClassName: low-priority. # Set low-priority for elastic part
           ...
```

### Handing pod eviction, pod crashes, and hardware degradation:
<a name="sagemaker-eks-elastic-pod-eviction"></a>

The HyperPod training operator includes built-in mechanisms to recover the training process when it is unexpectedly interrupted. Interruptions can occur for various reasons, such as training code failures, pod evictions, node failures, hardware degradation, and other runtime issues.

When this happens, the operator automatically attempts to recreate the affected pods and resume training from the latest checkpoint. If recovery is not immediately possible, for example, due to insufficient spare capacity, the operator can continue progress by temporarily reducing the world size and scale down the elastic training job.

When an elastic training job crashes or loses replicas, the system behaves as follows:
+ Recovery Phase (using spare nodes) The Training Controller waits up to `faultyScaleDownTimeoutInSeconds` for resources to become available and attempts to recover the failed replicas by redeploying pods on spare capacity.
+ Elastic scale-down If recovery is not possible within the timeout window, the training operator scales the job down to a smaller world size (if the job's elastic policy permits it). Training then resumes with fewer replicas.
+ Elastic scale-up When additional resources become available again, the operator automatically scales the training job back up to the preferred world size.

This mechanism ensures that training can continue with minimal downtime, even under resource pressure or partial infrastructure failures, while still taking advantage of elastic scaling.

### Use elastic training with other HyperPod features
<a name="sagemaker-eks-elastic-other-features"></a>

Elastic training does not currently support checkpointless training capabilities, HyperPod managed tiered checkpointing, or Spot instances.

**Note**  
We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to a job and scaling operations, resource management, and essential service functionality.

# Observability for Amazon SageMaker HyperPod cluster orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-cluster-observability"></a>

To achieve comprehensive observability into your Amazon SageMaker HyperPod (SageMaker HyperPod) cluster resources and software components, integrate the cluster with [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html), [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html), and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html). These tools provide visibility into cluster health, performance metrics, and resource utilization.

The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster's behavior. By leveraging these services, you gain a centralized and unified view of your HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads.

**Note**  
While CloudWatch, Amazon Managed Service for Prometheus, and Amazon Managed Grafana focus on operational metrics (for example, system health, training job performance), SageMaker HyperPod Usage Reports complement [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) to provide financial and resource accountability insights. These reports track:  
Compute utilization (GPU/CPU/Neuron Core hours) across namespaces/teams
Cost attribution for allocated vs. borrowed resources
Historical trends (up to 180 days) for auditing and optimization
For more information about setting up and generating usage reports, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html). 

**Tip**  
To find practical examples and solutions, see also the [Observability](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/06-observability) section in the [Amazon EKS Support in SageMaker HyperPod workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e).

Proceed to the following topics to set up for SageMaker HyperPod cluster observability.

**Topics**
+ [Model observability for training jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-cluster-observability-model.md)
+ [Cluster and task observability](sagemaker-hyperpod-eks-cluster-observability-cluster.md)

# Model observability for training jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-cluster-observability-model"></a>

SageMaker HyperPod clusters orchestrated with Amazon EKS can integrate with the [MLflow application on Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html). Cluster admins set up the MLflow server and connect it with the SageMaker HyperPod clusters. Data scientists can gain insights into the model.

**To set up an MLflow server using AWS CLI**

A cluster admin must create an MLflow tracking server.

1. Create a SageMaker AI MLflow tracking server, following the instructions at [Create a tracking server using the AWS CLI](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-cli.html#mlflow-create-tracking-server-cli-infra-setup).

1. Make sure that the [https://docs.aws.amazon.com/eks/latest/APIReference/API_auth_AssumeRoleForPodIdentity.html](https://docs.aws.amazon.com/eks/latest/APIReference/API_auth_AssumeRoleForPodIdentity.html) permission exists in the IAM execution role for SageMaker HyperPod.

1. If the `eks-pod-identity-agent` add-on is not already installed on your EKS cluster, install the add-on on the EKS cluster.

   ```
   aws eks create-addon \
       --cluster-name <eks_cluster_name> \
       --addon-name eks-pod-identity-agent \
       --addon-version vx.y.z-eksbuild.1
   ```

1. Create a `trust-relationship.json` file for a new role for Pod to call MLflow APIs.

   ```
   cat >trust-relationship.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
               "Effect": "Allow",
               "Principal": {
                   "Service": "pods.eks.amazonaws.com"
   
               },
               "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
               ]
           }
       ]
   }
   EOF
   ```

   Run the following code to create the role and attach the trust relationship.

   ```
   aws iam create-role --role-name hyperpod-mlflow-role \
       --assume-role-policy-document file://trust-relationship.json \
       --description "allow pods to emit mlflow metrics and put data in s3"
   ```

1. Create the following policy that grants Pod access to call all `sagemaker-mlflow` operations and to put model artifacts in S3. S3 permission already exists within the tracking server but if the model artifacts is too big direct call to s3 is made from the MLflow code to upload the artifacts.

   ```
   cat >hyperpod-mlflow-policy.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker-mlflow:AccessUI",
                   "sagemaker-mlflow:CreateExperiment",
                   "sagemaker-mlflow:SearchExperiments",
                   "sagemaker-mlflow:GetExperiment",
                   "sagemaker-mlflow:GetExperimentByName",
                   "sagemaker-mlflow:DeleteExperiment",
                   "sagemaker-mlflow:RestoreExperiment",
                   "sagemaker-mlflow:UpdateExperiment",
                   "sagemaker-mlflow:CreateRun",
                   "sagemaker-mlflow:DeleteRun",
                   "sagemaker-mlflow:RestoreRun",
                   "sagemaker-mlflow:GetRun",
                   "sagemaker-mlflow:LogMetric",
                   "sagemaker-mlflow:LogBatch",
                   "sagemaker-mlflow:LogModel",
                   "sagemaker-mlflow:LogInputs",
                   "sagemaker-mlflow:SetExperimentTag",
                   "sagemaker-mlflow:SetTag",
                   "sagemaker-mlflow:DeleteTag",
                   "sagemaker-mlflow:LogParam",
                   "sagemaker-mlflow:GetMetricHistory",
                   "sagemaker-mlflow:SearchRuns",
                   "sagemaker-mlflow:ListArtifacts",
                   "sagemaker-mlflow:UpdateRun",
                   "sagemaker-mlflow:CreateRegisteredModel",
                   "sagemaker-mlflow:GetRegisteredModel",
                   "sagemaker-mlflow:RenameRegisteredModel",
                   "sagemaker-mlflow:UpdateRegisteredModel",
                   "sagemaker-mlflow:DeleteRegisteredModel",
                   "sagemaker-mlflow:GetLatestModelVersions",
                   "sagemaker-mlflow:CreateModelVersion",
                   "sagemaker-mlflow:GetModelVersion",
                   "sagemaker-mlflow:UpdateModelVersion",
                   "sagemaker-mlflow:DeleteModelVersion",
                   "sagemaker-mlflow:SearchModelVersions",
                   "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts",
                   "sagemaker-mlflow:TransitionModelVersionStage",
                   "sagemaker-mlflow:SearchRegisteredModels",
                   "sagemaker-mlflow:SetRegisteredModelTag",
                   "sagemaker-mlflow:DeleteRegisteredModelTag",
                   "sagemaker-mlflow:DeleteModelVersionTag",
                   "sagemaker-mlflow:DeleteRegisteredModelAlias",
                   "sagemaker-mlflow:SetRegisteredModelAlias",
                   "sagemaker-mlflow:GetModelVersionByAlias"
               ],
               "Resource": "arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>"
           }
       ]
   }
   EOF
   ```
**Note**  
The ARNs are the one from the MLflow server and the S3 bucket set up with the MLflow server during the server you created following the instructions [Set up MLflow infrastructure](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-cli.html#mlflow-create-tracking-server-cli-infra-setup).

1. Attach the `mlflow-metrics-emit-policy` policy to the `hyperpod-mlflow-role` using the policy document saved in the previous step.

   ```
   aws iam put-role-policy \
     --role-name hyperpod-mlflow-role \
     --policy-name mlflow-metrics-emit-policy \
     --policy-document file://hyperpod-mlflow-policy.json
   ```

1. Create a Kubernetes service account for Pod to access the MLflow server. 

   ```
   cat >mlflow-service-account.yaml <<EOF
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: mlflow-service-account
     namespace: kubeflow
   EOF
   ```

   Run the following command to apply to the EKS cluster.

   ```
   kubectl apply -f mlflow-service-account.yaml
   ```

1. Create a Pod identity association.

   ```
   aws eks create-pod-identity-association \
       --cluster-name EKS_CLUSTER_NAME \
       --role-arn arn:aws:iam::111122223333:role/hyperpod-mlflow-role \
       --namespace kubeflow \
       --service-account mlflow-service-account
   ```

**To collect metrics from training jobs to the MLflow server**

Data scientists need to set up the training script and docker image to emit metrics to the MLflow server.

1. Add the following lines at the beginning of your training script.

   ```
   import mlflow
   
   # Set the Tracking Server URI using the ARN of the Tracking Server you created
   mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN'])
   # Enable autologging in MLflow
   mlflow.autolog()
   ```

1. Build a Docker image with the training script and push to Amazon ECR. Get the ARN of the ECR container. For more information about building and pushing a Docker image, see [Pushing a Docker image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html) in the *ECR User Guide*.
**Tip**  
Make sure that you add installation of mlflow and sagemaker-mlflow packages in the Docker file. To learn more about the installation of the packages, requirements, and compatible versions of the packages, see [Install MLflow and the SageMaker AI MLflow plugin](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-track-experiments.html#mlflow-track-experiments-install-plugin).

1. Add a service account in the training job Pods to give them access to `hyperpod-mlflow-role`. This allows Pods to call MLflow APIs. Run the following SageMaker HyperPod CLI job submission template. Create this with file name `mlflow-test.yaml`.

   ```
   defaults:
    - override hydra/job_logging: stdout
   
   hydra:
    run:
     dir: .
    output_subdir: null
   
   training_cfg:
    entry_script: ./train.py
    script_args: []
    run:
     name: test-job-with-mlflow # Current run name
     nodes: 2 # Number of nodes to use for current training
     # ntasks_per_node: 1 # Number of devices to use per node
   cluster:
    cluster_type: k8s # currently k8s only
    instance_type: ml.c5.2xlarge
    cluster_config:
     # name of service account associated with the namespace
     service_account_name: mlflow-service-account
     # persistent volume, usually used to mount FSx
     persistent_volume_claims: null
     namespace: kubeflow
     # required node affinity to select nodes with SageMaker HyperPod
     # labels and passed health check if burn-in enabled
     label_selector:
         required:
             sagemaker.amazonaws.com/node-health-status:
                 - Schedulable
         preferred:
             sagemaker.amazonaws.com/deep-health-check-status:
                 - Passed
         weights:
             - 100
     pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never
     restartPolicy: OnFailure # restart policy
   
   base_results_dir: ./result # Location to store the results, checkpoints and logs.
   container: 111122223333.dkr.ecr.us-west-2.amazonaws.com/tag # container to use
   
   env_vars:
    NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information
    MLFLOW_TRACKING_ARN: arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
   ```

1. Start the job using the YAML file as follows.

   ```
   hyperpod start-job --config-file /path/to/mlflow-test.yaml
   ```

1. Generate a pre-signed URL for the MLflow tracking server. You can open the link on your browser and start tracking your training job.

   ```
   aws sagemaker create-presigned-mlflow-tracking-server-url \                          
       --tracking-server-name "tracking-server-name" \
       --session-expiration-duration-in-seconds 1800 \
       --expires-in-seconds 300 \
       --region region
   ```

# Cluster and task observability
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster"></a>

There are two options for monitoring SageMaker HyperPod clusters:

**The SageMaker HyperPod observability add-on**—SageMaker HyperPod provides a comprehensive, out-of-the-box dashboard that gives you insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards. The dashboards are optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance. With this add-on, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators.

**Amazon CloudWatch Insights**—Amazon CloudWatch Insights collects metrics for compute resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

**Topics**
+ [Amazon SageMaker HyperPod observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus](sagemaker-hyperpod-observability-addon.md)
+ [Observability with Amazon CloudWatch](sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.md)

# Amazon SageMaker HyperPod observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus
<a name="sagemaker-hyperpod-observability-addon"></a>

Amazon SageMaker HyperPod (SageMaker HyperPod) provides a comprehensive, out-of-the-box dashboard that gives you insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards. The dashboards are optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance. With this add-on, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators.

## Restricted Instance Group (RIG) support
<a name="hyperpod-observability-addon-rig-support"></a>

The observability add-on also supports clusters that contain Restricted Instance Groups. In RIG clusters, the add-on automatically adapts its deployment strategy to comply with the network isolation and security constraints of restricted nodes. DaemonSet components (node exporter, DCGM exporter, EFA exporter, Neuron monitor, and node collector) run on both standard and restricted nodes. Deployment components (central collector, Kube State Metrics, and Training Metrics Agent) are scheduled with boundary-aware logic to respect network isolation between instance groups. Container log collection with Fluent Bit is not available on restricted nodes.

For information about setting up the add-on on clusters with Restricted Instance Groups, see [Setting up the SageMaker HyperPod observability add-on](hyperpod-observability-addon-setup.md).

**Topics**
+ [Restricted Instance Group (RIG) support](#hyperpod-observability-addon-rig-support)
+ [Setting up the SageMaker HyperPod observability add-on](hyperpod-observability-addon-setup.md)
+ [Amazon SageMaker HyperPod observability dashboards](hyperpod-observability-addon-viewing-dashboards.md)
+ [Exploring SageMaker HyperPod cluster metrics in Amazon Managed Grafana](hyperpod-observability-addon-exploring-metrics.md)
+ [Customizing SageMaker HyperPod cluster metrics dashboards and alerts](hyperpod-observability-addon-customizing.md)
+ [Creating custom SageMaker HyperPod cluster metrics](hyperpod-observability-addon-custom-metrics.md)
+ [SageMaker HyperPod cluster metrics](hyperpod-observability-cluster-metrics.md)
+ [Preconfigured alerts](hyperpod-observability-addon-alerts.md)
+ [Troubleshooting the Amazon SageMaker HyperPod observability add-on](hyperpod-observability-addon-troubleshooting.md)

# Setting up the SageMaker HyperPod observability add-on
<a name="hyperpod-observability-addon-setup"></a>

The following list describes the prerequisites for setting up the observability add-on.

To have metrics for your Amazon SageMaker HyperPod (SageMaker HyperPod) cluster sent to a Amazon Managed Service for Prometheus workspace and to optionally view them in Amazon Managed Grafana, first attach the following managed policies and permissions to your console role.
+ To use Amazon Managed Grafana, enable AWS IAM Identity Center (IAM Identity Center) in an AWS Region where Amazon Managed Grafana is available. For instructions, see [Getting started with IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/getting-started.html) in the *AWS IAM Identity Center User Guide*. For a list of AWS Regions where Amazon Managed Grafana is available, see [Supported Regions](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html#AMG-supported-Regions) in the *Amazon Managed Grafana User Guide*.
+ Create at least one user in IAM Identity Center.
+ Ensure that the [Amazon EKS Pod Identity Agent](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#add-ons-pod-id) add-on is installed in your Amazon EKS cluster. The Amazon EKS Pod Identity Agent add-on makes it possible for the SageMaker HyperPod observability add-on to get the credentials to interact with Amazon Managed Service for Prometheus and CloudWatch Logs. To check whether your Amazon EKS cluster has the add-on, go to the Amazon EKS console, and check your cluster's **Add-ons** tab. For information about how to install the add-on if it's not installed, see [Create add-on (AWS Management Console)](https://docs.aws.amazon.com/eks/latest/userguide/creating-an-add-on.html#_create_add_on_console) in the *Amazon EKS User Guide*.
**Note**  
The Amazon EKS Pod Identity Agent is required for standard instance groups. For Restricted Instance Groups (RIG), the Pod Identity Agent is not available due to network isolation constraints. The cluster's instance group execution IAM role is used to interact with Amazon Managed Service for Prometheus. For information about how to configure that role, see [Additional prerequisites for Restricted Instance Groups](#hyperpod-observability-addon-rig-prerequisites).
+ Ensure that you have at least one node in your SageMaker HyperPod cluster before installing SageMaker HyperPod observability add-on. The smallest Amazon EC2 instance type that works in this case is `4xlarge`. This minimum node size requirement ensures that the node can accommodate all the pods that the SageMaker HyperPod observability add-on creates alongside any other already running pods on the cluster.
+ Add the following policies and permissions to your role.
  + [AWS managed policy: AmazonSageMakerHyperPodObservabilityAdminAccess](security-iam-awsmanpol-AmazonSageMakerHyperPodObservabilityAdminAccess.md)
  + [AWS managed policy: AWSGrafanaWorkspacePermissionManagementV2](https://docs.aws.amazon.com/grafana/latest/userguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AWSGrafanaWorkspacePermissionManagementV2)
  + [AWS managed policy: AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html)
  + Additional permissions to set up required IAM roles for Amazon Managed Grafana and Amazon Elastic Kubernetes Service add-on access:

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Sid": "CreateRoleAccess",
                "Effect": "Allow",
                "Action": [
                    "iam:CreateRole",
                    "iam:CreatePolicy",
                    "iam:AttachRolePolicy",
                    "iam:ListRoles"
                ],
                "Resource": [
                    "arn:aws:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityGrafanaAccess*",
                    "arn:aws:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityAddonAccess*",
                    "arn:aws:iam::*:policy/service-role/HyperPodObservabilityAddonPolicy*",
                    "arn:aws:iam::*:policy/service-role/HyperPodObservabilityGrafanaPolicy*"
                ]
            }
        ]
    }
    ```

------
  + Additional permissions needed to manage IAM Identity Center users for Amazon Managed Grafana:

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Sid": "SSOAccess",
                "Effect": "Allow",
                "Action": [
                    "sso:ListProfileAssociations",
                    "sso-directory:SearchUsers",
                    "sso-directory:SearchGroups",
                    "sso:AssociateProfile",
                    "sso:DisassociateProfile"
                ],
                "Resource": [
                    "*"
                ]
            }
        ]
    }
    ```

------

## Additional prerequisites for Restricted Instance Groups
<a name="hyperpod-observability-addon-rig-prerequisites"></a>

If your cluster contains Restricted Instance Groups, the instance group execution role must have permissions to write metrics to Amazon Managed Service for Prometheus. When you use **Quick setup** to create your cluster with observability enabled, these permissions are added to the execution role automatically.

If you are using **Custom setup** or adding observability to an existing RIG cluster, ensure that the execution role for each Restricted Instance Group has the following permissions:

```
{
    "Version": "2012-10-17", 		 	 	 
    "Statement": [
        {
            "Sid": "PrometheusAccess",
            "Effect": "Allow",
            "Action": "aps:RemoteWrite",
            "Resource": "arn:aws:aps:us-east-1:account_id:workspace/workspace-ID"
        }
    ]
}
```

Replace *us-east-1*, *account\$1id*, and *workspace-ID* with your AWS Region, account ID, and Amazon Managed Service for Prometheus workspace ID.

After you ensure that you have met the above prerequisites, you can install the observability add-on.

**To quickly install the observability add-on**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **HyperPod Monitoring & Observability**, and choose **Quick install**.

**To do a custom-install of the observability add-on**

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **HyperPod Monitoring & Observability**, and choose **Custom install**.

1. Specify the metrics categories that you want to see. For more information about these metrics categories, see [SageMaker HyperPod cluster metrics](hyperpod-observability-cluster-metrics.md).

1. Specify whether you want to enable Amazon CloudWatch Logs.

1. Specify whether you want the service to create a new Amazon Managed Service for Prometheus workspace.

1. To be able to view the metrics in Amazon Managed Grafana dashboards, check the box labeled **Use an Amazon Managed Grafana workspace**. You can specify your own workspace or let the service create a new one for you. 
**Note**  
Amazon Managed Grafana isn't available in all AWS Regions in which Amazon Managed Service for Prometheus is available. However, you can set up a Grafana workspace in any AWS Region and configure it to get metrics data from a Prometheus workspace that resides in a different AWS Region. For information, see [Use AWS data source configuration to add Amazon Managed Service for Prometheus as a data source](https://docs.aws.amazon.com/grafana/latest/userguide/AMP-adding-AWS-config.html) and [Connect to Amazon Managed Service for Prometheus and open-source Prometheus data sources](https://docs.aws.amazon.com/grafana/latest/userguide/prometheus-data-source.html). 

# Amazon SageMaker HyperPod observability dashboards
<a name="hyperpod-observability-addon-viewing-dashboards"></a>

This topic describes how to view metrics dashboards for your Amazon SageMaker HyperPod (SageMaker HyperPod) clusters and how to add new users to a dashboard. The topic also describes the different types of dashboards.

## Accessing dashboards
<a name="hyperpod-observability-addon-accessing-dashboards"></a>

To view your SageMaker HyperPod cluster's metrics in Amazon Managed Grafana, perform the following steps:

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the **HyperPod Observability** section, and choose **Open dashboard in Grafana**.

## Adding new users to a Amazon Managed Grafana workspace
<a name="hyperpod-observability-addon-adding-users"></a>

For information about how to add users to a Amazon Managed Grafana workspace, see [Use AWS IAM Identity Center with your Amazon Managed Grafana workspace](https://docs.aws.amazon.com/grafana/latest/userguide/authentication-in-AMG-SSO.html) in the *Amazon Managed Grafana User Guide*.

## Observability dashboards
<a name="hyperpod-observability-addon-dashboards.title"></a>

The SageMaker HyperPod observability add-on provides six interconnected dashboards in your default Amazon Managed Grafana workspace. Each dashboard provides in-depth insights about different resources and tasks in the clusters for various users such as data scientists, machine learning engineers, and administrators.

### Task dashboard
<a name="hyperpod-observability-addon-task-dashboard"></a>

The Task dashboard provides comprehensive monitoring and visualization of resource utilization metrics for SageMaker HyperPod tasks. The main panel displays a detailed table grouping resource usage by parent tasks, showing CPU, GPU, and memory utilization across pods. Interactive time-series graphs track CPU usage, system memory consumption, GPU utilization percentages, and GPU memory usage for selected pods, allowing you to monitor performance trends over time. The dashboard features powerful filtering capabilities through variables like cluster name, namespace, task type, and specific pods, making it easy to drill down into specific workloads. This monitoring solution is essential for optimizing resource allocation and maintaining performance of machine learning workloads on SageMaker HyperPod.

### Training dashboard
<a name="hyperpod-observability-addon-training-dashboard"></a>

The training dashboard provides comprehensive monitoring of training task health, reliability, and fault management metrics. The dashboard features key performance indicators including task creation counts, success rates, and uptime percentages, along with detailed tracking of both automatic and manual restart events. It offers detailed visualizations of fault patterns through pie charts and heatmaps that break down incidents by type and remediation latency, enabling you to identify recurring issues and optimize task reliability. The interface includes real-time monitoring of critical metrics like system recovery times and fault detection latencies, making it an essential tool for maintaining high availability of training workloads. Additionally, the dashboard's 24-hour trailing window provides historical context for analyzing trends and patterns in training task performance, helping teams proactively address potential issues before they impact production workloads.

### Inference dashboard
<a name="hyperpod-observability-addon-inference-dashboard"></a>

The inference dashboard provides comprehensive monitoring of model deployment performance and health metrics across multiple dimensions. It features a detailed overview of active deployments, real-time monitoring of request rates, success percentages, and latency metrics, enabling you to track model serving performance and identify potential bottlenecks. The dashboard includes specialized panels for both general inference metrics and token-specific metrics for language models, such as time to first token (TTFT) and token throughput, making it particularly valuable for monitoring large language model deployments. Additionally, it provides infrastructure insights through pod and node allocation tracking, while offering detailed error analysis capabilities to help maintain high availability and performance of inference workloads.

### Cluster dashboard
<a name="hyperpod-observability-addon-cluster-dashboard"></a>

The cluster dashboard provides a comprehensive view of cluster health and performance, offering real-time visibility into compute, memory, network, and storage resources across your Amazon SageMaker HyperPod (SageMaker HyperPod) environment. At a glance, you can view critical metrics including total instances, GPU utilization, memory usage, and network performance through an intuitive interface that automatically updates data every few seconds. The dashboard is organized into logical sections, starting with a high-level cluster overview that displays key metrics such as healthy instance percentage and total resource counts, followed by detailed sections for GPU performance, memory utilization, network statistics, and storage metrics. Each section features interactive graphs and panels that allow you to drill down into specific metrics, with customizable time ranges and filtering options by cluster name, instance, or GPU ID.

### File system dashboard
<a name="hyperpod-observability-addon-filesystem-dashboard"></a>

The file-system dashboard provides comprehensive visibility into file system (Amazon FSx for Lustre) performance and health metrics. The dashboard displays critical storage metrics including free capacity, deduplication savings, CPU/memory utilization, disk IOPS, throughput, and client connections across multiple visualizations. It makes it possible for you to monitor both system-level performance indicators like CPU and memory usage, as well as storage-specific metrics such as read/write operations and disk utilization patterns. The interface includes alert monitoring capabilities and detailed time-series graphs for tracking performance trends over time, making it valuable for proactive maintenance and capacity planning. Additionally, through its comprehensive metrics coverage, the dashboard helps identify potential bottlenecks, optimize storage performance, and ensure reliable file system operations for SageMaker HyperPod workloads.

### GPU partition dashboard
<a name="hyperpod-observability-addon-gpu-partition-dashboard"></a>

To monitor GPU partition-specific metrics when using Multi-Instance GPU (MIG) configurations, you need to install or upgrade to the latest version of the SageMaker HyperPod Observability addon. This addon provides comprehensive monitoring capabilities, including MIG-specific metrics such as partition count, memory usage, and compute utilization per GPU partition.

If you already have SageMaker HyperPod Observability installed but need MIG metrics support, simply update the addon to the latest version. This process is non-disruptive and maintains your existing monitoring configuration.

SageMaker HyperPod automatically exposes MIG-specific metrics, including:
+ `nvidia_mig_instance_count`: Number of MIG instances per profile
+ `nvidia_mig_memory_usage`: Memory utilization per MIG instance
+ `nvidia_mig_compute_utilization`: Compute utilization per MIG instance

### Cluster Logs dashboard
<a name="hyperpod-observability-addon-cluster-logs-dashboard"></a>

The Cluster Logs dashboard provides a centralized view of CloudWatch Logs for your SageMaker HyperPod cluster. The dashboard queries the `/aws/sagemaker/Clusters/{cluster-name}/{cluster-id}` log group and displays log events with filtering capabilities by instance ID, log stream name, log level (ERROR, WARN, INFO, DEBUG), and free-text search. The dashboard includes an events timeline showing log event distribution over time, a total events counter, a searched events timeline for filtered results, and a detailed logs panel with full log messages, timestamps, and log stream metadata. This dashboard uses CloudWatch as its data source and is useful for debugging cluster issues, monitoring instance health events, and investigating training job failures.

# Exploring SageMaker HyperPod cluster metrics in Amazon Managed Grafana
<a name="hyperpod-observability-addon-exploring-metrics"></a>

After you connect Amazon Managed Grafana to your Amazon Managed Service for Prometheus workspace, you can use Grafana's query editor and visualization tools to explore your metrics data. Amazon Managed Grafana provides multiple ways to interact with Prometheus data, including a comprehensive query editor for building PromQL expressions, a metrics browser for discovering available metrics and labels, and templating capabilities for creating dynamic dashboards. You can perform range queries to visualize time series data over periods and instant queries to retrieve the latest values, with options to format results as time series graphs, tables, or heatmaps. For detailed information about configuring query settings, using the metrics browser, and leveraging templating features, see [Using the Prometheus data source](https://docs.aws.amazon.com/grafana/latest/userguide/using-prometheus-datasource.html).

# Customizing SageMaker HyperPod cluster metrics dashboards and alerts
<a name="hyperpod-observability-addon-customizing"></a>

Amazon Managed Grafana makes it possible for you to create comprehensive dashboards that visualize your data through panels containing queries connected to your data sources. You can build dashboards from scratch, import existing ones, or export your creations for sharing and backup purposes. Grafana dashboards support dynamic functionality through variables that replace hard-coded values in queries, making your visualizations more flexible and interactive. You can also enhance your dashboards with features like annotations, library panels for reusability, version history management, and custom links to create a complete monitoring and observability solution. For step-by-step guidance on creating, importing, configuring, and managing dashboards, see [Building dashboards](https://docs.aws.amazon.com/grafana/latest/userguide/v10-dash-building-dashboards.html).

# Creating custom SageMaker HyperPod cluster metrics
<a name="hyperpod-observability-addon-custom-metrics"></a>

The Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on provides hundreds of health, performance, and efficiency metrics out-of-the-box. In addition to those metrics, you might need to monitor custom metrics specific to your applications or business needs that aren't captured by default metrics, such as model-specific performance indicators, data processing statistics, or application-specific measurements. To address this need, you can implement custom metrics collection using OpenTelemetry by integrating a Python code snippet into your application.

To create custom metrics, first run the following shell command to install the core OpenTelemetry components needed to instrument Python applications for observability. This installation makes it possible for Python applications that run on SageMaker HyperPod clusters to emit custom telemetry data. That data gets collected by the OpenTelemetry collector and forwarded to the observability infrastructure.

```
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
```

The following example script configures an OpenTelemetry metrics pipeline that automatically tags metrics with pod and node information, ensuring proper attribution within your cluster, and sends these metrics to the SageMaker HyperPod built-in observability stack every second. The script establishes a connection to the SageMaker HyperPod metrics collector, sets up appropriate resource attributes for identification, and provides a meter interface through which you can create various types of metrics (counters, gauges, or histograms) to track any aspect of your application's performance. Custom metrics integrate with the SageMaker HyperPod monitoring dashboards alongside system metrics. This integration allows for comprehensive observability through a single interface where you can create custom alerts, visualizations, and reports to monitor your workload's complete performance profile.

```
import os
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

# Get hostname/pod name
hostname = os.uname()[1]
node_name = os.getenv('NODE_NAME', 'unknown')

collector_endpoint = "hyperpod-otel-collector.hyperpod-observability:4317"

# Configure the OTLP exporter
exporter = OTLPMetricExporter(
    endpoint=collector_endpoint,
    insecure=True,
    timeout=5000  # 5 seconds timeout
)

reader = PeriodicExportingMetricReader(
    exporter,
    export_interval_millis=1000
)

resource = Resource.create({
    "service.name": "metric-test",
    "pod.name": hostname,
    "node.name": node_name
})

meter_provider = MeterProvider(
    metric_readers=[reader],
    resource=resource
)
metrics.set_meter_provider(meter_provider)

# Create a meter
meter = metrics.get_meter("test-meter")

# Create a counter
counter = meter.create_counter(
    name="test.counter",
    description="A test counter"
)

counter.add(1, {"pod": hostname, "node": node_name})
```

# SageMaker HyperPod cluster metrics
<a name="hyperpod-observability-cluster-metrics"></a>

Amazon SageMaker HyperPod (SageMaker HyperPod) publishes various metrics across 9 distinct categories to your Amazon Managed Service for Prometheus workspace. Not all metrics are enabled by default or displayed in your Amazon Managed Grafana workspace. The following table shows which metrics are enabled by default when you install the observability add-on, which categories have additional metrics that can be enabled for more granular cluster information, and where they appear in the Amazon Managed Grafana workspace.


| Metric category | Enabled by default? | Additional advanced metrics available? | Available under which Grafana dashboards? | 
| --- | --- | --- | --- | 
| Training metrics | Yes | Yes | Training | 
| Inference metrics | Yes | No | Inference | 
| Task governance metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. | 
| Scaling metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. | 
| Cluster metrics | Yes | Yes | Cluster | 
| Instance metrics | Yes | Yes | Cluster | 
| Accelerated compute metrics | Yes | Yes | Task, Cluster | 
| Network metrics | No | Yes | Cluster | 
| File system | Yes | No | File system | 

The following tables describe the metrics available for monitoring your SageMaker HyperPod cluster, organized by category.

## Metrics availability on Restricted Instance Groups
<a name="hyperpod-observability-rig-metrics-availability"></a>

When your cluster contains Restricted Instance Groups, most metrics categories are available on restricted nodes with the following exceptions and considerations. You can also set up alerting on any metric of your choice.


| Metric category | Available on RIG nodes? | Notes | 
| --- | --- | --- | 
| Training metrics | Yes | Kubeflow and Kubernetes pod metrics are collected. Advanced training KPI metrics (from Training Metrics Agent) are not available from the RIG nodes. | 
| Inference metrics | No | Inference workloads are not supported on Restricted Instance Groups. | 
| Task governance metrics | No | Kueue metrics are collected from the standard nodes only, if any. | 
| Scaling metrics | No | KEDA metrics are collected from the standard nodes only, if any. | 
| Cluster metrics | Yes | Kube State Metrics and API server metrics are available. Kube State Metrics is preferentially scheduled on standard nodes but can run on restricted nodes in RIG-only clusters. | 
| Instance metrics | Yes | Node Exporter and cAdvisor metrics are collected on all nodes including restricted nodes. | 
| Accelerated compute metrics | Yes | DCGM Exporter runs on GPU-enabled restricted nodes. Neuron Monitor runs on Neuron-enabled restricted nodes when advanced mode is enabled. | 
| Network metrics | Yes | EFA Exporter runs on EFA-enabled restricted nodes when advanced mode is enabled. | 
| File system metrics | Yes | FSx for Lustre cluster utilization metrics are supported on Restricted Instance Groups. | 

**Note**  
Container log collection with Fluent Bit is not deployed on restricted nodes. Cluster logs from restricted nodes are available through the SageMaker HyperPod platform independently of the observability add-on. You can view these logs in the Cluster Logs dashboard.

## Training metrics
<a name="hyperpod-observability-training-metrics"></a>

Use these metrics to track the performance of training tasks executed on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Kubeflow metrics | [https://github.com/kubeflow/trainer](https://github.com/kubeflow/trainer) | Yes | Kubeflow | 
| Kubernetes pod metrics | [https://github.com/kubernetes/kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) | Yes | Kubernetes | 
| training\$1uptime\$1percentage | Percentage of training time out of the total window size | No | SageMaker HyperPod training operator | 
| training\$1manual\$1recovery\$1count | Total number of manual restarts performed on the job | No | SageMaker HyperPod training operator | 
| training\$1manual\$1downtime\$1ms | Total time in milliseconds the job was down due to manual interventions | No | SageMaker HyperPod training operator | 
| training\$1auto\$1recovery\$1count | Total number of automatic recoveries | No | SageMaker HyperPod training operator | 
| training\$1auto\$1recovery\$1downtime | Total infrastructure overhead time in milliseconds during fault recovery | No | SageMaker HyperPod training operator | 
| training\$1fault\$1count | Total number of faults encountered during training | No | SageMaker HyperPod training operator | 
| training\$1fault\$1type\$1count | Distribution of faults by type | No | SageMaker HyperPod training operator | 
| training\$1fault\$1recovery\$1time\$1ms | Recovery time in milliseconds for each type of fault | No | SageMaker HyperPod training operator | 
| training\$1time\$1ms | Total time in milliseconds spent in actual training | No | SageMaker HyperPod training operator | 

## Inference metrics
<a name="hyperpod-observability-inference-metrics"></a>

Use these metrics to track the performance of inference tasks on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| model\$1invocations\$1total | Total number of invocation requests to the model | Yes | SageMaker HyperPod inference operator | 
| model\$1errors\$1total | Total number of errors during model invocation | Yes | SageMaker HyperPod inference operator | 
| model\$1concurrent\$1requests | Active concurrent model requests | Yes | SageMaker HyperPod inference operator | 
| model\$1latency\$1milliseconds | Model invocation latency in milliseconds | Yes | SageMaker HyperPod inference operator | 
| model\$1ttfb\$1milliseconds | Model time to first byte latency in milliseconds | Yes | SageMaker HyperPod inference operator | 
| TGI | These metrics can be used to monitor the performance of TGI, auto-scale deployment and to help identify bottlenecks. For a detailed list of metrics, see [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md). | Yes | Model container | 
| LMI | These metrics can be used to monitor the performance of LMI, and to help identify bottlenecks. For a detailed list of metrics, see [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md). | Yes | Model container | 

## Task governance metrics
<a name="hyperpod-observability-task-governance-metrics"></a>

Use these metrics to monitor task governance and resource allocation on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Kueue | See [https://kueue.sigs.k8s.io/docs/reference/metrics/](https://kueue.sigs.k8s.io/docs/reference/metrics/). | No | Kueue | 

## Scaling metrics
<a name="hyperpod-observability-scaling-metrics"></a>

Use these metrics to monitor auto-scaling behavior and performance on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| KEDA Operator Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1operator](https://keda.sh/docs/2.17/integrations/prometheus/#operator). | No | Kubernetes Event-driven Autoscaler (KEDA) | 
| KEDA Webhook Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1admission-webhooks](https://keda.sh/docs/2.17/integrations/prometheus/#admission-webhooks). | No | Kubernetes Event-driven Autoscaler (KEDA) | 
| KEDA Metrics server Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1metrics-server](https://keda.sh/docs/2.17/integrations/prometheus/#metrics-server). | No | Kubernetes Event-driven Autoscaler (KEDA) | 

## Cluster metrics
<a name="hyperpod-observability-cluster-health-metrics"></a>

Use these metrics to monitor overall cluster health and resource allocation.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Cluster health | Kubernetes API server metrics. See [https://kubernetes.io/docs/reference/instrumentation/metrics/](https://kubernetes.io/docs/reference/instrumentation/metrics/). | Yes | Kubernetes | 
| Kubestate | See [https://github.com/kubernetes/kube-state-metrics/tree/main/docs\$1default-resources](https://github.com/kubernetes/kube-state-metrics/tree/main/docs#default-resources). | Limited | Kubernetes | 
| KubeState Advanced | See [https://github.com/kubernetes/kube-state-metrics/tree/main/docs\$1optional-resources](https://github.com/kubernetes/kube-state-metrics/tree/main/docs#optional-resources). | No | Kubernetes | 

## Instance metrics
<a name="hyperpod-observability-instance-metrics"></a>

Use these metrics to monitor individual instance performance and health.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Node Metrics | See [https://github.com/prometheus/node\$1exporter?tab=readme-ov-file\$1enabled-by-default](https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default). | Yes | Kubernetes | 
| Container Metrics | Container metrics exposed by Cadvisor. See [https://github.com/google/cadvisor](https://github.com/google/cadvisor). | Yes | Kubernetes | 

## Accelerated compute metrics
<a name="hyperpod-observability-accelerated-compute-metrics"></a>

Use these metrics to monitor the performance, health, and utilization of individual accelerated compute devices in your cluster.

**Note**  
When GPU partitioning with MIG (Multi-Instance GPU) is enabled on your cluster, DCGM metrics automatically provide partition-level granularity for monitoring individual MIG instances. Each MIG partition is exposed as a separate GPU device with its own metrics for temperature, power, memory utilization, and compute activity. This allows you to track resource usage and health for each GPU partition independently, enabling precise monitoring of workloads running on fractional GPU resources. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| NVIDIA GPU | DCGM metrics. See [https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv](https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv). | Limited |  NVIDIA Data Center GPU Manager (DCGM)  | 
|  NVIDIA GPU (advanced)  | DCGM metrics that are commented out in the following CSV file:[https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv](https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv) | No |  NVIDIA Data Center GPU Manager (DCGM)  | 
| AWS Trainium | Neuron metrics. See [https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html\$1neuron-monitor-nc-counters](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-nc-counters). | No | AWS Neuron Monitor | 

## Network metrics
<a name="hyperpod-observability-network-metrics"></a>

Use these metrics to monitor the performance and health of the Elastic Fabric Adapters (EFA) in your cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| EFA | See [https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation\$1and\$1observability/3.efa-node-exporter/README.md](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md). | No | Elastic Fabric Adapter | 

## File system metrics
<a name="hyperpod-observability-file-system-metrics"></a>


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| File system | Amazon FSx for Lustre metrics from Amazon CloudWatch:[Monitoring with Amazon CloudWatch](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring-cloudwatch.html). | Yes | Amazon FSx for Lustre | 

# Preconfigured alerts
<a name="hyperpod-observability-addon-alerts"></a>

The Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on enables default alerts for your cluster and workloads to notify you when the system detects common early indicators of cluster under-performance. These alerts are defined within the Amazon Managed Grafana built-in alerting system. For information about how to modify these pre-configured alerts or create new ones, see [Alerts in Grafana version 10](https://docs.aws.amazon.com/grafana/latest/userguide/v10-alerts.html) in the *Amazon Managed Grafana User Guide*. The following YAML shows the default alerts.

```
groups:
- name: sagemaker_hyperpod_alerts
  rules:
  # GPU_TEMP_ABOVE_80C
  - alert: GPUHighTemperature
    expr: DCGM_FI_DEV_GPU_TEMP > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU Temperature Above 80C"
      description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C."

  # GPU_TEMP_ABOVE_85C  
  - alert: GPUCriticalTemperature  
    expr: DCGM_FI_DEV_GPU_TEMP > 85
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "GPU Temperature Above 85C"
      description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C."

  # GPU_MEMORY_ERROR
  # Any ECC double-bit errors indicate serious memory issues requiring immediate attention
  - alert: GPUMemoryErrorDetected
    expr: DCGM_FI_DEV_ECC_DBE_VOL_TOTAL > 0 or DCGM_FI_DEV_ECC_DBE_AGG_TOTAL > DCGM_FI_DEV_ECC_DBE_AGG_TOTAL offset 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU ECC Double-Bit Error Detected"
      description: "GPU {{ $labels.gpu }} has detected ECC double-bit errors."

  # GPU_POWER_WARNING
  # Sustained power limit violations can impact performance and stability
  - alert: GPUPowerViolation
    expr: DCGM_FI_DEV_POWER_VIOLATION > 100
    for: 5m
    labels:
      severity: warning  
    annotations:
      summary: "GPU Power Violation"
      description: "GPU {{ $labels.gpu }} has been operating at power limit for extended period."

  # GPU_NVLINK_ERROR
  # NVLink errors above threshold indicate interconnect stability issues
  - alert: NVLinkErrorsDetected
    expr: DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL > 0 or DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL > 10
    labels:
      severity: warning
    annotations:
      summary: "NVLink Errors Detected" 
      description: "GPU {{ $labels.gpu }} has detected NVLink errors."

  # GPU_THERMAL_VIOLATION  
  # Immediate alert on thermal violations to prevent hardware damage
  - alert: GPUThermalViolation
    expr: increase(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "GPU Thermal Violation Detected"
      description: "GPU {{ $labels.gpu }} has thermal violations on node {{ $labels.Hostname }}"

  # GPU_XID_ERROR
  # XID errors indicate driver or hardware level GPU issues requiring investigation
  - alert: GPUXidError
    expr: DCGM_FI_DEV_XID_ERRORS > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "GPU XID Error Detected"
      description: "GPU {{ $labels.gpu }} experienced XID error {{ $value }} on node {{ $labels.Hostname }}"

  # MIG_CONFIG_FAILURE
  # MIG configuration failures indicate issues with GPU partitioning setup
  - alert: MIGConfigFailure
    expr: kubelet_node_name{nvidia_com_mig_config_state="failed"} > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "MIG Configuration Failed"
      description: "MIG configuration failed on node {{ $labels.instance }}"

  # DISK_SPACE_WARNING
  # 90% threshold ensures time to respond before complete disk exhaustion
  - alert: NodeDiskSpaceWarning
    expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High Disk Usage"
      description: "Node {{ $labels.instance }} disk usage is above 90%"

  # FSX_STORAGE_WARNING
  # 80% FSx utilization allows buffer for burst workloads
  - alert: FsxLustreStorageWarning
    expr: fsx_lustre_storage_used_bytes / fsx_lustre_storage_capacity_bytes * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High FSx Lustre Usage"
      description: "FSx Lustre storage usage is above 80% on file system {{ $labels.filesystem_id }}"
```

# Troubleshooting the Amazon SageMaker HyperPod observability add-on
<a name="hyperpod-observability-addon-troubleshooting"></a>

Use the following guidance to resolve common issues with the Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on.

## Troubleshooting missing metrics in Amazon Managed Grafana
<a name="troubleshooting-missing-metrics"></a>

If metrics don't appear in your Amazon Managed Grafana dashboards, perform the following steps to identify and resolve the issue.

### Verify the Amazon Managed Service for Prometheus-Amazon Managed Grafana connection
<a name="verify-amp-grafana-connection"></a>

1. Sign in to the Amazon Managed Grafana console.

1. In the left pane, choose **All workspaces**.

1. In the **Workspaces** table, choose your workspace.

1. In the details page of the workspace, choose the **Data sources** tab.

1. Verify that the Amazon Managed Service for Prometheus data source exists.

1. Check the connection settings:
   + Confirm that the endpoint URL is correct.
   + Verify that IAM authentication is properly configured.
   + Choose **Test connection**. Verify that the status is **Data source is working**.

### Verify the Amazon EKS add-on status
<a name="verify-eks-addon-status"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. Choose the **Add-ons** tab.

1. Verify that the SageMaker HyperPod observability add-on is listed and that its status is **ACTIVE**.

1. If the status isn't **ACTIVE**, see [Troubleshooting add-on installation failures](#troubleshooting-addon-installation-failures).

### Verify Pod Identity association
<a name="verify-pod-identity-association"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. On the cluster details page, choose the **Access** tab.

1. In the **Pod Identity associations** table, choose the association that has the following property values:
   + **Namespace**: `hyperpod-observability`
   + **Service account**: `hyperpod-observability-operator-otel-collector`
   + **Add-on**: `amazon-sagemaker-hyperpod-observability`

1. Ensure that the IAM role that is attached to this association has the following permissions.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "PrometheusAccess",
               "Effect": "Allow",
               "Action": "aps:RemoteWrite",
               "Resource": "arn:aws:aps:us-east-1:111122223333:workspace/workspace-ID"
           },
           {
               "Sid": "CloudwatchLogsAccess",
               "Effect": "Allow",
               "Action": [
                   "logs:CreateLogGroup",
                   "logs:CreateLogStream",
                   "logs:DescribeLogGroups",
                   "logs:DescribeLogStreams",
                   "logs:PutLogEvents",
                   "logs:GetLogEvents",
                   "logs:FilterLogEvents",
                   "logs:GetLogRecord",
                   "logs:StartQuery",
                   "logs:StopQuery",
                   "logs:GetQueryResults"
               ],
               "Resource": [
                   "arn:aws:logs:us-east-1:111122223333:log-group:/aws/sagemaker/Clusters/*",
                   "arn:aws:logs:us-east-1:111122223333:log-group:/aws/sagemaker/Clusters/*:log-stream:*"
               ]
           }
       ]
   }
   ```

------

1. Ensure that the IAM role that is attached to this association has the following trust policy. Verify that the source ARN and source account are correct.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
               "Effect": "Allow",
               "Principal": {
                   "Service": "pods.eks.amazonaws.com"
               },
               "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
               ],
               "Condition": {
                   "StringEquals": {
                       "aws:SourceArn": "arn:aws:eks:us-east-1:111122223333:cluster/cluster-name",
                       "aws:SourceAccount": "111122223333"
                   }
               }
           }
       ]
   }
   ```

------

### Check Amazon Managed Service for Prometheus throttling
<a name="check-amp-throttling"></a>

1. Sign in to the AWS Management Console and open the Service Quotas console at [https://console.aws.amazon.com/servicequotas/](https://console.aws.amazon.com/servicequotas/).

1. In the **Managed quotas** box, search for and select Amazon Managed Service for Prometheus.

1. Choose the **Active series per workspace** quota.

1. In the **Resource-level quotas** tab, select your Amazon Managed Service for Prometheus workspace.

1. Ensure that the utilization is less than your current quota.

1. If you've reached the quota limit, select your workspace by choosing the radio button to its left, and then choose **Request increase at resource level** .

### Verify KV caching and intelligent routing are enabled
<a name="verify-caching-routing"></a>

If the `KVCache Metrics` dashboard is missing, feature is either not enabled or the port isn't mentioned in the `modelMetrics`. For more information on how to enable this, see steps 1 and 3 in [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route). 

If the `Intelligent Router Metrics` dashboard is missing, enable the feature to have them show up. For more information on how to enable this, see [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route). 

## Troubleshooting add-on installation failures
<a name="troubleshooting-addon-installation-failures"></a>

If the observability add-on fails to install, use the following steps to diagnose and resolve the issue.

### Check health probe status
<a name="check-health-probe-status"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. Choose the **Add-ons** tab.

1. Choose the failed add-on.

1. Review the **Health issues** section.

1. If the health issue is related to credentials or pod identity, see [Verify Pod Identity association](#verify-pod-identity-association). Also ensure that the pod identity agent add-on is running in the cluster.

1. Check for errors in the manager logs. For instructions, see [Review manager logs](#review-manager-logs).

1. Contact AWS Support with the issue details.

### Review manager logs
<a name="review-manager-logs"></a>

1. Get the add-on manager pod:

   ```
   kubectl logs -n hyperpod-observability -l control-plane=hyperpod-observability-controller-manager
   ```

1. For urgent issues, contact Support.

## Review all observability pods
<a name="review-all-observability-pods"></a>

All the pods that the SageMaker HyperPod observability add-on creates are in the `hyperpod-observability` namespace. To get the status of these pods, run the following command.

```
kubectl get pods -n hyperpod-observability
```

Look for the pods whose status is either `pending` or `crashloopbackoff`. Run the following command to get the logs of these pending or failing pods.

```
kubectl logs -n hyperpod-observability pod-name
```

If you don't find errors in the logs, run the following command to describe the pods and look for errors.

```
kubectl describe -n hyperpod-observability pod pod-name
```

To get more context, run the two following commands to describe the deployments and daemonsets for these pods.

```
kubectl describe -n hyperpod-observability deployment deployment-name
```

```
kubectl describe -n hyperpod-observability daemonset daemonset-name
```

## Troubleshooting pods that are stuck in the pending status
<a name="pods-stuck-in-pending"></a>

If you see that there are pods that are stuck in the `pending` status, make sure that the node is large enough to fit in all the pods. To verify that it is, perform the following steps.

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Choose your cluster.

1. Choose the cluster's **Compute** tab.

1. Choose the node with the smallest instance type.

1. In the capacity allocation section, look for available pods.

1. If there are no available pods, then you need a larger instance type.

For urgent issues, contact AWS Support.

## Troubleshooting observability on Restricted Instance Groups
<a name="troubleshooting-rig-observability"></a>

Use the following guidance to resolve issues specific to clusters with Restricted Instance Groups.

### Observability pods not starting on restricted nodes
<a name="troubleshooting-rig-pods-not-starting"></a>

If observability pods are not starting on restricted nodes, check the pod status and events:

```
kubectl get pods -n hyperpod-observability -o wide
kubectl describe pod pod-name -n hyperpod-observability
```

Common causes include:
+ **Image pull failures:** The pod events may show image pull errors if the observability container images are not yet allowlisted on the restricted nodes. Ensure that you are running the latest version of the observability add-on. If the issue persists after upgrading, contact Support.
+ **Taint tolerations:** Verify that the pod spec includes the required toleration for restricted nodes. The add-on starting from version `v1.0.5-eksbuild.1` automatically adds this toleration when RIG support is enabled. If you are using older version, please upgrade to the latest version.

### Viewing logs for pods on restricted nodes
<a name="troubleshooting-rig-viewing-logs"></a>

The `kubectl logs` command does not work for pods running on restricted nodes. This is an expected limitation because the communication path required for log streaming is not available on restricted nodes.

To view logs from restricted nodes, use the **Cluster Logs** dashboard in Amazon Managed Grafana, which queries CloudWatch Logs directly. You can filter by instance ID, log stream, log level, and free-text search to find relevant log entries.

### DNS resolution failures in clusters with both standard and restricted nodes
<a name="troubleshooting-rig-dns-resolution"></a>

In hybrid clusters (clusters with both standard and restricted instance groups), pods on standard nodes may experience DNS resolution timeouts when trying to reach AWS service endpoints such as Amazon Managed Service for Prometheus or CloudWatch.

**Cause:** The `kube-dns` service has endpoints from both standard CoreDNS pods and RIG CoreDNS pods. Standard node pods cannot reach RIG CoreDNS endpoints due to network isolation. When `kube-proxy` load-balances a DNS request from a standard node pod to a RIG CoreDNS endpoint, the request times out.

**Resolution:** Set `internalTrafficPolicy: Local` on the `kube-dns` service so that pods only reach CoreDNS on their local node:

```
kubectl patch svc kube-dns -n kube-system -p '{"spec":{"internalTrafficPolicy":"Local"}}'
```

After applying this patch, restart the affected observability pods:

```
kubectl delete pods -n hyperpod-observability -l app.kubernetes.io/name=hyperpod-node-collector
```

### Metrics from restricted nodes not reaching Amazon Managed Service for Prometheus
<a name="troubleshooting-rig-metrics-not-reaching-amp"></a>

If metrics from restricted nodes are not appearing in your Amazon Managed Service for Prometheus workspace:

1. **Verify the execution role permissions.** Ensure that the execution role for the Restricted Instance Group has `aps:RemoteWrite` permission for your Prometheus workspace. For more information, see [Additional prerequisites for Restricted Instance Groups](hyperpod-observability-addon-setup.md#hyperpod-observability-addon-rig-prerequisites).

1. **Check the node collector pod status.** Run the following command and verify that node collector pods are running on restricted nodes:

   ```
   kubectl get pods -n hyperpod-observability | grep node-collector
   ```

1. **Check the central collector deployments.** In clusters with restricted nodes, the add-on deploys one central collector per network boundary. Verify that a central collector exists for each boundary:

   ```
   kubectl get deployments -n hyperpod-observability | grep central-collector
   ```

1. **Check pod events for errors.** Use `kubectl describe` on the collector pods to look for error events:

   ```
   kubectl describe pod collector-pod-name -n hyperpod-observability
   ```

If the issue persists after verifying the above, contact Support.

### Pod Identity verification does not apply to restricted instance group nodes
<a name="troubleshooting-rig-pod-identity"></a>

The [Verify Pod Identity association](#verify-pod-identity-association) troubleshooting steps apply only to standard nodes. On restricted nodes, the add-on uses the cluster instance group execution role for AWS authentication instead of Amazon EKS Pod Identity. If metrics are missing from restricted nodes, verify the execution role permissions instead of the Pod Identity association.

### Fluent Bit not running on restricted nodes
<a name="troubleshooting-rig-fluent-bit"></a>

This is expected behavior. Fluent Bit is intentionally not deployed on restricted nodes. Logs from restricted nodes are published to CloudWatch through the SageMaker HyperPod platform independently of the observability add-on. Use the **Cluster Logs** dashboard in Amazon Managed Grafana to view these logs.

# Observability with Amazon CloudWatch
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci"></a>

Use [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) to collect, aggregate, and summarize metrics and logs from the containerized applications and micro-services on the EKS cluster associated with a HyperPod cluster.

Amazon CloudWatch Insights collects metrics for compute resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

To find a complete list of metrics, see [Amazon EKS and Kubernetes Container Insights metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html) in the *Amazon EKS User Guide*.

## Install CloudWatch Container Insights
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-setup"></a>

Cluster admin users must set up CloudWatch Container Insights following the instructions at [Install the CloudWatch agent by using the Amazon CloudWatch Observability EKS add-on or the Helm chart](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html) in the *CloudWatch User Guide*. For more information about Amazon EKS add-on, see also [Install the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html) in the *Amazon EKS User Guide*.

After the installation has completed, verify that the CloudWatch Observability add-on is visible in the EKS cluster add-on tab. It might take about a couple of minutes until the dashboard loads.

**Note**  
SageMaker HyperPod requires the CloudWatch Insight v2.0.1-eksbuild.1 or later.

![\[CloudWatch Observability service card showing status, version, and IAM role information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-CIaddon.png)


# Access CloudWatch container insights dashboard
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-access-dashboard"></a>

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Insights**, and then choose **Container Insights**.

1. Select the EKS cluster set up with the HyperPod cluster you're using.

1. View the Pod/Cluster level metrics.

![\[Performance monitoring dashboard for EKS cluster showing node status, resource utilization, and pod metrics.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-CIdashboard.png)


## Access CloudWatch container insights logs
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-access-log"></a>

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Logs**, and then choose **Log groups**.

When you have the HyperPod clusters integrated with Amazon CloudWatch Container Insights, you can access the relevant log groups in the following format: `/aws/containerinsights /<eks-cluster-name>/*`. Within this log group, you can find and explore various types of logs such as Performance logs, Host logs, Application logs, and Data plane logs.

# Continuous provisioning for enhanced cluster operations on Amazon EKS
<a name="sagemaker-hyperpod-scaling-eks"></a>

Amazon SageMaker HyperPod clusters created with Amazon EKS orchestration now supports continuous provisioning, a new capability that enables greater flexibility and efficiency running large-scale AI/ML workloads. Continuous provisioning lets you start training quickly, scale seamlessly, perform maintenance without disrupting operations, and have granular visibility into cluster operations. 

**Note**  
Continuous provisioning is available as an optional configuration for HyperPod clusters created with EKS orchestration. Clusters created with Slurm orchestration use a different scaling model.

## How it works
<a name="sagemaker-hyperpod-scaling-eks-how"></a>

The continuous provisioning system introduces a desired-state architecture that replaces the traditional request-based model. This new architecture enables parallel, non-blocking operations across different resource levels while maintaining system stability and performance. The continuous provisioning system:
+ **Accepts the request**: Records the target instance count for each instance group
+ **Initiates provisioning**: Begins launching instances to meet the target count

  **Tracks progress**: Monitors each instance launch attempt and records the status
+ **Handles failures**: Automatically retries failed launches

Continuous provisioning is disabled by default. To use this feature, set `--node-provisioning-mode` to `Continuous`.

With continuous provisioning enabled, you can initiate multiple scaling operations simultaneously without waiting for previous operations to complete. This lets you scale different instance groups in the same cluster concurrently and submit multiple scaling requests to the same instance group. 

Continuous provisioning also gives you access to [DescribeClusterEvent](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterEvent.html) and [ListClusterEvent](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterEvents.html) for detailed event monitoring and operational visibility. 

## Usage metering
<a name="sagemaker-hyperpod-scaling-eks-metering"></a>

HyperPod clusters with continuous provisioning use instance-level metering to provide accurate billing that reflects actual resource usage. This metering approach differs from traditional cluster-level billing by tracking each instance independently.

**Instance-level billing**

With continuous provisioning, billing starts and stops at the individual instance level rather than waiting for cluster-level state changes. This approach provides the following benefits:
+ **Precise billing accuracy**: Billing starts when the lifecycle script execution begins. If the lifecycle script fails, the instance provision will be retried and you will be charged for the duration of the lifecycle script runtime.
+ **Independent metering**: Each instance's billing lifecycle is managed separately, preventing cascading billing errors
+ **Real-time billing updates**: Billing starts when an instance begins executing its lifecycle script and stops when the instance enters a terminating state

**Billing lifecycle**

Each instance in your HyperPod cluster follows this billing lifecycle:
+ **Billing starts**: When the instance successfully launches and begins executing its lifecycle configuration script
+ **Billing continues**: Throughout the instance's operational lifetime
+ **Billing stops**: When the instance enters a terminating state, regardless of the reason for termination

**Note**  
Billing does not start for instances that fail to launch. If an instance launch fails due to insufficient capacity or other issues, you are not charged for that failed attempt. Billing is calculated at the instance level and costs are aggregated and reported under your cluster's Amazon Resource Name (ARN). 

## Create a cluster with continuous provisioning enabled
<a name="sagemaker-hyperpod-scaling-eks-create"></a>

**Note**  
You must have an existing Amazon EKS cluster configured with VPC networking and the required Helm chart installed. Additionally, prepare a lifecycle configuration script and upload it to an Amazon S3 bucket that your execution role can access. For more information, see [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md).

The following AWS CLI operation creates a HyperPod cluster with one instance group and continuous provisioning enabled.

```
aws sagemaker create-cluster \ 
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET'"]
}' \
--instance-groups '{
   "InstanceGroupName": "ig-1",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'",
   "ThreadsPerCore": 1,
   "TrainingPlanArn": ""
}' \
--node-provisioning-mode Continuous


// Expected Output:
{
    "ClusterArn": "arn:aws:sagemaker:us-west-2:<account-id>:cluster/<cluster-id>"
}
```

After you’ve created your cluster, you can use [ListClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterNodes.html) or [DescribeClusterNode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterNode.html) to find out more information about the nodes in the cluster. 

Calling these operations will return a [ClusterInstanceStatusDetails](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceStatusDetails.html) object with one of the following values: 
+  **Running**: The node is healthy and registered with the cluster orchestrator (EKS). 
+  **Failure**: The node provisioning failed but the system will automatically retry provisioning with a new EC2 instance. 
+  **Pending**: The node is being provisioned or rebooted. 
+  **ShuttingDown**: The node termination is in progress. The node will either transition to Failure status if termination encounters issues, or will be successfully removed from the cluster. 
+  **SystemUpdating**: The node is undergoing AMI patching, either triggered manually or as part of patching cronjobs. 
+  **DeepHealthCheckInProgress**: [Deep health checks (DHCs)](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md) are being conducted. This could take anywhere between a few mins to several hours depending on the nature of tests. Bad nodes are replaced and healthy nodes switch to Running. 
+  **NotFound** : Used in [BatchAddClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchAddClusterNodes.html) response to indicate a node has been deleted during idempotent replay. 

## Minimum capacity requirements (MinCount)
<a name="sagemaker-hyperpod-scaling-eks-mincount"></a>

The MinCount feature allows you to specify the minimum number of instances that must be successfully provisioned before an instance group transitions to the `InService` status. This feature provides better control over scaling operations and helps prevent scenarios where partially provisioned instance groups cannot be used effectively for training workloads.

**Important**  
MinCount is not a permanent guarantee of minimum capacity. It only ensures that the specified minimum number of instances are available when the instance group first becomes `InService`. Brief dips below MinCount may occur during normal operations such as unhealthy instance replacements or maintenance activities.

### How MinCount works
<a name="sagemaker-hyperpod-scaling-eks-mincount-how"></a>

When you create or update an instance group with MinCount enabled, the following behavior occurs:
+ **New instance groups**: The instance group remains in `Creating` status until at least MinCount instances are successfully provisioned and ready. Once this threshold is met, the instance group transitions to `InService`.
+ **Existing instance groups**: When updating MinCount on an existing instance group, the status changes to `Updating` until the new MinCount requirement is satisfied.
+ **Continuous scaling**: If TargetCount is greater than MinCount, the continuous scaling system continues attempting to launch additional instances until TargetCount is reached.
+ **Timeout and rollback**: If MinCount cannot be satisfied within 3 hours, the system automatically rolls back the instance group to its last known good state. For more information about rollback behavior, see [Automatic rollback behavior](#sagemaker-hyperpod-scaling-eks-mincount-rollback).

### Instance group status during MinCount operations
<a name="sagemaker-hyperpod-scaling-eks-mincount-status"></a>

Instance groups with MinCount configured exhibit the following status behavior:

Creating  
For new instance groups when CurrentCount < MinCount. The instance group remains in this status until the minimum capacity requirement is met.

Updating  
For existing instance groups when MinCount is modified and CurrentCount < MinCount. The instance group remains in this status until the new minimum capacity requirement is satisfied.

InService  
When MinCount ≤ CurrentCount ≤ TargetCount. The instance group is ready for use and all mutating operations are unblocked.

During `Creating` or `Updating` status, the following restrictions apply:
+ Mutating operations such as `BatchAddClusterNodes`, `BatchDeleteClusterNodes`, or `UpdateClusterSoftware` are blocked
+ You can still modify MinCount and TargetCount values to correct configuration errors
+ Cluster and Instance group deletion is always permitted

### Automatic rollback behavior
<a name="sagemaker-hyperpod-scaling-eks-mincount-rollback"></a>

If an instance group cannot reach its MinCount within 3 hours, the system automatically initiates a rollback to prevent indefinite waiting:
+ **New instance groups**: MinCount and TargetCount are reset to (0, 0)
+ **Existing instance groups**: MinCount and TargetCount are restored to their values from the last `InService` state
+ **Instance selection for termination**: If instances need to be terminated during rollback, the system selects the unhealthy instances first, then those that were most recently provisioned.
+ **Status transition**: The instance group immediately transitions to `InService` status after rollback initiation, allowing the continuous scaling system to manage capacity according to the rollback settings

The 3-hour timeout resets each time MinCount is updated. For example, if you update MinCount multiple times, the timeout period starts fresh from the most recent update.

### MinCount events
<a name="sagemaker-hyperpod-scaling-eks-mincount-events"></a>

The system emits specific events to help you track MinCount operations:
+ **Minimum capacity reached**: Emitted when an instance group successfully reaches its MinCount and transitions to `InService`
+ **Rollback initiated**: Emitted when the 3-hour timeout expires and automatic rollback begins

You can monitor these events using [ListClusterEvents](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterEvents.html) to track the progress of your MinCount operations.

### API usage
<a name="sagemaker-hyperpod-scaling-eks-mincount-api"></a>

MinCount is specified using the `MinInstanceCount` parameter in instance group configurations:

```
aws sagemaker create-cluster \
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET'"]
}' \
--instance-groups '{
   "InstanceGroupName": "worker-group",
   "InstanceType": "ml.p4d.24xlarge",
   "InstanceCount": 64,
   "MinInstanceCount": 50,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'"
}' \
--node-provisioning-mode Continuous
```

Key considerations for MinCount usage:
+ `MinInstanceCount` must be between 0 and `InstanceCount` (inclusive) value of the instance group specified in [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) or [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) request
+ Setting `MinInstanceCount` to 0 (default) preserves standard continuous scaling behavior
+ Setting `MinInstanceCount` equal to `InstanceCount` provides all-or-nothing scaling behavior
+ MinCount is only available for clusters with `NodeProvisioningMode` set to `Continuous`

## Flexible instance groups
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig"></a>

Flexible instance groups allow you to specify multiple instance types within a single instance group. This simplifies cluster management by reducing the number of instance groups you need to create and manage, especially for inference workloads that use autoscaling.

With flexible instance groups, HyperPod:
+ Attempts to provision instances using the first instance type in your list
+ Falls back to subsequent instance types if capacity is unavailable
+ Terminates instances of the lowest-priority instance type first during scale-down

**Note**  
Flexible instance groups are only available for clusters with `NodeProvisioningMode` set to `Continuous`. The `InstanceType` and `InstanceRequirements` properties are mutually exclusive—you can specify one or the other, but not both.

### Create a cluster with a flexible instance group
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-create"></a>

Use `InstanceRequirements` instead of `InstanceType` to create a flexible instance group. The order of instance types in the list determines the priority for provisioning.

```
aws sagemaker create-cluster \
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET_AZ1'", "'$SUBNET_AZ2'"]
}' \
--instance-groups '[{
   "InstanceGroupName": "flexible-ig",
   "InstanceRequirements": {
      "InstanceTypes": ["ml.p5.48xlarge", "ml.p4d.24xlarge", "ml.g6.48xlarge"]
   },
   "InstanceCount": 10,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'"
}]' \
--node-provisioning-mode Continuous
```

### Targeted scaling with BatchAddClusterNodes
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-targeted"></a>

When using flexible instance groups, you can use [BatchAddClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchAddClusterNodes.html) to add nodes with specific instance types and availability zones. This is particularly useful when Karpenter autoscaling determines the optimal instance type and availability zone for your workload.

```
aws sagemaker batch-add-cluster-nodes \
--cluster-name $HP_CLUSTER_NAME \
--nodes-to-add '[
   {
      "InstanceGroupName": "flexible-ig",
      "IncrementTargetCountBy": 1,
      "InstanceTypes": ["ml.p5.48xlarge"],
      "AvailabilityZones": ["us-west-2a"]
   }
]'
```

### View flexible instance group details
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-describe"></a>

Use [DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) to view the instance types and per-type breakdown of your flexible instance group. The response includes:
+ `InstanceRequirements` — The current and desired instance types for the instance group
+ `InstanceTypeDetails` — A per-instance-type breakdown showing the count and configuration of each instance type in the group

### Using flexible instance groups with Karpenter autoscaling
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-autoscaling"></a>

Flexible instance groups integrate with HyperPod's managed Karpenter autoscaling. For more information about setting up Karpenter, see [Autoscaling on SageMaker HyperPod EKS](sagemaker-hyperpod-eks-autoscaling.md). When you reference a flexible instance group in a `HyperPodNodeClass` configuration, Karpenter automatically:
+ Detects the supported instance types from the flexible instance group
+ Selects the optimal instance type and availability zone based on pod requirements and pricing
+ Scales the flexible instance group using targeted `BatchAddClusterNodes` calls with the selected instance type and availability zone

**Note**  
When Karpenter manages scaling, it uses its own selection logic based on pod requirements and pricing to determine which instance type to provision. This differs from the list-order priority used by HyperPod's native provisioning (such as `CreateCluster` and `UpdateCluster`), where the first instance type in the list is always attempted first.

This eliminates the need to create separate instance groups for each instance type and manually configure Karpenter to reference multiple groups.

# Autoscaling on SageMaker HyperPod EKS
<a name="sagemaker-hyperpod-eks-autoscaling"></a>

Amazon SageMaker HyperPod provides a managed Karpenter based node autoscaling solution for clusters created with EKS orchestration. [Karpenter](https://karpenter.sh/) is an open-source, Kubernetes node lifecycle manager built by AWS that optimizes cluster scaling and cost efficiency. Unlike self-managed Karpenter deployments, SageMaker HyperPod's managed implementation eliminates the operational overhead of installing, configuring, and maintaining Karpenter controllers while providing integrated resilience and fault tolerance. This managed autoscaling solution is built on HyperPod's [continuous provisioning](sagemaker-hyperpod-scaling-eks.md) capabilities and enables you to efficiently scale compute resources for training and inference workloads with automatic failure handling and recovery. 

You pay only for what you use. You're responsible for paying for all compute instances that are automatically provisioned through autoscaling according to standard SageMaker HyperPod pricing. For detailed pricing information, see [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/pricing/).

By enabling Karpenter-based autoscaling with HyperPod, you have access to:
+ **Service managed lifecycle** - HyperPod handles Karpenter installation, updates, and maintenance, eliminating operational overhead.
+ **Just in time provisioning** - Karpenter will observe your pending pods and provision the required compute for your workloads from on-demand pool.
+ **Scale to zero** - Scale down to zero nodes without maintaining dedicated controller infrastructure.
+ **Workload aware node selection** - Karpenter chooses optimal instance types based on pod requirements, availability zones, and pricing to minimize costs.
+ **Automatic node consolidation** - Karpenter regularly evaluates cluster for optimization opportunities, shifting workloads to eliminate underutilized nodes.
+ **Integrated resilience** - Leverages HyperPod's built-in fault tolerance and node recovery mechanisms.

The following topics explain how to enable HyperPod autoscaling with Karpenter.

**Topics**
+ [Prerequisites](#sagemaker-hyperpod-eks-autoscaling-prereqs)
+ [Create an IAM role for HyperPod autoscaling with Karpenter](sagemaker-hyperpod-eks-autoscaling-iam.md)
+ [Create and configure a HyperPod cluster with Karpenter autoscaling](sagemaker-hyperpod-eks-autoscaling-cluster.md)
+ [Create a NodeClass](sagemaker-hyperpod-eks-autoscaling-nodeclass.md)
+ [Create a NodePool](sagemaker-hyperpod-eks-autoscaling-nodepool.md)
+ [Deploy a workload](sagemaker-hyperpod-eks-autoscaling-workload.md)

## Prerequisites
<a name="sagemaker-hyperpod-eks-autoscaling-prereqs"></a>
+ Continuous provisioning enabled on your HyperPod cluster. Enable continuous provisioning by setting `--node-provisioning-mode` to `Continuous` when creating your SageMaker HyperPod cluster. For more information, see [Continuous provisioning for enhanced cluster operations on Amazon EKS](sagemaker-hyperpod-scaling-eks.md).
+ Health Monitoring Agent version 1.0.742.0\$11.0.241.0 or above installed. Required for HyperPod cluster operations and monitoring. The agent must be configured before enabling Karpenter autoscaling to ensure proper cluster health reporting and node lifecycle management. For more information, see [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md).
+ Only if your Amazon EKS cluster has Karpenter running on it, the Karpenter `NodePool` and `NodeClaim` versions need to be v1.
+ `NodeRecovery` set to automatic. For more information, see [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md).

# Create an IAM role for HyperPod autoscaling with Karpenter
<a name="sagemaker-hyperpod-eks-autoscaling-iam"></a>

In the following steps, you'll create an IAM role that allows SageMaker HyperPod to manage Kubernetes nodes in your cluster through Karpenter-based autoscaling. This role provides the necessary permissions for HyperPod to add and remove cluster nodes automatically based on workload demand.

**Open the IAM console**

1. Sign in to the AWS Management Console and open the IAM console at console.aws.amazon.com.

1. In the navigation pane, choose **Roles**.

1. Choose **Create role**.

**Configure the trust policy**

1. For **Trusted entity type**, choose **Custom trust policy**.

1. In the **Custom trust policy** editor, replace the default policy with the following:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": [
                       "hyperpod.sagemaker.amazonaws.com"
                   ]
               },
               "Action": "sts:AssumeRole"
           }
       ]
   }
   ```

------

1. Choose **Next**.

**Create and attach the permissions policy**

Because SageMaker HyperPod requires specific permissions that aren't available in AWS managed policies, you must create a custom policy.

1. Choose **Create policy**. This opens a new browser tab.

1. Choose the **JSON** tab.

1. Replace the default policy with the following:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:BatchAddClusterNodes",
                   "sagemaker:BatchDeleteClusterNodes"
               ],
               "Resource": "arn:aws:sagemaker:*:*:cluster/*",
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           },
           {
               "Effect": "Allow",
               "Action": [
                   "kms:CreateGrant",
                   "kms:DescribeKey"
               ],
               "Resource": "arn:aws:kms:*:*:key/*",
               "Condition": {
                   "StringLike": {
                       "kms:ViaService": "sagemaker.*.amazonaws.com"
                   },
                   "Bool": {
                       "kms:GrantIsForAWSResource": "true"
                   },
                   "ForAllValues:StringEquals": {
                       "kms:GrantOperations": [
                           "CreateGrant",
                           "Decrypt",
                           "DescribeKey",
                           "GenerateDataKeyWithoutPlaintext",
                           "ReEncryptTo",
                           "ReEncryptFrom",
                           "RetireGrant"
                       ]
                   }
               }
           }
       ]
   }
   ```

------

1. Choose **Next**.

1. For **Policy name**, enter **SageMakerHyperPodKarpenterPolicy**.

1. (Optional) For **Description**, enter a description for the policy.

1. Choose **Create policy**.

1. Return to the role creation tab and refresh the policy list.

1. Search for and select the **SageMakerHyperPodKarpenterPolicy** that you just created.

1. Choose **Next**.

**Name and create the role**

1. For **Role name**, enter `SageMakerHyperPodKarpenterRole`.

1. (Optional) For **Description**, enter a description for the role.

1. In the **Step 1: Select trusted entities** section, verify that the trust policy shows the correct service principals.

1. In the **Step 2: Add permissions** section, verify that `SageMakerHyperPodKarpenterPolicy` is attached.

1. Choose **Create role**.

**Record the role ARN**

After the role is created successfully:

1. In the **Roles** list, choose the role name `SageMakerHyperPodKarpenterRole`.

1. Copy the **Role ARN** from the **Summary** section. You'll need this ARN when creating your HyperPod cluster.

The role ARN follows this format: `arn:aws:iam::ACCOUNT-ID:role/SageMakerHyperPodKarpenterRole`.

# Create and configure a HyperPod cluster with Karpenter autoscaling
<a name="sagemaker-hyperpod-eks-autoscaling-cluster"></a>

In the following steps, you'll create a SageMaker HyperPod cluster with continuous provisioning enabled and configure it to use Karpenter-based autoscaling.

**Create a HyperPod cluster**

1. Load your environment configuration and extract values from CloudFormation stacks.

   ```
   source .env
   SUBNET1=$(cfn-output $VPC_STACK_NAME PrivateSubnet1)
   SUBNET2=$(cfn-output $VPC_STACK_NAME PrivateSubnet2)
   SUBNET3=$(cfn-output $VPC_STACK_NAME PrivateSubnet3)
   SECURITY_GROUP=$(cfn-output $VPC_STACK_NAME NoIngressSecurityGroup)
   EKS_CLUSTER_ARN=$(cfn-output $EKS_STACK_NAME ClusterArn)
   EXECUTION_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ExecutionRole)
   SERVICE_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ServiceRole)
   BUCKET_NAME=$(cfn-output $SAGEMAKER_STACK_NAME Bucket)
   HP_CLUSTER_NAME="hyperpod-eks-test-$(date +%s)"
   EKS_CLUSTER_NAME=$(cfn-output $EKS_STACK_NAME ClusterName)
   HP_CLUSTER_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ClusterRole)
   ```

1. Upload the node initialization script to your Amazon S3 bucket.

   ```
   aws s3 cp lifecyclescripts/on_create_noop.sh s3://$BUCKET_NAME
   ```

1. Create a cluster configuration file with your environment variables.

   ```
   cat > cluster_config.json << EOF
   {
       "ClusterName": "$HP_CLUSTER_NAME",
       "InstanceGroups": [
           {
               "InstanceCount": 1,
               "InstanceGroupName": "system",
               "InstanceType": "ml.c5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE"
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-c5-az1",
               "InstanceType": "ml.c5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE"
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-c5-4xaz2",
               "InstanceType": "ml.c5.4xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE",
               "OverrideVpcConfig": {
                   "SecurityGroupIds": [
                       "$SECURITY_GROUP"
                   ],
                   "Subnets": [
                       "$SUBNET2"
                   ]
               }
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-g5-az3",
               "InstanceType": "ml.g5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE",
               "OverrideVpcConfig": {
                   "SecurityGroupIds": [
                       "$SECURITY_GROUP"
                   ],
                   "Subnets": [
                       "$SUBNET3"
                   ]
               }
           }
       ],
       "VpcConfig": {
           "SecurityGroupIds": [
               "$SECURITY_GROUP"
           ],
           "Subnets": [
               "$SUBNET1"
           ]
       },
       "Orchestrator": {
           "Eks": {
               "ClusterArn": "$EKS_CLUSTER_ARN"
           }
       },
       "ClusterRole": "$HP_CLUSTER_ROLE",
       "AutoScaling": {
           "Mode": "Enable",
           "AutoScalerType": "Karpenter"
       },
       "NodeProvisioningMode": "Continuous"
   }
   EOF
   ```

1. Run the following command to create your HyperPod cluster.

   ```
   aws sagemaker create-cluster --cli-input-json file://./cluster_config.json
   ```

1. The cluster creation process takes approximately 20 minutes. Monitor the cluster status until both ClusterStatus and AutoScaling.Status show InService.

1. Save the cluster ARN for subsequent operations.

   ```
   HP_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME \
      --output text --query ClusterArn)
   ```

**Enable Karpenter autoscaling**

1. Run the following command to enable Karpenter-based autoscaling on any pre-existing cluster that has continuous node provisioning mode.

   ```
   aws sagemaker update-cluster \
       --cluster-name $HP_CLUSTER_NAME \
       --auto-scaling Mode=Enable,AutoScalerType=Karpenter \
       --cluster-role $HP_CLUSTER_ROLE
   ```

1. Verify that Karpenter has been successfully enabled:

   ```
   aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --query 'AutoScaling'
   ```

1. Expected output:

   ```
   {
       "Mode": "Enable",
       "AutoScalerType": "Karpenter",
       "Status": "InService"
   }
   ```

Wait for the `Status` to show `InService` before proceeding to configure NodeClass and NodePool.

# Create a NodeClass
<a name="sagemaker-hyperpod-eks-autoscaling-nodeclass"></a>

**Important**  
You must start with 0 nodes in your instance group and let Karpenter handle the autoscaling. If you start with more than 0 nodes, Karpenter will scale them down to 0.

A node class (`NodeClass`) defines infrastructure-level settings that apply to groups of nodes in your Amazon EKS cluster, including network configuration, storage settings, and resource tagging. A `HyperPodNodeClass` is a custom `NodeClass` that maps to pre-created instance groups in SageMaker HyperPod, defining constraints around which instance types and Availability Zones are supported for Karpenter's autoscaling decisions.

**Considerations for creating a node class**
+ You can specify up to 10 instance groups in a `NodeClass`.
+ Instance groups that use `InstanceRequirements` (flexible instance groups) can contain multiple instance types within a single instance group. This simplifies your `NodeClass` configuration because you can reference fewer instance groups to cover the same set of instance types and Availability Zones. For example, instead of creating 6 instance groups (3 instance types × 2 AZs), you can create a single flexible instance group that covers all combinations. Note that `InstanceType` and `InstanceRequirements` are mutually exclusive—you must specify one or the other for each instance group.
+ When using GPU partitioning with MIG (Multi-Instance GPU), Karpenter can automatically provision nodes with MIG-enabled instance groups. Ensure your instance groups include MIG-supported instance types (ml.p4d.24xlarge, ml.p5.48xlarge, or ml.p5e/p5en.48xlarge) and configure the appropriate MIG labels during cluster creation. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).
+ If custom labels are applied to instance groups, you can view them in the `desiredLabels` field when querying the `HyperpodNodeClass` status. This includes MIG configuration labels such as `nvidia.com/mig.config`. When incoming jobs request MIG resources, Karpenter will automatically scale instances with the appropriate MIG labels applied.
+ If you choose to delete an instance group, we recommend removing it from your `NodeClass` before deleting it from your HyperPod cluster. If an instance group is deleted while it is used in a `NodeClass`, the `NodeClass` will be marked as not `Ready` for provisioning and won't be used for subsequent scaling operations until the instance group is removed from `NodeClass`.
+ When you remove instance groups from a `NodeClass`, Karpenter will detect a drift on the nodes that were managed by Karpenter in the instance group(s) and disrupt the nodes based on your disruption budget controls.
+ Subnets used by the instance group should belong to the same AZ. Subnets are specified either using `OverrideVpcConfig` at the instance group level or the cluster level. `VpcConfig` is used by default.
+ Only on-demand capacity is supported at this time. Instance groups with Training plan or reserved capacity are not supported.
+ Instance groups with `DeepHealthChecks (DHC)` are not supported. This is because a DHC takes around 60-90 minutes to complete and pods will remain in pending state during that time which can cause over-provisioning.

The following steps cover how to create a `NodeClass`.

1. Create a YAML file (for example, nodeclass.yaml) with your `NodeClass` configuration.

1. Apply the configuration to your cluster using kubectl.

1. Reference the `NodeClass` in your `NodePool` configuration.

1. Here's a sample `NodeClass` that uses a ml.c5.xlarge and ml.c5.4xlarge instance types:

   ```
   apiVersion: karpenter.sagemaker.amazonaws.com/v1
   kind: HyperpodNodeClass
   metadata:
     name: sample-nc
   spec:
     instanceGroups:
       # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created
       # MaxItems: 10
       - auto-c5-xaz1
       - auto-c5-4xaz2
   ```

1. Apply the configuration:

   ```
   kubectl apply -f nodeclass.yaml
   ```

1. Monitor the NodeClass status to ensure the Ready condition in status is set to True:

   ```
   kubectl get hyperpodnodeclass sample-nc -o yaml
   ```

   ```
   apiVersion: karpenter.sagemaker.amazonaws.com/v1
   kind: HyperpodNodeClass
   metadata:
     creationTimestamp: "<timestamp>"
     name: sample-nc
     uid: <resource-uid>
   spec:
     instanceGroups:
     - auto-c5-az1
     - auto-c5-4xaz2
   status:
     conditions:
     // true when all IGs in the spec are present in SageMaker cluster, false otherwise
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: InstanceGroupReady
       status: "True"
       type: InstanceGroupReady
     // true if subnets of IGs are discoverable, false otherwise
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: SubnetsReady
       status: "True"
       type: SubnetsReady
     // true when all dependent resources are Ready [InstanceGroup, Subnets]
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: Ready
       status: "True"
       type: Ready
     instanceGroups:
     - desiredLabels:
       - key: <custom_label_key>
         value: <custom_label_value>
       - key: nvidia.com/mig.config
         value: all-1g.5gb
       instanceTypes:
       - ml.c5.xlarge
       name: auto-c5-az1
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-a>
         zoneId: <zone-id-a>
     - instanceTypes:
       - ml.c5.4xlarge
       name: auto-c5-4xaz2
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-b>
         zoneId: <zone-id-b>
     # Flexible instance group with multiple instance types
     - instanceTypes:
       - ml.p5.48xlarge
       - ml.p4d.24xlarge
       - ml.g6.48xlarge
       name: inference-workers
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-a>
         zoneId: <zone-id-a>
       - id: <subnet-id>
         zone: <availability-zone-b>
         zoneId: <zone-id-b>
   ```

# Create a NodePool
<a name="sagemaker-hyperpod-eks-autoscaling-nodepool"></a>

The `NodePool` sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The `NodePool` can be configured to do things like:
+ Limit node creation to certain zones, instance types, and computer architectures.
+ Define labels or taints to limit the pods that can run on nodes Karpenter creates.

**Note**  
HyperPod provider supports a limited set of well-known Kubernetes and Karpenter requirements explained below. 

The following steps cover how to create a `NodePool`.

1. Create a YAML file named nodepool.yaml with your desired `NodePool` configuration.

1. You can use the sample configuration below.

   Look for `Ready` under `Conditions` to indicate all dependent resources are functioning properly.

   ```
   apiVersion: karpenter.sh/v1
   kind: NodePool
   metadata:
    name: sample-np
   spec:
    template:
      spec:
        nodeClassRef:
         group: karpenter.sagemaker.amazonaws.com
         kind: HyperpodNodeClass
         name: multiazc5
        expireAfter: Never
        requirements:
           - key: node.kubernetes.io/instance-type
             operator: Exists
   ```

1. Apply the `NodePool` to your cluster:

   ```
   kubectl apply -f nodepool.yaml
   ```

1. Monitor the `NodePool` status to ensure the `Ready` condition in status is set to `True`:

   ```
   kubectl get nodepool sample-np -oyaml
   ```

   ```
   apiVersion: karpenter.sh/v1
   kind: NodePool
   metadata:
     name: <nodepool-name>
     uid: <resource-uid>
     ...
   spec:
     disruption:
       budgets:
       - nodes: 90%
       consolidateAfter: 0s
       consolidationPolicy: WhenEmptyOrUnderutilized
     template:
       spec:
         expireAfter: 720h
         nodeClassRef:
           group: karpenter.sagemaker.amazonaws.com
           kind: HyperpodNodeClass
           name: <nodeclass-name>
         requirements:
         - key: node.kubernetes.io/instance-type
           operator: Exists
   status:
     conditions:
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: ValidationSucceeded
       status: "True"
       type: ValidationSucceeded
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: NodeClassReady
       status: "True"
       type: NodeClassReady
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: Ready
       status: "True"
       type: Ready
   ```

**Supported Labels for Karpenter HyperPod Provider**

These are the optional constraints and requirements you can specify in your `NodePool` configuration.


|  Requirement Type  |  Purpose  |  Use Case/Supported Values  |  Recommendation  | 
| --- | --- | --- | --- | 
|  Instance Types (`node.kubernetes.io/instance-type`)  |  Controls which SageMaker instance types Karpenter can choose from  |  Instead of restricting to only ml.c5.xlarge, let Karpenter pick from all available types in your instance groups  |  Leave this undefined or use Exists operator to give Karpenter maximum flexibility in choosing cost-effective instance types  | 
|  Availability Zones (`topology.kubernetes.io/zone`)  |  Controls which AWS availability zones nodes can be created in  |  Specific zone names like us-east-1c. Use when you need pods to run in specific zones for latency or compliance reasons  | n/a | 
|  Architecture (`kubernetes.io/arch`)  |  Specifies CPU architecture  |  Only amd64 (no ARM support currently)  |  n/a  | 

# Deploy a workload
<a name="sagemaker-hyperpod-eks-autoscaling-workload"></a>

The following examples demonstrate how HyperPod autoscaling with Karpenter automatically provisions nodes in response to workload demands. These examples show basic scaling behavior and multi-availability zone distribution patterns.

**Deploy a simple workload**

1. The following Kubernetes deployment includes pods that request for 1 CPU and 256M memory per replica or pod. In this scenario, the pods aren’t spun up yet.

   ```
   kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/inflate.yaml
   ```

1. To test the scale up process, run the following command. Karpenter will add new nodes to the cluster.

   ```
   kubectl scale deployment inflate --replicas 10
   ```

1. To test the scale down process, run the following command. Karpenter will remove nodes from the cluster.

   ```
   kubectl scale deployment inflate --replicas 0
   ```

**Deploy a workload across multiple AZs**

1. Run the following command to deploy a workload that runs a Kubernetes deployment where the pods in deployment need to spread evenly across different availability zones with a max Skew of 1.

   ```
   kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/spread-zone.yaml
   ```

1. Run the following command to adjust number of pods:

   ```
   kubectl scale deployment zone-spread --replicas 15
   ```

   Karpenter will add new nodes to the cluster with at least one node it a different availability zone.

For more examples, see [Karpenter example workloads](https://github.com/aws/karpenter-provider-aws/tree/main/examples/workloads) on GitHub.

# Using topology-aware scheduling in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-topology"></a>

Data transfer efficiency is a critical factor in high-performance computing (HPC) and machine learning workloads. When using UltraServers with Amazon SageMaker HyperPod, SageMaker HyperPod automatically applies topology labels to your resources. Topology-aware scheduling helps allocate resources to minimize data transfer overheads by considering both instance topology (how resources are connected within an instance) and network topology (how instances are connected with each other). For more information about instance topology, see [ Amazon EC2 instance topology](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-topology.html).

Topology-aware scheduling works with both clusters on Slurm and Amazon EKS. For general information about how topology works with Slurm, see the [Topology guide in the Slurm documentation](https://slurm.schedmd.com/topology.html).

In Amazon SageMaker HyperPod, data transfer overheads typically come from three main sources:
+ **GPU-to-GPU data transfer**: Modern technologies like NVLink and NVLink switches allow high-throughput data transfer between GPUs without involving other compute resources. This is extremely efficient but usually limited to a single instance.
+ **GPU-to-CPU data transfer**: Non-uniform memory access (NUMA) systems have multiple system buses on a single motherboard. In a typical EC2 instance architecture like p5.48xlarge, there are two different system buses, each with a CPU and 4 GPUs. For optimal performance, processes that load or read data to/from GPUs should execute on a CPU connected to the same system bus as the GPU.
+ **Network communications between instances**: Instances transfer data through a chain of network switches. The shortest path typically corresponds to the lowest latency.

## UltraServer architecture
<a name="sagemaker-hyperpod-topology-ultraserver-architecture"></a>

SageMaker HyperPod supports UltraServer architecture with p6e-gb200.36xlarge instances. An UltraServer contains up to 18 p6e-gb200.36xlarge instances, with 4 GPUs on each instance. All GPUs across all nodes are interconnected through NVLink switches, enabling data transfer between any two GPUs without using network interfaces.

This architecture provides a significant performance boost compared to individual instances. To leverage this architecture effectively, jobs should be submitted to compute nodes from a single UltraServer.

## EKS topology label
<a name="sagemaker-hyperpod-topology-eks-scheduling"></a>

In accordance with EC2 instance topology, HyperPod automatically labels your nodes with the following labels:
+ **topology.kubernetes.io/region** - the AWS Region that the node resides in.
+ **topology.kubernetes.io/zone** - the Availability Zone that the node resides in.
+ **topology.k8s.aws/network-node-layer** - NetworkNodes describes the network node set of an instance. In each network node set, the network nodes are listed in a hierarchical order from top to bottom. The network node that is connected to the instance is the last network node in the list. There are up to four network node layers, and each node is tagged with a label. Available layers are `topology.k8s.aws/network-node-layer-1`, `topology.k8s.aws/network-node-layer-2`, `topology.k8s.aws/network-node-layer-3`.
+ **topology.k8s.aws/ultraserver-id** - An identifier used to label each of the instances belonging to the same NVLink domain in an Ultraserver. To learn more about using UltraServers with SageMaker HyperPod, see [Using UltraServers in Amazon SageMaker HyperPod](sagemaker-hyperpod-ultraserver.md).

Using these labels, you can use topology-aware scheduling in HyperPod task governance to apply topology labels and annotations to optimize training efficiency of your workloads. For more information, see [Using topology-aware scheduling in Amazon SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-scheduling.md).

## Slurm network topology plugins
<a name="sagemaker-hyperpod-topology-slurm-plugins"></a>

Slurm provides built-in plugins for network topology awareness. SageMaker HyperPod automatically selects and configures the appropriate topology plugin based on the instance types in your cluster.

### Automatic topology selection
<a name="sagemaker-hyperpod-topology-auto-selection"></a>

When you create a HyperPod Slurm cluster, the system inspects all instance groups and their associated instance types, identifies the GPU communication characteristics of each instance type, and configures Slurm with the appropriate topology plugin. This process runs automatically and does not require any configuration.

HyperPod manages topology through a dynamically generated `topology.conf` file. As the cluster evolves through scaling operations or node replacements, HyperPod continuously reconciles the topology configuration to reflect the current cluster state. For more information, see [Dynamic topology updates](#sagemaker-hyperpod-topology-dynamic-updates).

### Using the topology/tree plugin
<a name="sagemaker-hyperpod-topology-tree"></a>

The `topology/tree` plugin models hierarchical communication structures with multiple bandwidth tiers. Tree topology enables Slurm to place jobs in a way that minimizes cross-tier communication and maximizes locality.

Tree topology is used for instance types with hierarchical interconnects, where distributed training workloads benefit from locality-aware placement. This includes instance types such as `ml.p5.48xlarge`, `ml.p5e.48xlarge`, and `ml.p5en.48xlarge`.

SageMaker HyperPod automatically configures the `topology/tree` plugin when your cluster uses these instance types. The generated `topology.conf` maps nodes into a switch hierarchy that reflects the communication tiers of your hardware.

Ensure your `slurm.conf` includes:

```
TopologyPlugin=topology/tree
```

#### Configuration
<a name="sagemaker-hyperpod-topology-tree-config"></a>

SageMaker HyperPod automatically configures the `topology/tree` plugin based on information provided by Amazon EC2. For more details about Amazon EC2 topology, see [Amazon EC2 instance topology](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-topology.html).

When the `topology/tree` plugin is used, the Slurm `topology.conf` looks like the following:

```
SwitchName=nn-6fe9d8a965d34d181 Switches=nn-0b53107754517bf0e

SwitchName=nn-0b53107754517bf0e Switches=nn-424c855d4ad825aa4,nn-95acd7c656329fc30

SwitchName=nn-424c855d4ad825aa4 Nodes=ip-10-1-111-198
SwitchName=nn-95acd7c656329fc30 Nodes=ip-10-1-53-231
```

#### Usage
<a name="sagemaker-hyperpod-topology-tree-usage"></a>

When the `topology/tree` plugin is configured, Slurm tries to allocate machines that are close to each other. You can force Slurm to allocate machines on a single switch by passing the `--switch` command line parameter to `sbatch` or `srun`:

```
sbatch --switch=1 ....
```

### Using the topology/block plugin
<a name="sagemaker-hyperpod-topology-block"></a>

NVIDIA developed a `topology/block` plugin that provides hierarchical scheduling across blocks of nodes with the following characteristics:
+ A block is a consecutive range of nodes
+ Blocks cannot overlap with each other
+ All nodes in a block are allocated to a job before the next block is used
+ The planning block size is the smallest block size configured
+ Every higher block level size is a power of two than the previous one

This plugin allocates nodes based on the defined network topology.

Block topology models uniform, high-bandwidth communication domains where all GPUs participate in a single high-speed domain with near-uniform latency. Block topology treats all nodes as part of a single cohesive communication unit. UltraServer architecture in SageMaker HyperPod supports the block plugin.

Block topology is used for instance types such as `ml.p6e-gb200.NVL72` and `ml.p6e-gb300.NVL72`.

#### Configuration
<a name="sagemaker-hyperpod-topology-block-config"></a>

SageMaker HyperPod automatically configures the `topology/block` plugin. If you want to configure the plugin manually, specify the following in the `topology.conf` file in your Slurm configuration directory:

```
BlockName=us1 Nodes=ultraserver1-[0-17]
BlockName=us2 Nodes=ultraserver2-[0-17]
BlockSizes=18
```

Ensure your `slurm.conf` includes:

```
TopologyPlugin=topology/block
```

#### Usage
<a name="sagemaker-hyperpod-topology-block-usage"></a>

When submitting jobs, you can use the following additional arguments with `sbatch` and `srun` commands:
+ `--segment=N`: Specify the number of nodes to group together. The size of the segment must be less than or equal to the planning block size.
+ `--exclusive=topo`: Request that no other jobs be placed on the same block. This is useful for benchmarking and performance-sensitive applications.

The following are sample scenarios you might consider when thinking about allocating blocks.

**Allocate a whole block of nodes on an empty system**

```
sbatch -N18
```

**Allocate two blocks of nodes on an empty system**

```
sbatch -N36
```

**Allocate 18 nodes on one block \$1 6 nodes on another block**

```
sbatch -N24
```

**Allocate 12 nodes on one block and 12 nodes on another block**

```
sbatch -N24 --segment=12
```

**With --exclusive=topo, job must be placed on block with no other jobs**

```
sbatch -N12 --exclusive=topo
```

### Topology selection for clusters with mixed instance types
<a name="sagemaker-hyperpod-topology-mixed-instances"></a>

HyperPod currently uses Slurm 24.11, which supports only a single topology configuration per cluster. This means that per-partition topology selection is not supported, multiple topology models cannot coexist within a single cluster, and all nodes must conform to a single topology definition.

When your cluster contains multiple instance types, HyperPod selects a topology that is compatible across all of them. The following table shows an example of how HyperPod resolves topology for a cluster with mixed instance types.


| Instance group | Instance type | Preferred topology | 
| --- | --- | --- | 
| IG-1 | ml.p5.48xlarge | Tree | 
| IG-2 | ml.p6e-gb300.NVL72 | Block | 

In this example, block topology is optimal for ml.p6e-gb300.NVL72, but tree topology is compatible with both ml.p5.48xlarge and ml.p6e-gb300.NVL72. HyperPod selects tree topology as the cluster-wide topology to ensure that all nodes can participate in scheduling correctly and no instance type is excluded or misrepresented.

**Important**  
For workloads where topology-aware scheduling is critical to performance, we recommend creating separate clusters for each instance type rather than combining different instance types in a single cluster. This ensures that each cluster uses the optimal topology for its hardware, delivering the best possible workload performance. For example, instead of combining ml.p5.48xlarge and ml.p6e-gb300.NVL72 instances in a single cluster where tree topology is selected as a compatible compromise, create a dedicated cluster for each instance type so that each uses its ideal topology model.

### Disable or change topology plugin
<a name="sagemaker-hyperpod-topology-disable-change"></a>

When a Slurm cluster is created, HyperPod automatically selects the optimal topology plugin. To manually change the topology plugin, update the `TopologyPlugin` value in `slurm.conf` on the controller node.

Example:

```
# Set this value to disable topology plugin
TopologyPlugin=topology/flat
```

### Dynamic topology updates
<a name="sagemaker-hyperpod-topology-dynamic-updates"></a>

Topology-aware scheduling continuously maintains topology correctness as your cluster changes. The topology is automatically recalculated and the `topology.conf` file is regenerated when any of the following events occur:
+ **Scale-up**: New nodes are added to the cluster.
+ **Scale-down**: Nodes are removed from the cluster.
+ **Node replacement**: Failed or unhealthy nodes are replaced, or nodes are manually replaced using the [BatchReplaceClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchReplaceClusterNodes.html) API.

When the topology is updated, new nodes are incorporated into the correct topology structure, removed nodes are pruned, and the Slurm configuration is updated without requiring manual intervention. This ensures that the topology always reflects the actual cluster state.

**Note**  
Advanced users can override the topology behavior by logging into the Slurm controller node and manually modifying `slurm.conf` and `topology.conf`. However, manual changes may be overwritten by HyperPod during subsequent cluster updates, including scaling operations, node replacements, and other cluster lifecycle events. If you modify these files manually, verify your changes after any cluster update.

## Best practices for UltraServer topology
<a name="sagemaker-hyperpod-topology-best-practices"></a>

For optimal performance with UltraServer architecture in SageMaker HyperPod:
+ **Set appropriate block sizes**: Configure `BlockSizes=18` (or 17 if one node is spare) to match the UltraServer architecture.
+ **Use segments for better availability**: Use `--segment=16`, `--segment=8`, or `--segment=9` with `srun` and `sbatch` commands to improve job scheduling flexibility.
+ **Consider job size and segment size**:
  + If `BlockSizes=18`, jobs with up to 18 instances will always run on a single UltraServer.
  + If `BlockSizes=16`, jobs with fewer than 16 instances will always run on a single UltraServer, while jobs with 18 instances may run on one or two UltraServers.

When thinking about segmenting, consider the following
+ With `--segment=1`, each instance can run on a separate UltraServer.
+ With `-N 18 --segment 9`, 9 nodes will be placed on one UltraServer, and another 9 nodes can be placed on the same or another UltraServer.
+ With `-N 24 --segment 8`, the job can run on 2 or 3 UltraServers, with every 8 nodes placed together on the same server.

## Limitations in SageMaker HyperPod topology aware scheduling
<a name="sagemaker-hyperpod-topology-limitations"></a>

The `topology/block` plugin has limitations with heterogeneous clusters (clusters with different instance types):
+ Only nodes listed in blocks are schedulable by Slurm
+ Every block must have at least `BlockSizes[0]` nodes

For heterogeneous clusters, consider these alternatives:
+ Do not use the block plugin with heterogeneous clusters. Instead, isolate UltraServer nodes in a different partition.
+ Create a separate cluster with UltraServers only in the same VPC and use Slurm's multicluster setup.

# Deploying models on Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-model-deployment"></a>

Amazon SageMaker HyperPod now extends beyond training to deliver a comprehensive inference platform that combines the flexibility of Kubernetes with the operational excellence of AWS managed services. Deploy, scale, and optimize your machine learning models with enterprise-grade reliability using the same HyperPod compute throughout the entire model lifecycle.

Amazon SageMaker HyperPod offers flexible deployment interfaces that allow you to deploy models through multiple methods including kubectl, Python SDK, Amazon SageMaker Studio UI, or HyperPod CLI. The service provides advanced autoscaling capabilities with dynamic resource allocation that automatically adjusts based on demand. Additionally, it includes comprehensive observability and monitoring features that track critical metrics such as time-to-first-token, latency, and GPU utilization to help you optimize performance.

**Note**  
When deploying on GPU-enabled instances, you can use GPU partitioning with Multi-Instance GPU (MIG) technology to run multiple inference workloads on a single GPU. This allows for better GPU utilization and cost optimization. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

**Unified infrastructure for training and inference**

Maximize your GPU utilization by seamlessly transitioning compute resources between training and inference workloads. This reduces the total cost of ownership while maintaining operational continuity.

**Enterprise-ready deployment options**

Deploy models from multiple sources including open-weights and gated models from Amazon SageMaker JumpStart and custom models from Amazon S3 and Amazon FSx with support for both single-node and multi-node inference architectures.

**Managed tiered Key-value (KV) caching and intelligent routing**

KV caching saves the precomputed key-value vectors after processing previous tokens. When the next token is processed, the vectors don't need to be recalculated. Through a two-tier caching architecture, you can configure an L1 cache that uses CPU memory for low-latency local reuse, and an L2 cache that leverages Redis to enable scalable, node-level cache sharing.

Intelligent routing analyzes incoming requests and directs them to the inference instance most likely to have relevant cached key-value pairs. The system examines the request and then routes it based on one of the following routing strategies:

1. `prefixaware` — Subsequent requests with the same prompt prefix are routed to the same instance

1. `kvaware` — Incoming requests are routed to the instance with the highest KV cache hit rate.

1. `session` — Requests from the same user session are routed to the same instance.

1. `roundrobin` — Even distribution of requests without considering the state of the KV cache.

For more information on how to enable this feature, see [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route).

**Inbuilt L2 cache Tiered Storage Support for KV Caching**

Building upon the existing KV cache infrastructure, HyperPod now integrates tiered storage as an additional L2 backend option alongside Redis. With the inbuilt SageMaker managed tiered storage, this offers improved performance. This enhancement provides customers with a more scalable and efficient option for cache offloading, particularly beneficial for high-throughput LLM inference workloads. The integration maintains compatibility with existing vLLM model servers and routing capabilities while offering better performance.

**Note**  
**Data encryption:** KV cache data (attention keys and values) is stored unencrypted at rest to optimize inference latency and improve performance. For workloads with strict encryption-at-rest requirements, consider application-layer encryption of prompts and responses, or disable caching.  
**Data isolation:** When using managed tiered storage as the L2 cache backend, multiple inference deployments within a cluster share cache storage with no isolation. L2 KV cache data (attention keys and values) from different deployments is not separated. For workloads requiring data isolation (multi-tenant scenarios, different data classification levels), deploy to separate clusters or use dedicated Redis instances.

**Multi-instance type deployment with automatic failover**

HyperPod Inference supports multi-instance type deployment to improve deployment reliability and resource utilization. Specify a prioritized list of instance types in your deployment configuration, and the system automatically selects from available alternatives when your preferred instance type lacks capacity. The Kubernetes scheduler uses `preferredDuringSchedulingIgnoredDuringExecution` node affinity to evaluate instance types in priority order, placing workloads on the highest-priority available instance type while ensuring deployment even when preferred resources are unavailable. This capability prevents deployment failures due to capacity constraints while maintaining your cost and performance preferences, ensuring continuous service availability even during cluster capacity fluctuations.

**Custom node affinity for granular scheduling control**

HyperPod Inference supports custom node affinity to control workload placement beyond instance type selection. Specify node selection criteria such as availability zone distribution, capacity type filtering (on-demand vs. spot), or custom node labels through the `nodeAffinity` field. The system supports mandatory placement constraints using `requiredDuringSchedulingIgnoredDuringExecution` and optional preferences through `preferredDuringSchedulingIgnoredDuringExecution`, providing full control over pod scheduling decisions while maintaining deployment flexibility.

**Note**  
We collect certain routine operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model inference workload. These metrics relate to deployment operations, resource management, and endpoint registration.

**Topics**
+ [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md)
+ [Deploy foundation models and custom fine-tuned models](sagemaker-hyperpod-model-deployment-deploy.md)
+ [Autoscaling policies for your HyperPod inference model deployment](sagemaker-hyperpod-model-deployment-autoscaling.md)
+ [Implementing inference observability on HyperPod clusters](sagemaker-hyperpod-model-deployment-observability.md)
+ [Task governance for model deployment on HyperPod](sagemaker-hyperpod-model-deployment-task-gov.md)
+ [HyperPod inference troubleshooting](sagemaker-hyperpod-model-deployment-ts.md)
+ [Amazon SageMaker HyperPod Inference release notes](sagemaker-hyperpod-inference-release-notes.md)

# Setting up your HyperPod clusters for model deployment
<a name="sagemaker-hyperpod-model-deployment-setup"></a>

This guide shows you how to enable inference capabilities on Amazon SageMaker HyperPod clusters. You'll set up the infrastructure, permissions, and operators that machine learning engineers need to deploy and manage inference endpoints.

**Note**  
To create a cluster with the inference operator pre-installed, see [Create an EKS-orchestrated SageMaker HyperPod cluster](sagemaker-hyperpod-quickstart.md#sagemaker-hyperpod-quickstart-eks). To install the inference operator on an existing cluster, continue with the following procedures.

You can install the inference operator using the SageMaker AI console for a streamlined experience, or use the AWS CLI for more control. This guide covers both installation methods.

## Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)
<a name="sagemaker-hyperpod-model-deployment-setup-ui"></a>

The SageMaker AI console provides the most streamlined experience with two installation options:
+ **Quick Install:** Automatically creates all required resources with optimized defaults, including IAM roles, Amazon S3 buckets, and dependency add-ons. A new Studio domain will be created with required permissions to deploy a JumpStart model to the relevant cluster. This option is ideal for getting started quickly with minimal configuration decisions.
+ **Custom Install:** Provides flexibility to specify existing resources or customize configurations while maintaining the one-click experience. Customers can choose to reuse existing IAM roles, Amazon S3 buckets, or dependency add-ons based on their organizational requirements.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-ui-prereqs"></a>
+ An existing HyperPod cluster with Amazon EKS orchestration
+ IAM permissions for Amazon EKS cluster administration
+ kubectl configured for cluster access

### Installation steps
<a name="sagemaker-hyperpod-model-deployment-setup-ui-steps"></a>

1. Navigate to the SageMaker AI console and go to **HyperPod Clusters** → **Cluster Management**.

1. Select your cluster where you want to install the Inference Operator.

1. Navigate to the **Inference** tab. Select **Quick Install** for automated setup or **Custom Install** for configuration flexibility.

1. If choosing Custom Install, specify existing resources or customize settings as needed.

1. Click **Install** to begin the automated installation process.

1. Verify the installation status through the console, or by running the following commands:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   ```
   aws eks describe-addon --cluster-name CLUSTER-NAME --addon-name amazon-sagemaker-hyperpod-inference --region REGION
   ```

After the add-on is successfully installed, you can deploy models using the model deployment documentation or navigate to [Verify the inference operator is working](#sagemaker-hyperpod-model-deployment-setup-verify).

## Method 2: Installing the Inference Operator using the AWS CLI
<a name="sagemaker-hyperpod-model-deployment-setup-addon"></a>

The AWS CLI installation method provides more control over the installation process and is suitable for automation and advanced configurations.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-prereq-addon"></a>

The inference operator enables deployment and management of machine learning inference endpoints on your Amazon EKS cluster. Before installation, ensure your cluster has the required security configurations and supporting infrastructure. Complete these steps to configure IAM roles, install the AWS Load Balancer Controller, set up Amazon S3 and Amazon FSx CSI drivers, and deploy KEDA and cert-manager:

1. [Connect to your cluster and set up environment variables](#sagemaker-hyperpod-model-deployment-setup-connect-addon)

1. [Configure IAM roles for inference operator](#sagemaker-hyperpod-model-deployment-setup-prepare-addon)

1. [Create the ALB Controller role](#sagemaker-hyperpod-model-deployment-setup-alb-addon)

1. [Create the KEDA operator role](#sagemaker-hyperpod-model-deployment-setup-keda-addon)

1. [Install the dependency EKS Add-Ons](#sagemaker-hyperpod-model-deployment-setup-install-dependencies)

**Note**  
Alternatively, you can use CloudFormation templates to automate the prerequisite setup. For more information, see [Using CloudFormation templates to create the prerequisite stack](#sagemaker-hyperpod-model-deployment-setup-cfn).

### Connect to your cluster and set up environment variables
<a name="sagemaker-hyperpod-model-deployment-setup-connect-addon"></a>

Before proceeding, verify that your AWS credentials are properly configured and have the necessary permissions. Run the following steps using an IAM principal with Administrator privileges and Cluster Admin access to an Amazon EKS cluster. Ensure you've created a HyperPod cluster with [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md). Install helm, eksctl, and kubectl command line utilities.

For Kubernetes administrative access to the Amazon EKS cluster, open the Amazon EKS console and select your cluster. In the **Access** tab, select **IAM Access Entries**. If no entry exists for your IAM principal, select **Create Access Entry**. Select the desired IAM principal and associate the `AmazonEKSClusterAdminPolicy` with it.

1. Configure kubectl to connect to the newly created HyperPod cluster orchestrated by Amazon EKS cluster. Specify the Region and HyperPod cluster name.

   ```
   export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name>
   export REGION=<region>
   
   # S3 bucket where tls certificates will be uploaded
   export BUCKET_NAME="hyperpod-tls-<your-bucket-suffix>" # Bucket should have prefix: hyperpod-tls-*
   
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
   --query 'Orchestrator.Eks.ClusterArn' --output text | \
   cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```
**Note**  
If using a custom bucket name that doesn't start with `hyperpod-tls-`, attach the following policy to your execution role:  

   ```
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "TLSBucketDeleteObjectsPermission",
               "Effect": "Allow",
               "Action": ["s3:DeleteObject"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"],
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           },
           {
               "Sid": "TLSBucketGetObjectAccess",
               "Effect": "Allow",
               "Action": ["s3:GetObject"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"]
           },
           {
               "Sid": "TLSBucketPutObjectAccess",
               "Effect": "Allow",
               "Action": ["s3:PutObject", "s3:PutObjectTagging"],
               "Resource": ["arn:aws:s3:::${BUCKET_NAME}/*"],
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           }
       ]
   }
   ```

1. Set default env variables.

   ```
   HYPERPOD_INFERENCE_ROLE_NAME="SageMakerHyperPodInference-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_NAMESPACE="hyperpod-inference-system"
   ```

1. Extract the Amazon EKS cluster name from the cluster ARN, update the local kubeconfig, and verify connectivity by listing all pods across namespaces.

   ```
   kubectl get pods --all-namespaces
   ```

1. (Optional) Install the NVIDIA device plugin to enable GPU support on the cluster.

   ```
   # Install nvidia device plugin
   kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
   # Verify that GPUs are visible to k8s
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
   ```

### Configure IAM roles for inference operator
<a name="sagemaker-hyperpod-model-deployment-setup-prepare-addon"></a>

1. Gather essential AWS resource identifiers and ARNs required for configuring service integrations between Amazon EKS, SageMaker AI, and IAM components.

   ```
   %%bash -x
   
   export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text)
   export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
   export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
   ```

1. Associate an IAM OIDCidentity provider with your EKS cluster.

   ```
   eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
   ```

1. Create the trust policy required for the HyperPod inference operator IAM role. These policies enable secure cross-service communication between Amazon EKS, SageMaker AI, and other AWS services.

   ```
   %%bash -x
   
   # Create trust policy JSON
   cat << EOF > trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Service": [
                   "sagemaker.amazonaws.com"
               ]
           },
           "Action": "sts:AssumeRole"
       },
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
                   "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager"
               }
           }
       }
   ]
   }
   EOF
   ```

1. Create execution Role for the inference operator.

   ```
   aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json
   aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess
   ```

1. Create a namespace for inference operator resources

   ```
   kubectl create namespace $HYPERPOD_INFERENCE_NAMESPACE
   ```

### Create the ALB Controller role
<a name="sagemaker-hyperpod-model-deployment-setup-alb-addon"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/alb-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:aws-load-balancer-controller",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   
   # Create permissions policy
   export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy
   curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json
   
   # Create the role
   aws iam create-role \
       --role-name alb-role \
       --assume-role-policy-document file:///tmp/alb-trust-policy.json 
   
   # Create the policy
   ALB_POLICY_ARN=$(aws iam create-policy \
       --policy-name $ALBController_IAM_POLICY_NAME \
       --policy-document file://AWSLoadBalancerControllerIAMPolicy.json \
       --query 'Policy.Arn' \
       --output text)
   
   # Attach the policy to the role
   aws iam attach-role-policy \
       --role-name alb-role \
       --policy-arn $ALB_POLICY_ARN
   ```

1. Apply Tags (`kubernetes.io.role/elb`) to all subnets in the Amazon EKS cluster (both public and private).

   ```
   export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   
   # Add Tags
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' | \
   xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1
   
   # Verify Tags are added
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' |
   xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
   ```

1. Create an Amazon S3 VPC endpoint.

   ```
   aws ec2 create-vpc-endpoint \
       --region ${REGION} \
       --vpc-id ${VPC_ID} \
       --vpc-endpoint-type Gateway \
       --service-name "com.amazonaws.${REGION}.s3" \
       --route-table-ids $(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
   ```

### Create the KEDA operator role
<a name="sagemaker-hyperpod-model-deployment-setup-keda-addon"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/keda-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:hyperpod-inference-system:keda-operator",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   
   # Create permissions policy
   cat <<EOF > /tmp/keda-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "cloudwatch:GetMetricData",
               "cloudwatch:GetMetricStatistics",
               "cloudwatch:ListMetrics"
           ],
           "Resource": "*"
       },
       {
           "Effect": "Allow",
           "Action": [
               "aps:QueryMetrics",
               "aps:GetLabels",
               "aps:GetSeries",
               "aps:GetMetricMetadata"
           ],
           "Resource": "*"
       }
   ]
   }
   EOF
   
   # Create the role
   aws iam create-role \
       --role-name keda-operator-role \
       --assume-role-policy-document file:///tmp/keda-trust-policy.json
   
   # Create the policy
   KEDA_POLICY_ARN=$(aws iam create-policy \
       --policy-name KedaOperatorPolicy \
       --policy-document file:///tmp/keda-policy.json \
       --query 'Policy.Arn' \
       --output text)
   
   # Attach the policy to the role
   aws iam attach-role-policy \
       --role-name keda-operator-role \
       --policy-arn $KEDA_POLICY_ARN
   ```

1. If you're using gated models, create an IAM role to access the gated models.

   1. Create an IAM policy.

      ```
      %%bash -s $REGION
      
      JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}"
      
      cat <<EOF > /tmp/trust-policy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringLike": {
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*",
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
                  }
              }
          },
              {
              "Effect": "Allow",
              "Principal": {
                  "Service": "sagemaker.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
          }
      ]
      }
      EOF
      ```

   1. Create an IAM role.

      ```
      # Create the role using existing trust policy
      aws iam create-role \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --assume-role-policy-document file:///tmp/trust-policy.json
      
      aws iam attach-role-policy \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccess
      ```

      ```
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text
      JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0]
      !echo $JUMPSTART_GATED_ROLE_ARN
      ```

### Install the dependency EKS Add-Ons
<a name="sagemaker-hyperpod-model-deployment-setup-install-dependencies"></a>

Before installing the inference operator, you must install the following required EKS add-ons on your cluster. The inference operator will fail to install if any of these dependencies are missing. Each add-on has a minimum version requirement for compatibility with the Inference add-on.

**Important**  
Install all dependency add-ons before attempting to install the inference operator. Missing dependencies will cause installation failures with specific error messages.

#### Required Add-ons
<a name="sagemaker-hyperpod-model-deployment-setup-required-addons"></a>

1. **Amazon S3 Mountpoint CSI Driver** (minimum version: v1.14.1-eksbuild.1)

   Required for mounting S3 buckets as persistent volumes in inference workloads.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-mountpoint-s3-csi-driver \
       --region $REGION \
       --service-account-role-arn $S3_CSI_ROLE_ARN
   ```

   For detailed installation instructions including required IAM permissions, see [Mountpoint for Amazon S3 CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#mountpoint-for-s3-add-on).

1. **Amazon FSx CSI Driver** (minimum version: v1.6.0-eksbuild.1)

   Required for mounting FSx file systems for high-performance model storage.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-fsx-csi-driver \
       --region $REGION \
       --service-account-role-arn $FSX_CSI_ROLE_ARN
   ```

   For detailed installation instructions including required IAM permissions, see [Amazon FSx for Lustre CSI driver](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#add-ons-aws-fsx-csi-driver).

1. **Metrics Server** (minimum version: v0.7.2-eksbuild.4)

   Required for autoscaling functionality and resource metrics collection.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name metrics-server \
       --region $REGION
   ```

   For detailed installation instructions, see [Metrics Server](https://docs.aws.amazon.com/eks/latest/userguide/metrics-server.html).

1. **Cert Manager** (minimum version: v1.18.2-eksbuild.2)

   Required for TLS certificate management for secure inference endpoints.

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

   For detailed installation instructions, see [cert-manager](https://docs.aws.amazon.com/eks/latest/userguide/community-addons.html#addon-cert-manager).

#### Verify Add-on Installation
<a name="sagemaker-hyperpod-model-deployment-setup-verify-dependencies"></a>

After installing the required add-ons, verify they are running correctly:

```
# Check add-on status
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION

# Verify pods are running
kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)"
kubectl get pods -n cert-manager
```

All add-ons should show status "ACTIVE" and all pods should be in "Running" state before proceeding with inference operator installation.

**Note**  
If you created your HyperPod cluster using the quick setup or custom setup options, the FSx CSI Driver and Cert Manager may already be installed. Verify their presence using the commands above.

### Installing the Inference Operator with EKS add-on
<a name="sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon"></a>

The EKS add-on installation method provides a managed experience with automatic updates and integrated dependency validation. This is the recommended approach for installing the inference operator.

**Install the inference operator add-on**

1. Prepare the add-on configuration by gathering all required ARNs and creating the configuration file:

   ```
   # Gather required ARNs
   export EXECUTION_ROLE_ARN=$(aws iam get-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text)
   export HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --region $REGION --query "ClusterArn" --output text)
   export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text)
   export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text)
   
   # Verify all ARNs are set correctly
   echo "Execution Role ARN: $EXECUTION_ROLE_ARN"
   echo "HyperPod Cluster ARN: $HYPERPOD_CLUSTER_ARN"
   echo "KEDA Role ARN: $KEDA_ROLE_ARN"
   echo "ALB Role ARN: $ALB_ROLE_ARN"
   echo "TLS S3 Bucket: $BUCKET_NAME"
   ```

1. Create the add-on configuration file with all required settings:

   ```
   cat > addon-config.json << EOF
   {
     "executionRoleArn": "$EXECUTION_ROLE_ARN",
     "tlsCertificateS3Bucket": "$BUCKET_NAME",
     "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
     "jumpstartGatedModelDownloadRoleArn": "$JUMPSTART_GATED_ROLE_ARN",
     "alb": {
       "serviceAccount": {
         "create": true,
         "roleArn": "$ALB_ROLE_ARN"
       }
     },
     "keda": {
       "auth": {
         "aws": {
           "irsa": {
             "roleArn": "$KEDA_ROLE_ARN"
           }
         }
       }
     }
   }
   EOF
   
   # Verify the configuration file
   cat addon-config.json
   ```

1. Install the inference operator add-on (minimum version: v1.0.0-eksbuild.1):

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Monitor the installation progress and verify successful completion:

   ```
   # Check installation status (repeat until status shows "ACTIVE")
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   
   # Verify pods are running
   kubectl get pods -n hyperpod-inference-system
   
   # Check operator logs for any issues
   kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50
   ```

For detailed troubleshooting of installation issues, see [HyperPod inference troubleshooting](sagemaker-hyperpod-model-deployment-ts.md).

To verify the inference operator is working correctly, continue to [Verify the inference operator is working](#sagemaker-hyperpod-model-deployment-setup-verify).

### Using CloudFormation templates to create the prerequisite stack
<a name="sagemaker-hyperpod-model-deployment-setup-cfn"></a>

As an alternative to manually configuring the prerequisites, you can use CloudFormation templates to automate the creation of required IAM roles and policies for the inference operator.

1. Set up input variables. Replace the placeholder values with your own:

   ```
   #!/bin/bash
   set -e
   
   # ===== INPUT VARIABLES =====
   HP_CLUSTER_NAME="my-hyperpod-cluster"  # Replace with your HyperPod cluster name
   REGION="us-east-1"  # Replace with your AWS region
   PREFIX="my-prefix"  # Replace with your resource prefix
   SHORT_PREFIX="12a34d56"  # Replace with your short prefix (maximum 8 characters)
   CREATE_DOMAIN="true"  # Set to "false" if you don't need a SageMaker Studio domain
   STACK_NAME="hyperpod-inference-prerequisites"  # Replace with your stack name
   TEMPLATE_URL="https://aws-sagemaker-hyperpod-cluster-setup-${REGION}-prod.s3.${REGION}.amazonaws.com/templates/main-stack-inference-operator-addon-template.yaml"
   ```

1. Derive cluster and network information:

   ```
   # ===== DERIVE EKS CLUSTER NAME =====
   EKS_CLUSTER_NAME=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'Orchestrator.Eks.ClusterArn' --output text | awk -F'/' '{print $NF}')
   echo "EKS_CLUSTER_NAME=$EKS_CLUSTER_NAME"
   
   # ===== GET VPC AND OIDC =====
   VPC_ID=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   echo "VPC_ID=$VPC_ID"
   
   OIDC_PROVIDER=$(aws eks describe-cluster --name $EKS_CLUSTER_NAME --region $REGION --query 'cluster.identity.oidc.issuer' --output text | sed 's|https://||')
   echo "OIDC_PROVIDER=$OIDC_PROVIDER"
   
   # ===== GET PRIVATE ROUTE TABLES =====
   ALL_ROUTE_TABLES=$(aws ec2 describe-route-tables --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" --query 'RouteTables[].RouteTableId' --output text)
   EKS_PRIVATE_ROUTE_TABLES=""
   for rtb in $ALL_ROUTE_TABLES; do
       HAS_IGW=$(aws ec2 describe-route-tables --region $REGION --route-table-ids $rtb --query 'RouteTables[0].Routes[?GatewayId && starts_with(GatewayId, `igw-`)]' --output text 2>/dev/null)
       if [ -z "$HAS_IGW" ]; then
           EKS_PRIVATE_ROUTE_TABLES="${EKS_PRIVATE_ROUTE_TABLES:+$EKS_PRIVATE_ROUTE_TABLES,}$rtb"
       fi
   done
   echo "EKS_PRIVATE_ROUTE_TABLES=$EKS_PRIVATE_ROUTE_TABLES"
   
   # ===== CHECK S3 VPC ENDPOINT =====
   S3_ENDPOINT_EXISTS=$(aws ec2 describe-vpc-endpoints --region $REGION --filters "Name=vpc-id,Values=$VPC_ID" "Name=service-name,Values=com.amazonaws.$REGION.s3" --query 'VpcEndpoints[0].VpcEndpointId' --output text)
   CREATE_S3_ENDPOINT_STACK=$([ "$S3_ENDPOINT_EXISTS" == "None" ] && echo "true" || echo "false")
   echo "CREATE_S3_ENDPOINT_STACK=$CREATE_S3_ENDPOINT_STACK"
   
   # ===== GET HYPERPOD DETAILS =====
   HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'ClusterArn' --output text)
   echo "HYPERPOD_CLUSTER_ARN=$HYPERPOD_CLUSTER_ARN"
   
   # ===== GET DEFAULT VPC FOR DOMAIN =====
   DOMAIN_VPC_ID=$(aws ec2 describe-vpcs --region $REGION --filters "Name=isDefault,Values=true" --query 'Vpcs[0].VpcId' --output text)
   echo "DOMAIN_VPC_ID=$DOMAIN_VPC_ID"
   
   DOMAIN_SUBNET_IDS=$(aws ec2 describe-subnets --region $REGION --filters "Name=vpc-id,Values=$DOMAIN_VPC_ID" --query 'Subnets[0].SubnetId' --output text)
   echo "DOMAIN_SUBNET_IDS=$DOMAIN_SUBNET_IDS"
   
   # ===== GET INSTANCE GROUPS =====
   INSTANCE_GROUPS=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region $REGION --query 'InstanceGroups[].InstanceGroupName' --output json | python3 -c "import sys, json; groups = json.load(sys.stdin); print('[' + ','.join([f'\\\\\\\"' + g + '\\\\\\\"' for g in groups]) + ']')")
   echo "INSTANCE_GROUPS=$INSTANCE_GROUPS"
   ```

1. Create parameters file and deploy stack:

   ```
   # ===== CREATE PARAMETERS JSON =====
   cat > /tmp/cfn-params.json << EOF
   [
     {"ParameterKey":"ResourceNamePrefix","ParameterValue":"$PREFIX"},
     {"ParameterKey":"ResourceNameShortPrefix","ParameterValue":"$SHORT_PREFIX"},
     {"ParameterKey":"VpcId","ParameterValue":"$VPC_ID"},
     {"ParameterKey":"EksPrivateRouteTableIds","ParameterValue":"$EKS_PRIVATE_ROUTE_TABLES"},
     {"ParameterKey":"EKSClusterName","ParameterValue":"$EKS_CLUSTER_NAME"},
     {"ParameterKey":"OIDCProviderURLWithoutProtocol","ParameterValue":"$OIDC_PROVIDER"},
     {"ParameterKey":"HyperPodClusterArn","ParameterValue":"$HYPERPOD_CLUSTER_ARN"},
     {"ParameterKey":"HyperPodClusterName","ParameterValue":"$HP_CLUSTER_NAME"},
     {"ParameterKey":"CreateDomain","ParameterValue":"$CREATE_DOMAIN"},
     {"ParameterKey":"DomainVpcId","ParameterValue":"$DOMAIN_VPC_ID"},
     {"ParameterKey":"DomainSubnetIds","ParameterValue":"$DOMAIN_SUBNET_IDS"},
     {"ParameterKey":"CreateS3EndpointStack","ParameterValue":"$CREATE_S3_ENDPOINT_STACK"},
     {"ParameterKey":"TieredStorageConfig","ParameterValue":"{\"Mode\":\"Enable\",\"InstanceMemoryAllocationPercentage\":20}"},
     {"ParameterKey":"TieredKVCacheConfig","ParameterValue":"{\"KVCacheMode\":\"Enable\",\"InstanceGroup\":$INSTANCE_GROUPS,\"NVMeMode\":\"Enable\"}"}
   ]
   EOF
   
   echo -e "\n===== CREATING CLOUDFORMATION STACK ====="
   aws cloudformation create-stack \
       --region $REGION \
       --stack-name $STACK_NAME \
       --template-url $TEMPLATE_URL \
       --parameters file:///tmp/cfn-params.json \
       --capabilities CAPABILITY_NAMED_IAM
   ```

1. Monitor the stack creation status:

   ```
   aws cloudformation describe-stacks \
       --stack-name $STACK_NAME \
       --region $REGION \
       --query 'Stacks[0].StackStatus'
   ```

1. Once the stack is created successfully, retrieve the output values for use in the inference operator installation:

   ```
   aws cloudformation describe-stacks \
       --stack-name $STACK_NAME \
       --region $REGION \
       --query 'Stacks[0].Outputs'
   ```

After the CloudFormation stack is created, continue with [Installing the Inference Operator with EKS add-on](#sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon) to install the inference operator.

## Method 3: Helm chart installation
<a name="sagemaker-hyperpod-model-deployment-setup-helm"></a>

**Note**  
For the simplest installation experience, we recommend using [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](#sagemaker-hyperpod-model-deployment-setup-ui) or [Method 2: Installing the Inference Operator using the AWS CLI](#sagemaker-hyperpod-model-deployment-setup-addon). Helm chart installation may be deprecated in a future release.

### Prerequisites
<a name="sagemaker-hyperpod-model-deployment-setup-prereq"></a>

Before proceeding, verify that your AWS credentials are properly configured and have the necessary permissions. The following steps need to be run by an IAM principal with Administrator privileges and Cluster Admin access to an Amazon EKS cluster. Verify that you've created a HyperPod cluster with [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) . Verify you have installed helm, eksctl, and kubectl command line utilities. 

For Kubernetes administrative access to the Amazon EKS cluster, go to the Amazon EKS console and select the cluster you are using. Look in the **Access** tab and select IAM Access Entries. If there isn't an entry for your IAM principal, select **Create Access Entry**. Then select the desired IAM principal and associate the `AmazonEKSClusterAdminPolicy` with it.

1. Configure kubectl to connect to the newly created HyperPod cluster orchestrated by Amazon EKS cluster. Specify the Region and HyperPod cluster name.

   ```
   export HYPERPOD_CLUSTER_NAME=<hyperpod-cluster-name>
   export REGION=<region>
   
   # S3 bucket where tls certificates will be uploaded
   BUCKET_NAME="<Enter name of your s3 bucket>" # This should be bucket name, not URI
   
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
   --query 'Orchestrator.Eks.ClusterArn' --output text | \
   cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Set default env variables.

   ```
   LB_CONTROLLER_POLICY_NAME="AWSLoadBalancerControllerIAMPolicy-$HYPERPOD_CLUSTER_NAME"
   LB_CONTROLLER_ROLE_NAME="aws-load-balancer-controller-$HYPERPOD_CLUSTER_NAME"
   S3_MOUNT_ACCESS_POLICY_NAME="S3MountpointAccessPolicy-$HYPERPOD_CLUSTER_NAME"
   S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$HYPERPOD_CLUSTER_NAME"
   KEDA_OPERATOR_POLICY_NAME="KedaOperatorPolicy-$HYPERPOD_CLUSTER_NAME"
   KEDA_OPERATOR_ROLE_NAME="keda-operator-role-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_ROLE_NAME="HyperpodInferenceRole-$HYPERPOD_CLUSTER_NAME"
   HYPERPOD_INFERENCE_SA_NAME="hyperpod-inference-operator-controller"
   HYPERPOD_INFERENCE_SA_NAMESPACE="hyperpod-inference-system"
   JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-$HYPERPOD_CLUSTER_NAME"
   FSX_CSI_ROLE_NAME="AmazonEKSFSxLustreCSIDriverFullAccess-$HYPERPOD_CLUSTER_NAME"
   ```

1. Extract the Amazon EKS cluster name from the cluster ARN, update the local kubeconfig, and verify connectivity by listing all pods across namespaces.

   ```
   kubectl get pods --all-namespaces
   ```

1. (Optional) Install the NVIDIA device plugin to enable GPU support on the cluster.

   ```
   #Install nvidia device plugin
   kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml
   # Verify that GPUs are visible to k8s
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia.com/gpu
   ```

### Prepare your environment for inference operator installation
<a name="sagemaker-hyperpod-model-deployment-setup-prepare"></a>

1. Gather essential AWS resource identifiers and ARNs required for configuring service integrations between Amazon EKS, SageMaker AI, and IAM components.

   ```
   %%bash -x
   
   export ACCOUNT_ID=$(aws --region $REGION sts get-caller-identity --query 'Account' --output text)
   export OIDC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
   export EKS_CLUSTER_ROLE=$(aws eks --region $REGION describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.roleArn' --output text)
   ```

1. Associate an IAM OIDCidentity provider with your EKS cluster.

   ```
   eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$EKS_CLUSTER_NAME --approve
   ```

1. Create the trust policy required for the HyperPod inference operator IAM role. This policy enables secure cross-service communication between Amazon EKS, SageMaker AI, and other AWS services.

   ```
   %%bash -x
   
   # Create trust policy JSON
   cat << EOF > trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
   {
       "Effect": "Allow",
       "Principal": {
           "Service": [
               "sagemaker.amazonaws.com"
           ]
       },
       "Action": "sts:AssumeRole"
   },
   {
       "Effect": "Allow",
       "Principal": {
           "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}"
       },
       "Action": "sts:AssumeRoleWithWebIdentity",
       "Condition": {
           "StringLike": {
               "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
               "oidc.eks.${REGION}.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:hyperpod-inference-system:hyperpod-inference-controller-manager"
           }
       }
   }
   ]
   }
   EOF
   ```

1. Create execution Role for the inference operator and attach the managed policy.

   ```
   aws iam create-role --role-name $HYPERPOD_INFERENCE_ROLE_NAME --assume-role-policy-document file://trust-policy.json
   aws iam attach-role-policy --role-name $HYPERPOD_INFERENCE_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodInferenceAccess
   ```

1. Download and create the IAM policy required for the AWS Load Balancer Controller to manage Application Load Balancers and Network Load Balancers in your EKS cluster.

   ```
   %%bash -x 
   
   export ALBController_IAM_POLICY_NAME=HyperPodInferenceALBControllerIAMPolicy
   
   curl -o AWSLoadBalancerControllerIAMPolicy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/v2.13.0/docs/install/iam_policy.json
   aws iam create-policy --policy-name $ALBController_IAM_POLICY_NAME --policy-document file://AWSLoadBalancerControllerIAMPolicy.json
   ```

1. Create an IAM service account that links the Kubernetes service account with the IAM policy, enabling the AWS Load Balancer Controller to assume the necessary AWS permissions through IRSA (IAM Roles for Service Accounts).

   ```
   %%bash -x 
   
   export ALB_POLICY_ARN="arn:aws:iam::$ACCOUNT_ID:policy/$ALBController_IAM_POLICY_NAME"
   
   # Create IAM service account with gathered values
   eksctl create iamserviceaccount \
   --approve \
   --override-existing-serviceaccounts \
   --name=aws-load-balancer-controller \
   --namespace=kube-system \
   --cluster=$EKS_CLUSTER_NAME \
   --attach-policy-arn=$ALB_POLICY_ARN \
   --region=$REGION
   
   # Print the values for verification
   echo "Cluster Name: $EKS_CLUSTER_NAME"
   echo "Region: $REGION"
   echo "Policy ARN: $ALB_POLICY_ARN"
   ```

1. Apply Tags (`kubernetes.io.role/elb`) to all subnets in the Amazon EKS cluster (both public and private).

   ```
   export VPC_ID=$(aws --region $REGION eks describe-cluster --name $EKS_CLUSTER_NAME --query 'cluster.resourcesVpcConfig.vpcId' --output text)
   
   # Add Tags
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' | \
   xargs -I{} aws ec2 create-tags --resources {} --tags Key=kubernetes.io/role/elb,Value=1
   
   # Verify Tags are added
   aws ec2 describe-subnets \
   --filters "Name=vpc-id,Values=${VPC_ID}" "Name=map-public-ip-on-launch,Values=true" \
   --query 'Subnets[*].SubnetId' --output text | \
   tr '\t' '\n' |
   xargs -n1 -I{} aws ec2 describe-tags --filters "Name=resource-id,Values={}" "Name=key,Values=kubernetes.io/role/elb" --query "Tags[0].Value" --output text
   ```

1. Create a Namespace for KEDA and the Cert Manager.

   ```
   kubectl create namespace keda
   kubectl create namespace cert-manager
   ```

1. Create an Amazon S3 VPC endpoint.

   ```
   aws ec2 create-vpc-endpoint \
   --vpc-id ${VPC_ID} \
   --vpc-endpoint-type Gateway \
   --service-name "com.amazonaws.${REGION}.s3" \
   --route-table-ids $(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=${VPC_ID}" --query 'RouteTables[].Associations[].RouteTableId' --output text | tr ' ' '\n' | sort -u | tr '\n' ' ')
   ```

1. Configure S3 storage access:

   1. Create an IAM policy that grants the necessary S3 permissions for using Mountpoint for Amazon S3, which enables file system access to S3 buckets from within the cluster.

      ```
      %%bash -x
      
      export S3_CSI_BUCKET_NAME=“<bucketname_for_mounting_through_filesystem>”
      
      cat <<EOF> s3accesspolicy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          
          {
              "Sid": "MountpointAccess",
              "Effect": "Allow",
              "Action": [
                  "s3:ListBucket",
                  "s3:GetObject",
                  "s3:PutObject",
                  "s3:AbortMultipartUpload",
                  "s3:DeleteObject"
              ],
              "Resource": [
                      "arn:aws:s3:::${S3_CSI_BUCKET_NAME}",
                      "arn:aws:s3:::${S3_CSI_BUCKET_NAME}/*"
              ]
          }
      ]
      }
      EOF
      
      aws iam create-policy \
      --policy-name S3MountpointAccessPolicy \
      --policy-document file://s3accesspolicy.json
      
      cat <<EOF> s3accesstrustpolicy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringEquals": {
                      "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:aud": "sts.amazonaws.com",
                      "oidc.eks.$REGION.amazonaws.com/id/${OIDC_ID}:sub": "system:serviceaccount:kube-system:${s3-csi-driver-sa}"
                  }
              }
          }
      ]
      }
      EOF
      
      aws iam create-role --role-name $S3_CSI_ROLE_NAME --assume-role-policy-document file://s3accesstrustpolicy.json
      
      aws iam attach-role-policy --role-name $S3_CSI_ROLE_NAME --policy-arn "arn:aws:iam::$ACCOUNT_ID:policy/S3MountpointAccessPolicy"
      ```

   1. (Optional) Create an IAM service account for the Amazon S3 CSI driver. The Amazon S3 CSI driver requires an IAM service account with appropriate permissions to mount S3 buckets as persistent volumes in your Amazon EKS cluster. This step creates the necessary IAM role and Kubernetes service account with the required S3 access policy.

      ```
      %%bash -x 
      
      export S3_CSI_ROLE_NAME="SM_HP_S3_CSI_ROLE-$REGION"
      export S3_CSI_POLICY_ARN=$(aws iam list-policies --query 'Policies[?PolicyName==`S3MountpointAccessPolicy`]' | jq '.[0].Arn' |  tr -d '"')
      
      eksctl create iamserviceaccount \
      --name s3-csi-driver-sa \
      --namespace kube-system \
      --cluster $EKS_CLUSTER_NAME \
      --attach-policy-arn $S3_CSI_POLICY_ARN \
      --approve \
      --role-name $S3_CSI_ROLE_NAME \
      --region $REGION 
      
      kubectl label serviceaccount s3-csi-driver-sa app.kubernetes.io/component=csi-driver app.kubernetes.io/instance=aws-mountpoint-s3-csi-driver app.kubernetes.io/managed-by=EKS app.kubernetes.io/name=aws-mountpoint-s3-csi-driver -n kube-system --overwrite
      ```

   1. (Optional) Install the Amazon S3 CSI driver add-on. This driver enables your pods to mount S3 buckets as persistent volumes, providing direct access to S3 storage from within your Kubernetes workloads.

      ```
      %%bash -x
      
      export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME  --query 'Role.Arn' --output text)
      eksctl create addon --name aws-mountpoint-s3-csi-driver --cluster $EKS_CLUSTER_NAME --service-account-role-arn $S3_CSI_ROLE_ARN --force
      ```

   1. (Optional) Create a Persistent Volume Claim (PVC) for S3 storage. This PVC enables your pods to request and use S3 storage as if it were a traditional file system.

      ```
      %%bash -x 
      
      cat <<EOF> pvc_s3.yaml
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
      name: s3-claim
      spec:
      accessModes:
      - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
      storageClassName: "" # required for static provisioning
      resources:
      requests:
          storage: 1200Gi # ignored, required
      volumeName: s3-pv
      EOF
      
      kubectl apply -f pvc_s3.yaml
      ```

1. (Optional) Configure FSx storage access. Create an IAM service account for the Amazon FSx CSI driver. This service account will be used by the FSx CSI driver to interact with the Amazon FSx service on behalf of your cluster.

   ```
   %%bash -x 
   
   
   eksctl create iamserviceaccount \
   --name fsx-csi-controller-sa \
   --namespace kube-system \
   --cluster $EKS_CLUSTER_NAME \
   --attach-policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess \
   --approve \
   --role-name FSXLCSI-${EKS_CLUSTER_NAME}-${REGION} \
   --region $REGION
   ```

### Create the KEDA operator role
<a name="sagemaker-hyperpod-model-deployment-setup-keda"></a>

1. Create the trust policy and permissions policy.

   ```
   # Create trust policy
   cat <<EOF > /tmp/keda-trust-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Principal": {
               "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
           },
           "Action": "sts:AssumeRoleWithWebIdentity",
           "Condition": {
               "StringLike": {
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:kube-system:keda-operator",
                   "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
               }
           }
       }
   ]
   }
   EOF
   # Create permissions policy
   cat <<EOF > /tmp/keda-policy.json
   {
   "Version": "2012-10-17",		 	 	 
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "cloudwatch:GetMetricData",
               "cloudwatch:GetMetricStatistics",
               "cloudwatch:ListMetrics"
           ],
           "Resource": "*"
       },
       {
           "Effect": "Allow",
           "Action": [
               "aps:QueryMetrics",
               "aps:GetLabels",
               "aps:GetSeries",
               "aps:GetMetricMetadata"
           ],
           "Resource": "*"
       }
   ]
   }
   EOF
   # Create the role
   aws iam create-role \
   --role-name keda-operator-role \
   --assume-role-policy-document file:///tmp/keda-trust-policy.json
   # Create the policy
   KEDA_POLICY_ARN=$(aws iam create-policy \
   --policy-name KedaOperatorPolicy \
   --policy-document file:///tmp/keda-policy.json \
   --query 'Policy.Arn' \
   --output text)
   # Attach the policy to the role
   aws iam attach-role-policy \
   --role-name keda-operator-role \
   --policy-arn $KEDA_POLICY_ARN
   ```

1. If you're using gated models, create an IAM role to access the gated models.

   1. Create the trust policy and IAM role for gated model access.

      ```
      %%bash -s $REGION
      
      JUMPSTART_GATED_ROLE_NAME="JumpstartGatedRole-${REGION}-${HYPERPOD_CLUSTER_NAME}"
      
      cat <<EOF > /tmp/trust-policy.json
      {
      "Version": "2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringLike": {
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:sub": "system:serviceaccount:*:hyperpod-inference-service-account*",
                      "oidc.eks.$REGION.amazonaws.com/id/$OIDC_ID:aud": "sts.amazonaws.com"
                  }
              }
          },
              {
              "Effect": "Allow",
              "Principal": {
                  "Service": "sagemaker.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
          }
      ]
      }
      EOF
      
      # Create the role and attach the managed policy
      aws iam create-role \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --assume-role-policy-document file:///tmp/trust-policy.json
      
      aws iam attach-role-policy \
      --role-name $JUMPSTART_GATED_ROLE_NAME \
      --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerHyperPodGatedModelAccess
      ```

      ```
      JUMPSTART_GATED_ROLE_ARN_LIST= !aws iam get-role --role-name=$JUMPSTART_GATED_ROLE_NAME --query "Role.Arn" --output text
      JUMPSTART_GATED_ROLE_ARN = JUMPSTART_GATED_ROLE_ARN_LIST[0]
      !echo $JUMPSTART_GATED_ROLE_ARN
      ```

### Install the inference operator
<a name="sagemaker-hyperpod-model-deployment-setup-install"></a>

1. Install the HyperPod inference operator. This step gathers the required AWS resource identifiers and generates the Helm installation command with the appropriate configuration parameters.

   Access the helm chart from [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm\$1chart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart) .

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli
   cd sagemaker-hyperpod-cli
   cd helm_chart/HyperPodHelmChart
   helm dependencies update charts/inference-operator
   ```

   ```
   %%bash -x
   
   HYPERPOD_INFERENCE_ROLE_ARN=$(aws iam get-role --role-name=$HYPERPOD_INFERENCE_ROLE_NAME --query "Role.Arn" --output text)
   echo $HYPERPOD_INFERENCE_ROLE_ARN
   
   S3_CSI_ROLE_ARN=$(aws iam get-role --role-name=$S3_CSI_ROLE_NAME --query "Role.Arn" --output text)
   echo $S3_CSI_ROLE_ARN
   
   HYPERPOD_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME --query "ClusterArn")
   
   # Verify values
   echo "Cluster Name: $EKS_CLUSTER_NAME"
   echo "Execution Role: $HYPERPOD_INFERENCE_ROLE_ARN"
   echo "Hyperpod ARN: $HYPERPOD_CLUSTER_ARN"
   # Run the the HyperPod inference operator installation. 
   
   helm install hyperpod-inference-operator charts/inference-operator \
   -n kube-system \
   --set region=$REGION \
   --set eksClusterName=$EKS_CLUSTER_NAME \
   --set hyperpodClusterArn=$HYPERPOD_CLUSTER_ARN \
   --set executionRoleArn=$HYPERPOD_INFERENCE_ROLE_ARN \
   --set s3.serviceAccountRoleArn=$S3_CSI_ROLE_ARN \
   --set s3.node.serviceAccount.create=false \
   --set keda.podIdentity.aws.irsa.roleArn="arn:aws:iam::$ACCOUNT_ID:role/keda-operator-role" \
   --set tlsCertificateS3Bucket="s3://$BUCKET_NAME" \
   --set alb.region=$REGION \
   --set alb.clusterName=$EKS_CLUSTER_NAME \
   --set alb.vpcId=$VPC_ID
   
   # For JumpStart Gated Model usage, Add
   # --set jumpstartGatedModelDownloadRoleArn=$UMPSTART_GATED_ROLE_ARN
   ```

1. Configure the service account annotations for IAM integration. This annotation enables the operator's service account to assume the necessary IAM permissions for managing inference endpoints and interacting with AWS services.

   ```
   %%bash -x 
   
   EKS_CLUSTER_ROLE_NAME=$(echo $EKS_CLUSTER_ROLE | sed 's/.*\///')
   
   # Annotate service account
   kubectl annotate serviceaccount hyperpod-inference-operator-controller-manager \
   -n hyperpod-inference-system \
   eks.amazonaws.com/role-arn=arn:aws:iam::${ACCOUNT_ID}:role/${EKS_CLUSTER_ROLE_NAME} \
   --overwrite
   ```

## Verify the inference operator is working
<a name="sagemaker-hyperpod-model-deployment-setup-verify"></a>

Follow these steps to verify that your inference operator installation is working correctly by deploying and testing a simple model.

**Deploy a test model to verify the operator**

1. Create a model deployment configuration file. This creates a Kubernetes manifest file that defines a JumpStart model deployment for the HyperPod inference operator.

   ```
   cat <<EOF>> simple_model_install.yaml
   ---
   apiVersion: inference.sagemaker.aws.amazon.com/v1
   kind: JumpStartModel
   metadata:
   name: testing-deployment-bert
   namespace: default
   spec:
   model:
   modelId: "huggingface-eqa-bert-base-cased"
   sageMakerEndpoint:
   name: "hp-inf-ep-for-testing"
   server:
   instanceType: "ml.c5.2xlarge"
   environmentVariables:
   - name: SAMPLE_ENV_VAR
       value: "sample_value"
   maxDeployTimeInSeconds: 1800
   EOF
   ```

1. Deploy the model and clean up the configuration file.

   ```
   kubectl create -f simple_model_install.yaml
   rm -f simple_model_install.yaml
   ```

1. Verify the service account configuration to ensure the operator can assume AWS permissions.

   ```
   # Get the service account details
   kubectl get serviceaccount -n hyperpod-inference-system
   
   # Check if the service account has the AWS annotations
   kubectl describe serviceaccount hyperpod-inference-operator-controller-manager -n hyperpod-inference-system
   ```

**Configure deployment settings (if using Studio UI)**

1. Review the recommended instance type under **Deployment settings**.

1. If modifying the **Instance type**, ensure compatibility with your HyperPod cluster. Contact your admin if compatible instances aren't available.

1. For GPU-partitioned instances with MIG enabled, select an appropriate **GPU partition** from available MIG profiles to optimize GPU utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

1. If using task governance, configure priority settings for model deployment preemption capabilities.

1. Enter the namespace provided by your admin. Contact your admin for the correct namespace if needed.

## (Optional) Set up user access through the JumpStart UI in SageMaker AI Studio Classic
<a name="sagemaker-hyperpod-model-deployment-setup-optional-js"></a>

For more background on setting up SageMaker HyperPod access for Studio Classic users and configuring fine-grained Kubernetes RBAC permissions for data scientist users, read [Setting up an Amazon EKS cluster in Studio](sagemaker-hyperpod-studio-setup-eks.md) and [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md).

1. Identify the IAM role that Data Scientist users will use to manage and deploy models to SageMaker HyperPod from SageMaker AI Studio Classic. This is typically the User Profile Execution Role or Domain Execution Role for the Studio Classic user.

   ```
   %%bash -x
   
   export DATASCIENTIST_ROLE_NAME="<Execution Role Name used in SageMaker Studio Classic>"
   
   export DATASCIENTIST_POLICY_NAME="HyperPodUIAccessPolicy"
   export EKS_CLUSTER_ARN=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text)
   
   export DATASCIENTIST_HYPERPOD_NAMESPACE="team-namespace"
   ```

1. Attach an Identity Policy enabling Model Deployment access.

   ```
   %%bash -x
   
   # Create access policy
   cat << EOF > hyperpod-deployment-ui-access-policy.json
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "DescribeHyerpodClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeCluster"
               ],
               "Resource": "$HYPERPOD_CLUSTER_ARN"
           },
           {
               "Sid": "UseEksClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "eks:DescribeCluster",
                   "eks:AccessKubernetesApi",
                   "eks:MutateViaKubernetesApi",
                   "eks:DescribeAddon"
               ],
               "Resource": "$EKS_CLUSTER_ARN"
           },
           {
               "Sid": "ListPermission",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:ListClusters",
                   "sagemaker:ListEndpoints"
               ],
               "Resource": "*"
           },
           {
               "Sid": "SageMakerEndpointAccess",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeEndpoint",
                   "sagemaker:InvokeEndpoint"
               ],
               "Resource": "arn:aws:sagemaker:$REGION:$ACCOUNT_ID:endpoint/*"
           }
       ]
   }
   EOF
   
   aws iam put-role-policy --role-name DATASCIENTIST_ROLE_NAME --policy-name HyperPodDeploymentUIAccessInlinePolicy --policy-document file://hyperpod-deployment-ui-access-policy.json
   ```

1. Create an EKS Access Entry for the user mapping them to a kubernetes group.

   ```
   %%bash -x
   
   aws eks create-access-entry --cluster-name $EKS_CLUSTER_NAME \
       --principal-arn "arn:aws:iam::$ACCOUNT_ID:role/$DATASCIENTIST_ROLE_NAME" \
       --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
   ```

1. Create Kubernetes RBAC policies for the user.

   ```
   %%bash -x
   
   cat << EOF > cluster_level_config.yaml
   kind: ClusterRole
   apiVersion: rbac.authorization.k8s.io/v1
   metadata:
     name: hyperpod-scientist-user-cluster-role
   rules:
   - apiGroups: [""]
     resources: ["pods"]
     verbs: ["list"]
   - apiGroups: [""]
     resources: ["nodes"]
     verbs: ["list"]
   - apiGroups: [""]
     resources: ["namespaces"]
     verbs: ["list"]
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: ClusterRoleBinding
   metadata:
     name: hyperpod-scientist-user-cluster-role-binding
   subjects:
   - kind: Group
     name: hyperpod-scientist-user-cluster-level
     apiGroup: rbac.authorization.k8s.io
   roleRef:
     kind: ClusterRole
     name: hyperpod-scientist-user-cluster-role
     apiGroup: rbac.authorization.k8s.io
   EOF
   
   
   kubectl apply -f cluster_level_config.yaml
   
   
   cat << EOF > namespace_level_role.yaml
   kind: Role
   apiVersion: rbac.authorization.k8s.io/v1
   metadata:
     namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE
     name: hyperpod-scientist-user-namespace-level-role
   rules:
   - apiGroups: [""]
     resources: ["pods"]
     verbs: ["create", "get"]
   - apiGroups: [""]
     resources: ["nodes"]
     verbs: ["get", "list"]
   - apiGroups: [""]
     resources: ["pods/log"]
     verbs: ["get", "list"]
   - apiGroups: [""]
     resources: ["pods/exec"]
     verbs: ["get", "create"]
   - apiGroups: ["kubeflow.org"]
     resources: ["pytorchjobs", "pytorchjobs/status"]
     verbs: ["get", "list", "create", "delete", "update", "describe"]
   - apiGroups: [""]
     resources: ["configmaps"]
     verbs: ["create", "update", "get", "list", "delete"]
   - apiGroups: [""]
     resources: ["secrets"]
     verbs: ["create", "get", "list", "delete"]
   - apiGroups: [ "inference.sagemaker.aws.amazon.com" ]
     resources: [ "inferenceendpointconfig", "inferenceendpoint", "jumpstartmodel" ]
     verbs: [ "get", "list", "create", "delete", "update", "describe" ]
   - apiGroups: [ "autoscaling" ]
     resources: [ "horizontalpodautoscalers" ]
     verbs: [ "get", "list", "watch", "create", "update", "patch", "delete" ]
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: RoleBinding
   metadata:
     namespace: $DATASCIENTIST_HYPERPOD_NAMESPACE
     name: hyperpod-scientist-user-namespace-level-role-binding
   subjects:
   - kind: Group
     name: hyperpod-scientist-user-namespace-level
     apiGroup: rbac.authorization.k8s.io
   roleRef:
     kind: Role
     name: hyperpod-scientist-user-namespace-level-role
     apiGroup: rbac.authorization.k8s.io
   EOF
   
   
   kubectl apply -f namespace_level_role.yaml
   ```

# Deploy foundation models and custom fine-tuned models
<a name="sagemaker-hyperpod-model-deployment-deploy"></a>

Whether you're deploying pre-trained foundation open-weights or gated models from Amazon SageMaker JumpStart or your own custom or fine-tuned models stored in Amazon S3 or Amazon FSx, SageMaker HyperPod provides the flexible, scalable infrastructure you need for production inference workloads.


****  

|  | Deploy open-weights and gated foundation models from JumpStart | Deploy custom and fine-tuned models from Amazon S3 and Amazon FSx | 
| --- | --- | --- | 
| Description |  Deploy from a comprehensive catalog of pre-trained foundation models with automatic optimization and scaling policies tailored to each model family.  | Bring your own custom and fine-tuned models and leverage SageMaker HyperPod's enterprise infrastructure for production-scale inference. Choose between cost-effective storage with Amazon S3 or a high-performance file system with Amazon FSx. | 
| Key benefits | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html) |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  | 
| Deployment options |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-model-deployment-deploy.html)  | 

The following sections step you through deploying models from Amazon SageMaker JumpStart and from Amazon S3 and Amazon FSx.

**Topics**
+ [Deploy models from JumpStart using Amazon SageMaker Studio](sagemaker-hyperpod-model-deployment-deploy-js-ui.md)
+ [Deploy models from JumpStart using kubectl](sagemaker-hyperpod-model-deployment-deploy-js-kubectl.md)
+ [Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl](sagemaker-hyperpod-model-deployment-deploy-ftm.md)
+ [Deploy custom fine-tuned models using the Python SDK and HPCLI](deploy-trained-model.md) 
+ [Deploy models from Amazon SageMaker JumpStart using the Python SDK and HPCLI](deploy-jumpstart-model.md) 

# Deploy models from JumpStart using Amazon SageMaker Studio
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui"></a>

The following steps show you through how to deploy models from JumpStart using Amazon SageMaker Studio.

## Prerequisites
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-prereqs"></a>

Verify that you've set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md). 

## Create a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-create"></a>

1. In Amazon SageMaker Studio, open the **JumpStart** landing page from the left navigation pane. 

1. Under **All public models**, choose a model you want to deploy.
**Note**  
If you’ve selected a gated model, you’ll have to accept the End User License Agreement (EULA).

1. Choose **SageMaker HyperPod**.

1. Under **Deployment settings**, JumpStart will recommend an instance for deployment. You can modify these settings if necessary.

   1. If you modify **Instance type**, ensure it’s compatible with the chosen **HyperPod cluster**. If there aren’t any compatible instances, you’ll need to select a new **HyperPod cluster **or contact your admin to add compatible instances to the cluster.

   1. To prioritize the model deployment, install the task governance addon, create compute allocations, and set up task rankings for the cluster policy. Once this is done, you should see an option to select a priority for the model deployment which can be used for preemption of other deployments and tasks on the cluster. 

   1. Enter the namespace to which your admin has provided you access. You may have to directly reach out to your admin to get the exact namespace. Once a valid namespace is provided, the **Deploy** button should be enabled to deploy the model.

   1. If your instance type is partitioned (MIG enabled), select a **GPU partition type**.

   1. If you want to enable L2 KVCache or Intelligent routing for speeding up LLM inference, enable them. By default, only L1 KV Cache is enabled. For more details on KVCache and Intelligent routing, see [SageMaker HyperPod model deployment](sagemaker-hyperpod-model-deployment.md).

1. Choose **Deploy** and wait for the **Endpoint** to be created.

1. After the **Endpoint** has been created, select **Test inference**.

## Edit a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-edit"></a>

1. In Amazon SageMaker Studio, select **Compute** and then **HyperPod clusters** from the left navigation pane. 

1. Under **Deployments**, choose the HyperPod cluster deployment you want to modify.

1. From the vertical ellipsis icon (⋮), choose **Edit**.

1. Under **Deployment settings**, you can enable or disable **Auto-scaling**, and change the number of **Max replicas**.

1. Select **Save**.

1. The **Status** will change to **Updating**. Once it changes back to **In service**, your changes are complete and you’ll see a message confirming it.

## Delete a HyperPod deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-js-ui-delete"></a>

1. In Amazon SageMaker Studio, select **Compute** and then **HyperPod clusters** from the left navigation pane. 

1. Under **Deployments**, choose the HyperPod cluster deployment you want to modify.

1. From the vertical ellipsis icon (⋮), choose **Delete**.

1. In the **Delete HyperPod deployment window**, select the checkbox.

1. Choose **Delete**.

1. The **Status** will change to **Deleting**. Once the HyperPod deployment has been deleted, you’ll see a message confirming it.

# Deploy models from JumpStart using kubectl
<a name="sagemaker-hyperpod-model-deployment-deploy-js-kubectl"></a>

The following steps show you how to deploy a JumpStart model to a HyperPod cluster using kubectl.

The following instructions contain code cells and commands designed to run in a terminal. Ensure you have configured your environment with AWS credentials before executing these commands. 

## Prerequisites
<a name="kubectl-prerequisites"></a>

Before you begin, verify that you've: 
+ Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md).
+ Installed [kubectl](https://kubernetes.io/docs/reference/kubectl/) utility and configured [jq](https://jqlang.org/) in your terminal.

## Setup and configuration
<a name="kubectl-prerequisites-setup-and-configuration"></a>

1. Select your Region.

   ```
   export REGION=<region>
   ```

1. View all SageMaker public hub models and HyperPod clusters.

1. Select a `JumpstartModel` from JumpstartPublic Hub. JumpstartPublic hub has a large number of models available so you can use `NextToken` to iteratively list all available models in the public hub.

   ```
   aws sagemaker list-hub-contents --hub-name SageMakerPublicHub --hub-content-type Model --query '{Models: HubContentSummaries[].{ModelId:HubContentName,Version:HubContentVersion}, NextToken: NextToken}' --output json
   ```

   ```
   export MODEL_ID="deepseek-llm-r1-distill-qwen-1-5b"
   export MODEL_VERSION="2.0.4"
   ```

1. Configure the model ID and cluster name you’ve selected into the variables below.
**Note**  
Check with your cluster admin to ensure permissions are granted for this role or user. You can run `!aws sts get-caller-identity --query "Arn"` to check which role or user you are using in your terminal.

   ```
   aws sagemaker list-clusters --output table
   
   # Select the cluster name where you want to deploy the model.
   export HYPERPOD_CLUSTER_NAME="<insert cluster name here>"
   
   # Select the instance that is relevant for your model deployment and exists within the selected cluster.
   # List availble instances in your HyperPod cluster
   aws sagemaker describe-cluster --cluster-name=$HYPERPOD_CLUSTER_NAME --query "InstanceGroups[].{InstanceType:InstanceType,Count:CurrentCount}" --output table
   
   # List supported instance types for the selected model
   aws sagemaker describe-hub-content --hub-name SageMakerPublicHub --hub-content-type Model --hub-content-name "$MODEL_ID" --output json | jq -r '.HubContentDocument | fromjson | {Default: .DefaultInferenceInstanceType, Supported: .SupportedInferenceInstanceTypes}'
   
   
   # Select and instance type from the cluster that is compatible with the model. 
   # Make sure that the selected instance is either default or supported instance type for the jumpstart model 
   export INSTANCE_TYPE="<Instance_type_In_cluster"
   ```

1. Confirm with the cluster admin which namespace you are permitted to use. The admin should have created a `hyperpod-inference` service account in your namespace.

   ```
   export CLUSTER_NAMESPACE="default"
   ```

1. Set a name for endpoint and custom object to be create.

   ```
   export SAGEMAKER_ENDPOINT_NAME="deepsek-qwen-1-5b-test"
   ```

1. The following is an example for a `deepseek-llm-r1-distill-qwen-1-5b` model deployment from Jumpstart. Create a similar deployment yaml file based on the model selected iin the above step.
**Note**  
If your cluster uses GPU partitioning with MIG, you can request specific MIG profiles by adding the `acceleratorPartitionType` field to the server specification. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

   ```
   cat << EOF > jumpstart_model.yaml
   ---
   apiVersion: inference.sagemaker.aws.amazon.com/v1
   kind: JumpStartModel
   metadata:
     name: $SAGEMAKER_ENDPOINT_NAME
     namespace: $CLUSTER_NAMESPACE 
   spec:
     sageMakerEndpoint:
       name: $SAGEMAKER_ENDPOINT_NAME
     model:
       modelHubName: SageMakerPublicHub
       modelId: $MODEL_ID
       modelVersion: $MODEL_VERSION
     server:
       instanceType: $INSTANCE_TYPE
       # Optional: Specify GPU partition profile for MIG-enabled instances
       # acceleratorPartitionType: "1g.10gb"
     metrics:
       enabled: true
     environmentVariables:
       - name: SAMPLE_ENV_VAR
         value: "sample_value"
     maxDeployTimeInSeconds: 1800
     autoScalingSpec:
       cloudWatchTrigger:
         name: "SageMaker-Invocations"
         namespace: "AWS/SageMaker"
         useCachedMetrics: false
         metricName: "Invocations"
         targetValue: 10
         minValue: 0.0
         metricCollectionPeriod: 30
         metricStat: "Sum"
         metricType: "Average"
         dimensions:
           - name: "EndpointName"
             value: "$SAGEMAKER_ENDPOINT_NAME"
           - name: "VariantName"
             value: "AllTraffic"
   EOF
   ```

## Deploy your model
<a name="kubectl-deploy-your-model"></a>

**Update your kubernetes configuration and deploy your model**

1. Configure kubectl to connect to the HyperPod cluster orchestrated by Amazon EKS.

   ```
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text | \
     cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Deploy your JumpStart model.

   ```
   kubectl apply -f jumpstart_model.yaml
   ```

**Monitor the status of your model deployment**

1. Verify that the model is successfully deployed.

   ```
   kubectl describe JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Verify that the endpoint is successfully created.

   ```
   aws sagemaker describe-endpoint --endpoint-name=$SAGEMAKER_ENDPOINT_NAME --output table
   ```

1. Invoke your model endpoint. You can programmatically retrieve example payloads from the `JumpStartModel` object.

   ```
   aws sagemaker-runtime invoke-endpoint \
     --endpoint-name $SAGEMAKER_ENDPOINT_NAME \
     --content-type "application/json" \
     --body '{"inputs": "What is AWS SageMaker?"}' \
     --region $REGION \
     --cli-binary-format raw-in-base64-out \
     /dev/stdout
   ```

## Manage your deployment
<a name="kubectl-manage-your-deployment"></a>

Delete your JumpStart model deployment once you no longer need it.

```
kubectl delete JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
```

**Troubleshooting**

Use these debugging commands if your deployment isn't working as expected.

1. Check the status of Kubernetes deployment. This command inspects the underlying Kubernetes deployment object that manages the pods running your model. Use this to troubleshoot pod scheduling, resource allocation, and container startup issues.

   ```
   kubectl describe deployment $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the status of your JumpStart model resource. This command examines the custom `JumpStartModel` resource that manages the high-level model configuration and deployment lifecycle. Use this to troubleshoot model-specific issues like configuration errors or SageMaker AI endpoint creation problems.

   ```
   kubectl describe JumpStartModel $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the status of all Kubernetes objects. This command provides a comprehensive overview of all related Kubernetes resources in your namespace. Use this for a quick health check to see the overall state of pods, services, deployments, and custom resources associated with your model deployment.

   ```
   kubectl get pods,svc,deployment,JumpStartModel,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

# Deploy custom fine-tuned models from Amazon S3 and Amazon FSx using kubectl
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm"></a>

The following steps show you how to deploy models stored on Amazon S3 or Amazon FSx to a Amazon SageMaker HyperPod cluster using kubectl. 

The following instructions contain code cells and commands designed to run in a terminal. Ensure you have configured your environment with AWS credentials before executing these commands.

## Prerequisites
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-prereqs"></a>

Before you begin, verify that you've: 
+ Set up inference capabilities on your Amazon SageMaker HyperPod clusters. For more information, see [Setting up your HyperPod clusters for model deployment](sagemaker-hyperpod-model-deployment-setup.md).
+ Installed [kubectl](https://kubernetes.io/docs/reference/kubectl/) utility and configured [jq](https://jqlang.org/) in your terminal.

## Setup and configuration
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-setup"></a>

Replace all placeholder values with your actual resource identifiers.

1. Select your Region in your environment.

   ```
   export REGION=<region>
   ```

1. Initialize your cluster name. This identifies the HyperPod cluster where your model will be deployed.
**Note**  
Check with your cluster admin to ensure permissions are granted for this role or user. You can run `!aws sts get-caller-identity --query "Arn"` to check which role or user you are using in your terminal.

   ```
   # Specify your hyperpod cluster name here
   HYPERPOD_CLUSTER_NAME="<Hyperpod_cluster_name>"
   
   # NOTE: For sample deployment, we use g5.8xlarge for deepseek-r1 1.5b model which has sufficient memory and GPU
   instance_type="ml.g5.8xlarge"
   ```

1. Initialize your cluster namespace. Your cluster admin should've already created a hyperpod-inference service account in your namespace.

   ```
   cluster_namespace="<namespace>"
   ```

1. Create a CRD using one of the following options:

------
#### [ Using Amazon FSx as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-fsx"
      ```

   1. Configure the Amazon FSx file system ID to be used.

      ```
      export FSX_FILE_SYSTEM_ID="fs-1234abcd"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon FSx and a DeepSeek model.
**Note**  
For clusters with GPU partitioning enabled, replace `nvidia.com/gpu` with the appropriate MIG resource name such as `nvidia.com/mig-1g.10gb`. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

      ```
      cat <<EOF> deploy_fsx_cluster_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1
      kind: InferenceEndpointConfig
      metadata:
        name: lmcache-test
        namespace: inf-update
      spec:
        modelName: Llama-3.1-8B-Instruct
        instanceType: ml.g5.24xlarge
        invocationEndpoint: v1/chat/completions
        replicas: 2
        modelSourceConfig:
          fsxStorage:
            fileSystemId: $FSX_FILE_SYSTEM_ID
          modelLocation: deepseek-1-5b
          modelSourceType: fsx
        worker:
          environmentVariables:
          - name: HF_MODEL_ID
            value: /opt/ml/model
          - name: SAGEMAKER_PROGRAM
            value: inference.py
          - name: SAGEMAKER_SUBMIT_DIRECTORY
            value: /opt/ml/model/code
          - name: MODEL_CACHE_ROOT
            value: /opt/ml/model
          - name: SAGEMAKER_ENV
            value: '1'
          image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0
          modelInvocationPort:
            containerPort: 8080
            name: http
          modelVolumeMount:
            mountPath: /opt/ml/model
            name: model-weights
          resources:
            limits:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
            requests:
              cpu: 30000m
              memory: 100Gi
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
      EOF
      ```

------
#### [ Using Amazon S3 as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3"
      ```

   1. Configure the Amazon S3 bucket location where the model is located.

      ```
      export S3_MODEL_LOCATION="deepseek-qwen-1-5b"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon S3 and a DeepSeek model.
**Note**  
For clusters with GPU partitioning enabled, replace `nvidia.com/gpu` with the appropriate MIG resource name such as `nvidia.com/mig-1g.10gb`. For more information, see [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md).

      ```
      cat <<EOF> deploy_s3_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1alpha1
      kind: InferenceEndpointConfig
      metadata:
        name: $SAGEMAKER_ENDPOINT_NAME
        namespace: $CLUSTER_NAMESPACE
      spec:
        modelName: deepseek15b
        endpointName: $SAGEMAKER_ENDPOINT_NAME
        instanceType: ml.g5.8xlarge
        invocationEndpoint: invocations
        modelSourceConfig:
          modelSourceType: s3
          s3Storage:
            bucketName: $S3_MODEL_LOCATION
            region: $REGION
          modelLocation: deepseek15b
          prefetchEnabled: true
        worker:
          resources:
            limits:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
            requests:
              nvidia.com/gpu: 1
              # For MIG-enabled instances, use: nvidia.com/mig-1g.10gb: 1
              cpu: 25600m
              memory: 102Gi
          image: 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124
          modelInvocationPort:
            containerPort: 8000
            name: http
          modelVolumeMount:
            name: model-weights
            mountPath: /opt/ml/model
          environmentVariables:
            - name: PYTHONHASHSEED
              value: "123"
            - name: OPTION_ROLLING_BATCH
              value: "vllm"
            - name: SERVING_CHUNKED_READ_TIMEOUT
              value: "480"
            - name: DJL_OFFLINE
              value: "true"
            - name: NUM_SHARD
              value: "1"
            - name: SAGEMAKER_PROGRAM
              value: "inference.py"
            - name: SAGEMAKER_SUBMIT_DIRECTORY
              value: "/opt/ml/model/code"
            - name: MODEL_CACHE_ROOT
              value: "/opt/ml/model"
            - name: SAGEMAKER_MODEL_SERVER_WORKERS
              value: "1"
            - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
              value: "3600"
            - name: OPTION_TRUST_REMOTE_CODE
              value: "true"
            - name: OPTION_ENABLE_REASONING
              value: "true"
            - name: OPTION_REASONING_PARSER
              value: "deepseek_r1"
            - name: SAGEMAKER_CONTAINER_LOG_LEVEL
              value: "20"
            - name: SAGEMAKER_ENV
              value: "1"
            - name: MODEL_SERVER_TYPE
              value: "vllm"
            - name: SESSION_KEY
              value: "x-user-id"
      EOF
      ```

------
#### [ Using Amazon S3 as the model source ]

   1. Set up a SageMaker endpoint name.

      ```
      export SAGEMAKER_ENDPOINT_NAME="deepseek15b-s3"
      ```

   1. Configure the Amazon S3 bucket location where the model is located.

      ```
      export S3_MODEL_LOCATION="deepseek-qwen-1-5b"
      ```

   1. The following is an example yaml file for creating an endpoint with Amazon S3 and a DeepSeek model.

      ```
      cat <<EOF> deploy_s3_inference.yaml
      ---
      apiVersion: inference.sagemaker.aws.amazon.com/v1
      kind: InferenceEndpointConfig
      metadata:
        name: lmcache-test
        namespace: inf-update
      spec:
        modelName: Llama-3.1-8B-Instruct
        instanceType: ml.g5.24xlarge
        invocationEndpoint: v1/chat/completions
        replicas: 2
        modelSourceConfig:
          modelSourceType: s3
          s3Storage:
            bucketName: bugbash-ada-resources
            region: us-west-2
          modelLocation: models/Llama-3.1-8B-Instruct
          prefetchEnabled: false
        kvCacheSpec:
          enableL1Cache: true
      #    enableL2Cache: true
      #    l2CacheSpec:
      #      l2CacheBackend: redis/sagemaker
      #      l2CacheLocalUrl: redis://redis.redis-system.svc.cluster.local:6379
        intelligentRoutingSpec:
          enabled: true
        tlsConfig:
          tlsCertificateOutputS3Uri: s3://sagemaker-lmcache-fceb9062-tls-6f6ee470
        metrics:
          enabled: true
          modelMetrics:
            port: 8000
        loadBalancer:
          healthCheckPath: /health
        worker:
          resources:
            limits:
              nvidia.com/gpu: "4"
            requests:
              cpu: "6"
              memory: 30Gi
              nvidia.com/gpu: "4"
          image: lmcache/vllm-openai:latest
          args:
            - "/opt/ml/model"
            - "--max-model-len"
            - "20000"
            - "--tensor-parallel-size"
            - "4"
          modelInvocationPort:
            containerPort: 8000
            name: http
          modelVolumeMount:
            name: model-weights
            mountPath: /opt/ml/model
          environmentVariables:
            - name: PYTHONHASHSEED
              value: "123"
            - name: OPTION_ROLLING_BATCH
              value: "vllm"
            - name: SERVING_CHUNKED_READ_TIMEOUT
              value: "480"
            - name: DJL_OFFLINE
              value: "true"
            - name: NUM_SHARD
              value: "1"
            - name: SAGEMAKER_PROGRAM
              value: "inference.py"
            - name: SAGEMAKER_SUBMIT_DIRECTORY
              value: "/opt/ml/model/code"
            - name: MODEL_CACHE_ROOT
              value: "/opt/ml/model"
            - name: SAGEMAKER_MODEL_SERVER_WORKERS
              value: "1"
            - name: SAGEMAKER_MODEL_SERVER_TIMEOUT
              value: "3600"
            - name: OPTION_TRUST_REMOTE_CODE
              value: "true"
            - name: OPTION_ENABLE_REASONING
              value: "true"
            - name: OPTION_REASONING_PARSER
              value: "deepseek_r1"
            - name: SAGEMAKER_CONTAINER_LOG_LEVEL
              value: "20"
            - name: SAGEMAKER_ENV
              value: "1"
            - name: MODEL_SERVER_TYPE
              value: "vllm"
            - name: SESSION_KEY
              value: "x-user-id"
      EOF
      ```

------

## Configure KV caching and intelligent routing for improved performance
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route"></a>

1. Enable KV caching by setting `enableL1Cache` and `enableL2Cache` to `true`.Then, set `l2CacheSpec` to `redis` and update `l2CacheLocalUrl` with the Redis cluster URL.

   ```
     kvCacheSpec:
       enableL1Cache: true
       enableL2Cache: true
       l2CacheSpec:
         l2CacheBackend: <redis | tieredstorage>
         l2CacheLocalUrl: <redis cluster URL if l2CacheBackend is redis >
   ```
**Note**  
If the redis cluster is not within the same Amazon VPC as the HyperPod cluster, encryption for the data in transit is not guaranteed.
**Note**  
Do not need l2CacheLocalUrl if tieredstorage is selected.

1. Enable intelligent routing by setting `enabled` to `true` under `intelligentRoutingSpec`. You can specify which routing strategy to use under `routingStrategy`. If no routing strategy is specified, it defaults to `prefixaware`.

   ```
   intelligentRoutingSpec:
       enabled: true
       routingStrategy: <routing strategy to use>
   ```

1. Enable router metrics and caching metrics by setting `enabled` to `true` under `metrics`. The `port` value needs to be the same as the `containerPort` value under `modelInvocationPort`.

   ```
   metrics:
       enabled: true
       modelMetrics:
         port: <port value>
       ...
       modelInvocationPort:
         containerPort: <port value>
   ```

## Deploy your model from Amazon S3 or Amazon FSx
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-deploy"></a>

1. Get the Amazon EKS cluster name from the HyperPod cluster ARN for kubectl authentication.

   ```
   export EKS_CLUSTER_NAME=$(aws --region $REGION sagemaker describe-cluster --cluster-name $HYPERPOD_CLUSTER_NAME \
     --query 'Orchestrator.Eks.ClusterArn' --output text | \
     cut -d'/' -f2)
   aws eks update-kubeconfig --name $EKS_CLUSTER_NAME --region $REGION
   ```

1. Deploy your InferenceEndpointConfig model with one of the following options:

------
#### [ Deploy with Amazon FSx as a source ]

   ```
   kubectl apply -f deploy_fsx_luster_inference.yaml
   ```

------
#### [ Deploy with Amazon S3 as a source ]

   ```
   kubectl apply -f deploy_s3_inference.yaml
   ```

------

## Verify the status of your deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-verify"></a>

1. Check if the model successfully deployed.

   ```
   kubectl describe InferenceEndpointConfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check that the endpoint is successfully created.

   ```
   kubectl describe SageMakerEndpointRegistration $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Test the deployed endpoint to verify it's working correctly. This step confirms that your model is successfully deployed and can process inference requests.

   ```
   aws sagemaker-runtime invoke-endpoint \
     --endpoint-name $SAGEMAKER_ENDPOINT_NAME \
     --content-type "application/json" \
     --body '{"inputs": "What is AWS SageMaker?"}' \
     --region $REGION \
     --cli-binary-format raw-in-base64-out \
     /dev/stdout
   ```

## Manage your deployment
<a name="sagemaker-hyperpod-model-deployment-deploy-ftm-manage"></a>

When you're finished testing your deployment, use the following commands to clean up your resources.

**Note**  
Verify that you no longer need the deployed model or stored data before proceeding.

**Clean up your resources**

1. Delete the inference deployment and associated Kubernetes resources. This stops the running model containers and removes the SageMaker endpoint.

   ```
   kubectl delete inferenceendpointconfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Verify the cleanup was done successfully.

   ```
   # # Check that Kubernetes resources are removed
   kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

   ```
   # Verify SageMaker endpoint is deleted (should return error or empty)
   aws sagemaker describe-endpoint --endpoint-name $SAGEMAKER_ENDPOINT_NAME --region $REGION
   ```

**Troubleshooting**

Use these debugging commands if your deployment isn't working as expected.

1. Check the Kubernetes deployment status.

   ```
   kubectl describe deployment $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check the InferenceEndpointConfig status to see the high-level deployment state and any configuration issues.

   ```
   kubectl describe InferenceEndpointConfig $SAGEMAKER_ENDPOINT_NAME -n $CLUSTER_NAMESPACE
   ```

1. Check status of all Kubernetes objects. Get a comprehensive view of all related Kubernetes resources in your namespace. This gives you a quick overview of what's running and what might be missing.

   ```
   kubectl get pods,svc,deployment,InferenceEndpointConfig,sagemakerendpointregistration -n $CLUSTER_NAMESPACE
   ```

# Autoscaling policies for your HyperPod inference model deployment
<a name="sagemaker-hyperpod-model-deployment-autoscaling"></a>

This following information provides practical examples and configurations for implementing autoscaling policies on Amazon SageMaker HyperPod inference model deployments. 

You'll learn how to configure automatic scaling using the built-in `autoScalingSpec` in your deployment YAML files, as well as how to create standalone KEDA `ScaledObject` configurations for advanced scaling scenarios. The examples cover scaling triggers based on CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource utilization metrics like CPU and memory. 

## Using autoScalingSpec in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-autoscaling-yaml"></a>

Amazon SageMaker HyperPod inference operator provides built-in autoscaling capabilities for model deployments using metrics from CloudWatch and Amazon Managed Prometheus (AMP). The following deployment YAML example includes an `autoScalingSpec` section that defines the configuration values for scaling your model deployment.

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name: deepseek-sample624
  namespace: ns-team-a
spec:
  sageMakerEndpoint:
    name: deepsek7bsme624
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
    modelVersion: 2.0.4
  server:
    instanceType: ml.g5.8xlarge
  metrics:
    enabled: true
  environmentVariables:
    - name: SAMPLE_ENV_VAR
      value: "sample_value"
  maxDeployTimeInSeconds: 1800
  tlsConfig:
    tlsCertificateOutputS3Uri: "s3://{USER}-tls-bucket-{REGION}/certificates"
  autoScalingSpec:
    minReplicaCount: 0
    maxReplicaCount: 5
    pollingInterval: 15
    initialCooldownPeriod: 60
    cooldownPeriod: 120
    scaleDownStabilizationTime: 60
    scaleUpStabilizationTime: 0
    cloudWatchTrigger:
        name: "SageMaker-Invocations"
        namespace: "AWS/SageMaker"
        useCachedMetrics: false
        metricName: "Invocations"
        targetValue: 10.5
        activationTargetValue: 5.0
        minValue: 0.0
        metricCollectionStartTime: 300
        metricCollectionPeriod: 30
        metricStat: "Sum"
        metricType: "Average"
        dimensions:
          - name: "EndpointName"
            value: "deepsek7bsme624"
          - name: "VariantName"
            value: "AllTraffic"
    prometheusTrigger: 
        name: "Prometheus-Trigger"
        useCachedMetrics: false
        serverAddress: http://<prometheus-host>:9090
        query: sum(rate(http_requests_total{deployment="my-deployment"}[2m]))
        targetValue: 10.0
        activationTargetValue: 5.0
        namespace: "namespace"
        customHeaders: "X-Client-Id=cid"
        metricType: "Value"
```

### Explanation of fields used in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-autoscaling-fields"></a>

`minReplicaCount` (Optional, Integer)  
Specifies the minimum number of model deployment replicas to maintain in the cluster. During scale-down events, the deployment scales down to this minimum number of pods. Must be greater than or equal to 0. Default: 1.

`maxReplicaCount` (Optional, Integer)  
Specifies the maximum number of model deployment replicas to maintain in the cluster. Must be greater than or equal to `minReplicaCount`. During scale-up events, the deployment scales up to this maximum number of pods. Default: 5.

`pollingInterval` (Optional, Integer)  
The time interval in seconds for querying metrics. Minimum: 0. Default: 30 seconds.

`cooldownPeriod` (Optional, Integer)  
The time interval in seconds to wait before scaling down from 1 to 0 pods during a scale-down event. Only applies when `minReplicaCount` is set to 0. Minimum: 0. Default: 300 seconds.

`initialCooldownPeriod` (Optional, Integer)  
The time interval in seconds to wait before scaling down from 1 to 0 pods during initial deployment. Only applies when `minReplicaCount` is set to 0. Minimum: 0. Default: 300 seconds.

`scaleDownStabilizationTime` (Optional, Integer)  
The stabilization time window in seconds after a scale-down trigger activates before scaling down occurs. Minimum: 0. Default: 300 seconds.

`scaleUpStabilizationTime` (Optional, Integer)  
The stabilization time window in seconds after a scale-up trigger activates before scaling up occurs. Minimum: 0. Default: 0 seconds.

`cloudWatchTrigger`  
The trigger configuration for CloudWatch metrics used in autoscaling decisions. The following fields are available in `cloudWatchTrigger`:  
+ `name` (Optional, String) - Name for the CloudWatch trigger. If not provided, uses the default format: <model-deployment-name>-scaled-object-cloudwatch-trigger.
+ `useCachedMetrics` (Optional, Boolean) - Determines whether to cache metrics queried by KEDA. KEDA queries metrics using the pollingInterval, while the Horizontal Pod Autoscaler (HPA) requests metrics from KEDA every 15 seconds. When set to true, queried metrics are cached and used to serve HPA requests. Default: true.
+ `namespace` (Required, String) - The CloudWatch namespace for the metric to query.
+ `metricName` (Required, String) - The name of the CloudWatch metric.
+ `dimensions` (Optional, List) - The list of dimensions for the metric. Each dimension includes a name (dimension name - String) and value (dimension value - String).
+ `targetValue` (Required, Float) - The target value for the CloudWatch metric used in autoscaling decisions.
+ `activationTargetValue` (Optional, Float) - The target value for the CloudWatch metric used when scaling from 0 to 1 pod. Only applies when `minReplicaCount` is set to 0. Default: 0.
+ `minValue` (Optional, Float) - The value to use when the CloudWatch query returns no data. Default: 0.
+ `metricCollectionStartTime` (Optional, Integer) - The start time for the metric query, calculated as T-metricCollectionStartTime. Must be greater than or equal to metricCollectionPeriod. Default: 300 seconds.
+ `metricCollectionPeriod` (Optional, Integer) - The duration for the metric query in seconds. Must be a CloudWatch-supported value (1, 5, 10, 30, or a multiple of 60). Default: 300 seconds.
+ `metricStat` (Optional, String) - The statistic type for the CloudWatch query. Default: `Average`.
+ `metricType` (Optional, String) - Defines how the metric is used for scaling calculations. Default: `Average`. Allowed values: `Average`, `Value`.
  + **Average**: Desired replicas = ceil (Metric Value) / (targetValue)
  + **Value**: Desired replicas = (current replicas) × ceil (Metric Value) / (targetValue)

`prometheusTrigger`  
The trigger configuration for Amazon Managed Prometheus (AMP) metrics used in autoscaling decisions. The following fields are available in `prometheusTrigger`:  
+ `name` (Optional, String) - Name for the CloudWatch trigger. If not provided, uses the default format: <model-deployment-name>-scaled-object-cloudwatch-trigger.
+ `useCachedMetrics` (Optional, Boolean) - Determines whether to cache metrics queried by KEDA. KEDA queries metrics using the pollingInterval, while the Horizontal Pod Autoscaler (HPA) requests metrics from KEDA every 15 seconds. When set to true, queried metrics are cached and used to serve HPA requests. Default: true.
+ `serverAddress` (Required, String) - The address of the AMP server. Must use the format: <https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace\$1id>
+ `query` (Required, String) - The PromQL query used for the metric. Must return a scalar value.
+ `targetValue` (Required, Float) - The target value for the CloudWatch metric used in autoscaling decisions.
+ `activationTargetValue` (Optional, Float) - The target value for the CloudWatch metric used when scaling from 0 to 1 pod. Only applies when `minReplicaCount` is set to 0. Default: 0.
+ `namespace` (Optional, String) - The namespace to use for namespaced queries. Default: empty string (`""`).
+ `customHeaders` (Optional, String) - Custom headers to include when querying the Prometheus endpoint. Default: empty string ("").
+ `metricType` (Optional, String) - Defines how the metric is used for scaling calculations. Default: `Average`. Allowed values: `Average`, `Value`.
  + **Average**: Desired replicas = ceil (Metric Value) / (targetValue)
  + **Value**: Desired replicas = (current replicas) × ceil (Metric Value) / (targetValue)

## Using KEDA ScaledObject yaml definitions through kubectl
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl"></a>

In addition to configuring autoscaling through the autoScalingSpec section in your deployment YAML, you can create and apply standalone KEDA `ScaledObject` YAML definitions using kubectl.

This approach provides greater flexibility for complex scaling scenarios and allows you to manage autoscaling policies independently from your model deployments. KEDA `ScaledObject` configurations support a [wide range of scaling triggers](https://keda.sh/docs/2.17/scalers/) including CloudWatch metrics, Amazon SQS queue lengths, Prometheus queries, and resource-based metrics like CPU and memory utilization. You can apply these configurations to existing model deployments by referencing the deployment name in the scaleTargetRef section of the ScaledObject specification.

**Note**  
Ensure the keda operator role provided during the HyperPod Inference operator installation has adequate permissions to query the metrics defined in the scaled object triggers.

### CloudWatch metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-cw"></a>

The following KEDA yaml policy uses CloudWatch metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/aws-cloudwatch/](https://keda.sh/docs/2.17/scalers/aws-cloudwatch/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: aws-cloudwatch
    metadata:
      namespace: AWS/SageMaker
      metricName: Invocations
      targetMetricValue: "1"
      minMetricValue: "1"
      awsRegion: "us-west-2"
      dimensionName: EndpointName;VariantName
      dimensionValue: $ENDPOINT_NAME;$VARIANT_NAME
      metricStatPeriod: "30" # seconds
      metricStat: "Sum"
      identityOwner: operator
```

### Amazon SQS metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-sqs"></a>

The following KEDA yaml policy uses Amazon SQS metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/aws-sqs/](https://keda.sh/docs/2.17/scalers/aws-sqs/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.eu-west-1.amazonaws.com/account_id/QueueName
      queueLength: "5"  # Default: "5"
      awsRegion: "us-west-1"
      scaleOnInFlight: true
      identityOwner: operator
```

### Prometheus metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-prometheus"></a>

The following KEDA yaml policy uses Prometheus metrics as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://<prometheus-host>:9090
      query: avg(rate(http_requests_total{deployment="$DEPLOYMENT_NAME"}[2m])) # Note: query must return a vector/scalar single element response
      threshold: '100.50'
      namespace: example-namespace  # for namespaced queries, eg. Thanos
      customHeaders: X-Client-Id=cid,X-Tenant-Id=tid,X-Organization-Id=oid # Optional. Custom headers to include in query. In case of auth header, use the custom authentication or relevant authModes.
      unsafeSsl: "false" #  Default is `false`, Used for skipping certificate check when having self-signed certs for Prometheus endpoint    
      timeout: 1000 # Custom timeout for the HTTP client used in this scaler
      identityOwner: operator
```

### CPU metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-cpu"></a>

The following KEDA yaml policy uses cpu metric as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: cpu
    metricType: Utilization # Allowed types are 'Utilization' or 'AverageValue'
    metadata:
        value: "60"
        containerName: "" # Optional. You can use this to target a specific container
```

### Memory metrics
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-memory"></a>

The following KEDA yaml policy uses Prometheus metrics query as a trigger to perform autoscaling on a kubernetes deployment. The policy queries the number of invocations for a Sagemaker endpoint and scales the number of deployment pods. The complete list of parameters supported by KEDA for `aws-cloudwatch` trigger can be found at [https://keda.sh/docs/2.17/scalers/prometheus/](https://keda.sh/docs/2.17/scalers/prometheus/).

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 1 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  triggers:
  - type: memory
    metricType: Utilization # Allowed types are 'Utilization' or 'AverageValue'
    metadata:
        value: "60"
        containerName: "" # Optional. You can use this to target a specific container in a pod
```

## Sample Prometheus policy for scaling down to 0 pods
<a name="sagemaker-hyperpod-model-deployment-autoscaling-kubectl-sample"></a>

The following KEDA yaml policy uses prometheus metrics query as a trigger to perform autoscaling on a kubernetes deployment. This policy uses a `minReplicaCount` of 0 which enables KEDA to scale the deployment down to 0 pods. When `minReplicaCount` is set to 0, you need to provide an activation criteria in order to bring up the first pod, after the pods scale down to 0. For the Prometheus trigger, this value is provided by `activationThreshold`. For the SQS queue, it comes from `activationQueueLength`.

**Note**  
While using `minReplicaCount` of 0, make sure the activation does not depend on a metric that is being generated by the pods. When the pods scale down to 0, that metric will never be generated and the pods will not scale up again.

```
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: invocations-scaledobject # name of the scaled object that will be created by this
  namespace: ns-team-a # namespace that this scaled object targets
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: $DEPLOYMENT_NAME # name of the model deployment
  minReplicaCount: 0 # minimum number of pods to be maintained
  maxReplicaCount: 4 # maximum number of pods to scale to
  pollingInterval: 10
  cooldownPeriod:  30
  initialCooldownPeriod:  180 # time before scaling down the pods after initial deployment
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://<prometheus-host>:9090
      query: sum(rate(http_requests_total{deployment="my-deployment"}[2m])) # Note: query must return a vector/scalar single element response
      threshold: '100.50'
      activationThreshold: '5.5' # Required if minReplicaCount is 0 for initial scaling
      namespace: example-namespace
      timeout: 1000
      identityOwner: operator
```

**Note**  
The CPU and Memory triggers can scale to 0 only when you define at least one additional scaler which is not CPU or Memory (eg. SQS \$1 CPU, or Prometheus \$1 CPU). 

# Implementing inference observability on HyperPod clusters
<a name="sagemaker-hyperpod-model-deployment-observability"></a>

Amazon SageMaker HyperPod provides comprehensive inference observability capabilities that enable data scientists and machine learning engineers to monitor and optimize their deployed models. This solution is enabled through SageMaker HyperPod Observability and automatically collects performance metrics for inference workloads, delivering production-ready monitoring through integrated [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/oss/) dashboards.

With metrics enabled by default, the platform captures essential model performance data including invocation latency, concurrent requests, error rates, and token-level metrics, while providing standard Prometheus endpoints for customers who prefer to implement custom observability solutions.

**Note**  
This topic contains a deep dive in to implementing inference observability on HyperPod clusters. For a more general reference, see [Cluster and task observability](sagemaker-hyperpod-eks-cluster-observability-cluster.md).

This guide provides step-by-step instructions for implementing and using inference observability on your HyperPod clusters. You'll learn how to configure metrics in your deployment YAML files, access monitoring dashboards based on your role (administrator, data scientist, or machine learning engineer), integrate with custom observability solutions using Prometheus endpoints, and troubleshoot common monitoring issues.

## Supported inference metrics
<a name="sagemaker-hyperpod-model-deployment-observability-metrics"></a>

**Invocation metrics**

These metrics capture model inference request and response data, providing universal visibility regardless of your model type or serving framework. When inference metrics are enabled, these metrics are calculated at invocation time and exported to your monitoring infrastructure.
+ `model_invocations_total` - Total number of invocation requests to the model 
+ `model_errors_total` - Total number of errors during model invocation
+ `model_concurrent_requests` - Active concurrent model requests
+ `model_latency_milliseconds` - Model invocation latency in milliseconds
+ `model_ttfb_milliseconds` - Model time to first byte latency in milliseconds

**Model container metrics**

These metrics provide insights into the internal operations of your model containers, including token processing, queue management, and framework-specific performance indicators. The metrics available depend on your model serving framework:
+ [TGI container metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics) 
+ [LMI container metrics](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md) 

**Metric dimensions**

All inference metrics include comprehensive labels that enable detailed filtering and analysis across your deployments:
+ **Cluster Identity:**
  + `cluster_id` - The unique ID of the HyperPod cluster
  + `cluster_name` - The name of the HyperPod cluster
+ **Resource Identity:**
  + `resource_name` - Deployment name (For example, "jumpstart-model-deployment")
  + `resource_type` - Type of deployment (jumpstart, inference-endpoint)
  + `namespace` - Kubernetes namespace for multi-tenancy
+ **Model Characteristics:**
  + `model_name` - Specific model identifier (For example, "llama-2-7b-chat")
  + `model_version` - Model version for A/B testing and rollbacks
  + `model_container_type` - Serving framework (TGI, LMI, -)
+ **Infrastructure Context:**
  + `pod_name` - Individual pod identifier for debugging
  + `node_name` - Kubernetes node for resource correlation
  + `instance_type` - EC2 instance type for cost analysis
+ **Operational Context:**
  + `metric_source` - Collection point (reverse-proxy, model-container)
  + `task_type` - Workload classification (inference)

## Configure metrics in deployment YAML
<a name="sagemaker-hyperpod-model-deployment-observability-yaml"></a>

Amazon SageMaker HyperPod enables inference metrics by default for all model deployments, providing immediate observability without additional configuration. You can customize metrics behavior by modifying the deployment YAML configuration to enable or disable metrics collection based on your specific requirements.

**Deploy a model from JumpStart**

Use the following YAML configuration to deploy a JuJumpStartmpStart model with metrics enabled:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name:mistral-model
  namespace: ns-team-a
spec:
  model:
    modelId: "huggingface-llm-mistral-7b-instruct"
    modelVersion: "3.19.0"
  metrics:
    enabled:true # Default: true (can be set to false to disable)
  replicas: 2
  sageMakerEndpoint:
    name: "mistral-model-sm-endpoint"
  server:
    instanceType: "ml.g5.12xlarge"
    executionRole: "arn:aws:iam::123456789:role/SagemakerRole"
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/mistral-model/certs/
```

**Deploy custom and fine-tuned models from Amazon S3 or Amazon FSx**

Configure custom inference endpoints with detailed metrics settings using the following YAML:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name:mistral-model
  namespace: ns-team-a
spec:
  model:
    modelId: "huggingface-llm-mistral-7b-instruct"
    modelVersion: "3.19.0"
  metrics:
    enabled:true # Default: true (can be set to false to disable)
  replicas: 2
  sageMakerEndpoint:
    name: "mistral-model-sm-endpoint"
  server:
    instanceType: "ml.g5.12xlarge"
    executionRole: "arn:aws:iam::123456789:role/SagemakerRole"
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/mistral-model/certs/

Deploy a custom inference endpoint

Configure custom inference endpoints with detailed metrics settings using the following YAML:

apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: inferenceendpoint-deepseeks
  namespace: ns-team-a
spec:
  modelName: deepseeks
  modelVersion: 1.0.1
  metrics:
    enabled: true # Default: true (can be set to false to disable)
    metricsScrapeIntervalSeconds: 30 # Optional: if overriding the default 15s
    modelMetricsConfig:
        port: 8000 # Optional: if overriding, it defaults to the WorkerConfig.ModelInvocationPort.ContainerPort within the InferenceEndpointConfig spec 8080
        path: "/custom-metrics" # Optional: if overriding the default "/metrics"
  endpointName: deepseek-sm-endpoint
  instanceType: ml.g5.12xlarge
  modelSourceConfig:
    modelSourceType: s3
    s3Storage:
      bucketName: model-weights
      region: us-west-2
    modelLocation: deepseek
    prefetchEnabled: true
  invocationEndpoint: invocations
  worker:
    resources:
      limits:
        nvidia.com/gpu: 1
      requests:
        nvidia.com/gpu: 1
        cpu: 25600m
        memory: 102Gi
    image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.32.0-lmi14.0.0-cu124
    modelInvocationPort:
      containerPort: 8080
      name: http
    modelVolumeMount:
      name: model-weights
      mountPath: /opt/ml/model
    environmentVariables: ...
  tlsConfig:
    tlsCertificateOutputS3Uri: s3://hyperpod/inferenceendpoint-deepseeks4/certs/
```

**Note**  
To disable metrics for specific deployments, set `metrics.enabled: false` in your YAML configuration.

## Monitor and troubleshoot inference workloads by role
<a name="sagemaker-hyperpod-model-deployment-observability-role"></a>

Amazon SageMaker HyperPod provides comprehensive observability capabilities that support different user workflows, from initial cluster setup to advanced performance troubleshooting. Use the following guidance based on your role and monitoring requirements.

**HyperPod admin**

**Your responsibility:** Enable observability infrastructure and ensure system health across the entire cluster.

**What you need to know:**
+ Cluster-wide observability provides infrastructure metrics for all workloads
+ One-click setup deploys monitoring stack with pre-configured dashboards
+ Infrastructure metrics are separate from model-specific inference metrics

**What you need to do:**

1. Navigate to the HyperPod console.

1. Select your cluster.

1. Go to the HyperPod cluster details page you just created. You will see a new option to install the HyperPod observability add-on.

1. Click on the **Quick install** option. After 1-2 minutes all the steps will be completed and you will see the Grafana dashboard and Prometheus workspace details.

This single action automatically deploys the EKS Add-on, configures observability operators, and provisions pre-built dashboards in Grafana.

**Data scientist**

**Your responsibility:** Deploy models efficiently and monitor their basic performance.

**What you need to know:**
+ Metrics are automatically enabled when you deploy models
+ Grafana dashboards provide immediate visibility into model performance
+ You can filter dashboards to focus on your specific deployments

**What you need to do:**

1. Deploy your model using your preferred method:

   1. Amazon SageMaker Studio UI

   1. HyperPod CLI commands

   1. Python SDK in notebooks

   1. kubectl with YAML configurations

1. Access your model metrics:

   1. Open Amazon SageMaker Studio

   1. Navigate to HyperPod Cluster and open Grafana Dashboard

   1. Select Inference Dashboard

   1. Apply filters to view your specific model deployment

1. Monitor key performance indicators:

   1. Track model latency and throughput

   1. Monitor error rates and availability

   1. Review resource utilization trends

After this is complete, you'll have immediate visibility into your model's performance without additional configuration, enabling quick identification of deployment issues or performance changes.

**Machine learning engineer (MLE)**

**Your responsibility:** Maintain production model performance and resolve complex performance issues.

**What you need to know:**
+ Advanced metrics include model container details like queue depths and token metrics
+ Correlation analysis across multiple metric types reveals root causes
+ Auto-scaling configurations directly impact performance during traffic spikes

**Hypothetical scenario:** A customer's chat model experiences intermittent slow responses. Users are complaining about 5-10 second delays. The MLE can leverage inference observability for systematic performance investigation.

**What you need to do:**

1. Examine the Grafana dashboard to understand the scope and severity of the performance issue:

   1. High latency alert active since 09:30

   1. P99 latency: 8.2s (normal: 2.1s)

   1. Affected time window: 09:30-10:15 (45 minutes)

1. Correlate multiple metrics to understand the system behavior during the incident:

   1. Concurrent requests: Spiked to 45 (normal: 15-20)

   1. Pod scaling: KEDA scaled 2→5 pods during incident

   1. GPU utilization: Remained normal (85-90%)

   1. Memory usage: Normal (24GB/32GB)

1. Examine the distributed system behavior since the infrastructure metrics appear normal:

   1. Node-level view: All pods concentrated on same node (poor distribution)

   1. Model container metrics: TGI queue depth shows 127 requests (normal: 5-10)

   ```
   Available in Grafana dashboard under "Model Container Metrics" panel
           Metric: tgi_queue_size{resource_name="customer-chat-llama"}
           Current value: 127 requests queued (indicates backlog)
   ```

1. Identify interconnected configuration issues:

   1. KEDA scaling policy: Too slow (30s polling interval)

   1. Scaling timeline: Scaling response lagged behind traffic spike by 45\$1 seconds

1. Implement targeted fixes based on the analysis:

   1. Updated KEDA polling interval: 30s → 15s

   1. Increased maxReplicas in scaling configuration

   1. Adjusted scaling thresholds to scale earlier (15 vs 20 concurrent requests)

You can systematically diagnose complex performance issues using comprehensive metrics, implement targeted fixes, and establish preventive measures to maintain consistent production model performance.

## Implement your own observability integration
<a name="sagemaker-hyperpod-model-deployment-observability-diy"></a>

Amazon SageMaker HyperPod exposes inference metrics through industry-standard Prometheus endpoints, enabling integration with your existing observability infrastructure. Use this approach when you prefer to implement custom monitoring solutions or integrate with third-party observability platforms instead of using the built-in Grafana and Prometheus stack.

**Access inference metrics endpoints**

**What you need to know:**
+ Inference metrics are automatically exposed on standardized Prometheus endpoints
+ Metrics are available regardless of your model type or serving framework
+ Standard Prometheus scraping practices apply for data collection

**Inference metrics endpoint configuration:**
+ **Port:** 9113
+ **Path:** /metrics
+ **Full endpoint:** http://pod-ip:9113/metrics

**Available inference metrics:**
+ `model_invocations_total` - Total number of invocation requests to the model
+ `model_errors_total` - Total number of errors during model invocation
+ `model_concurrent_requests` - Active concurrent requests per model
+ `model_latency_milliseconds` - Model invocation latency in milliseconds
+ `model_ttfb_milliseconds` - Model time to first byte latency in milliseconds

**Access model container metrics**

**What you need to know:**
+ Model containers expose additional metrics specific to their serving framework
+ These metrics provide internal container insights like token processing and queue depths
+ Endpoint configuration varies by model container type

**For JumpStart model deployments using Text Generation Inference (TGI) containers:**
+ **Port:** 8080 (model container port)
+ **Path:** /metrics
+ **Documentation:** [https://huggingface.co/docs/text-generation-inference/en/reference/metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics)

**For JumpStart model deployments using Large Model Inference (LMI) containers:**
+ **Port:** 8080 (model container port)
+ **Path:** /server/metrics
+ **Documentation:** [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md)

**For custom inference endpoints (BYOD):**
+ **Port:** Customer-configured (default 8080 Defaults to the WorkerConfig.ModelInvocationPort.ContainerPort within the InferenceEndpointConfig spec.)
+ **Path:** Customer-configured (default /metrics)

**Implement custom observability integration**

With a custom observability integration, you're responsible for:

1. **Metrics Scraping:** Implement Prometheus-compatible scraping from the endpoints above

1. **Data Export:** Configure export to your chosen observability platform

1. **Alerting:** Set up alerting rules based on your operational requirements

1. **Dashboards:** Create visualization dashboards for your monitoring needs

## Troubleshoot inference observability issues
<a name="sagemaker-hyperpod-model-deployment-observability-troubleshoot"></a>

**The dashboard shows no data**

If the Grafana dashboard is empty and all panels show "No data," perform the following steps to investigate:

1. Verify Administrator has inference observability installed:

   1. Navigate to HyperPod Console > Select cluster > Check if "Observability" status shows "Enabled"

   1. Verify Grafana workspace link is accessible from cluster overview

   1. Confirm Amazon Managed Prometheus workspace is configured and receiving data

1. Verify HyperPod Observabilty is enabled:

   ```
   hyp observability view      
   ```

1. Verify model metrics are enabled:

   ```
   kubectl get jumpstartmodel -n <namespace> customer-chat-llama -o jsonpath='{.status.metricsStatus}' # Expected: enabled: true, state:Enabled       
   ```

   ```
   kubectl get jumpstartmodel -n <namespace> customer-chat-llama -o jsonpath='{.status.metricsStatus}' # Expected: enabled: true, state:Enabled        
   ```

1. Check the metrics endpoint:

   ```
   kubectl port-forward pod/customer-chat-llama-xxx 9113:9113
   curl localhost:9113/metrics | grep model_invocations_total# Expected: model_invocations_total{...} metrics
   ```

1. Check the logs:

   ```
   # Model Container
   kubectl logs customer-chat-llama-xxx -c customer-chat-llama# Look for: OOM errors, CUDA errors, model loading failures
   
   # Proxy/SideCar
   kubectl logs customer-chat-llama-xxx -c sidecar-reverse-proxy# Look for: DNS resolution issues, upstream connection failures
   
   # Metrics Exporter Sidecar
   kubectl logs customer-chat-llama-xxx -c otel-collector# Look for: Metrics collection issues, export failures
   ```

**Other common issues**


| Issue | Solution | Action | 
| --- | --- | --- | 
|  Inference observability is not installed  |  Install inference observability through the console  |  "Enable Observability" in HyperPod console  | 
|  Metrics disabled in model  |  Update model configuration  |  Add `metrics: {enabled: true}` to model spec  | 
|  AMP workspace not configured  |  Fix data source connection  |  Verify AMP workspace ID in Grafana data sources  | 
|  Network connectivity  |  Check security groups/NACLs  |  Ensure pods can reach AMP endpoints  | 

# Task governance for model deployment on HyperPod
<a name="sagemaker-hyperpod-model-deployment-task-gov"></a>

This section covers how to optimize your shared Amazon SageMaker HyperPod EKS clusters for real-time inference workloads. You'll learn to configure Kueue's task governance features—including quota management, priority scheduling, and resource sharing policies—to ensure your inference workloads get the GPU resources they need during traffic spikes while maintaining fair allocation across your teams' training, evaluation, and testing activities. For more general information on task governance, see [SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) .

## How inference workload management works
<a name="sagemaker-hyperpod-model-deployment-task-gov-how"></a>

To effectively manage real-time inference traffic spikes in shared HyperPod EKS clusters, implement the following task governance strategies using Kueue's existing capabilities.

**Priority class configuration**

Define dedicated priority classes for inference workloads with high weights (such as 100) to ensure inference pods are admitted and scheduled before other task types. This configuration enables inference workloads to preempt lower-priority jobs during cluster load, which is critical for maintaining low-latency requirements during traffic surges.

**Quota sizing and allocation**

Reserve sufficient GPU resources in your team's `ClusterQueue` to handle expected inference spikes. During periods of low inference traffic, unused quota resources can be temporarily allocated to other teams' tasks. When inference demand increases, these borrowed resources can be reclaimed to prioritize pending inference pods. For more information, see [Cluster Queue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/).

**Resource Sharing Strategies**

Choose between two quota sharing approaches based on your requirements:

1. **Strict Resource Control:** Disable quota lending and borrowing to guarantee reserved GPU capacity is always available for your workloads. This approach requires sizing quotas large enough to independently handle peak demand and may result in idle nodes during low-traffic periods.

1. **Flexible Resource Sharing:** Enable quota borrowing to utilize idle resources from other teams when needed. Borrowed pods are marked as preemptible and may be evicted if the lending team reclaims capacity.

**Intra-Team Preemption**

Enable intra-team preemption when running mixed workloads (evaluation, training, and inference) under the same quota. This allows Kueue to preempt lower-priority jobs within your team to accommodate high-priority inference pods, ensuring real-time inference can run without depending on external quota borrowing. For more information, see [Preemption](https://kueue.sigs.k8s.io/docs/concepts/preemption/).

## Sample inference workload setup
<a name="sagemaker-hyperpod-model-deployment-task-gov-example"></a>

The following example shows how Kueue manages GPU resources in a shared Amazon SageMaker HyperPod cluster.

**Cluster configuration and policy setup**  
Your cluster has the following configuration:
+ **Team A**: 10 P4 GPU quota
+ **Team B**: 20 P4 GPU quota
+ **Static provisioning**: No autoscaling
+ **Total capacity**: 30 P4 GPUs

The shared GPU pool uses this priority policy:

1. **Real-time inference**: Priority 100

1. **Training**: Priority 75

1. **Evaluation**: Priority 50

Kueue enforces team quotas and priority classes, with preemption and quota borrowing enabled.

**Initial state: Normal cluster utilization**  
In normal operations:
+ Team A runs training and evaluation jobs on all 10 P4 GPUs
+ Team B runs real-time inference (10 P4s) and evaluation (10 P4s) within its 20 GPU quota
+ The cluster is fully utilized with all jobs admitted and running

**Inference spike: Team B requires additional GPUs**  
When Team B experiences a traffic spike, additional inference pods require 5 more P4 GPUs. Kueue detects that the new pods are:
+ Within Team B's namespace
+ Priority 100 (real-time inference)
+ Pending admission due to quota constraints

**Kueue's response process chooses between two options:**  
**Option 1: Quota borrowing** - If Team A uses only 6 of its 10 P4s, Kueue can admit Team B's pods using the idle 4 P4s. However, these borrowed resources are preemptible—if Team A submits jobs to reach its full quota, Kueue evicts Team B's borrowed inference pods.

**Option 2: Self-preemption (Recommended)** - Team B runs low-priority evaluation jobs (priority 50). When high-priority inference pods are waiting, Kueue preempts the evaluation jobs within Team B's quota and admits the inference pods. This approach provides safe resource allocation with no external eviction risk.

Kueue follows a three-step process to allocate resources:

1. **Quota check**

   Question: Does Team B have unused quota?
   + Yes → Admit the pods
   + No → Proceed to Step 2

1. **Self-preemption within Team B**

   Question: Can lower-priority Team B jobs be preempted?
   + Yes → Preempt evaluation jobs (priority 50), free 5 P4s, and admit inference pods
   + No → Proceed to Step 3

   This approach keeps workloads within Team B's guaranteed quota, avoiding external eviction risks.

1. **Borrowing from other teams**

   Question: Is there idle, borrowable quota from other teams?
   + Yes → Admit using borrowed quota (marked as preemptible)
   + No → Pod remains in `NotAdmitted` state

# HyperPod inference troubleshooting
<a name="sagemaker-hyperpod-model-deployment-ts"></a>

This troubleshooting guide addresses common issues that can occur during Amazon SageMaker HyperPod inference deployment and operation. These problems typically involve VPC networking configuration, IAM permissions, Kubernetes resource management, and operator connectivity issues that can prevent successful model deployment or cause deployments to fail or remain in pending states.

This troubleshooting guide uses the following terminology: **Troubleshooting steps** are diagnostic procedures to identify and investigate problems, **Resolution** provides the specific actions to fix identified issues, and **Verification** confirms that the solution worked correctly.

**Topics**
+ [Inference operator installation failures through SageMaker AI console](sagemaker-hyperpod-model-deployment-ts-console-cfn-failures.md)
+ [Inference operator installation failures through AWS CLI](sagemaker-hyperpod-model-deployment-ts-cli.md)
+ [Certificate download timeout](sagemaker-hyperpod-model-deployment-ts-certificate.md)
+ [Model deployment issues](sagemaker-hyperpod-model-deployment-ts-deployment-issues.md)
+ [VPC ENI permission issue](sagemaker-hyperpod-model-deployment-ts-permissions.md)
+ [IAM trust relationship issue](sagemaker-hyperpod-model-deployment-ts-trust.md)
+ [Missing NVIDIA GPU plugin error](sagemaker-hyperpod-model-deployment-ts-gpu.md)
+ [Inference operator fails to start](sagemaker-hyperpod-model-deployment-ts-startup.md)

# Inference operator installation failures through SageMaker AI console
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-failures"></a>

**Overview:** When installing the inference operator through the SageMaker AI console using Quick Install or Custom Install, the underlying CloudFormation stacks may fail due to various issues. This section covers common failure scenarios and their resolutions.

## Inference operator add-on installation failure through Quick or Custom install
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-stack-failed"></a>

**Problem:** The HyperPod cluster creation completes successfully, but the inference operator add-on installation fails.

**Common causes:**
+ Pod capacity limits exceeded on cluster nodes. The inference operator installation requires a minimum of 13 pods. The minimum recommended instance type is `ml.c5.4xlarge`.
+ IAM permission issues
+ Resource quota constraints
+ Network or VPC configuration problems

### Symptoms and diagnosis
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-symptoms"></a>

**Symptoms:**
+ Inference operator add-on shows CREATE\$1FAILED or DEGRADED status in console
+ CloudFormation stack associated with the add-on is in CREATE\$1FAILED state
+ Installation progress stops or shows error messages

**Diagnostic steps:**

1. Check the inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

1. Check for pod limit issues:

   ```
   # Check current pod count per node
   kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, allocatable: .status.allocatable.pods, capacity: .status.capacity.pods}'
   
   # Check pods running on each node
   kubectl get pods --all-namespaces -o wide | awk '{print $8}' | sort | uniq -c
   
   # Check for pod evictions or failures
   kubectl get events --all-namespaces --sort-by='.lastTimestamp' | grep -i "pod\|limit\|quota"
   ```

1. Check CloudFormation stack status (if using console installation):

   ```
   # List CloudFormation stacks related to the cluster
   aws cloudformation list-stacks \
       --region $REGION \
       --query "StackSummaries[?contains(StackName, '$EKS_CLUSTER_NAME') && StackStatus=='CREATE_FAILED'].{Name:StackName,Status:StackStatus,Reason:StackStatusReason}" \
       --output table
   
   # Get detailed stack events
   aws cloudformation describe-stack-events \
       --stack-name <stack-name> \
       --region $REGION \
       --query "StackEvents[?ResourceStatus=='CREATE_FAILED']" \
       --output table
   ```

### Resolution
<a name="sagemaker-hyperpod-model-deployment-ts-console-cfn-resolution"></a>

To resolve the installation failure, save the current configuration, delete the failed add-on, fix the underlying issue, and then reinstall the inference operator through the SageMaker AI console (recommended) or the AWS CLI.

**Step 1: Save the current configuration**
+ Extract and save the add-on configuration before deletion:

  ```
  # Save the current configuration
  aws eks describe-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION \
      --query 'addon.configurationValues' \
      --output text > addon-config-backup.json
  
  # Verify the configuration was saved
  cat addon-config-backup.json
  
  # Pretty print for readability
  cat addon-config-backup.json | jq '.'
  ```

**Step 2: Delete the failed add-on**
+ Delete the inference operator add-on:

  ```
  aws eks delete-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION
  
  # Wait for deletion to complete
  echo "Waiting for add-on deletion..."
  aws eks wait addon-deleted \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --region $REGION 2>/dev/null || sleep 60
  ```

**Step 3: Fix the underlying issue**

Choose the appropriate resolution based on the failure cause:

If the issue is pod limit exceeded:

```
# The inference operator requires a minimum of 13 pods.
# The minimum recommended instance type is ml.c5.4xlarge.
#
# Option 1: Add instance group with higher pod capacity
# Different instance types support different maximum pod counts
# For example: m5.large (29 pods), m5.xlarge (58 pods), m5.2xlarge (58 pods)
aws sagemaker update-cluster \
    --cluster-name $HYPERPOD_CLUSTER_NAME \
    --region $REGION \
    --instance-groups '[{"InstanceGroupName":"worker-group-2","InstanceType":"ml.m5.xlarge","InstanceCount":2}]'

# Option 2: Scale existing node group to add more nodes
aws eks update-nodegroup-config \
    --cluster-name $EKS_CLUSTER_NAME \
    --nodegroup-name <nodegroup-name> \
    --scaling-config minSize=2,maxSize=10,desiredSize=5 \
    --region $REGION

# Option 3: Clean up unused pods
kubectl delete pods --field-selector status.phase=Failed --all-namespaces
kubectl delete pods --field-selector status.phase=Succeeded --all-namespaces
```

**Step 4: Reinstall the inference operator**

After fixing the underlying issue, reinstall the inference operator using one of the following methods:
+ **SageMaker AI console with Custom Install (recommended):** Reuse existing IAM roles and TLS bucket from your previous installation. For steps, see [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-ui).
+ **AWS CLI with saved configuration:** Use the configuration you backed up in Step 1 to reinstall the add-on. For the full CLI installation steps, see [Method 2: Installing the Inference Operator using the AWS CLI](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-addon).

  ```
  aws eks create-addon \
      --cluster-name $EKS_CLUSTER_NAME \
      --addon-name amazon-sagemaker-hyperpod-inference \
      --addon-version v1.0.0-eksbuild.1 \
      --configuration-values file://addon-config-backup.json \
      --region $REGION
  ```
+ **SageMaker AI console with Quick Install:** Creates new IAM roles, TLS bucket, and dependency add-ons automatically. For steps, see [Method 1: Install HyperPod Inference Add-on through SageMaker AI console (Recommended)](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-ui).

**Step 5: Verify successful installation**

```
# Check add-on status
aws eks describe-addon \
    --cluster-name $EKS_CLUSTER_NAME \
    --addon-name amazon-sagemaker-hyperpod-inference \
    --region $REGION \
    --query "addon.{Status:status,Health:health}" \
    --output table

# Verify pods are running
kubectl get pods -n hyperpod-inference-system

# Check operator logs
kubectl logs -n hyperpod-inference-system deployment/hyperpod-inference-controller-manager --tail=50
```

## Cert-manager installation failed due to Kueue webhook not ready
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-webhook-race"></a>

**Problem:** The cert-manager add-on installation fails with a webhook error because the Task Governance (Kueue) webhook service has no available endpoints. This is a race condition that occurs when cert-manager tries to create resources before the Task Governance webhook pods are fully running. This can happen when Task Governance add-on is being installed along with the Inference operator during cluster creation.

### Symptoms and diagnosis
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-symptoms"></a>

**Error message:**

```
AdmissionRequestDenied
Internal error occurred: failed calling webhook "mdeployment.kb.io": failed to call webhook: 
Post "https://kueue-webhook-service.kueue-system.svc:443/mutate-apps-v1-deployment?timeout=10s": 
no endpoints available for service "kueue-webhook-service"
```

**Root cause:**
+ Task Governance add-on installs and registers a mutating webhook that intercepts all Deployment creations
+ Cert-manager add-on tries to create Deployment resources before Task Governance webhook pods are ready
+ Kubernetes admission control calls the Task Governance webhook, but it has no endpoints (pods not running yet)

**Diagnostic step:**

1. Check cert-manager add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

### Resolution
<a name="sagemaker-hyperpod-model-deployment-ts-console-kueue-resolution"></a>

**Solution: Delete and reinstall cert-manager**

The Task Governance webhook becomes ready within 60 seconds. Simply delete and reinstall the cert-manager add-on:

1. Delete the failed cert-manager add-on:

   ```
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

1. Wait 30-60 seconds for the Task Governance webhook to become ready, then reinstall the cert-manager add-on:

   ```
   sleep 60
   
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --region $REGION
   ```

# Inference operator installation failures through AWS CLI
<a name="sagemaker-hyperpod-model-deployment-ts-cli"></a>

**Overview:** When installing the inference operator through the AWS CLI, add-on installation may fail due to missing dependencies. This section covers common CLI installation failure scenarios and their resolutions.

## Inference add-on installation failed due to missing CSI drivers
<a name="sagemaker-hyperpod-model-deployment-ts-missing-csi-drivers"></a>

**Problem:** The inference operator add-on creation fails because required CSI driver dependencies are not installed on the EKS cluster.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
S3 CSI driver not installed (missing CSIDriver s3.csi.aws.com). 
Please install the required CSI driver and see the troubleshooting guide for more information.

FSx CSI driver not installed (missing CSIDriver fsx.csi.aws.com). 
Please install the required CSI driver and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if CSI drivers are installed:

   ```
   # Check for S3 CSI driver
   kubectl get csidriver s3.csi.aws.com
   kubectl get pods -n kube-system | grep mountpoint
   
   # Check for FSx CSI driver  
   kubectl get csidriver fsx.csi.aws.com
   kubectl get pods -n kube-system | grep fsx
   ```

1. Check EKS add-on status:

   ```
   # List all add-ons
   aws eks list-addons --cluster-name $EKS_CLUSTER_NAME --region $REGION
   
   # Check specific CSI driver add-ons
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION 2>/dev/null || echo "S3 CSI driver not installed"
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION 2>/dev/null || echo "FSx CSI driver not installed"
   ```

1. Check inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

**Resolution:**

**Step 1: Install missing S3 CSI driver**

1. Create IAM role for S3 CSI driver (if not already created):

   ```
   # Set up service account role ARN (from installation steps)
   export S3_CSI_ROLE_ARN=$(aws iam get-role --role-name $S3_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
   echo "S3 CSI Role ARN: $S3_CSI_ROLE_ARN"
   ```

1. Install S3 CSI driver add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-mountpoint-s3-csi-driver \
       --addon-version v1.14.1-eksbuild.1 \
       --service-account-role-arn $S3_CSI_ROLE_ARN \
       --region $REGION
   ```

1. Verify S3 CSI driver installation:

   ```
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
   
   # Verify CSI driver is available
   kubectl get csidriver s3.csi.aws.com
   kubectl get pods -n kube-system | grep mountpoint
   ```

**Step 2: Install missing FSx CSI driver**

1. Create IAM role for FSx CSI driver (if not already created):

   ```
   # Set up service account role ARN (from installation steps)
   export FSX_CSI_ROLE_ARN=$(aws iam get-role --role-name $FSX_CSI_ROLE_NAME --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
   echo "FSx CSI Role ARN: $FSX_CSI_ROLE_ARN"
   ```

1. Install FSx CSI driver add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name aws-fsx-csi-driver \
       --addon-version v1.6.0-eksbuild.1 \
       --service-account-role-arn $FSX_CSI_ROLE_ARN \
       --region $REGION
   
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
   
   # Verify FSx CSI driver is running
   kubectl get pods -n kube-system | grep fsx
   ```

**Step 3: Verify all dependencies**

After installing the missing dependencies, verify they are running correctly before retrying the inference operator installation:

```
# Check all required add-ons are active
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-mountpoint-s3-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name aws-fsx-csi-driver --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name metrics-server --region $REGION
aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION

# Verify all pods are running
kubectl get pods -n kube-system | grep -E "(mountpoint|fsx|metrics-server)"
kubectl get pods -n cert-manager
```

## Inference Custom Resource Definitions are missing during model deployment
<a name="sagemaker-hyperpod-model-deployment-ts-crd-not-exist"></a>

**Problem:** Custom Resource Definitions (CRDs) are missing when you attempt to create model deployments. This issue occurs when you previously installed and deleted the inference add-on without cleaning up model deployments that have finalizers.

**Symptoms and diagnosis:**

**Root cause:**

If you delete the inference add-on without first removing all model deployments, custom resources with finalizers remain in the cluster. These finalizers must complete before you can delete the CRDs. The add-on deletion process doesn't wait for CRD deletion to complete, which causes the CRDs to remain in a terminating state and prevents new installations.

**To diagnose this issue**

1. Check whether CRDs exist.

   ```
   kubectl get crd | grep inference.sagemaker.aws.amazon.com
   ```

1. Check for stuck custom resources.

   ```
   # Check for JumpStartModel resources
   kubectl get jumpstartmodels -A
   
   # Check for InferenceEndpointConfig resources
   kubectl get inferenceendpointconfigs -A
   ```

1. Inspect finalizers on stuck resources.

   ```
   # Example for a specific JumpStartModel
   kubectl get jumpstartmodels <model-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
   
   # Example for a specific InferenceEndpointConfig
   kubectl get inferenceendpointconfigs <config-name> -n <namespace> -o jsonpath='{.metadata.finalizers}'
   ```

**Resolution:**

Manually remove the finalizers from all model deployments that weren't deleted when you removed the inference add-on. Complete the following steps for each stuck custom resource.

**To remove finalizers from JumpStartModel resources**

1. List all JumpStartModel resources across all namespaces.

   ```
   kubectl get jumpstartmodels -A
   ```

1. For each JumpStartModel resource, remove the finalizers by patching the resource to set metadata.finalizers to an empty array.

   ```
   kubectl patch jumpstartmodels <model-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

   The following example shows how to patch a resource named kv-l1-only.

   ```
   kubectl patch jumpstartmodels kv-l1-only -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

1. Verify that the model instance is deleted.

   ```
   kubectl get jumpstartmodels -A
   ```

   When all resources are cleaned up, you should see the following output.

   ```
   Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=jumpstartmodels": the server could not find the requested resource (get jumpstartmodels.inference.sagemaker.aws.amazon.com)
   ```

1. Verify that the JumpStartModel CRD is removed.

   ```
   kubectl get crd | grep jumpstartmodels.inference.sagemaker.aws.amazon.com
   ```

   If the CRD is successfully removed, this command returns no output.

**To remove finalizers from InferenceEndpointConfig resources**

1. List all InferenceEndpointConfig resources across all namespaces.

   ```
   kubectl get inferenceendpointconfigs -A
   ```

1. For each InferenceEndpointConfig resource, remove the finalizers.

   ```
   kubectl patch inferenceendpointconfigs <config-name> -n <namespace> -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

   The following example shows how to patch a resource named my-inference-config.

   ```
   kubectl patch inferenceendpointconfigs my-inference-config -n default -p '{"metadata":{"finalizers":[]}}' --type=merge
   ```

1. Verify that the config instance is deleted.

   ```
   kubectl get inferenceendpointconfigs -A
   ```

   When all resources are cleaned up, you should see the following output.

   ```
   Error from server (NotFound): Unable to list "inference.sagemaker.aws.amazon.com/v1, Resource=inferenceendpointconfigs": the server could not find the requested resource (get inferenceendpointconfigs.inference.sagemaker.aws.amazon.com)
   ```

1. Verify that the InferenceEndpointConfig CRD is removed.

   ```
   kubectl get crd | grep inferenceendpointconfigs.inference.sagemaker.aws.amazon.com
   ```

   If the CRD is successfully removed, this command returns no output.

**To reinstall the inference add-on**

After you clean up all stuck resources and verify that the CRDs are removed, reinstall the inference add-on. For more information, see [Installing the Inference Operator with EKS add-on](sagemaker-hyperpod-model-deployment-setup.md#sagemaker-hyperpod-model-deployment-setup-install-inference-operator-addon).

**Verification:**

1. Verify that the inference add-on is successfully installed.

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   ```

   The Status should be ACTIVE and the Health should be HEALTHY.

1. Verify that CRDs are properly installed.

   ```
   kubectl get crd | grep inference.sagemaker.aws.amazon.com
   ```

   You should see the inference-related CRDs listed in the output.

1. Test creating a new model deployment to confirm that the issue is resolved.

   ```
   # Create a test deployment using your preferred method
   kubectl apply -f <your-model-deployment.yaml>
   ```

**Prevention:**

To prevent this issue, complete the following steps before you uninstall the inference add-on.

1. Delete all model deployments.

   ```
   # Delete all JumpStartModel resources
   kubectl delete jumpstartmodels --all -A
   
   # Delete all InferenceEndpointConfig resources
   kubectl delete inferenceendpointconfigs --all -A
   
   # Wait for all resources to be fully deleted
   kubectl get jumpstartmodels -A
   kubectl get inferenceendpointconfigs -A
   ```

1. Verify that all custom resources are deleted.

1. After you confirm that all resources are cleaned up, delete the inference add-on.

## Inference add-on installation failed due to missing cert-manager
<a name="sagemaker-hyperpod-model-deployment-ts-missing-cert-manager"></a>

**Problem:** The inference operator add-on creation fails because the cert-manager EKS Add-On is not installed, resulting in missing Custom Resource Definitions (CRDs).

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
Missing required CRD: certificaterequests.cert-manager.io. 
The cert-manager add-on is not installed. Please install cert-manager and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if cert-manager is installed:

   ```
   # Check for cert-manager CRDs
   kubectl get crd | grep cert-manager
   kubectl get pods -n cert-manager
   
   # Check EKS add-on status
   aws eks describe-addon --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION 2>/dev/null || echo "Cert-manager not installed"
   ```

1. Check inference operator add-on status:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,Issues:issues}" \
       --output json
   ```

**Resolution:**

**Step 1: Install cert-manager add-on**

1. Install the cert-manager EKS add-on:

   ```
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name cert-manager \
       --addon-version v1.18.2-eksbuild.2 \
       --region $REGION
   ```

1. Verify cert-manager installation:

   ```
   # Wait for add-on to be active
   aws eks wait addon-active --cluster-name $EKS_CLUSTER_NAME --addon-name cert-manager --region $REGION
   
   # Verify cert-manager pods are running
   kubectl get pods -n cert-manager
   
   # Verify CRDs are installed
   kubectl get crd | grep cert-manager | wc -l
   # Expected: Should show multiple cert-manager CRDs
   ```

**Step 2: Retry inference operator installation**

1. After cert-manager is installed, retry the inference operator installation:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall the inference operator add-on
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Monitor the installation:

   ```
   # Check installation status
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health}" \
       --output table
   
   # Verify inference operator pods are running
   kubectl get pods -n hyperpod-inference-system
   ```

## Inference add-on installation failed due to missing ALB Controller
<a name="sagemaker-hyperpod-model-deployment-ts-missing-alb"></a>

**Problem:** The inference operator add-on creation fails because the AWS Load Balancer Controller is not installed or not properly configured for the inference add-on.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
ALB Controller not installed (missing aws-load-balancer-controller pods). 
Please install the Application Load Balancer Controller and see the troubleshooting guide for more information.
```

**Diagnostic steps:**

1. Check if ALB Controller is installed:

   ```
   # Check for ALB Controller pods
   kubectl get pods -n kube-system | grep aws-load-balancer-controller
   kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller
   
   # Check ALB Controller service account
   kubectl get serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null || echo "ALB Controller service account not found"
   kubectl get serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null || echo "ALB Controller service account not found in inference namespace"
   ```

1. Check inference operator add-on configuration:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \
       --output json
   ```

**Resolution:**

Choose one of the following options based on your setup:

**Option 1: Let the inference add-on install ALB Controller (Recommended)**
+ Ensure the ALB role is created and properly configured in your add-on configuration:

  ```
  # Verify ALB role exists
  export ALB_ROLE_ARN=$(aws iam get-role --role-name alb-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
  echo "ALB Role ARN: $ALB_ROLE_ARN"
  
  # Update your addon-config.json to enable ALB
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "enabled": true,
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Option 2: Use existing ALB Controller installation**
+ If you already have ALB Controller installed, configure the add-on to use the existing installation:

  ```
  # Update your addon-config.json to disable ALB installation
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "enabled": false
    },
    "keda": {
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Step 3: Retry inference operator installation**

1. Reinstall the inference operator add-on with the updated configuration:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall with updated configuration
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Verify ALB Controller is working:

   ```
   # Check ALB Controller pods
   kubectl get pods -n hyperpod-inference-system | grep aws-load-balancer-controller
   kubectl get pods -n kube-system | grep aws-load-balancer-controller
   
   # Check service account annotations
   kubectl describe serviceaccount aws-load-balancer-controller -n hyperpod-inference-system 2>/dev/null
   kubectl describe serviceaccount aws-load-balancer-controller -n kube-system 2>/dev/null
   ```

## Inference add-on installation failed due to missing KEDA operator
<a name="sagemaker-hyperpod-model-deployment-ts-missing-keda"></a>

**Problem:** The inference operator add-on creation fails because the KEDA (Kubernetes Event Driven Autoscaler) operator is not installed or not properly configured for the inference add-on.

**Symptoms and diagnosis:**

**Error messages:**

The following errors appear in the add-on creation logs or inference operator logs:

```
KEDA operator not installed (missing keda-operator pods). 
KEDA can be installed separately in any namespace or via the Inference addon.
```

**Diagnostic steps:**

1. Check if KEDA operator is installed:

   ```
   # Check for KEDA operator pods in common namespaces
   kubectl get pods -n keda-system | grep keda-operator 2>/dev/null || echo "KEDA not found in keda-system namespace"
   kubectl get pods -n kube-system | grep keda-operator 2>/dev/null || echo "KEDA not found in kube-system namespace"
   kubectl get pods -n hyperpod-inference-system | grep keda-operator 2>/dev/null || echo "KEDA not found in inference namespace"
   
   # Check for KEDA CRDs
   kubectl get crd | grep keda 2>/dev/null || echo "KEDA CRDs not found"
   
   # Check KEDA service account
   kubectl get serviceaccount keda-operator -A 2>/dev/null || echo "KEDA service account not found"
   ```

1. Check inference operator add-on configuration:

   ```
   aws eks describe-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION \
       --query "addon.{Status:status,Health:health,ConfigurationValues:configurationValues}" \
       --output json
   ```

**Resolution:**

Choose one of the following options based on your setup:

**Option 1: Let the inference add-on install KEDA (Recommended)**
+ Ensure the KEDA role is created and properly configured in your add-on configuration:

  ```
  # Verify KEDA role exists
  export KEDA_ROLE_ARN=$(aws iam get-role --role-name keda-operator-role --query 'Role.Arn' --output text 2>/dev/null || echo "Role not found")
  echo "KEDA Role ARN: $KEDA_ROLE_ARN"
  
  # Update your addon-config.json to enable KEDA
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "enabled": true,
      "auth": {
        "aws": {
          "irsa": {
            "roleArn": "$KEDA_ROLE_ARN"
          }
        }
      }
    }
  }
  EOF
  ```

**Option 2: Use existing KEDA installation**
+ If you already have KEDA installed, configure the add-on to use the existing installation:

  ```
  # Update your addon-config.json to disable KEDA installation
  cat > addon-config.json << EOF
  {
    "executionRoleArn": "$EXECUTION_ROLE_ARN",
    "tlsCertificateS3Bucket": "$BUCKET_NAME",
    "hyperpodClusterArn": "$HYPERPOD_CLUSTER_ARN",
    "alb": {
      "serviceAccount": {
        "create": true,
        "roleArn": "$ALB_ROLE_ARN"
      }
    },
    "keda": {
      "enabled": false
    }
  }
  EOF
  ```

**Step 3: Retry inference operator installation**

1. Reinstall the inference operator add-on with the updated configuration:

   ```
   # Delete the failed add-on if it exists
   aws eks delete-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --region $REGION 2>/dev/null || echo "Add-on not found, proceeding with installation"
   
   # Wait for deletion to complete
   sleep 30
   
   # Reinstall with updated configuration
   aws eks create-addon \
       --cluster-name $EKS_CLUSTER_NAME \
       --addon-name amazon-sagemaker-hyperpod-inference \
       --addon-version v1.0.0-eksbuild.1 \
       --configuration-values file://addon-config.json \
       --region $REGION
   ```

1. Verify KEDA is working:

   ```
   # Check KEDA pods
   kubectl get pods -n hyperpod-inference-system | grep keda
   kubectl get pods -n kube-system | grep keda
   kubectl get pods -n keda-system | grep keda 2>/dev/null
   
   # Check KEDA CRDs
   kubectl get crd | grep scaledobjects
   kubectl get crd | grep scaledjobs
   
   # Check KEDA service account annotations
   kubectl describe serviceaccount keda-operator -n hyperpod-inference-system 2>/dev/null
   kubectl describe serviceaccount keda-operator -n kube-system 2>/dev/null
   kubectl describe serviceaccount keda-operator -n keda-system 2>/dev/null
   ```

# Certificate download timeout
<a name="sagemaker-hyperpod-model-deployment-ts-certificate"></a>

When deploying a SageMaker AI endpoint, the creation process fails due to the inability to download the certificate authority (CA) certificate in a VPC environment. For detailed configuration steps, refer to the [Admin guide](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/SageMakerHyperpod/hyperpod-inference/Hyperpod_Inference_Admin_Notebook.ipynb).

**Error message:**

The following error appears in the SageMaker AI endpoint CloudWatch logs: 

```
Error downloading CA certificate: Connect timeout on endpoint URL: "https://****.s3.<REGION>.amazonaws.com/****/***.pem"
```

**Root cause:**
+ This issue occurs when the inference operator cannot access the self-signed certificate in Amazon S3 within your VPC
+ Proper configuration of the Amazon S3 VPC endpoint is essential for certificate access

**Resolution:**

1. If you don't have an Amazon S3 VPC endpoint:
   + Create an Amazon S3 VPC endpoint following the configuration in section 5.3 of the [Admin guide](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/SageMakerHyperpod/hyperpod-inference/Hyperpod_Inference_Admin_Notebook.ipynb).

1. If you already have an Amazon S3 VPC endpoint:
   + Ensure that the subnet route table is configured to point to the VPC endpoint (if using gateway endpoint) or that private DNS is enabled for interface endpoint.
   + Amazon S3 VPC endpoint should be similar to the configuration mentioned in section 5.3 Endpoint creation step

# Model deployment issues
<a name="sagemaker-hyperpod-model-deployment-ts-deployment-issues"></a>

**Overview:** This section covers common issues that occur during model deployment, including pending states, failed deployments, and monitoring deployment progress.

## Model deployment stuck in pending state
<a name="sagemaker-hyperpod-model-deployment-ts-pending"></a>

When deploying a model, the deployment remains in a "Pending" state for an extended period. This indicates that the inference operator is unable to initiate the model deployment in your HyperPod cluster.

**Components affected:**

During normal deployment, the inference operator should:
+ Deploy model pod
+ Create load balancer
+ Create SageMaker AI endpoint

**Troubleshooting steps:**

1. Check the inference operator pod status:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   Expected output example:

   ```
   NAME                                                           READY   STATUS    RESTARTS   AGE
   hyperpod-inference-operator-controller-manager-65c49967f5-894fg   1/1     Running   0         6d13h
   ```

1. Review the inference operator logs and examine the operator logs for error messages:

   ```
   kubectl logs hyperpod-inference-operator-controller-manager-5b5cdd7757-txq8f -n hyperpod-inference-operator-system
   ```

**What to look for:**
+ Error messages in the operator logs
+ Status of the operator pod
+ Any deployment-related warnings or failures

**Note**  
A healthy deployment should progress beyond the "Pending" state within a reasonable time. If issues persist, review the inference operator logs for specific error messages to determine the root cause.

## Model deployment failed state troubleshooting
<a name="sagemaker-hyperpod-model-deployment-ts-failed"></a>

When a model deployment enters a "Failed" state, the failure could occur in one of three components:
+ Model pod deployment
+ Load balancer creation
+ SageMaker AI endpoint creation

**Troubleshooting steps:**

1. Check the inference operator status:

   ```
   kubectl get pods -n hyperpod-inference-system
   ```

   Expected output:

   ```
   NAME                                                           READY   STATUS    RESTARTS   AGE
   hyperpod-inference-operator-controller-manager-65c49967f5-894fg   1/1     Running   0         6d13h
   ```

1. Review the operator logs:

   ```
   kubectl logs hyperpod-inference-operator-controller-manager-5b5cdd7757-txq8f -n hyperpod-inference-operator-system
   ```

**What to look for:**

The operator logs will indicate which component failed:
+ Model pod deployment failures
+ Load balancer creation issues
+ SageMaker AI endpoint errors

## Checking model deployment progress
<a name="sagemaker-hyperpod-model-deployment-ts-progress"></a>

To monitor the progress of your model deployment and identify potential issues, you can use kubectl commands to check the status of various components. This helps determine whether the deployment is progressing normally or has encountered problems during the model pod creation, load balancer setup, or SageMaker AI endpoint configuration phases.

**Method 1: Check the JumpStart model status**

```
kubectl describe jumpstartmodel.inference.sagemaker.aws.amazon.com/<model-name> -n <namespace>
```

**Key status indicators to monitor:**

1. Deployment Status
   + Look for `Status.State`: Should show `DeploymentComplete`
   + Check `Status.Deployment Status.Available Replicas`
   + Monitor `Status.Conditions` for deployment progress

1. SageMaker AI Endpoint Status
   + Check `Status.Endpoints.Sagemaker.State`: Should show `CreationCompleted`
   + Verify `Status.Endpoints.Sagemaker.Endpoint Arn`

1. TLS Certificate Status
   + View `Status.Tls Certificate` details
   + Check certificate expiration in `Last Cert Expiry Time`

**Method 2: Check the inference endpoint configuration**

```
kubectl describe inferenceendpointconfig.inference.sagemaker.aws.amazon.com/<deployment_name> -n <namespace>
```

**Common status states:**
+ `DeploymentInProgress`: Initial deployment phase
+ `DeploymentComplete`: Successful deployment
+ `Failed`: Deployment failed

**Note**  
Monitor the Events section for any warnings or errors. Check replica count matches expected configuration. Verify all conditions show `Status: True` for a healthy deployment.

# VPC ENI permission issue
<a name="sagemaker-hyperpod-model-deployment-ts-permissions"></a>

SageMaker AI endpoint creation fails due to insufficient permissions for creating network interfaces in VPC.

**Error message:**

```
Please ensure that the execution role for variant AllTraffic has sufficient permissions for creating an endpoint variant within a VPC
```

**Root cause:**

The inference operator's execution role lacks the required Amazon EC2 permission to create network interfaces (ENI) in VPC.

**Resolution:**

Add the following IAM permission to the inference operator's execution role:

```
{
    "Effect": "Allow",
    "Action": [
        "ec2:CreateNetworkInterfacePermission",
        "ec2:DeleteNetworkInterfacePermission"
     ],
    "Resource": "*"
}
```

**Verification:**

After adding the permission:

1. Delete the failed endpoint (if exists)

1. Retry the endpoint creation

1. Monitor the deployment status for successful completion

**Note**  
This permission is essential for SageMaker AI endpoints running in VPC mode. Ensure the execution role has all other necessary VPC-related permissions as well.

# IAM trust relationship issue
<a name="sagemaker-hyperpod-model-deployment-ts-trust"></a>

HyperPod inference operator fails to start with an STS AssumeRoleWithWebIdentity error, indicating an IAM trust relationship configuration problem.

**Error message:**

```
failed to enable inference watcher for HyperPod cluster *****: operation error SageMaker: UpdateClusterInference, 
get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, 
operation error STS: AssumeRoleWithWebIdentity, https response error StatusCode: 403, RequestID: ****, 
api error AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
```

**Resolution:**

Update the trust relationship of the inference operator's IAM execution role with the following configuration.

Replace the following placeholders:
+ `<ACCOUNT_ID>`: Your AWS account ID
+ `<REGION>`: Your AWS region
+ `<OIDC_ID>`: Your Amazon EKS cluster's OIDC provider ID

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
            "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringLike": {
                    "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:sub": "system:serviceaccount:<namespace>:<service-account-name>",
                    "oidc.eks.<REGION>.amazonaws.com/id/<OIDC_ID>:aud": "sts.amazonaws.com"
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

**Verification:**

After updating the trust relationship:

1. Verify the role configuration in IAM console

1. Restart the inference operator if necessary

1. Monitor operator logs for successful startup

# Missing NVIDIA GPU plugin error
<a name="sagemaker-hyperpod-model-deployment-ts-gpu"></a>

Model deployment fails with GPU insufficiency error despite having available GPU nodes. This occurs when the NVIDIA device plugin is not installed in the HyperPod cluster.

**Error message:**

```
0/15 nodes are available: 10 node(s) didn't match Pod's node affinity/selector, 
5 Insufficient nvidia.com/gpu. preemption: 0/15 nodes are available: 
10 Preemption is not helpful for scheduling, 5 No preemption victims found for incoming pod.
```

**Root cause:**
+ Kubernetes cannot detect GPU resources without the NVIDIA device plugin
+ Results in scheduling failures for GPU workloads

**Resolution:**

Install the NVIDIA GPU plugin by running:

```
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/refs/tags/v0.17.1/deployments/static/nvidia-device-plugin.yml
```

**Verification steps:**

1. Check the plugin deployment status:

   ```
   kubectl get pods -n kube-system | grep nvidia-device-plugin
   ```

1. Verify GPU resources are now visible:

   ```
   kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu
   ```

1. Retry model deployment

**Note**  
Ensure NVIDIA drivers are installed on GPU nodes. Plugin installation is a one-time setup per cluster. May require cluster admin privileges to install.

# Inference operator fails to start
<a name="sagemaker-hyperpod-model-deployment-ts-startup"></a>

Inference operator pod failed to start and is causing the following error message. This error is due to permission policy on the operator execution role not being authorized to perform `sts:AssumeRoleWithWebIdentity`. Due to this, the operator part running on the control plane is not started.

**Error message:**

```
Warning Unhealthy 5m46s (x22 over 49m) kubelet Startup probe failed: Get "http://10.1.100.59:8081/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
```

**Root cause:**
+ Permission policy of the inference operator execution role is not set to access authorization token for resources.

**Resolution:**

Set the following policy of the execution role of `EXECUTION_ROLE_ARN` for the HyperPod inference operator:

```
HyperpodInferenceAccessPolicy-ml-cluster to include all resources
```

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken"
            ],
            "Resource": "*"
        }
    ]
}
```

------

**Verification steps:**

1. Change the policy.

1. Terminate the HyperPod inference operator pod.

1. The pod will be restarted without throwing any exceptions.

# Amazon SageMaker HyperPod Inference release notes
<a name="sagemaker-hyperpod-inference-release-notes"></a>

This topic covers release notes that track updates, fixes, and new features for Amazon SageMaker HyperPod Inference. SageMaker HyperPod Inference enables you to deploy and scale machine learning models on your HyperPod clusters with enterprise-grade reliability. For general Amazon SageMaker HyperPod platform releases, updates, and improvements, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

For information about SageMaker HyperPod Inference capabilities and deployment options, see [Deploying models on Amazon SageMaker HyperPod](sagemaker-hyperpod-model-deployment.md).

## SageMaker HyperPod Inference release notes: v3.1
<a name="sagemaker-hyperpod-inference-release-notes-20260403"></a>

**Release Date:** April 3, 2026

**Summary**

Inference Operator v3.1 introduces custom Kubernetes pod configuration, custom certificate support, and per-pod request limits.

**Key Features**
+ **Custom Kubernetes Pod Configuration** – Added a new `kubernetes` field to the `InferenceEndpointConfig` CRD that allows users to customize inference pod configurations:
  + **Custom init containers** – Run user-defined init containers before the inference server starts (for example, cache warming, GDS setup). Init containers are injected after the operator's prefetch container.
  + **Custom volumes** – Add additional volumes (`emptyDir`, `hostPath`, `configMap`, etc.) to the pod spec, which can be referenced by init containers via `volumeMounts`.
  + **Custom scheduler name** – Specify a custom Kubernetes scheduler for pod placement.
+ **Custom Certificates** – Use your own ACM certificates for inference endpoints instead of operator-generated self-signed certificates, configured via `customCertificateConfig`. Supports publicly trusted ACM certificates, AWS Private CA certificates, and certificates imported from external CAs. The operator monitors certificate health and supports automatic renewal detection.
+ **Request Limits** – Control request handling per pod via the new `RequestLimits` configuration under `Worker`, with the following configurable fields:
  + `maxConcurrentRequests` – Maximum concurrent in-flight requests per pod.
  + `maxQueueSize` – Requests to queue when the concurrency limit is reached before rejecting.
  + `overflowStatusCode` – HTTP status code returned when limits are exceeded (default: 429).

For detailed information including prerequisites and upgrade instructions, see the sections below.

### Prerequisites
<a name="sagemaker-hyperpod-inference-v3-1-prerequisites"></a>

To use the Custom Certificates feature, add the following permissions to your Inference Operator execution role:

```
{  
    "Sid": "ACMCertificateAccess",  
    "Effect": "Allow",  
    "Action": [  
        "acm:DescribeCertificate",  
        "acm:GetCertificate"  
    ],  
    "Resource": "arn:aws:acm:*:*:certificate/*"  
}
```

### Upgrade to v3.1
<a name="sagemaker-hyperpod-inference-v3-1-upgrade"></a>

If you already have the Inference Operator installed via Helm, use the following commands to upgrade:

```
helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm upgrade hyperpod-inference-operator . -n kube-system \
  -f current-values.yaml --set image.tag=v3.1
    
# Verification
kubectl get deployment hyperpod-inference-operator-controller-manager \
  -n hyperpod-inference-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
```

## SageMaker HyperPod Inference release notes: v3.0
<a name="sagemaker-hyperpod-inference-release-notes-20260223"></a>

**Release Date:** February 23, 2026

**Summary**

Inference Operator 3.0 introduces EKS Add-on integration for simplified lifecycle management, Node Affinity support for granular scheduling control, and improved resource tagging. Existing Helm-based installations can be migrated to the EKS Add-on using the provided migration script. Update your Inference Operator execution role with new tagging permissions before upgrading.

**Key Features**
+ **EKS Add-on Integration** – Enterprise-grade lifecycle management with simplified installation experience
+ **Node Affinity** – Granular scheduling control for excluding spot instances, preferring availability zones, or targeting nodes with custom labels

For detailed information including prerequisites, upgrade instructions, and migration guidance, see the sections below.

### Prerequisites
<a name="sagemaker-hyperpod-inference-v3-0-prerequisites"></a>

Before upgrading the Helm version to 3.0, customers should add additional tagging permissions to their Inference operator execution role. As part of improving resource tagging and security, the Inference Operator now tags ALB, S3, and ACM resources. This enhancement requires additional permissions in the Inference Operator execution role. Add the following permissions to your Inference Operator execution role:

```
{  
    "Sid": "CertificateTagginPermission",  
    "Effect": "Allow",  
    "Action": [  
        "acm:AddTagsToCertificate"  
    ],  
    "Resource": "arn:aws:acm:*:*:certificate/*",  
},  
{  
    "Sid": "S3PutObjectTaggingAccess",  
    "Effect": "Allow",  
    "Action": [  
        "s3:PutObjectTagging"  
    ],  
    "Resource": [  
        "arn:aws:s3:::<TLS_BUCKET>/*" # Replace * with your TLS bucket  
    ]  
}
```

### Upgrade to v3.0
<a name="sagemaker-hyperpod-inference-v3-0-upgrade"></a>

If you already have the Inference Operator installed via Helm, use the following commands to upgrade:

```
helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm upgrade hyperpod-inference-operator . -n kube-system \
  -f current-values.yaml --set image.tag=v3.0
    
# Verification
kubectl get deployment hyperpod-inference-operator-controller-manager \
  -n hyperpod-inference-system \
  -o jsonpath='{.spec.template.spec.containers[0].image}'
```

### Helm to EKS Add-on Migration
<a name="sagemaker-hyperpod-inference-v3-0-migration"></a>

If Inference operator is installed through Helm before 3.0 version, we recommend migrating to EKS Add-on to get timely updates on the new features that will be released for Inference Operator. This script migrates the SageMaker HyperPod Inference Operator from Helm-based installation to EKS Add-on installation.

**Overview:** The script takes a cluster name and region as parameters, retrieves the existing Helm installation configuration, and migrates to EKS Add-on deployment. It creates new IAM roles for the Inference Operator, ALB Controller, and KEDA Operator.

Before migrating the Inference Operator, the script ensures required dependencies (S3 CSI driver, FSx CSI driver, cert-manager, and metrics-server) exist. If they don't exist, it deploys them as Add-on.

After the Inference Operator Add-on migration completes, the script also migrates S3, FSx, and other dependencies (ALB, KEDA, cert-manager, metrics-server) if they were originally installed via the Inference Operator Helm chart. Use `--skip-dependencies-migration` to skip this step for S3 CSI driver, FSx CSI driver, cert-manager, and metrics-server. Note that ALB and KEDA are installed as part of the Add-on in the same namespace as Inference Operator, and will be migrated as part of the Inference Operator Add-on.

**Important**  
During the migration, do not deploy new models as they will not be deployed until the migration is completed. Once the Inference Operator Add-on is in ACTIVE state, new models can be deployed. Migration time typically takes 15 to 20 minutes, and it can complete within 30 minutes if only a few models are currently deployed.

**Migration Prerequisites:**
+ AWS CLI configured with appropriate credentials
+ kubectl configured with access to your EKS cluster
+ Helm installed
+ Existing Helm installation of hyperpod-inference-operator

**Note**  
Endpoints that are already running will not be interrupted during the migration process. Existing endpoints will continue to serve traffic without disruption throughout the migration.

**Getting the Migration Script:**

```
git clone https://github.com/aws/sagemaker-hyperpod-cli.git
cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator/migration
```

**Usage:**

```
./helm_to_addon.sh [OPTIONS] \
  --cluster-name <cluster-name> (Required) \
  --region <region> (Required) \
  --helm-namespace kube-system (Optional) \
  --auto-approve (Optional) \
  --skip-dependencies-migration (Optional) \
  --s3-mountpoint-role-arn <s3-mountpoint-role-arn> (Optional) \
  --fsx-role-arn <fsx-role-arn> (Optional)
```

**Options:**
+ `--cluster-name NAME` – EKS cluster name (required)
+ `--region REGION` – AWS region (required)
+ `--helm-namespace NAMESPACE` – Namespace where Helm chart is installed (default: kube-system) (optional)
+ `--s3-mountpoint-role-arn ARN` – S3 Mountpoint CSI driver IAM role ARN (optional)
+ `--fsx-role-arn ARN` – FSx CSI driver IAM role ARN (optional)
+ `--auto-approve` – Skip confirmation prompts if this flag is enabled. `step-by-step` and `auto-approve` are mutually exclusive, if `--auto-approve` is given, do not specify `--step-by-step` (optional)
+ `--step-by-step` – Pause after each major step for review. This should not be mentioned if `--auto-approve` is already added (optional)
+ `--skip-dependencies-migration` – Skip migration of Helm-installed dependencies to Add-on. For dependencies were NOT installed via the Inference Operator Helm chart, or if you want to manage them separately. (optional)

**Examples:**

Basic migration (migrates dependencies):

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1
```

Auto-approve without prompts:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --auto-approve
```

Skip dependency migration for FSx, S3 mountpoint, cert manager and Metrics server:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --skip-dependencies-migration
```

Provide existing S3 and FSx IAM roles:

```
./helm_to_addon.sh \
  --cluster-name my-cluster \
  --region us-east-1 \
  --s3-mountpoint-role-arn arn:aws:iam::123456789012:role/s3-csi-role \
  --fsx-role-arn arn:aws:iam::123456789012:role/fsx-csi-role
```

**Backup Location:**

Backups are stored in `/tmp/hyperpod-migration-backup-<timestamp>/`

Backups enable safe migration and recovery:
+ **Rollback on Failure** – If migration fails, the script can automatically restore your cluster to its pre-migration state using the backed up configurations
+ **Audit Trail** – Provides a complete record of what existed before migration for troubleshooting and compliance
+ **Configuration Reference** – Allows you to compare pre-migration and post-migration configurations
+ **Manual Recovery** – If needed, you can manually inspect and restore specific resources from the backup directory

**Rollback:**

If migration fails, the script prompts for user confirmation before initiating rollback to restore the previous state.

## SageMaker HyperPod Inference release notes: v2.3
<a name="sagemaker-hyperpod-inference-release-notes-20260203"></a>

**What's new**

This release introduces new optional fields in the Custom Resource Definitions (CRDs) to enhance deployment configuration flexibility.

**Features**
+ **Multi Instance Types**
  + **Enhanced deployment reliability** – Supports multi-instance type configurations with automatic failover to alternative instance types when preferred options lack capacity
  + **Intelligent resource scheduling** – Uses Kubernetes node affinity to prioritize instance types while guaranteeing deployment even when preferred resources are unavailable
  + **Optimized cost and performance** – Maintains your instance type preferences and prevents capacity-related failures during cluster fluctuations

**Bug Fixes**

Changes to the field `invocationEndpoint` in the spec of the `InferenceEndpointConfig` will now take effect:
+ If the `invocationEndpoint` field is patched or updated, dependent resources, such as the `Ingress`, the Load Balancer, `SageMakerEndpointRegistration`, and SageMaker Endpoint, will be updated with normalisation.
+ The value for `invocationEndpoint` provided will be stored as-is in the `InferenceEndpointConfig` spec itself. When this value is used to create a Load Balancer and— if enabled— a SageMaker Endpoint, it will be normalised to have one leading forward slash.
  + `v1/chat/completions` will be normalised to `/v1/chat/completions` for the `Ingress`, AWS Load Balancer, and SageMaker Endpoint. For the `SageMakerEndpointRegistration`, it will be displayed in its spec as `v1/chat/completions`.
  + `///invoke` will be normalised to `/invoke` for the `Ingress`, AWS Load Balancer, and SageMaker Endpoint. For the `SageMakerEndpointRegistration`, it will be displayed in its spec as `invoke`.

**Installing Helm:**

Follow: [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm\$1chart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart)

If you are focused on only installing the inference operator, after step 1 i.e. `Set Up Your Helm Environment`, do `cd HyperPodHelmChart/charts/inference-operator`. Since you are in the inference operator chart directory itself, in the commands, wherever you see `helm_chart/HyperPodHelmChart`, replace with `.` .

**Upgrade Operator to v2.3 in case already installed:**

```
cd sagemaker-hyperpod-cli/helm_chart/HyperPodHelmChart/\
charts/inference-operator

helm get values -n kube-system hyperpod-inference-operator \
> current-values.yaml

helm upgrade hyperpod-inference-operator . \
  -n kube-system \
  -f current-values.yaml \
  --set image.tag=v2.3
```

# HyperPod in Studio
<a name="sagemaker-hyperpod-studio"></a>

You can launch machine learning workloads on Amazon SageMaker HyperPod clusters and view HyperPod cluster information in Amazon SageMaker Studio. The increased visibility into cluster details and hardware metrics can help your team identify the right candidate for your pre-training or fine-tuning workloads. 

A set of commands are available to help you get started when you launch Studio IDEs on a HyperPod cluster. You can work on your training scripts, use Docker containers for the training scripts, and submit jobs to the cluster, all from within the Studio IDEs. The following sections provide information on how to set this up, how to discover clusters and monitor their tasks, how to view cluster information, and how to connect to HyperPod clusters in IDEs within Studio.

**Topics**
+ [Setting up HyperPod in Studio](sagemaker-hyperpod-studio-setup.md)
+ [HyperPod tabs in Studio](sagemaker-hyperpod-studio-tabs.md)
+ [Connecting to HyperPod clusters and submitting tasks to clusters](sagemaker-hyperpod-studio-open.md)
+ [Troubleshooting](sagemaker-hyperpod-studio-troubleshoot.md)

# Setting up HyperPod in Studio
<a name="sagemaker-hyperpod-studio-setup"></a>

You need to set up the clusters depending on your choice of the cluster orchestrator to access your clusters through Amazon SageMaker Studio. In the following sections, choose the setup that matches with your orchestrator.

The instructions assume that you already have your cluster set up. For information on the cluster orchestrators and how to set up, start with the HyperPod orchestrator pages:
+  [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) 
+  [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) 

**Topics**
+ [Setting up a Slurm cluster in Studio](sagemaker-hyperpod-studio-setup-slurm.md)
+ [Setting up an Amazon EKS cluster in Studio](sagemaker-hyperpod-studio-setup-eks.md)

# Setting up a Slurm cluster in Studio
<a name="sagemaker-hyperpod-studio-setup-slurm"></a>

The following instructions describe how to set up a HyperPod Slurm cluster in Studio.

1. Create a domain or have one ready. For information on creating a domain, see [Guide to getting set up with Amazon SageMaker AI](gs.md).

1. (Optional) Create and attach a custom FSx for Lustre volume to your domain. 

   1. Ensure that your FSx Lustre file system exists in the same VPC as your intended domain, and is in one of the subnets present in the domain.

   1. You can follow the instructions in [Adding a custom file system to a domain](domain-custom-file-system.md). 

1. (Optional) We recommend that you add tags to your clusters to ensure a more smooth workflow. For information on how to add tags, see [Edit a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters) to update your cluster using the SageMaker AI console.

   1. Tag your FSx for Lustre file system to your Studio domain. This will help you identify the file system while launching your Studio spaces. To do so, add the following tag to your cluster to identify it with the FSx filesystem ID, `fs-id`. 

      Tag Key = “`hyperpod-cluster-filesystem`”, Tag Value = “`fs-id`”.

   1. Tag your [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html) workspace to your Studio domain. This will be used to quickly link to your Grafana workspace directly from your cluster in Studio. To do so, add the following tag to your cluster to identify it with your Grafana workspace ID, `ws-id`.

      Tag Key = “`grafana-workspace`”, Tag Value = “`ws-id`”.

1. Add the following permission to your execution role. 

   For information on SageMaker AI execution roles and how to edit them, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md). 

   To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "ssm:StartSession",
                   "ssm:TerminateSession"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:CreateCluster",
                   "sagemaker:ListClusters"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "cloudwatch:PutMetricData",
                   "cloudwatch:GetMetricData"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeCluster",
                   "sagemaker:DescribeClusterNode",
                   "sagemaker:ListClusterNodes",
                   "sagemaker:UpdateCluster",
                   "sagemaker:UpdateClusterSoftware"
               ],
               "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
           }
       ]
   }
   ```

------

1. Add a tag to this IAM role, with Tag Key = “`SSMSessionRunAs`” and Tag Value = “`os user`”. The `os user` here is the same user that you setup for the Slurm cluster. Manage access to SageMaker HyperPod clusters at an IAM role or user level by using the Run As feature in [AWS Systems Manager Agent (SSM Agent)](https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html). With this feature, you can start each SSM session using the operating system (OS) user associated to the IAM role or user. 

   For information on how to add tags to your execution role, see [Tag IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_tags_roles.html).

1. [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html). The Run As settings are account wide and is required for all SSM sessions to start successfully.

1. (Optional) [Restrict task view in Studio for Slurm clusters](#sagemaker-hyperpod-studio-setup-slurm-restrict-tasks-view). For information on viewable tasks in Studio, see [Tasks](sagemaker-hyperpod-studio-tabs.md#sagemaker-hyperpod-studio-tabs-tasks).

In Amazon SageMaker Studio you can navigate to view your clusters in HyperPod clusters (under Compute).

## Restrict task view in Studio for Slurm clusters
<a name="sagemaker-hyperpod-studio-setup-slurm-restrict-tasks-view"></a>

You can restrict users to view Slurm tasks that are authorized to view, without requiring manual input of namespaces or additional permissions checks. The restriction is applied based on the users’ IAM role, providing a streamlined and secure user experience. The following section provides information on how to restrict task view in Studio for Slurm clusters. For information on viewable tasks in Studio, see [Tasks](sagemaker-hyperpod-studio-tabs.md#sagemaker-hyperpod-studio-tabs-tasks). 

All Studio users can view, manage, and interact with all Slurm cluster tasks by default. To restrict this, you can manage access to SageMaker HyperPod clusters at an IAM role or user level by using the **Run As** feature in [AWS Systems Manager Agent (SSM Agent)](https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent.html).

You can do this by tagging IAM roles with specific identifiers, such as their username or group. When a user accesses Studio, the Session Manager uses the Run As feature to execute commands as a specific Slurm user account that matches their IAM role tags. The Slurm configuration can be set up to limit task visibility based on the user account. The Studio UI will automatically filter tasks visible to that specific user account when commands are executed through the Run As feature. Once set up, each user assuming the role with the specified identifiers will have those Slurm tasks filtered based on the Slurm configuration. For information on how to add tags to your execution role, see [Tag IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_tags_roles.html).

# Setting up an Amazon EKS cluster in Studio
<a name="sagemaker-hyperpod-studio-setup-eks"></a>

The following instructions describe how to set up an Amazon EKS cluster in Studio.

1. Create a domain or have one ready. For information on creating a domain, see [Guide to getting set up with Amazon SageMaker AI](gs.md).

1. Add the following permission to your execution role. 

   For information on SageMaker AI execution roles and how to edit them, see [Understanding domain space permissions and execution roles](execution-roles-and-spaces.md). 

   To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "DescribeHyerpodClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:DescribeCluster"
               ],
               "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/cluster-name"
           },
           {
               "Effect": "Allow",
               "Action": "ec2:Describe*",
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "ecr:CompleteLayerUpload",
                   "ecr:GetAuthorizationToken",
                   "ecr:UploadLayerPart",
                   "ecr:InitiateLayerUpload",
                   "ecr:BatchCheckLayerAvailability",
                   "ecr:PutImage"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
                   "Action": [
                       "cloudwatch:PutMetricData",
                       "cloudwatch:GetMetricData"
                       ],
               "Resource": "*"
           },
           {
               "Sid": "UseEksClusterPermissions",
               "Effect": "Allow",
               "Action": [
                   "eks:DescribeCluster",
                   "eks:AccessKubernetesApi",
                   "eks:DescribeAddon"
               ],
               "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/cluster-name"
           },
           {
               "Sid": "ListClustersPermission",
               "Effect": "Allow",
               "Action": [
                   "sagemaker:ListClusters"
               ],
               "Resource": "*"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "ssm:StartSession",
                   "ssm:TerminateSession"
               ],
               "Resource": "*"
           }
       ]
   }
   ```

------

1. [Grant IAM users access to Kubernetes with EKS access entries](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html).

   1. Navigate to the Amazon EKS cluster associated with your HyperPod cluster.

   1. Choose the **Access** tab and [create an access entry](https://docs.aws.amazon.com/eks/latest/userguide/creating-access-entries.html) for the execution role you created. 

      1. In step 1, Select the execution role you created above in the **IAM** principal dropdown.

      1. In step 2, select a policy name and select an access scope that you want the users to have access to. 

1. (Optional) To ensure a more smooth experience, we recommend that you add tags to your clusters. For information on how to add tags, see [Edit a SageMaker HyperPod cluster](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-edit-clusters) to update your cluster using the SageMaker AI console.

   1. Tag your [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html) workspace to your Studio domain. This will be used to quickly link to your Grafana workspace directly from your cluster in Studio. To do so, add the following tag to your cluster to identify it with your Grafana workspace ID, `ws-id`.

     Tag Key = “`grafana-workspace`”, Tag Value = “`ws-id`”.

1. (Optional) [Restrict task view in Studio for EKS clusters](#sagemaker-hyperpod-studio-setup-eks-restrict-tasks-view). For information on viewable tasks in Studio, see [Tasks](sagemaker-hyperpod-studio-tabs.md#sagemaker-hyperpod-studio-tabs-tasks).

## Restrict task view in Studio for EKS clusters
<a name="sagemaker-hyperpod-studio-setup-eks-restrict-tasks-view"></a>

You can restrict Kubernetes namespace permissions for users, so that they will only have access to view tasks belonging to a specified namespace. The following provides information on how to restrict the task view in Studio for EKS clusters. For information on viewable tasks in Studio, see [Tasks](sagemaker-hyperpod-studio-tabs.md#sagemaker-hyperpod-studio-tabs-tasks). 

Users will have visibility to all EKS cluster tasks by default. You can restrict users’ visibility for EKS cluster tasks to specified namespaces, ensuring that users can access the resources they need while maintaining strict access controls. You will need to provide the namespace for the user to display jobs of that namespace once the following is set up.

Once the restriction is applied, you will need to provide the namespace to the users assuming the role. Studio will only display the jobs of the namespace once the user provides inputs namespace they have permissions to view in the **Tasks** tab. 

The following configuration allows administrators to grant specific, limited access to data scientists for viewing tasks within the cluster. This configuration grants the following permissions:
+ List and get pods
+ List and get events
+ Get Custom Resource Definitions (CRDs)

YAML Configuration

```
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pods-events-crd-cluster-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["events"]
  verbs: ["get", "list"]
- apiGroups: ["apiextensions.k8s.io"]
  resources: ["customresourcedefinitions"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: pods-events-crd-cluster-role-binding
subjects:
- kind: Group
  name: pods-events-crd-cluster-level
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: pods-events-crd-cluster-role
  apiGroup: rbac.authorization.k8s.io
```

1. Save the YAML configuration to a file named `cluster-role.yaml`.

1. Apply the configuration using [https://kubernetes.io/docs/reference/kubectl/](https://kubernetes.io/docs/reference/kubectl/):

   ```
   kubectl apply -f cluster-role.yaml
   ```

1. Verify the configuration:

   ```
   kubectl get clusterrole pods-events-crd-cluster-role
   kubectl get clusterrolebinding pods-events-crd-cluster-role-binding
   ```

1. Assign users to the `pods-events-crd-cluster-level` group through your identity provider or IAM.

# HyperPod tabs in Studio
<a name="sagemaker-hyperpod-studio-tabs"></a>

In Amazon SageMaker Studio you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view your list of clusters. The displayed clusters contain information like tasks, hardware metrics, settings, and metadata details. This visibility can help your team identify the right candidate for your pre-training or finetuning workloads. The following sections provide information on each type of information.

## Tasks
<a name="sagemaker-hyperpod-studio-tabs-tasks"></a>

Amazon SageMaker HyperPod provides a view of your cluster tasks. Tasks are operations or jobs that are sent to the cluster. These can be machine learning operations, like training, running experiments, or inference. The following section provides information on your HyperPod cluster tasks.

In Amazon SageMaker Studio, you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view the **Tasks** information on your cluster. If you are having any issues with viewing tasks, see [Troubleshooting](sagemaker-hyperpod-studio-troubleshoot.md).

The task table includes:

------
#### [ For Slurm clusters ]

For Slurm clusters, the tasks currently in the Slurm job scheduler queue are shown in the table. The information shown for each task includes the task name, status, job ID, partition, run time, nodes, created by, and actions.

For a list and details about past jobs, use the [https://slurm.schedmd.com/sacct.html](https://slurm.schedmd.com/sacct.html) command in JupyterLab or a Code Editor terminal. The `sacct` command is used to view *historical information* about jobs that have *finished* or are *complete* in the system. It provides accounting information, including job resources usage like memory and exit status. 

By default, all Studio users can view, manage, and interact with all available Slurm tasks. To restrict the viewable tasks to Studio users, see [Restrict task view in Studio for Slurm clusters](sagemaker-hyperpod-studio-setup-slurm.md#sagemaker-hyperpod-studio-setup-slurm-restrict-tasks-view).

------
#### [ For Amazon EKS clusters ]

For Amazon EKS clusters, kubeflow (PyTorch, MPI, TensorFlow) tasks are shown in the table. PyTorch tasks are shown by default. You can sort for PyTorch, MPI, and TensorFlow under **Task type**. The information that is shown for each task includes the task name, status, namespace, priority class, and creation time. 

By default, all users can view jobs across all namespaces. To restrict the viewable Kubernetes namespaces available to Studio users, see [Restrict task view in Studio for EKS clusters](sagemaker-hyperpod-studio-setup-eks.md#sagemaker-hyperpod-studio-setup-eks-restrict-tasks-view). If a user cannot view the tasks and is asked to provide a namespace, they need to get that information from the administrator. 

------

## Metrics
<a name="sagemaker-hyperpod-studio-tabs-metrics"></a>

Amazon SageMaker HyperPod provides a view of your Slurm or Amazon EKS cluster utilization metrics. The following provides information on your HyperPod cluster metrics. 

You will need to install the Amazon EKS add-on to view the following metrics. For more information, see [Install the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html).

In Amazon SageMaker Studio, you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view the **Metrics** details on your cluster. Metrics provides a comprehensive view of cluster utilization metrics, including hardware, team, and task metrics. This includes compute availability and usage, team allocation and utilization, and task run and wait time information. 

## Settings
<a name="sagemaker-hyperpod-studio-tabs-settings"></a>

Amazon SageMaker HyperPod provides a view of your cluster settings. The following provides information on your HyperPod cluster settings.

In Amazon SageMaker Studio you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view the **Settings** information on your cluster. The information includes the following:
+ **Instances** details, including instance ID, status, instance type, and instance group
+ **Instance groups** details, including instance group name, type, counts, and compute information
+ **Orchestration** details, including the orchestrator, version, and certification authority
+ **Cluster resiliency** details
+ **Security** details, including subnets and security groups

## Details
<a name="sagemaker-hyperpod-studio-tabs-details"></a>

Amazon SageMaker HyperPod provides a view of your cluster metadata details. The following paragraph provides information on how to get your HyperPod cluster details.

In Amazon SageMaker Studio, you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view the **Details** on your cluster. This includes the tags, logs, and metadata.

# Connecting to HyperPod clusters and submitting tasks to clusters
<a name="sagemaker-hyperpod-studio-open"></a>

You can launch machine learning workloads on HyperPod clusters within Amazon SageMaker Studio IDEs. When you launch Studio IDEs on a HyperPod cluster, a set of commands are available to help you get started. You can work on your training scripts, use Docker containers for the training scripts, and submit jobs to the cluster, all from within the Studio IDEs. The following section provides information on how to connect your cluster to Studio IDEs.

In Amazon SageMaker Studio you can navigate to one of your clusters in **HyperPod clusters** (under **Compute**) and view your list of clusters. You can connect your cluster to an IDE listed under **Actions**. 

You can also choose your custom file system from the list of options. For information on how to get this set up, see [Setting up HyperPod in Studio](sagemaker-hyperpod-studio-setup.md).

Alternatively, you can create a space and launch an IDE using the AWS CLI. Use the following commands to do so. The following example creates a `Private` `JupyterLab` space for `user-profile-name` with the `fs-id` FSx for Lustre file system attached.

1. Create a space using the [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-space.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-space.html) AWS CLI.

   ```
   aws sagemaker create-space \
   --region your-region \
   --ownership-settings "OwnerUserProfileName=user-profile-name" \
   --space-sharing-settings "SharingType=Private" \
   --space-settings "AppType=JupyterLab,CustomFileSystems=[{FSxLustreFileSystem={FileSystemId=fs-id}}]"
   ```

1. Create the app using the [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-app.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sagemaker/create-app.html) AWS CLI.

   ```
   aws sagemaker create-app \
   --region your-region \
   --space-name space-name \
   --resource-spec '{"ec2InstanceType":"'"instance-type"'","appEnvironmentArn":"'"image-arn"'"}'
   ```

Once you have your applications open, you can submit tasks directly to the clusters you are connected to. 

# Troubleshooting
<a name="sagemaker-hyperpod-studio-troubleshoot"></a>

The following section lists troubleshooting solutions for HyperPod in Studio.

**Topics**
+ [Tasks tab](#sagemaker-hyperpod-studio-troubleshoot-tasks)
+ [Metrics tab](#sagemaker-hyperpod-studio-troubleshoot-metrics)

## Tasks tab
<a name="sagemaker-hyperpod-studio-troubleshoot-tasks"></a>

If you get Custom Resource Definition (CRD) is not configured on the cluster while in the **Tasks** tab.
+ Grant `EKSAdminViewPolicy` and `ClusterAccessRole` policies to your domain execution role. 

  For information on how to add tags to your execution role, see [Tag IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_tags_roles.html).

  To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

If the tasks grid for Slurm metrics doesn’t stop loading in the **Tasks** tab.
+ Ensure that `RunAs` enabled in your [AWS Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html) preferences and the role you are using has the `SSMSessionRunAs` tag attached. 
  + To enable `RunAs`, navigate to the **Preference** tab in the [Systems Manager console](https://console.aws.amazon.com/systems-manager/session-manager). 
  +  [Turn on Run As support for Linux and macOS managed nodes](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-preferences-run-as.html) 

For restricted task view in Studio for EKS clusters:
+ If your execution role doesn’t have permissions to list namespaces for EKS clusters.
  + See [Restrict task view in Studio for EKS clusters](sagemaker-hyperpod-studio-setup-eks.md#sagemaker-hyperpod-studio-setup-eks-restrict-tasks-view).
+ If users are experiencing issues with access for EKS clusters.

  1. Verify RBAC is enabled by running the following AWS CLI command.

     ```
     kubectl api-versions | grep rbac
     ```

     This should return rbac.authorization.k8s.io/v1.

  1. Check if the `ClusterRole` and `ClusterRoleBinding` exist by running the following commands.

     ```
     kubectl get clusterrole pods-events-crd-cluster-role
     kubectl get clusterrolebinding pods-events-crd-cluster-role-binding
     ```

  1. Verify user group membership. Ensure the user is correctly assigned to the `pods-events-crd-cluster-level` group in your identity provider or IAM.
+ If user can't see any resources.
  + Verify group membership and ensure the `ClusterRoleBinding` is correctly applied.
+ If users can see resources in all namespaces.
  + If namespace restriction is required, consider using `Role` and `RoleBinding` instead of `ClusterRole` and `ClusterRoleBinding`.
+ If configuration appears correct, but permissions aren't applied.
  + Check if there are any `NetworkPolicies` or `PodSecurityPolicies` interfering with access.

## Metrics tab
<a name="sagemaker-hyperpod-studio-troubleshoot-metrics"></a>

If there are no Amazon CloudWatch metrics are displayed in the **Metrics** tab.
+ The `Metrics` section of HyperPod cluster details uses CloudWatch to fetch the data. In order to see the metrics in this section, you need to have enabled [Cluster and task observability](sagemaker-hyperpod-eks-cluster-observability-cluster.md). Contact your administrator to configure metrics.

# SageMaker HyperPod references
<a name="sagemaker-hyperpod-ref"></a>

Find more information and references about using SageMaker HyperPod in the following topics.

**Topics**
+ [SageMaker HyperPod pricing](#sagemaker-hyperpod-ref-pricing)
+ [SageMaker HyperPod APIs](#sagemaker-hyperpod-ref-api)
+ [SageMaker HyperPod Slurm configuration](#sagemaker-hyperpod-ref-slurm-configuration)
+ [SageMaker HyperPod DLAMI](#sagemaker-hyperpod-ref-hyperpod-ami)
+ [SageMaker HyperPod API permissions reference](#sagemaker-hyperpod-ref-api-permissions)
+ [SageMaker HyperPod commands in AWS CLI](#sagemaker-hyperpod-ref-cli)
+ [SageMaker HyperPod Python modules in AWS SDK for Python (Boto3)](#sagemaker-hyperpod-ref-boto3)

## SageMaker HyperPod pricing
<a name="sagemaker-hyperpod-ref-pricing"></a>

The following topics provide information about SageMaker HyperPod pricing. To find more details on price per hour for using SageMaker HyperPod instances, see also [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/). 

**Capacity requests**

You can allocate on-demand or reserved compute capacity with SageMaker AI for use on SageMaker HyperPod. On-demand cluster creation allocates available capacity from the SageMaker AI on-demand capacity pool. Alternatively, you can request reserved capacity to ensure access by submitting a ticket for a quota increase. Inbound capacity requests are prioritized by SageMaker AI and you receive an estimated time for capacity allocation.

**Service billing**

When you provision a compute capacity on SageMaker HyperPod, you are billed for the duration of the capacity allocation. SageMaker HyperPod billing appears in your anniversary bills with a line item for the type of capacity allocation (on-demand, reserved), the instance type, and the time spent on using the instance. 

To submit a ticket for a quota increase, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

## SageMaker HyperPod APIs
<a name="sagemaker-hyperpod-ref-api"></a>

The following list is a full set of SageMaker HyperPod APIs for submitting action requests in JSON format to SageMaker AI through AWS CLI or AWS SDK for Python (Boto3).
+ [BatchDeleteClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchDeleteClusterNodes.html)
+ [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html)
+ [DeleteCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteCluster.html)
+ [DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html)
+ [DescribeClusterNode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterNode.html)
+ [ListClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterNodes.html)
+ [ListClusters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusters.html)
+ [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html)
+ [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html)

## SageMaker HyperPod Slurm configuration
<a name="sagemaker-hyperpod-ref-slurm-configuration"></a>

HyperPod supports two approaches for configuring Slurm on your cluster. Choose the approach that best fits your needs.


|  |  |  | 
| --- |--- |--- |
| Approach | Description | Recommended For | 
| API-driven configuration | Define Slurm configuration directly in the CreateCluster and UpdateCluster API requests | New clusters; simplified management | 
| Legacy configuration | Use a separate provisioning\$1parameters.json file stored in Amazon S3 | Existing clusters; backward compatibility | 

### API-driven Slurm configuration (Recommended)
<a name="sagemaker-hyperpod-ref-slurm-api-driven"></a>

With API-driven configuration, you define Slurm node types, partition assignments, and filesystem mounts directly in the CreateCluster and UpdateCluster API requests. This approach provides:
+ **Single source of truth** – All configuration in the API request
+ **No S3 file management** – No need to create or maintain `provisioning_parameters.json`
+ **Built-in validation** – API validates Slurm topology before cluster creation
+ **Drift detection** – Detects unauthorized changes to `slurm.conf`
+ **Per-instance-group storage** – Configure different FSx filesystems for different instance groups
+ **FSx for OpenZFS support** – Mount OpenZFS filesystems in addition to FSx for Lustre

#### SlurmConfig (per instance group)
<a name="sagemaker-hyperpod-ref-slurm-config"></a>

Add `SlurmConfig` to each instance group to define the Slurm node type and partition assignment.

```
"SlurmConfig": {
    "NodeType": "Controller | Login | Compute",
    "PartitionNames": ["string"]
}
```

**Parameters:**
+ `NodeType` – Required. The Slurm node type for this instance group. Valid values:
  + `Controller` – Slurm controller (head) node. Runs the `slurmctld` daemon. Exactly one instance group must have this node type.
  + `Login` – Login node for user access. Optional. At most one instance group can have this node type.
  + `Compute` – Worker nodes that execute jobs. Can have multiple instance groups with this node type.
**Important**  
`NodeType` is immutable. Once set during cluster creation, it cannot be changed. To use a different node type, create a new instance group.
+ `PartitionNames` – Conditional. An array of Slurm partition names. Required for `Compute` node types; not allowed for `Controller` or `Login` node types. Currently supports a single partition name per instance group.
**Note**  
All nodes are automatically added to the universal `dev` partition in addition to their specified partition.

**Example:**

```
{
    "InstanceGroupName": "gpu-compute",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 8,
    "SlurmConfig": {
        "NodeType": "Compute",
        "PartitionNames": ["gpu-training"]
    },
    "LifeCycleConfig": {
        "SourceS3Uri": "s3://sagemaker-bucket/lifecycle/src/",
        "OnCreate": "on_create.sh"
    },
    "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole"
}
```

#### Orchestrator.Slurm (cluster level)
<a name="sagemaker-hyperpod-ref-slurm-orchestrator"></a>

Add `Orchestrator.Slurm` to the cluster configuration to specify how HyperPod manages the `slurm.conf` file.

```
"Orchestrator": {
    "Slurm": {
        "SlurmConfigStrategy": "Managed | Overwrite | Merge"
    }
}
```

**Parameters:**
+ `SlurmConfigStrategy` – Required when `Orchestrator.Slurm` is provided. Controls how HyperPod manages the `slurm.conf` file on the controller node. Valid values:
  + `Managed` (default) – HyperPod fully controls the partition-node mappings in `slurm.conf`. Drift detection is enabled: if the current `slurm.conf` differs from the expected configuration, UpdateCluster fails with an error. Use this strategy when you want HyperPod to be the single source of truth for Slurm configuration.
  + `Overwrite` – HyperPod forces the API configuration to be applied, overwriting any manual changes to `slurm.conf`. Drift detection is disabled. Use this strategy to recover from drift or reset the cluster to a known state.
  + `Merge` – HyperPod preserves manual `slurm.conf` changes and merges them with the API configuration. Drift detection is disabled. Use this strategy if you need to make manual Slurm configuration changes that should persist across updates.

**Note**  
If `Orchestrator.Slurm` is omitted from the request, the default behavior is `Managed` strategy.

**Tip**  
You can change `SlurmConfigStrategy` at any time using UpdateCluster. There is no lock-in to a specific strategy.

**Example:**

```
{
    "ClusterName": "my-hyperpod-cluster",
    "InstanceGroups": [...],
    "Orchestrator": {
        "Slurm": {
            "SlurmConfigStrategy": "Managed"
        }
    }
}
```

#### SlurmConfigStrategy comparison
<a name="sagemaker-hyperpod-ref-slurm-strategy-comparison"></a>


|  |  |  |  | 
| --- |--- |--- |--- |
| Strategy | Drift Detection | Manual Changes | Use Case | 
| Managed | Enabled – blocks updates if drift detected | Blocked | HyperPod managed | 
| Overwrite | Disabled | Overwritten | Recovery from drift; reset to known state | 
| Merge | Disabled | Preserved | Advanced users with custom slurm.conf needs | 

#### FSx configuration via InstanceStorageConfigs
<a name="sagemaker-hyperpod-ref-slurm-fsx-config"></a>

With API-driven configuration, you can configure FSx filesystems per instance group using `InstanceStorageConfigs`. This allows different instance groups to mount different filesystems.

**Prerequisites:**
+ Your cluster must use a custom VPC (via `VpcConfig`). FSx filesystems reside in your VPC, and the platform-managed VPC cannot reach them.
+ At least one instance group must have `SlurmConfig` with `NodeType: Controller`.

##### FsxLustreConfig
<a name="sagemaker-hyperpod-ref-slurm-fsx-lustre"></a>

Configure FSx for Lustre filesystem mounting for an instance group.

```
"InstanceStorageConfigs": [
    {
        "FsxLustreConfig": {
            "DnsName": "string",
            "MountPath": "string",
            "MountName": "string"
        }
    }
]
```

**Parameters:**
+ `DnsName` – Required. The DNS name of the FSx for Lustre filesystem. Example: `fs-0abc123def456789.fsx.us-west-2.amazonaws.com`
+ `MountPath` – Optional. The local mount path on the instance. Default: `/fsx`
+ `MountName` – Required. The mount name of the FSx for Lustre filesystem. You can find this in the Amazon FSx console or by running `aws fsx describe-file-systems`.

##### FsxOpenZfsConfig
<a name="sagemaker-hyperpod-ref-slurm-fsx-openzfs"></a>

Configure FSx for OpenZFS filesystem mounting for an instance group.

```
"InstanceStorageConfigs": [
    {
        "FsxOpenZfsConfig": {
            "DnsName": "string",
            "MountPath": "string"
        }
    }
]
```

**Parameters:**
+ `DnsName` – Required. The DNS name of the FSx for OpenZFS filesystem. Example: `fs-0xyz987654321.fsx.us-west-2.amazonaws.com`
+ `MountPath` – Optional. The local mount path on the instance. Default: `/home`

**Note**  
Each instance group can have at most one `FsxLustreConfig` and one `FsxOpenZfsConfig`.

**Example with multiple filesystems:**

```
{
    "InstanceGroupName": "gpu-compute",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 4,
    "SlurmConfig": {
        "NodeType": "Compute",
        "PartitionNames": ["gpu-training"]
    },
    "InstanceStorageConfigs": [
        {
            "FsxLustreConfig": {
                "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                "MountPath": "/fsx",
                "MountName": "abcdefgh"
            }
        },
        {
            "FsxOpenZfsConfig": {
                "DnsName": "fs-0xyz987654321.fsx.us-west-2.amazonaws.com",
                "MountPath": "/shared"
            }
        },
        {
            "EbsVolumeConfig": {
                "VolumeSizeInGB": 500
            }
        }
    ],
    "LifeCycleConfig": {
        "SourceS3Uri": "s3://sagemaker-bucket/lifecycle/src/",
        "OnCreate": "on_create.sh"
    },
    "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole"
}
```

**Important**  
FSx configuration changes only apply during node provisioning. Existing nodes retain their original FSx configuration. To apply new FSx configuration to all nodes, scale down the instance group to 0, then scale back up.

#### Complete API-driven configuration example
<a name="sagemaker-hyperpod-ref-slurm-complete-example"></a>

The following example shows a complete CreateCluster request using API-driven Slurm configuration:

```
{
    "ClusterName": "ml-training-cluster",
    "InstanceGroups": [
        {
            "InstanceGroupName": "controller",
            "InstanceType": "ml.c5.xlarge",
            "InstanceCount": 1,
            "SlurmConfig": {
                "NodeType": "Controller"
            },
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://sagemaker-us-west-2-111122223333/lifecycle/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole",
            "ThreadsPerCore": 2
        },
        {
            "InstanceGroupName": "login",
            "InstanceType": "ml.m5.xlarge",
            "InstanceCount": 1,
            "SlurmConfig": {
                "NodeType": "Login"
            },
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://sagemaker-us-west-2-111122223333/lifecycle/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole",
            "ThreadsPerCore": 2
        },
        {
            "InstanceGroupName": "gpu-compute",
            "InstanceType": "ml.p4d.24xlarge",
            "InstanceCount": 8,
            "SlurmConfig": {
                "NodeType": "Compute",
                "PartitionNames": ["gpu-training"]
            },
            "InstanceStorageConfigs": [
                {
                    "FsxLustreConfig": {
                        "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                        "MountPath": "/fsx",
                        "MountName": "abcdefgh"
                    }
                }
            ],
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://sagemaker-us-west-2-111122223333/lifecycle/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole",
            "ThreadsPerCore": 2,
            "OnStartDeepHealthChecks": ["InstanceStress", "InstanceConnectivity"]
        },
        {
            "InstanceGroupName": "cpu-compute",
            "InstanceType": "ml.c5.18xlarge",
            "InstanceCount": 4,
            "SlurmConfig": {
                "NodeType": "Compute",
                "PartitionNames": ["cpu-preprocessing"]
            },
            "InstanceStorageConfigs": [
                {
                    "FsxLustreConfig": {
                        "DnsName": "fs-0abc123def456789.fsx.us-west-2.amazonaws.com",
                        "MountPath": "/fsx",
                        "MountName": "abcdefgh"
                    }
                }
            ],
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://sagemaker-us-west-2-111122223333/lifecycle/src/",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "arn:aws:iam::111122223333:role/HyperPodRole",
            "ThreadsPerCore": 2
        }
    ],
    "Orchestrator": {
        "Slurm": {
            "SlurmConfigStrategy": "Managed"
        }
    },
    "VpcConfig": {
        "SecurityGroupIds": ["sg-0abc123def456789a"],
        "Subnets": ["subnet-0abc123def456789a", "subnet-0abc123def456789b"]
    },
    "Tags": [
        {
            "Key": "Project",
            "Value": "ML-Training"
        }
    ]
}
```

To learn more about using API-driven configuration, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

### Legacy configuration: provisioning\$1parameters.json
<a name="sagemaker-hyperpod-ref-provisioning-forms"></a>

**Note**  
The `provisioning_parameters.json` approach is the legacy method for configuring Slurm on HyperPod. For new clusters, we recommend using the API-driven configuration approach described above. The legacy approach remains fully supported for backward compatibility.

With the legacy approach, you create a Slurm configuration file named `provisioning_parameters.json` and upload it to Amazon S3 as part of your lifecycle scripts. HyperPod reads this file during cluster creation to configure Slurm nodes.

#### Configuration form for provisioning\$1parameters.json
<a name="sagemaker-hyperpod-ref-provisioning-forms-slurm"></a>

The following code is the Slurm configuration form you should prepare to properly set up Slurm nodes on your HyperPod cluster. You should complete this form and upload it as part of a set of lifecycle scripts during cluster creation. To learn how this form should be prepared throughout HyperPod cluster creation processes, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

```
// Save as provisioning_parameters.json.
{
    "version": "1.0.0",
    "workload_manager": "slurm",
    "controller_group": "string",
    "login_group": "string",
    "worker_groups": [
        {
            "instance_group_name": "string",
            "partition_name": "string"
        }
    ],
    "fsx_dns_name": "string",
    "fsx_mountname": "string"
}
```

**Parameters:**
+ `version` – Required. This is the version of the HyperPod provisioning parameter form. Keep it to `1.0.0`.
+ `workload_manager` – Required. This is for specifying which workload manager to be configured on the HyperPod cluster. Keep it to `slurm`.
+ `controller_group` – Required. This is for specifying the name of the HyperPod cluster instance group you want to assign to Slurm controller (head) node.
+ `login_group` – Optional. This is for specifying the name of the HyperPod cluster instance group you want to assign to Slurm login node.
+ `worker_groups` – Required. This is for setting up Slurm worker (compute) nodes on the HyperPod cluster.
  + `instance_group_name` – Required. This is for specifying the name of the HyperPod instance group you want to assign to Slurm worker (compute) node.
  + `partition_name` – Required. This is for specifying the partition name to the node.
+ `fsx_dns_name` – Optional. If you want to set up your Slurm nodes on the HyperPod cluster to communicate with Amazon FSx, specify the FSx DNS name.
+ `fsx_mountname` – Optional. If you want to set up your Slurm nodes on the HyperPod cluster to communicate with Amazon FSx, specify the FSx mount name.

### Comparison: API-driven vs. legacy configuration
<a name="sagemaker-hyperpod-ref-slurm-comparison"></a>


|  |  |  | 
| --- |--- |--- |
| Feature | API-driven (Recommended) | Legacy (provisioning\$1parameters.json) | 
| Configuration location | CreateCluster API request | S3 file | 
| FSx for Lustre | Yes – Per instance group | Yes – Cluster-wide only | 
| FSx for OpenZFS | Yes – Per instance group | No – Not supported | 
| Built-in validation | Yes | No | 
| Drift detection | Yes – (Managed strategy) | No | 
| S3 file management | Not required | Required | 
| Lifecycle script complexity | Simplified | Full SLURM setup required | 

## SageMaker HyperPod DLAMI
<a name="sagemaker-hyperpod-ref-hyperpod-ami"></a>

SageMaker HyperPod runs a DLAMI based on:
+ [AWS Deep Learning Base GPU AMI (Ubuntu 20.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) for orchestration with Slurm.
+ Amazon Linux 2 based AMI for orchestration with Amazon EKS.

The SageMaker HyperPod DLAMI is bundled with additional packages to support open source tools such as Slurm, Kubernetes, dependencies, and SageMaker HyperPod cluster software packages to support resiliency features such as cluster health check and auto-resume. To follow up with HyperPod software updates that the HyperPod service team distributes through DLAMIs, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

## SageMaker HyperPod API permissions reference
<a name="sagemaker-hyperpod-ref-api-permissions"></a>

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

When you are setting up access control for allowing to run SageMaker HyperPod API operations and writing a permissions policy that you can attach to IAM users for cloud administrators, use the following table as a reference.


|  |  |  | 
| --- |--- |--- |
| Amazon SageMaker API Operations | Required Permissions (API Actions) | Resources | 
| CreateCluster | sagemaker:CreateCluster | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| DeleteCluster | sagemaker:DeleteCluster | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| DescribeCluster | sagemaker:DescribeCluster | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| DescribeClusterNode | sagemaker:DescribeClusterNode | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| ListClusterNodes | sagemaker:ListClusterNodes | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| ListClusters | sagemaker:ListClusters | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| UpdateCluster | sagemaker:UpdateCluster | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 
| UpdateClusterSoftware | sagemaker:UpdateClusterSoftware | arn:aws:sagemaker:region:account-id:cluster/cluster-id | 

For a complete list of permissions and resource types for SageMaker APIs, see [Actions, resources, and condition keys for Amazon SageMaker AI](https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazonsagemaker.html) in the *AWS Service Authorization Reference*.

## SageMaker HyperPod commands in AWS CLI
<a name="sagemaker-hyperpod-ref-cli"></a>

The following are the AWS CLI commands for SageMaker HyperPod to run the core [HyperPod API operations](#sagemaker-hyperpod-ref-api).
+ [batch-delete-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/batch-delete-cluster-nodes.html)
+ [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html)
+ [delete-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-cluster.html)
+ [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster.html)
+ [describe-cluster-node](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster-node.html)
+ [list-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html)
+ [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-clusters.html)
+ [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html)
+ [update-cluster-software](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster-software.html)

## SageMaker HyperPod Python modules in AWS SDK for Python (Boto3)
<a name="sagemaker-hyperpod-ref-boto3"></a>

The following are the methods of the AWS SDK for Python (Boto3) client for SageMaker AI to run the core [HyperPod API operations](#sagemaker-hyperpod-ref-api).
+ [batch\$1delete\$1cluster\$1nodes](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/batch_delete_cluster_nodes.html#)
+ [create\$1cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_cluster.html)
+ [delete\$1cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/delete_cluster.html)
+ [describe\$1cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_cluster.html)
+ [describe\$1cluster\$1node](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/describe_cluster_node.html)
+ [list\$1cluster\$1nodes](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/list_cluster_nodes.html)
+ [list\$1clusters](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/list_clusters.html)
+ [update\$1cluster](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/update_cluster.html)
+ [update\$1cluster\$1software](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/update_cluster_software.html)

# Amazon SageMaker HyperPod release notes
<a name="sagemaker-hyperpod-release-notes"></a>

This topic covers release notes that track update, fixes, and new features for Amazon SageMaker HyperPod. If you are looking for general feature releases, updates, and improvements for Amazon SageMaker HyperPod, you might find this page helpful.

The HyperPod AMI releases are documented separately to include information of the key components including general AMI releases, versions, and dependencies. If you are looking for these information related to HyperPod AMI releases, see [Amazon SageMaker HyperPod AMI](sagemaker-hyperpod-release-ami.md).

## SageMaker HyperPod release notes: April 16, 2026
<a name="sagemaker-hyperpod-release-notes-20260416"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ **Flexible instance groups** – You can now create instance groups with multiple instance types using the new `InstanceRequirements` parameter. This enables priority-based provisioning, where HyperPod attempts to provision the highest-priority instance type first and falls back to lower-priority types if capacity is unavailable. Flexible instance groups simplify Karpenter auto-scaling configurations by reducing the number of instance groups needed. You can specify up to 20 instance types per instance group. For more information, see [Flexible instance groups](sagemaker-hyperpod-scaling-eks.md#sagemaker-hyperpod-scaling-eks-flexible-ig).

## SageMaker HyperPod release notes: January 25, 2026
<a name="sagemaker-hyperpod-release-notes-20260125"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Released the new SageMaker HyperPod AMI for Amazon EKS 1.34. For more information, see [SageMaker Hyperpod AMI releases for Amazon EKS: January 25, 2026](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20260125).

For more information, see [Kubernetes v1.34](https://kubernetes.io/blog/2025/08/27/kubernetes-v1-34-release/).

## SageMaker HyperPod release notes: November 07, 2025
<a name="sagemaker-hyperpod-release-notes-20251107"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Upgraded security patches [SageMaker HyperPod AMI releases for Amazon EKS: November 07, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20251107).

## SageMaker HyperPod release notes: September 29, 2025
<a name="sagemaker-hyperpod-release-notes-20250929"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Released the new SageMaker HyperPod AMI for Amazon EKS 1.33. For more information, [SageMaker HyperPod AMI releases for Amazon EKS: September 29, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250929).
**Important**  
The Dynamic Resource Allocation beta Kubernetes API is enabled by default in this release.  
This API improves scheduling and monitoring workloads that require resources such as GPUs.
This API was developed by the open source Kubernetes community and might change in future versions of Kubernetes. Before you use the API, review the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) and understand how it affects your workloads.
HyperPod is not releasing a HyperPod Amazon Linux 2 AMI for Kubernetes 1.33. AWS recommends that you migrate to AL2023. For more information, see [Upgrade from Amazon Linux 2 to AL2023](https://docs.aws.amazon.com/eks/latest/userguide/al2023.html).

For more information, see [Kubernetes v1.33](https://kubernetes.io/blog/2025/04/23/kubernetes-v1-33-release/).

## SageMaker HyperPod release notes: August 4, 2025
<a name="sagemaker-hyperpod-release-notes-20250804"></a>

SageMaker HyperPod releases new public AMIs for EKS orchestration. Public AMIs can be used by themselves, or they can be used to create custom AMIs. For more information about the public AMIs, see [Public AMI releases](sagemaker-hyperpod-release-public-ami.md). For more information about creating a custom AMI, see [Custom Amazon Machine Images (AMIs) for SageMaker HyperPod clusters](hyperpod-custom-ami-support.md). 

## SageMaker HyperPod release notes: July 31, 2025
<a name="sagemaker-hyperpod-release-notes-20250731"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Released a new AMI that updates the operating system from Amazon Linux 2 to Amazon Linux 2023 for EKS clusters. Key upgrades include Linux Kernel 6.1, Python 3.10, NVIDIA Driver 560.35.03, and DNF package manager replacing YUM.
**Important**  
The update from Amazon Linux 2 to AL2023 introduces significant changes that might affect compatibility with software and configurations designed for AL2. We strongly recommend testing your applications with AL2023 before fully upgrading your clusters.

  For more information about the new AMI and how to upgrade your clusters, see [SageMaker HyperPod AMI releases for Amazon EKS: July 31, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250731).

## SageMaker HyperPod release notes: May 13, 2025
<a name="sagemaker-hyperpod-release-notes-20250513"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features and improvements**
+ Released an updated AMI that supports Ubuntu 22.04 LTS for Slurm clusters. This release includes several system and software component upgrades to provide improved performance, updated features, and enhanced security.
**Important**  
The update from Ubuntu 20.04 LTS to Ubuntu 22.04 LTS introduces changes that might affect compatibility with software and configurations designed for Ubuntu 20.04.

  For more information, see:
  + [Key updates in the Ubuntu 22.04 AMI](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-ami-slurm-ubuntu22-updates)
  + [Upgrading to the Ubuntu 22.04 AMI](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-ami-slurm-ubuntu22-upgrade)
  + [Troubleshooting upgrade failures](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-ami-slurm-ubuntu22-troubleshoot)

## SageMaker HyperPod release notes: May 1, 2025
<a name="sagemaker-hyperpod-release-notes-20250501"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Added usage reporting for EKS-orchestrated clusters, allowing organizations to implement transparent, usage-based cost allocation across teams, projects, or departments. This feature complements HyperPod’s [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) functionality to ensure fair cost distribution in shared multi-tenant AI/ML environments. For more information, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html).

## SageMaker HyperPod release notes: April 28, 2025
<a name="sagemaker-hyperpod-release-notes-20250428"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Upgraded NVIDIA driver from version 550.144.03 to 550.163.01. This upgrade is to address Common Vulnerabilities and Exposures (CVEs) present in the [NVIDIA GPU Display Security Bulletin for April 2025](https://nvidia.custhelp.com/app/answers/detail/a_id/5630).

For information about related AMI releases, see [SageMaker HyperPod AMI releases for Slurm: April 28, 2025](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20250428) and [SageMaker HyperPod AMI releases for Amazon EKS: April 28, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250428).

## SageMaker HyperPod release notes: April 18, 2025
<a name="sagemaker-hyperpod-release-notes-20250418"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ Released new SageMaker HyperPod AMI for Amazon EKS 1.32.1. For more information, see [SageMaker HyperPod AMI releases for Amazon EKS: April 18, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250418).

## SageMaker HyperPod release notes: April 10, 2025
<a name="sagemaker-hyperpod-release-notes-20250410"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features and improvements**
+ Added a Direct Preference Optimization (DPO) recipe tutorial for SageMaker HyperPod with Slurm orchestration. This fine-tuning tutorial provides step-by-step guidance for optimizing model alignment using the DPO method on GPU-powered SageMaker HyperPod Slurm clusters. For more information, see [HyperPod Slurm cluster DPO tutorial (GPU)](hyperpod-gpu-slurm-dpo-tutorial.md).

## SageMaker HyperPod release notes: April 03, 2025
<a name="sagemaker-hyperpod-release-notes-20250403"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Added a [Quickstart](sagemaker-hyperpod-quickstart.md) page for deploying SageMaker HyperPod clusters. The page leverages streamlined setup workflows from SageMaker HyperPod’s specialized workshops and automates deployment using prebuilt AWS CloudFormation templates. It supports infrastructure preferences like Slurm or Amazon EKS, for easy configuration and deployment of baseline clusters.
+ SageMaker HyperPod now supports the following instance types for both Slurm and Amazon EKS clusters.
  + New instance types: I3en, M7i, R7i instances. For the full list of supported instances, see the `InstanceType` field in the `[ClusterInstanceGroupDetails](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupDetails.html)`.

## SageMaker HyperPod release notes: March 16, 2025
<a name="sagemaker-hyperpod-release-notes-20250316"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Added the following IAM condition keys for more granular access control in the [https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateCluster.html) and [https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_UpdateCluster.html) API operations.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-notes.html)

## SageMaker HyperPod release notes: February 20, 2025
<a name="sagemaker-hyperpod-release-notes-20250220"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Added support for deleting instance groups from your SageMaker HyperPod cluster. For more information, see [Delete instance groups](smcluster-scale-down.md#smcluster-remove-instancegroup) from EKS-orchestrated clusters and [Scale down a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-scale-down) for Slurm-orchestrated clusters. 

## SageMaker HyperPod release notes: February 18, 2025
<a name="sagemaker-hyperpod-release-notes-20250218"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features**
+ This release of SageMaker HyperPod incorporates a security update from the Nvidia container toolkit (from version 1.17.3 to version 1.17.4). For more information, see [v1.17.4 release note](https://github.com/NVIDIA/nvidia-container-toolkit/releases/tag/v1.17.4). 
**Note**  
For all container workloads in the Nvidia container toolkit version 1.17.4, the mounting of CUDA compatibility libraries is now disabled. To ensure compatibility with multiple CUDA versions on container workflows, update your `LD_LIBRARY_PATH` to include your CUDA compatibility libraries. You can find the specific steps in [If you use a CUDA compatibility layer](inference-gpu-drivers.md#collapsible-cuda-compat).

For information about related AMI releases, see [SageMaker HyperPod AMI releases for Slurm: February 18, 2025](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20250218) and [SageMaker HyperPod AMI releases for Amazon EKS: February 18, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250218).

## SageMaker HyperPod release notes: February 06, 2025
<a name="sagemaker-hyperpod-release-notes-20250206"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md) and [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).

**New features and improvements**
+ Enhanced SageMaker HyperPod multi-AZ support: You can specify different subnets and security groups, cutting across different Availability Zones, for individual instance groups within your cluster. For more information about SageMaker HyperPod multi-AZ support, see [Setting up SageMaker HyperPod clusters across multiple AZs](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-multiple-availability-zones).

## SageMaker HyperPod release notes: January 22, 2025
<a name="sagemaker-hyperpod-release-notes-20250122"></a>

**AMI releases**
+ [SageMaker HyperPod AMI releases for Amazon EKS: January 22, 2025](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20250122)

## SageMaker HyperPod release notes: January 09, 2025
<a name="sagemaker-hyperpod-release-notes-20250109"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features and improvements**
+ Added IPv6 support: Clusters can use IPv6 addressing when configured with IPv6-enabled VPC and subnets. For more information, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

## SageMaker HyperPod release notes: December 21, 2024
<a name="sagemaker-hyperpod-release-notes-20241221"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features**
+ SageMaker HyperPod now supports the following instance types for both Slurm and Amazon EKS clusters.
  + New instance types: C6gn, C6i, M6i, R6i.
  + New Trainium instance types: Trn1 and Trn1n.

**Improvements**
+ Enhanced error logging visibility when Slurm interrupts jobs, and prevented unnecessary job step termination during Slurm-initiated job cancellations.
+ Updated base DLAMI for p5en for both Slurm and Amazon EKS clusters.

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: December 21, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241221)
+ [SageMaker HyperPod AMI releases for Amazon EKS: December 21, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241221)

## SageMaker HyperPod release notes: December 13, 2024
<a name="sagemaker-hyperpod-release-notes-20241213"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New feature**
+ SageMaker HyperPod releases a set of Amazon CloudWatch metrics to monitor the health and performance of SageMaker HyperPod Slurm clusters. These metrics are related to CPU, GPU, memory utilization, and cluster instance information such as node counts and failed nodes. This monitoring feature is enabled by default, and the metrics can be accessed under the `/aws/sagemaker/Clusters` CloudWatch namespace. You can also set up CloudWatch alarms based on these metrics to proactively detect and address potential issues within their Slurm-based HyperPod clusters. For more information, see [Amazon SageMaker HyperPod Slurm metrics](smcluster-slurm-metrics.md).

**AMI releases**
+ [SageMaker HyperPod AMI releases for Amazon EKS: December 13, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241213)

## SageMaker HyperPod release notes: November 24, 2024
<a name="sagemaker-hyperpod-release-notes-20241124"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features**
+ Added support for configuring SageMaker HyperPod clusters across multiple Availability Zones. For more information about SageMaker HyperPod multi-AZ support, see [Setting up SageMaker HyperPod clusters across multiple AZs](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-multiple-availability-zones).

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: November 24, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241124)
+ [SageMaker HyperPod AMI releases for Amazon EKS: November 24, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241124)

## SageMaker HyperPod release notes: November 15, 2024
<a name="sagemaker-hyperpod-release-notes-20241115"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md). For more information, see and [SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241115).

**New features and improvements**
+ Added support for trn1 and trn1n instance types for both Amazon EKS and Slurm orchestrated clusters.
+ Improved log management for Slurm clusters:
  +  Implemented log rotation: weekly or daily based on size.
  +  Set log retention to 3 weeks.
  +  Compressed logs to reduce storage impact.
  +  Continued uploading logs to CloudWatch for long-term retention.
**Note**  
Some logs are still stored in syslogs.
+ Adjusted Fluent Bit settings to prevent tracking issues with files containing long lines.

**Bug fixes**
+ Prevented unintended truncation with Slurm controller node updates in configuration file `slurm.config`.

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: November 15, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241115)
+ [SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241115)

## SageMaker HyperPod release notes: November 11, 2024
<a name="sagemaker-hyperpod-release-notes-20241111"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md). 

**New feature**
+ SageMaker HyperPod AMI now supports G6e instance types.

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: November 11, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241111)
+ [SageMaker HyperPod AMI releases for Amazon EKS: November 11, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241111)

## SageMaker HyperPod release notes: October 31, 2024
<a name="sagemaker-hyperpod-release-notes-20241031"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features**
+ Added scaling down SageMaker HyperPod clusters at the instance group level and instance level for both Amazon EKS and Slurm orchestrated clusters. For more information about scaling down Amazon EKS clusters, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md). For more information about scaling down Slurm clusters, see *Scale down a cluster* in [Managing SageMaker HyperPod Slurm clusters using the AWS CLI](sagemaker-hyperpod-operate-slurm-cli-command.md).
+ SageMaker HyperPod now supports the P5e instance type for both Amazon EKS and Slurm orchestrated clusters. 

## SageMaker HyperPod release notes: October 21, 2024
<a name="sagemaker-hyperpod-release-notes-20241021"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New feature**
+ SageMaker HyperPod now supports the P5e[n], G6, Gr6, and Trn2[n] instance types for both Slurm and Amazon EKS clusters.

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: October 21, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20241021)
+ [SageMaker HyperPod AMI releases for Amazon EKS: October 21, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20241021)

## SageMaker HyperPod release notes: September 10, 2024
<a name="sagemaker-hyperpod-release-notes-20240910"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md) and [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features**
+ Added Amazon EKS support in SageMaker HyperPod. To learn more, see [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md).
+ Added support for managing SageMaker HyperPod clusters through CloudFormation and Terraform. For more information about managing HyperPod clusters through CloudFormation, see [CloudFormation documentation for `AWS::SageMaker::Cluster`](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-sagemaker-cluster.html). To learn about managing HyperPod clusters through Terraform, see [Terraform documentation for `awscc_sagemaker_cluster`](https://registry.terraform.io/providers/hashicorp/awscc/latest/docs/data-sources/sagemaker_cluster).

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: September 10, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20240910)
+ [SageMaker HyperPod AMI releases for Amazon EKS: September 10, 2024](sagemaker-hyperpod-release-ami-eks.md#sagemaker-hyperpod-release-ami-eks-20240910)

## SageMaker HyperPod release notes: August 20, 2024
<a name="sagemaker-hyperpod-release-notes-20240820"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**New features**
+ Enhanced the [SageMaker HyperPod auto-resume functionality](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html#sagemaker-hyperpod-resiliency-slurm-auto-resume), extending the resiliency capability for Slurm nodes attached with Generic RESources (GRES).

  When [Generic Resources (GRES)](https://slurm.schedmd.com/gres.html) are attached to a Slurm node, Slurm typically doesn't permit changes in the node allocation, such as replacing nodes, and thus doesn’t allow to resume a failed job. Unless explicitly forbidden, the HyperPod auto-resume functionality automatically re-queues any faulty job associated with the GRES-enabled nodes. This process involves stopping the job, placing it back into the job queue, and then restarting the job from the beginning.

**Other changes**
+ Pre-packaged [https://slurm.schedmd.com/slurmrestd.html](https://slurm.schedmd.com/slurmrestd.html) in the SageMaker HyperPod AMI.
+ Changed the default values for `ResumeTimeout` and `UnkillableStepTimeout` from 60 seconds to 300 seconds in `slurm.conf` to improve system responsiveness and job handling.
+ Made minor improvements on health checks for NVIDIA Data Center GPU Manager (DCGM) and The NVIDIA System Management Interface (nvidia-smi).

**Bug fixes**
+ The HyperPod auto-resume plug-in can use idle nodes to resume a job.

## SageMaker HyperPod release notes: June 20, 2024
<a name="sagemaker-hyperpod-release-notes-20240620"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**New features**
+ Added a new capability of attaching additional storage to SageMaker HyperPod cluster instances. With this capability, you can configure supplementary storage at the instance group configuration level during the cluster creation or update processes, either through the SageMaker HyperPod console or the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) APIs. The additional EBS volume is attached to each instance within a SageMaker HyperPod cluster and mounted to `/opt/sagemaker`. To learn more about implementing it in your SageMaker HyperPod cluster, see the updated documentation on the following pages.
  + [Getting started with SageMaker HyperPod](smcluster-getting-started-slurm.md)
  + [SageMaker HyperPod Slurm cluster operations](sagemaker-hyperpod-operate-slurm.md)

  Note that you need to update the HyperPod cluster software to use this capability. After patching the HyperPod cluster software, you can utilize this capability for existing SageMaker HyperPod clusters created before June 20, 2024 by adding new instance groups. This capability is fully effective for any SageMaker HyperPod clusters created after June 20, 2024.

**Upgrade steps**
+ Run the following command to call the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software). 
**Important**  
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

  ```
   aws sagemaker update-cluster-software --cluster-name your-cluster-name
  ```
**Note**  
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

## SageMaker HyperPod release notes: April 24, 2024
<a name="sagemaker-hyperpod-release-notes-20240424"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**Bug fixes**
+ Fixed a bug with the `ThreadsPerCore` parameter in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html) API. With the fix, the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) and [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) APIs properly take and apply the user input through `ThreadsPerCore`. This fix is effective on HyperPod clusters created after April 24, 2024. If you had issues with this bug and want to get this fix applied to your cluster, you need to create a new cluster. Make sure that you back up and restore your work while moving to a new cluster following the instructions at [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

## SageMaker HyperPod release notes: March 27, 2024
<a name="sagemaker-hyperpod-release-notes-20240327"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**HyperPod software patch**

The HyperPod service team distributes software patches through [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami). See the following details about the latest HyperPod DLAMI.
+ In this release of the HyperPod DLAMI, Slurm is built with REST service (`slurmestd`) with JSON, YAML, and JWT support.
+ Upgraded [Slurm](https://slurm.schedmd.com/documentation.html) to v23.11.3.

**Improvements**
+ Increased auto-resume service timeout to 60 minutes.
+ Improved instance replacement process to not restart the Slurm controller.
+ Improved error messages from running lifecycle scripts, such as download errors and instance health check errors on instance start-up.

**Bug fixes**
+ Fixed a bug with chrony service that caused an issue with time synchronization.
+ Fixed a bug with parsing `slurm.conf`.
+ Fixed an issue with [NVIDIA `go-dcgm`](https://github.com/NVIDIA/go-dcgm) library.

## SageMaker HyperPod release notes: March 14, 2024
<a name="sagemaker-hyperpod-release-notes-20240314"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**Improvements**
+ HyperPod now properly supports passing partition names provided through `provisioning_parameters.json` and creates partitions appropriately based on provided inputs. For more information about `provisioning_parameters.json`, see [Legacy configuration: provisioning\$1parameters.json](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-provisioning-forms) and [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

**AMI releases**
+ [SageMaker HyperPod AMI releases for Slurm: March 14, 2024](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20240314)

## SageMaker HyperPod release notes: February 15, 2024
<a name="sagemaker-hyperpod-release-notes-20240215"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**New features**
+ Added a new `UpdateClusterSoftware` API for SageMaker HyperPod security patching. When security patches become available, we recommend you to update existing SageMaker HyperPod clusters in your account by running `aws sagemaker update-cluster-software --cluster-name your-cluster-name`. To follow up with future security patches, keep tracking this Amazon SageMaker HyperPod release notes page. To learn how the `UpdateClusterSoftware` API works, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).

## SageMaker HyperPod release notes: November 29, 2023
<a name="sagemaker-hyperpod-release-notes-20231129"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with Slurm](sagemaker-hyperpod-slurm.md).

**New features**
+ Launched Amazon SageMaker HyperPod at AWS re:Invent 2023.

**AMI releases**
+ [SageMaker HyperPod AMI release for Slurm: November 29, 2023](sagemaker-hyperpod-release-ami-slurm.md#sagemaker-hyperpod-release-ami-slurm-20231129)

# Amazon SageMaker HyperPod AMI
<a name="sagemaker-hyperpod-release-ami"></a>

Amazon SageMaker HyperPod Amazon Machine Images (AMIs) are specialized machine images for distributed machine learning workloads and high-performance computing. These AMIs enhance base images with essential components including GPU drivers and AWS Neuron accelerator support.

Key components added to HyperPod AMIs include:
+ [Public AMIs](sagemaker-hyperpod-release-public-ami.md) with support for [building custom AMIs](hyperpod-custom-ami-support.md)
+ Advanced orchestration tools:
  + [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md)
  + [Orchestrating SageMaker HyperPod clusters with Amazon EKS](sagemaker-hyperpod-eks.md)
+ Cluster management dependencies
+ Built-in resiliency features:
  + cluster health check
  + auto-resume capabilities
+ Support for HyperPod cluster management and configuration

These enhancements are built upon the following base Deep Learning AMIs (DLAMIs):
+ [AWS Deep Learning Base GPU AMI (Ubuntu 20.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) for orchestration with Slurm.
+ Amazon Linux 2 or Amazon Linux 2023 based AMI for orchestration with Amazon EKS.

Choose your HyperPod AMIs based on your orchestration preference:
+ For Slurm orchestration, see [SageMaker HyperPod AMI releases for Slurm](sagemaker-hyperpod-release-ami-slurm.md).
+ For Amazon EKS orchestration, see [SageMaker HyperPod AMI releases for Amazon EKS](sagemaker-hyperpod-release-ami-eks.md).

For information about Amazon SageMaker HyperPod feature releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

# Update your AMI version in your SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-release-ami-update"></a>

Amazon SageMaker HyperPod Amazon Machine Images (AMIs) are specialized machine images for distributed machine learning workloads and high-performance computing. Each AMI comes pre-loaded with drivers, machine learning frameworks, training libraries, and performance monitoring tools. By updating the AMI version in your cluster, you can use the latest versions of these components and packages for your training jobs and workflows.

 When updating the AMI version within your cluster, you have the option to process the update immediately, schedule a one-time only update, or use a cron expression to create a recurring schedule. You can also choose to update all of the instances in an instance group or just batches of instances. If you choose to update batches, you set the percentage or amount of instances that SageMaker AI should upgrade at a time. If you use this method of updating, you set an interval of how long SageMaker AI should wait in between batches.

If you choose to update in batches, you can also include a list of alarms and metrics. During the wait interval, SageMaker AI observes these metrics and if any exceed their threshold, the corresponding alarm goes into the ALARM state, and SageMaker AI rolls back the AMI update. To utilize automatic rollbacks, your IAM execution role must have the permission `cloudwatch:DescribeAlarms`.

**Note**  
Updating your cluster in batches is available only for HyperPod clusters integrated with Amazon EKS. Also, if you’re creating multiple schedules, we recommend that you have a time buffer in between schedules. If schedules overlap, updates might fail.

For more information about each AMI release for your HyperPod cluster, see [Amazon SageMaker HyperPod AMI](sagemaker-hyperpod-release-ami.md). For more information about general HyperPod releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

You can use the SageMaker AI API or CLI operations to update your cluster or see scheduled updates for a specific cluster. If you're using the AWS console, follow these steps:

**Note**  
Updating your AMI with the AWS console is available only for clusters integrated with Amazon EKS. If you have a Slurm cluster, you must use the SageMaker AI API or CLI operations.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the left, expand **HyperPod Clusters**, and choose **Cluster Management**.

1. Choose the cluster that you want to update, then choose **Details**, and **Update AMI**.


To create and manage update schedules programmatically, use the following API operations:
+ [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) – create a cluster while specifying an update schedule
+ [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) – update a cluster to add an update schedule
+ [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) – to update the platform software of a cluster
+ [ DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) – see an update schedule you created for a cluster
+ [DescribeClusterNode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterNode.html) and [ListClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterNodes.html) – see when the cluster was last updated.

## Required permissions
<a name="sagemaker-hyperpod-release-ami-update-permissions"></a>

Depending to how you configured your [Pod Disruption Budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) in your Amazon EKS cluster, HyperPod evicts pods, releases nodes, and prevents any update scheduling during the AMI update process. If any constraints within the budget are violated, HyperPod skips that node during the AMI update. For SageMaker HyperPod to correctly evict pods, you must add the necessary permissions to the HyperPod service-linked role. The following yaml file has the necessary permissions.

```
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: hyperpod-patching
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hyperpod-patching
subjects:
- kind: User
  name: hyperpod-service-linked-role
roleRef:
  kind: ClusterRole
  name: hyperpod-patching
  apiGroup: rbac.authorization.k8s.io
```

Use the following commands to apply the permissions.

```
git clone https://github.com/aws/sagemaker-hyperpod-cli.git 

cd sagemaker-hyperpod-cli/helm_chart

helm upgrade hyperpod-dependencies HyperPodHelmChart --namespace kube-system --install
```

## Cron expressions
<a name="sagemaker-hyperpod-release-ami-update-cron"></a>

To configure a one-time update at a certain time or a recurring schedule, use cron expressions. Cron expressions support six fields and are separated by white space. All six fields are required.

```
cron(Minutes Hours Day-of-month Month Day-of-week Year)
```


| **Fields** | **Values** | **Wildcards** | 
| --- | --- | --- | 
|  Minutes  |  00 – 59  |  N/A  | 
|  Hours  |  00 – 23  |  N/A  | 
|  Day-of-month  |  01 – 31  | ? | 
|  Month  |  01 – 12  | \$1 / | 
|  Day-of-week  |  1 – 7 or MON-SUN  | ? \$1 L | 
|  Year  |  Current year – 2099  | \$1 | 

**Wildcards**
+ The **\$1** (asterisk) wildcard includes all values in the field. In the `Hours` field, **\$1** would include every hour.
+ The **/** (forward slash) wildcard specifies increments. In the `Months` field, you could enter **\$1/3** to specify every 3rd month.
+ The **?** (question mark) wildcard specifies one or another. In the `Day-of-month` field you could enter **7**, and if you didn't care what day of the week the seventh was, you could enter **?** in the Day-of-week field.
+ The **L** wildcard in the `day-of-week` or field specifies the last day of the month or week. For example, `5L` means the last Friday of the month.
+ The **\$1** wildcard in the ay-of-week field specifies a certain instance of the specified day of the week within a month. For example, 3\$12 would be the second Tuesday of the month: the 3 refers to Tuesday because it is the third day of each week, and the 2 refers to the second day of that type within the month.

You can use cron expressions for the following scenarios:
+ One-time schedule that runs at a certain time and day. You can use the `?` wildcard to denote that day-of-month or day-of-week don't matter.

  ```
  cron(30 14 ? 12 MON 2024)
  ```

  ```
  cron(30 14 15 12 ? 2024)
  ```
+ A weekly schedule that runs at a certain time and day. The following example creates a schedule that runs at 12:00pm on every Monday regardless of day-of-month.

  ```
  cron(00 12 ? * 1 *)
  ```
+ Monthly schedule that runs every month regardless of the day-of-week. The following schedule runs at 12:30pm on the 15th of every month.

  ```
  cron(30 12 15 * ? *)
  ```
+ A monthly schedule that uses day-of-week.

  ```
  cron(30 12 ? * MON *)
  ```
+ To create a schedule that runs every Nth month, use the `/` wildcard. The following example creates a monthly schedule that runs every 3 months. The following two examples demonstrate how it works with day-of-week and day-of-month.

  ```
  cron(30 12 15 */3 ? *)
  ```

  ```
  cron(30 12 ? */3 MON *)
  ```
+ A schedule that runs on a certain instance of the specified day of the week. The following example creates a schedule that runs at 12:30pm on the second Monday of every month.

  ```
  cron(30 12 ? * 1#2 *)
  ```
+ A schedule that runs on the last instance of the specified day of the week. The following schedule runs at 12:30pm on the last Monday of every month.

  ```
  cron(30 12 ? * 1L *)
  ```

# SageMaker HyperPod AMI releases for Slurm
<a name="sagemaker-hyperpod-release-ami-slurm"></a>

The following release notes track the latest updates for Amazon SageMaker HyperPod AMI releases for Slurm orchestration. These HyperPod AMIs are built upon [AWS Deep Learning Base GPU AMI (Ubuntu 22.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-22-04/). The HyperPod service team distributes software patches through [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami). For HyperPod AMI releases for Amazon EKS orchestration, see [SageMaker HyperPod AMI releases for Amazon EKS](sagemaker-hyperpod-release-ami-eks.md). For information about Amazon SageMaker HyperPod feature releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Note**  
To update existing HyperPod clusters with the latest DLAMI, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).

## SageMaker HyperPod AMI releases for Slurm: April 20, 2026
<a name="sagemaker-hyperpod-release-ami-slurm-20260420"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx28
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.2
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.1
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx28
  + nvidia-imex version: 580.126.09-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1b1344-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx28
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.2
  + aws Neuronx DKMS version: 2.26.10.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx28
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1b1344-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: March 01, 2026
<a name="sagemaker-hyperpod-release-ami-slurm-20260301"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx26
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.1
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx26
  + nvidia-imex version: 580.126.09-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1b1344-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx26
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + aws Neuronx DKMS version: 2.26.5.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx26
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1b1344-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: February 12, 2026
<a name="sagemaker-hyperpod-release-ami-slurm-20260212"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.1
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + nvidia-imex version: 580.126.09-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.0b1337-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 1.45.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.0b1337-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: January 25, 2026
<a name="sagemaker-hyperpod-release-ami-slurm-20260125"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 2.3.1amzn3.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + nvidia-imex version: 580.126.09-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300063.0b1323-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 2.3.1amzn2.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300063.0b1323-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: December 29, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20251229"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm v24.11 ]
+ Slurm 24.11 (ARM64):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 2.3.1amzn3.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + nvidia-imex version: 580.105.08-1
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + git version: 2.34.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.0b1304-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1
+ Slurm 24.11 (x86\$164):
  + Linux Kernel version: 6.8
  + Glibc version: 2.35
  + OpenSSL version: 3.0.2
  + FSx Lustre Client version: 2.15.6-1fsx25
  + Runc version: 1.3.4
  + Containerd version: containerd containerd.io v2.2.1
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.6, 12.8, 12.9, 13.0
  + EFA Installer version: 2.3.1amzn2.0
  + Python version: 3.10.12
  + Slurm version: 24.11.0
  + nvme-cli version: 1.16
  + stress version: 1.0.5
  + collectd version: 5.12.0.
  + lustre-client version: 2.15.6-1fsx25
  + systemd version: 249
  + openssh version: 8.9
  + sudo version: 1.9.9
  + ufw version: 0.36.1
  + gcc version: 11.4.0
  + cmake version: 3.22.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.0b1304-1
  + nfs-utils version: 1:2.6.1-1ubuntu1.2
  + iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
  + lvm2 version: 2.03.11
  + ec2-instance-connect version: 1.1.14-0ubuntu1.1
  + rdma-core version: 60.0-1

------

## SageMaker HyperPod AMI releases for Slurm: November 22, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20251128"></a>

 **AMI general updates** 
+ Released updates for SageMaker HyperPod AMI for Slurm versions 24.11.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker HyperPod DLAMI for Slurm support** 

This release includes the following updates:

------
#### [ Slurm (arm64) ]
+ Linux Kernel version: 6.8
+ Glibc version: 2.35
+ OpenSSL version: 3.0.2
+ FSx Lustre Client version: 2.15.6-1fsx21
+ Runc version: 1.3.3
+ Containerd version: containerd containerd.io v2.1.5
+ NVIDIA Driver version: 580.95.05
+ CUDA version: 12.6, 12.8, 12.9, 13.0
+ EFA Installer version: 2.1.0amzn5.0
+ Python version: 3.10.12
+ Slurm version: 24.11.0
+ nvme-cli version: 1.16
+ collectd version: 5.12.0.
+ lustre-client version: 2.15.6-1fsx21
+ nvidia-imex version: 580.95.05-1
+ systemd version: 249
+ openssh version: 8.9
+ sudo version: 1.9.9
+ ufw version: 0.36.1
+ gcc version: 11.4.0
+ cmake version: 3.22.1
+ git version: 2.34.1
+ make version: 4.3
+ cloudwatch-agent version: 1.300062.0b1304-1
+ nfs-utils version: 1:2.6.1-1ubuntu1.2
+ iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
+ lvm2 version: 2.03.11
+ ec2-instance-connect version: 1.1.14-0ubuntu1.1
+ rdma-core version: 58.amzn0-1

------
#### [ Slurm (x86\$164) ]
+ Linux Kernel version: 6.8
+ Glibc version: 2.35
+ OpenSSL version: 3.0.2
+ FSx Lustre Client version: 2.15.6-1fsx21
+ Runc version: 1.3.3
+ Containerd version: containerd containerd.io v2.1.5
+ aws Neuronx DKMS version: 2.24.7.0
+ NVIDIA Driver version: 580.95.05
+ CUDA version: 12.6, 12.8, 12.9, 13.0
+ EFA Installer version: 2.3.1amzn1.0
+ Python version: 3.10.12
+ Slurm version: 24.11.0
+ nvme-cli version: 1.16
+ stress version: 1.0.5
+ collectd version: 5.12.0.
+ lustre-client version: 2.15.6-1fsx21
+ systemd version: 249
+ openssh version: 8.9
+ sudo version: 1.9.9
+ ufw version: 0.36.1
+ gcc version: 11.4.0
+ cmake version: 3.22.1
+ make version: 4.3
+ cloudwatch-agent version: 1.300062.0b1304-1
+ nfs-utils version: 1:2.6.1-1ubuntu1.2
+ iscsi-initiator-utils version: 2.1.5-1ubuntu1.1
+ lvm2 version: 2.03.11
+ ec2-instance-connect version: 1.1.14-0ubuntu1.1
+ rdma-core version: 59.amzn0-1

------

## SageMaker HyperPod release notes: November 07, 2025
<a name="sagemaker-hyperpod-release-notes-20251107"></a>

**The AMI includes the following:**
+ Supported AWS service: Amazon EC2
+ Operating System: Ubuntu 22.04
+ Compute Architecture: ARM64
+ Updated packages: NVIDIA Driver: 580.95.05
+ CUDA Versions: cuda-12.6, cuda-12.8, cuda-12.9, cuda-13.0
+ Security fixes: [ Runc Security patch](https://aws.amazon.com/security/security-bulletins/rss/aws-2025-024/)

## SageMaker HyperPod release notes: September 29, 2025
<a name="sagemaker-hyperpod-release-notes-20250929"></a>

**The AMI includes the following:**
+ Supported AWS service: Amazon EC2
+ Operating System: Ubuntu 22.04
+ Compute Architecture: ARM64
+ Updated packages: NVIDIA Driver: 570.172.08
+ Security fixes

## SageMaker HyperPod release notes: August 12, 2025
<a name="sagemaker-hyperpod-release-notes-20250812"></a>

**The AMI includes the following:**
+ Supported AWS service: Amazon EC2
+ Operating System: Ubuntu 22.04
+ Compute Architecture: ARM64
+ Latest available version is installed for the following packages:
  + Linux Kernel: 6.8
  + FSx Lustre
  + Docker
  + AWS CLI v2 at `/usr/bin/aws`
  + NVIDIA DCGM
  + Nvidia container toolkit:
    + Version command: `nvidia-container-cli -V`
  + Nvidia-docker2:
    + Version command: `nvidia-docker version`
  + Nvidia-IMEX: v570.172.08-1
+ NVIDIA Driver: 570.158.01
+ NVIDIA CUDA 12.4, 12.5, 12.6, 12.8 stack:
  + CUDA, NCCL and cuDDN installation directories: `/usr/local/cuda-xx.x/`
    + Example: `/usr/local/cuda-12.8/`, `/usr/local/cuda-12.8/`
  + Compiled NCCL Version:
    + For CUDA directory of 12.4, compiled NCCL Version 2.22.3\$1CUDA12.4
    + For CUDA directory of 12.5, compiled NCCL Version 2.22.3\$1CUDA12.5
    + For CUDA directory of 12.6, compiled NCCL Version 2.24.3\$1CUDA12.6
    + For CUDA directory of 12.8, compiled NCCL Version 2.27.5\$1CUDA12.8
  + Default CUDA: 12.8
    + PATH `/usr/local/cuda` points to CUDA 12.8
    + Updated below env vars:
      + `LD_LIBRARY_PATH` to have `/usr/local/cuda-12.8/lib:/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8:/usr/local/cuda-12.8/targets/sbsa-linux/lib:/usr/local/cuda-12.8/nvvm/lib64:/usr/local/cuda-12.8/extras/CUPTI/lib64`
      + `PATH` to have `/usr/local/cuda-12.8/bin/:/usr/local/cuda-12.8/include/`
      + For any different CUDA version, please update `LD_LIBRARY_PATH` accordingly.
+ EFA installer: 1.42.0
+ Nvidia GDRCopy: 2.5.1
+ AWS OFI NCCL plugin comes with EFA installer
  + Paths `/opt/amazon/ofi-nccl/lib/aarch64-linux-gnu` and `/opt/amazon/ofi-nccl/efa` are added to `LD_LIBRARY_PATH`.
+ AWS CLI v2 at `/usr/local/bin/aws2` and AWS CLI v1 at `/usr/bin/aws`
+ EBS volume type: gp3
+ Python: `/usr/bin/python3.10`

## SageMaker HyperPod release notes: May 27, 2025
<a name="sagemaker-hyperpod-release-notes-20250527"></a>

SageMaker HyperPod releases the following for [Orchestrating SageMaker HyperPod clusters with SlurmSlurm orchestration](sagemaker-hyperpod-slurm.md).

**New features and improvements**
+ Updated base AMI to `Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) 20250523` with the following key components:
  + NVIDIA Driver: 570.133.20
  + CUDA: 12.8 (default), with support for CUDA 12.4-12.6
  + NCCL Version: 2.26.5
  + EFA Installer: 1.40.0
  + AWS OFI NCCL: 1.14.2-aws
+ Updated Neuron SDK packages:
  + aws-neuronx-collectives: 2.25.65.0-9858ac9a1 (from 2.24.59.0-838c7fc8b)
  + aws-neuronx-dkms: 2.21.37.0 (from 2.20.28.0)
  + aws-neuronx-runtime-lib: 2.25.57.0-166c7a468 (from 2.24.53.0-f239092cc)
  + aws-neuronx-tools: 2.23.9.0 (from 2.22.61.0)

**Important notes**
+ NVIDIA Container Toolkit 1.17.4 now has disabled mounting of CUDA compatible libraries.
+ Updated EFA configuration from 1.37 to 1.38, and EFA now includes the AWS OFI NCCL plugin, which is located in the `/opt/amazon/ofi-nccl` directory instead of the original `/opt/aws-ofi-nccl/` path. (Released on February 18, 2025)
+ Kernel version is pinned for stability and driver compatibility.

## SageMaker HyperPod AMI releases for Slurm: May 13, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20250513"></a>

Amazon SageMaker HyperPod released an updated AMI that supports Ubuntu 22.04 LTS for Slurm clusters. AWS regularly updates AMIs to ensure you have access to the most current software stack. Upgrading to the latest AMI provides enhanced security through comprehensive package updates, improved performance and stability for your workloads, and compatibility with new instance types and latest kernel features.

**Important**  
The update from Ubuntu 20.04 LTS to Ubuntu 22.04 LTS introduces changes that might affect compatibility with software and configurations designed for Ubuntu 20.04.

**Topics**
+ [Key updates in the Ubuntu 22.04 AMI](#sagemaker-hyperpod-ami-slurm-ubuntu22-updates)
+ [Upgrading to the Ubuntu 22.04 AMI](#sagemaker-hyperpod-ami-slurm-ubuntu22-upgrade)
+ [Troubleshooting upgrade failures](#sagemaker-hyperpod-ami-slurm-ubuntu22-troubleshoot)

### Key updates in the Ubuntu 22.04 AMI
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-updates"></a>

The following table lists the component versions of the Ubuntu 22.04 AMI compared to the previous AMI.


**Component versions of the Ubuntu 22.04 AMI compared to the previous AMI**  

| Component | Previous version | Updated version | 
| --- | --- | --- | 
|  **Ubuntu OS**  |  20.04 LTS  |  22.04 LTS  | 
|  **Slurm**  |  24.11  |  24.11 (unchanged)  | 
|  **Python**  |  3.8 (default)  |  3.10 (default)  | 
|  **Elastic Fabric Adapter (EFA) on Amazon FSx**  |  Not supported  |  Supported  | 
|  **Linux kernel**  |  5.15  |  6.8  | 
|  **GNU C Library (glibc)**  |  2.31  |  2.35  | 
|  **GNU Compiler Collection (GCC)**  |  9.4.0  |  11.4.0  | 
|  **libc6**  |  ≤ 2.31  |  ≥ 2.35 supported  | 
|  **Network File System (NFS)**  |  1:1.3.4  |  1:2.6.1  | 

**Note**  
Although the Slurm version (24.11) remains unchanged, the underlying OS and library updates in this AMI may affect your system behavior and workload compatibility. You must test your workloads before upgrading production clusters.

### Upgrading to the Ubuntu 22.04 AMI
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-upgrade"></a>

Before upgrading your cluster to the Ubuntu 22.04 AMI, complete these preparation steps and review the upgrade requirements. To troubleshoot upgrade failures, see [Troubleshooting upgrade failures](#sagemaker-hyperpod-ami-slurm-ubuntu22-troubleshoot).

#### Review Python compatibility
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-python-compatibility"></a>

The Ubuntu 22.04 AMI uses Python 3.10 as the default version, upgraded from Python 3.8. Although Python 3.10 maintains compatibility with most Python 3.8 code, you should test your existing workloads before upgrading. If your workloads require Python 3.8, you can install it using the following command in your lifecycle script:

```
yum install python-3.8
```

Before upgrading your cluster, make sure to do the following:

1. Test your code compatibility with Python 3.10.

1. Verify your lifecycle scripts work in the new environment.

1. Check that all dependencies are compatible with the new Python version.

1. If you created your HyperPod cluster by copying the default lifecycle script from GitHub, add the following command to your `setup_mariadb_accounting.sh` file before upgrading to Ubuntu 22. For the complete script, see [setup\$1mariadb\$1accounting.sh on GitHub](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/setup_mariadb_accounting.sh).

   ```
   apt-get -y -o DPkg::Lock::Timeout=120 update && apt-get -y -o DPkg::Lock::Timeout=120 install apg
   ```

#### Upgrade your Slurm cluster
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-upgrade-cluster"></a>

You can upgrade your Slurm cluster to use the new AMI in two ways:

1. Create a new cluster using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API.

1. Update an existing cluster's software using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API.

#### Validated configurations
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-validation"></a>

AWS has tested a wide range of distributed training workloads and infrastructure features on G5, G6, G6e, P4d, P5, and Trn1 instances, including:
+ Distributed training with PyTorch (e.g., FSDP, NeMo, LLaMA, MNIST).
+ Accelerator testing across instance types with Nvidia (P/G series) and AWS Neuron (Trn1).
+ Resiliency features that include [auto-resume](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html#sagemaker-hyperpod-resiliency-slurm-auto-resume) and [deep health checks](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html).

#### Cluster downtime and availability
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-downtime-availability"></a>

During the upgrade process, the cluster will be unavailable. To minimize disruption, do the following:
+ Test the upgrade process on smaller clusters.
+ Create checkpoints before the upgrade, then restart training workloads from existing checkpoints after the upgrade completes.

### Troubleshooting upgrade failures
<a name="sagemaker-hyperpod-ami-slurm-ubuntu22-troubleshoot"></a>

When an upgrade fails, first determine if the failure is related to lifecycle scripts. These scripts commonly fail due to syntax errors, missing dependencies, or incorrect configurations.

To investigate failures related to lifecycle scripts, check CloudWatch logs. All SageMaker HyperPod events and logs are stored under the log group: `/aws/sagemaker/Clusters/[ClusterName]/[ClusterID]`. Look specifically at the log stream `LifecycleConfig/[instance-group-name]/[instance-id]`, which provides detailed information about any errors during script execution.

If the upgrade failure is unrelated to lifecycle scripts, collect relevant information including the cluster ARN, error logs, and timestamps, then contact [AWS support](https://aws.amazon.com/premiumsupport/) for further assistance.

## SageMaker HyperPod AMI releases for Slurm: May 07, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20250507"></a>

Amazon SageMaker HyperPod for Slurm released a major OS version upgrade to Ubuntu 22.04 (from the earlier Ubuntu 20.04). Check DLAMI Ubuntu 22.04 ([release notes](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-22-04/) ) for more information: `Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 22.04) 20250503`.

Key package upgrades:
+ Ubuntu 22.04 LTS (from 20.04)
+ Python Version:
  + Python 3.10 is now the default Python version in the Slurm AMI Ubuntu 22.04
  + This upgrade provide access to the latest features, performance improvements and bug fixes introduced in Python 3.10
+ Support for EFA on FSx
+ New Linux Kernel version 6.8 (updated from 5.15)
+ Glibc version: 2.35 (updated from 2.31)
+ GCC version: 11.4.0 (updated from 9.4.0)
+ Newer libc6 version support (from libc6 version <= 2.31)
+ NFS version: 1:2.6.1 (updated from 1:1.3.4)

## SageMaker HyperPod AMI releases for Slurm: April 28, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20250428"></a>

**Improvements for Slurm**
+ Upgraded NVIDIA driver from version 550.144.03 to 550.163.01. This upgrade is to address Common Vulnerabilities and Exposures (CVEs) present in the [NVIDIA GPU Display Security Bulletin for April 2025](https://nvidia.custhelp.com/app/answers/detail/a_id/5630).

**Amazon SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Installed the latest version of AWS Neuron SDK ]
+ **aws-neuronx-collectives:** 2.24.59.0-838c7fc8b
+ **aws-neuronx-dkms:** 2.20.28.0
+ **aws-neuronx-runtime-lib:** 2.24.53.0-f239092cc
+ **aws-neuronx-tools/unknown:** 2.22.61.0

------

## SageMaker HyperPod AMI releases for Slurm: February 18, 2025
<a name="sagemaker-hyperpod-release-ami-slurm-20250218"></a>

**Improvements for Slurm**
+ Upgraded Slurm version to 24.11.
+ Upgraded Elastic Fabric Adapter (EFA) version from 1.37.0 to 1.38.0.
+ The EFA now includes the AWS OFI NCCL plugin. You can find this plugin in the `/opt/amazon/ofi-nccl` directory, rather than the original `/opt/aws-ofi-nccl/` location. If you need to update your `LD_LIBRARY_PATH` environment variable, make sure to modify the path to point to the new `/opt/amazon/ofi-nccl` location for the OFI NCCL plugin.
+ Removed the emacs package from these DLAMIs. You can install emacs from GNU emac.

**Amazon SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Installed the latest version of AWS Neuron SDK 2.19 ]
+ **aws-neuronx-collectives/unknown:** 2.23.135.0-3e70920f2 amd64
+ **aws-neuronx-dkms/unknown:** 2.19.64.0 amd64
+ **aws-neuronx-runtime-lib/unknown:** 2.23.112.0-9b5179492 amd64
+ **aws-neuronx-tools/unknown:** 2.20.204.0 amd64

------

## SageMaker HyperPod AMI releases for Slurm: December 21, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241221"></a>

**SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Deep Learning Slurm AMI ]
+ **NVIDIA driver:** 550.127.05
+ **EFA driver:** 2.13.0-1
+ Installed the latest version of AWS Neuron SDK
  + **aws-neuronx-collectives:** 2.22.33.0
  + **aws-neuronx-dkms:** 2.18.20.0
  + **aws-neuronx-oci-hook:** 2.5.8.0
  + **aws-neuronx-runtime-lib:** 2.22.19.0
  + **aws-neuronx-tools:** 2.19.0.0

------

## SageMaker HyperPod AMI releases for Slurm: November 24, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241124"></a>

**AMI general updates**
+ Released in `MEL` (Melbourne) Region.
+ Updated SageMaker HyperPod base DLAMI to the following versions:
  + Slurm: 2024-11-22.

## SageMaker HyperPod AMI releases for Slurm: November 15, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241115"></a>

**AMI general updates**
+ Installed latest `libnvidia-nscq-xxx` package.

**SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Deep Learning Slurm AMI ]
+ **NVIDIA driver:** 550.127.05
+ **EFA driver:** 2.13.0-1
+ Installed the latest version of AWS Neuron SDK
  + **aws-neuronx-collectives:** v2.22.33.0-d2128d1aa
  + **aws-neuronx-dkms:** v2.17.17.0
  + **aws-neuronx-oci-hook:** v2.4.4.0
  + **aws-neuronx-runtime-lib:** v2.21.41.0
  + **aws-neuronx-tools:** v2.18.3.0

------

## SageMaker HyperPod AMI releases for Slurm: November 11, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241111"></a>

**AMI general updates**
+ Updated SageMaker HyperPod base DLAMI to the following version:
  + Slurm: 2024-10-23.

## SageMaker HyperPod AMI releases for Slurm: October 21, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20241021"></a>

**AMI general updates**
+ Updated SageMaker HyperPod base DLAMI to the following versions:
  + Slurm: 2024-09-27.

## SageMaker HyperPod AMI releases for Slurm: September 10, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20240910"></a>

**SageMaker HyperPod DLAMI for Slurm support**

------
#### [ Deep Learning Slurm AMI ]
+ Installed the NVIDIA driver v550.90.07
+ Installed the EFA driver v2.10
+ Installed the latest version of AWS Neuron SDK
  + **aws-neuronx-collectives:** v2.21.46.0
  + **aws-neuronx-dkms:** v2.17.17.0
  + **aws-neuronx-oci-hook:** v2.4.4.0
  + **aws-neuronx-runtime-lib:** v2.21.41.0
  + **aws-neuronx-tools:** v2.18.3.0

------

## SageMaker HyperPod AMI releases for Slurm: March 14, 2024
<a name="sagemaker-hyperpod-release-ami-slurm-20240314"></a>

**HyperPod DLAMI for Slurm software patch**
+ Upgraded [Slurm](https://slurm.schedmd.com/documentation.html) to v23.11.1
+ Added [OpenPMIx](https://openpmix.github.io/code/getting-the-reference-implementation) v4.2.6 for enabling [Slurm with PMIx](https://slurm.schedmd.com/mpi_guide.html#pmix).
+ Built upon the [AWS Deep Learning Base GPU AMI (Ubuntu 20.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) released on 2023-10-26
+ A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI
  + [Slurm](https://slurm.schedmd.com/documentation.html): v23.11.1
  + [OpenPMIx ](https://openpmix.github.io/code/getting-the-reference-implementation): v4.2.6
  + Munge: v0.5.15
  + `aws-neuronx-dkms`: v2.\$1
  + `aws-neuronx-collectives`: v2.\$1
  + `aws-neuronx-runtime-lib`: v2.\$1
  + `aws-neuronx-tools`: v2.\$1
  + SageMaker HyperPod software packages to support features such as cluster health check and auto-resume

**Upgrade steps**
+ Run the following command to call the [UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API to update your existing HyperPod clusters with the latest HyperPod DLAMI. To find more instructions, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).
**Important**  
Back up your work before running this API. The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre. For more information, see [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).

  ```
   aws sagemaker update-cluster-software --cluster-name your-cluster-name
  ```
**Note**  
Note that you should run the AWS CLI command to update your HyperPod cluster. Updating the HyperPod software through SageMaker HyperPod console UI is currently not available.

## SageMaker HyperPod AMI release for Slurm: November 29, 2023
<a name="sagemaker-hyperpod-release-ami-slurm-20231129"></a>

**HyperPod DLAMI for Slurm software patch**

The HyperPod service team distributes software patches through [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami). See the following details about the latest HyperPod DLAMI.
+ Built upon the [AWS Deep Learning Base GPU AMI (Ubuntu 20.04)](https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/) released on 2023-10-18
+ A complete list of pre-installed packages in this HyperPod DLAMI in addition to the base AMI
  + [Slurm](https://slurm.schedmd.com/documentation.html): v23.02.3
  + Munge: v0.5.15
  + `aws-neuronx-dkms`: v2.\$1
  + `aws-neuronx-collectives`: v2.\$1
  + `aws-neuronx-runtime-lib`: v2.\$1
  + `aws-neuronx-tools`: v2.\$1
  + SageMaker HyperPod software packages to support features such as cluster health check and auto-resume

# SageMaker HyperPod AMI releases for Amazon EKS
<a name="sagemaker-hyperpod-release-ami-eks"></a>

The following release notes track the latest updates for Amazon SageMaker HyperPod AMI releases for Amazon EKS orchestration. Each release note includes a summarized list of packages pre-installed or pre-configured in the SageMaker HyperPod DLAMIs for Amazon EKS support. Each DLAMI is built on AL2023 and supports a specific Kubernetes version. For HyperPod DLAMI releases for Slurm orchestration, see [SageMaker HyperPod AMI releases for Slurm](sagemaker-hyperpod-release-ami-slurm.md). For information about Amazon SageMaker HyperPod feature releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

## SageMaker Hyperpod AMI releases for Amazon EKS: April 20, 2026
<a name="sagemaker-hyperpod-release-ami-eks-20260420"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, 1.34.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.50 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.60
  + aws Neuronx DKMS version: 2.26.5.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.26.10.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.50 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.60
  + aws Neuronx DKMS version: 2.26.5.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.26.10.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.50 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.60
  + aws Neuronx DKMS version: 2.26.5.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.26.10.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.50 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.60
  + aws Neuronx DKMS version: 2.26.5.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.26.10.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.43.3
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.50 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.60
  + aws Neuronx DKMS version: 2.26.5.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.26.10.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.43.3
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.26.10.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.43.3
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.34 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.26.10.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.45.0
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + EFA Installer version: 1.43.3
  + ENA Driver version: 2.16.1g
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: March 01, 2026
<a name="sagemaker-hyperpod-release-ami-eks-20260301"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, 1.34.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.44 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.54
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.44 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.54
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.44 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.54
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.44 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.54
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.46 Python/3.10.17 Linux/5.10.248-247.988.amzn2.x86\$164 botocore/1.42.56
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.34 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.16.1g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.2
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300064.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: February 12, 2026
<a name="sagemaker-hyperpod-release-ami-eks-20260212"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, 1.34.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.31 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.41
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.34 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.45.0
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + EFA Installer version: 1.43.3
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.1
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: January 25, 2026
<a name="sagemaker-hyperpod-release-ami-eks-20260125"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, 1.34.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.14, build 0bab007
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.21 Python/3.10.17 Linux/5.10.247-246.989.amzn2.x86\$164 botocore/1.42.31
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.211.01
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.34 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.4
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.5
  + NVIDIA Driver version: 580.126.09
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.34.2-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.126.09
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300062.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: December 29, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251229"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.28.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.29.15-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.30.14-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.31.13-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.105.08
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.29
  + aws CLI v2 version: aws-cli/1.44.4 Python/3.10.17 Linux/5.10.245-245.983.amzn2.x86\$164 botocore/1.42.14
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.32.9-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.105.08
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + aws Neuronx DKMS version: 2.25.4.0
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 60.0
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd/v2 2.1.4
  + NVIDIA Driver version: 580.105.08
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.25
  + Kubernetes version: v1.33.5-eks-ecaa3a6
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.105.08
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker Hyperpod AMI releases for Amazon EKS: November 22, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251128"></a>

 **AMI general updates** 
+ Released updates for SageMaker Hyperpod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33.
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

 **SageMaker Hyperpod DLAMI for Amazon EKS support** 

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.71 Python/3.10.17 Linux/5.10.245-241.978.amzn2.x86\$164 botocore/1.40.71
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.28.15-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.28.15-eks-473151a
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.

------
#### [ Kubernetes v1.29 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.71 Python/3.10.17 Linux/5.10.245-241.978.amzn2.x86\$164 botocore/1.40.71
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.29.15-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.29.15-eks-473151a
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.

------
#### [ Kubernetes v1.30 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.2
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.69 Python/3.10.17 Linux/5.10.245-241.976.amzn2.x86\$164 botocore/1.40.69
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.30.11-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.30.11-eks-473151a
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.

------
#### [ Kubernetes v1.31 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.71 Python/3.10.17 Linux/5.10.245-241.978.amzn2.x86\$164 botocore/1.40.71
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.31.7-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.31.13-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.31.13-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.95.05
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.32 ]
+  **AL2 is now deprecated. Kubernetes AMI is based on AL2023.** 
+ AL2 (x86\$164):
  + Linux Kernel version: 5.10
  + Glibc version: 2.26
  + OpenSSL version: 1.0.2k-fips
  + FSx Lustre Client version: 2.12.8
  + Docker version: Docker version 25.0.13, build 0bab007
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws CLI v2 version: aws-cli/1.42.74 Python/3.10.17 Linux/5.10.245-241.978.amzn2.x86\$164 botocore/1.40.74
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 570.195.03
  + CUDA version: 12.2
  + ENA Driver version: 2.15.0g
  + Python version: 3.7.16
  + Kubernetes version: v1.32.3-eks-473151a
  + iptables-services version: 1.8.4
  + nginx version: 1.20.1
  + nvme-cli version: 1.11.1
  + epel-release version: 7
  + stress version: 1.0.4
  + collectd version: 5.8.1
  + acl version: 2.2.51
  + rsyslog version: 8.24.0
  + lustre-client version: 2.12.8
  + systemd version: 219
  + openssh version: 7.4
  + sudo version: 1.8.23
  + gcc version: 7.3.1
  + cmake version: 2.8.12.2
  + git version: 2.47.3
  + make version: 3.82
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 1.3.0
  + lvm2 version: 2.02.187
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.32.9-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.32.9-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.95.05
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + Linux Kernel version: 6.1
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + aws Neuronx DKMS version: 2.24.7.0
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.33.5-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 59.
+ AL2023 (ARM64):
  + Linux Kernel version: 6.12
  + Glibc version: 2.34
  + OpenSSL version: 3.2.2
  + FSx Lustre Client version: 2.15.6
  + Runc version: 1.3.3
  + Containerd version: containerd github.com/containerd/containerd 1.7.27
  + NVIDIA Driver version: 580.95.05
  + CUDA version: 12.8
  + ENA Driver version: 2.15.0g
  + Python version: 3.9.24
  + Kubernetes version: v1.33.5-eks-113cf36
  + iptables-services version: 1.8.8
  + nginx version: 1.28.0
  + nvme-cli version: 2.13 1.13
  + stress version: 1.0.7
  + collectd version: 5.12.0.
  + acl version: 2.3.1
  + lustre-client version: 2.15.6
  + nvidia-imex version: 580.95.05
  + systemd version: 252
  + openssh version: 8.7
  + sudo version: 1.9.15
  + gcc version: 11.5.0
  + cmake version: 3.22.2
  + git version: 2.50.1
  + make version: 4.3
  + cloudwatch-agent version: 1.300060.1
  + nfs-utils version: 2.5.4
  + lvm2 version: 2.03.16
  + ec2-instance-connect version: 1.1
  + aws-cfn-bootstrap version: 2.0
  + rdma-core version: 58.

------

## SageMaker HyperPod AMI releases for Amazon EKS: November 07, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251107"></a>

**AMI general updates**
+ Released updates for SageMaker HyperPod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, and 1.33. 
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/appendix-ami-release-notes.html#appendix-ami-release-notes-base).

**SageMaker HyperPod DLAMI for Amazon EKS support**

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.28.15
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.28.15
+ Package updates include boto3, botocore, pip, regex, psutil, and nvidia container toolkit components.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.29 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.29.15
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.29.15
+ Package updates include kernel updates, glibc updates, and various system libraries.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.30 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.30.11
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.30.11
+ Package updates include kernel livepatch updates and system library updates.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.31 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.31.7
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.31.13
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.31.13
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include extensive system library updates, kernel updates, and boost library updates.
+ Added packages: apr-util-lmdb, kernel-livepatch-6.1.156-177.286

------
#### [ Kubernetes v1.32 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.32.3
  + AWS IAM Authenticator version: v0.6.29
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.32.9
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.32.9
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include kernel livepatch updates and system library updates.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.33.5
  + Kernel version: 6.1.155-176.282.amzn2023.x86\$164
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.33.5
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include extensive system library updates, kernel updates, and boost library updates.
+ Added packages: apr-util-lmdb, kernel-livepatch updates

------

**Note**  
runc version has been upgraded to 1.3.2 [Security bulletin](https://aws.amazon.com/security/security-bulletins/rss/aws-2025-024/)

## SageMaker HyperPod AMI releases for Amazon EKS: October 29, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251029"></a>

**AMI general updates**
+ Released updates for SageMaker HyperPod AMI for Amazon EKS versions 1.28, 1.29, 1.30, 1.31, 1.32, and 1.33. 
+ Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/aws-deep-learning-ami-baseoss-aml2-2025-10-14.html).

**SageMaker HyperPod DLAMI for Amazon EKS support**

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.28.15
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.28.15
+ Package updates include boto3, botocore, pip, regex, psutil, and nvidia container toolkit components.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.29 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.29.15
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.29.15
+ Package updates include kernel updates, glibc updates, and various system libraries.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.30 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.30.11
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.30.11
+ Package updates include kernel livepatch updates and system library updates.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.31 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.31.7
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.31.13
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.31.13
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include extensive system library updates, kernel updates, and boost library updates.
+ Added packages: apr-util-lmdb, kernel-livepatch-6.1.156-177.286

------
#### [ Kubernetes v1.32 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ AL2 (x86\$164):
  + NVIDIA driver version: 570.195.03
  + CUDA version: 12.8
  + Kubernetes version: 1.32.3
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.32.9
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.32.9
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include kernel livepatch updates and system library updates.
+ Added package: annotated-doc 0.0.3

------
#### [ Kubernetes v1.33 ]
+ AL2023 (x86\$164):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.33.5
  + Kernel version: 6.1.155-176.282.amzn2023.x86\$164
+ AL2023 (arm):
  + NVIDIA driver version: 580.95.05
  + CUDA version: 13.0
  + Kubernetes version: 1.33.5
  + Kernel version: 6.12.46-66.121.amzn2023.aarch64
+ Package updates include extensive system library updates, kernel updates, and boost library updates.
+ Added packages: apr-util-lmdb, kernel-livepatch updates

------

## SageMaker HyperPod AMI releases for Amazon EKS: October 22, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20251022"></a>

**AL2x86**

**Note**  
Amazon Linux 2 is now deprecated. The Kubernetes AMI is based on AL2023.

Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/aws-deep-learning-ami-baseoss-aml2-2025-10-14.html).
+ EKS versions 1.28 - 1.32
+ This release contains CVE patches for affected NVIDIA Driver packages found in the [Nvidia October Security Bulletin](https://nvidia.custhelp.com/app/answers/detail/a_id/5703).
+ NVIDIA SMI

  ```
  NVIDIA-SMI 570.195.03             
  Driver Version: 570.195.03     
  CUDA Version: 12.8
  ```
+ Major versions  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Added packages: No packages were added in this release.
+ Updated packages  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Removed packages: No packages were removed in this release.

**AL2023x86**

Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/aws-deep-learning-ami-gpubaseoss-al2023-2025-10-14.html).
+ EKS versions 1.28 - 1.32. No release for EKS version 1.33.
+ This release contains CVE patches for affected NVIDIA Driver packages found in the [Nvidia October Security Bulletin](https://nvidia.custhelp.com/app/answers/detail/a_id/5703).
+ NVIDIA SMI

  ```
  NVIDIA-SMI 580.95.05             
  Driver Version: 580.95.05  
  CUDA Version: 13.0
  ```
+ Major versions  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Added packages: No packages were added in this release.
+ Updated packages  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Removed packages: No packages were removed in this release.

**AL2023 ARM64**

Base DLAMI release note is available [here](https://docs.aws.amazon.com//dlami/latest/devguide/aws-deep-learning-ami-gpubaseossarm64-al2023-2025-10-14.html).
+ EKS versions 1.31 - 1.33.
+ This release contains CVE patches for affected NVIDIA Driver packages found in the [Nvidia October Security Bulletin](https://nvidia.custhelp.com/app/answers/detail/a_id/5703).
+ NVIDIA SMI

  ```
  NVIDIA-SMI 580.95.05        
  Driver Version: 580.95.05    
  CUDA Version: 13.0
  ```
+ Major versions  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Added packages: No packages were added in this release.
+ Updated packages  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html)
+ Removed packages: No packages were removed in this release.

## SageMaker HyperPod AMI releases for Amazon EKS: September 29, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250929"></a>

**AMI general updates**
+ Released the new SageMaker HyperPod AMI for Amazon EKS 1.33. For more information, see SageMaker HyperPod AMI releases for Amazon EKS: September 29, 2025.
**Important**  
The Dynamic Resource Allocation beta Kubernetes API is enabled by default in this release.  
This API improves scheduling and monitoring workloads that require resources such as GPUs.
This API was developed by the open source Kubernetes community and might change in future versions of Kubernetes. Before you use the API, review the [Kubernetes documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) and understand how it affects your workloads.
HyperPod is not releasing a HyperPod Amazon Linux 2 AMI for Kubernetes 1.33. AWS recommends that you migrate to AL2023. For more information, see [Upgrade from Amazon Linux 2 to AL2023](https://docs.aws.amazon.com/eks/latest/userguide/al2023.html).

For more information, see [Kubernetes v1.33](https://kubernetes.io/blog/2025/04/23/kubernetes-v1-33-release/).

**SageMaker HyperPod DLAMI for Amazon EKS support**

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.29 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.30 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.31 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.32 ]
+ **Amazon Linux 2 is now deprecated. Kubernetes AMI is based on AL2023.**
+ NVIDIA SMI:
  + NVIDIA driver version: 570.172.08
  + CUDA version: 12.8
+ Packages:
  + Languages and core libraries:
    + GCC: 11.5.0-5.amzn2023.0.5
    + GCC 14: 14.2.1-7.amzn2023.0.1
    + Java: 17.0.16\$18-1.amzn2023.1
    + Perl: 5.32.1-477.amzn2023.0.7
    + Python: 3.9.23-1.amzn2023.0.3
    + Go: 3.2.0-37.amzn2023
    + Rust: 1.89.0-1.amzn2023.0.2
  + Core Libraries:
    + GlibC: 2.34-196.amzn2023.0.1
    + OpenSSL: 3.2.2-1.amzn2023.0.1
    + Zlib: 1.2.11-33.amzn2023.0.5
    + XZ Utils: 5.2.5-9.amzn2023.0.2
    + Util-linux: 2.37.4-1.amzn2023.0.4
  + Neuron:
    + aws-neuronx-dkms: 2.23.9.0-dkms
    + aws-neuronx-tools: 2.25.145.0-1
  + EFA:
    + efa driver: 2.17.2-1.amzn2023
    + efa config: 1.18-1.amzn2023
    + efa nv peermem: 1.2.2-1.amzn2023
    + efa profile: 1.7-1.amzn2023
  + kernel:
    + kernel: 6.1.148-173.267.amzn2023
    + kernel development: 6.1.148-173.267.amzn2023
    + kernel headers: 6.1.148-173.267.amzn2023
    + kernel tools: 6.1.148-173.267.amzn2023
    + kernel modules extra: 6.1.148-173.267.amzn2023
    + kernel livepatch: 1.0-0.amzn2023
  + Nvidia:
    + nvidia container toolkit: 1.17.8-1
    + nvidia container toolkit base: 1.17.8-1
    + libnvidia-container: 1.17.8-1 (with tools)
    + nvidia fabric manager: 570.172.08-1
    + libnvidia-nscq: 570.172.08-1

------
#### [ Kubernetes v1.33 ]

The following table contains information about components within this AMI release and the corresponding versions.


| component | AL2023\$1x86 | AL2023\$1arm64 | 
| --- | --- | --- | 
| EKS | v1.33.4 | v1.33.4 | 
| amazon-ssm-agent | 3.3.2299.0-1.amzn2023 | 3.3.2299.0-1.amzn2023 | 
| aws-neuronx-dkms | 2.23.9.0-dkms | N/A | 
| containerd | 1.7.27-1.eks.amzn2023.0.4 | 1.7.27-1.eks.amzn2023.0.4 | 
| efa | 2.17.2-1.amzn2023 | 2.17.2-1.amzn2023 | 
| ena | 2.14.1g | 2.14.1g | 
| kernel | 6.12.40-64.114.amzn2023 | N/A | 
| kernel6.12 | N/A | 6.12.40-64.114.amzn2023 | 
| kmod-nvidia-latest-dkms | 570.172.08-1.amzn2023 | 570.172.08-1.el9 | 
| nvidia-container-toolkit | 1.17.8-1 | 1.17.8-1 | 
| runc | 1.2.6-1.amzn2023.0.1 | 1.2.6-1.amzn2023.0.1 | 

------

## SageMaker HyperPod AMI releases for Amazon EKS: August 25, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250825"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

This release includes the following updates:

------
#### [ Kubernetes v1.28 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------
#### [ Kubernetes v1.29 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------
#### [ Kubernetes v1.30 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ aws-neuronx-dkms.noarch: 2.22.2.0-dkms → 2.23.9.0-dkms
+ efa.x86\$164: 2.15.3-1.amzn2 → 2.17.2-1.amzn2
+ efa-nv-peermem.x86\$164: 1.2.1-1.amzn2 → 1.2.2-1.amzn2
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ ibacm.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ infiniband-diags.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libfabric-aws.x86\$164: 2.1.0amzn3.0-1.amzn2 → 2.1.0amzn5.0-1.amzn2
+ libfabric-aws-devel.x86\$164: 2.1.0amzn3.0-1.amzn2 → 2.1.0amzn5.0-1.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ libibumad.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs-core.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs-utils.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libnccl-ofi.x86\$164: 1.15.0-1.amzn2 → 1.16.2-1.amzn2
+ librdmacm.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ librdmacm-utils.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4
+ rdma-core.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ rdma-core-devel.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------
#### [ Kubernetes v1.31 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------
#### [ Kubernetes v1.32 ]

**NVIDIA SMI:**
+ Nvidia Driver Version: 570.172.08
+ CUDA Version: 12.8

**Added Packages:**
+ kernel-livepatch-5.10.240-238.955.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Updated Packages:**
+ aws-neuronx-dkms.noarch: 2.22.2.0-dkms → 2.23.9.0-dkms
+ efa.x86\$164: 2.15.3-1.amzn2 → 2.17.2-1.amzn2
+ efa-nv-peermem.x86\$164: 1.2.1-1.amzn2 → 1.2.2-1.amzn2
+ gdk-pixbuf2.x86\$164: 2.36.12-3.amzn2 → 2.36.12-3.amzn2.0.2
+ ibacm.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ infiniband-diags.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ kernel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-devel.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-headers.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ kernel-tools.x86\$164: 5.10.239-236.958.amzn2 → 5.10.240-238.955.amzn2
+ libfabric-aws.x86\$164: 2.1.0amzn3.0-1.amzn2 → 2.1.0amzn5.0-1.amzn2
+ libfabric-aws-devel.x86\$164: 2.1.0amzn3.0-1.amzn2 → 2.1.0amzn5.0-1.amzn2
+ libgs.x86\$164: 9.54.0-9.amzn2.0.11 → 9.54.0-9.amzn2.0.12
+ libibumad.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs-core.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libibverbs-utils.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ libnccl-ofi.x86\$164: 1.15.0-1.amzn2 → 1.16.2-1.amzn2
+ librdmacm.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ librdmacm-utils.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ microcode\$1ctl.x86\$164: 2:2.1-47.amzn2.4.24 → 2:2.1-47.amzn2.4.25
+ pam.x86\$164: 1.1.8-23.amzn2.0.2 → 1.1.8-23.amzn2.0.4
+ rdma-core.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2
+ rdma-core-devel.x86\$164: 57.amzn1-1.amzn2.0.2 → 58.amzn0-1.amzn2.0.2

**Removed Packages:**
+ kernel-livepatch-5.10.239-236.958.x86\$164 1.0-0.amzn2 amzn2extra-kernel-5.10

**Repository Changed:**
+ libnvidia-container-tools.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ libnvidia-container1.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit
+ nvidia-container-toolkit-base.x86\$164: cuda-rhel8-x86\$164 → nvidia-container-toolkit

------

## SageMaker HyperPod AMI releases for Amazon EKS: August 12, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250812"></a>

**The AMI includes the following:**
+ Supported AWS Service: Amazon EC2
+ Operating System: Amazon Linux 2023
+ Compute Architecture: ARM64
+ Latest available version is installed for the following packages:
  + Linux Kernel: 6.12
  + FSx Lustre
  + Docker
  + AWS CLI v2 at `/usr/bin/aws`
  + NVIDIA DCGM
  + Nvidia container toolkit:
    + Version command: `nvidia-container-cli -V`
  + Nvidia-docker2:
    + Version command: `nvidia-docker version`
  + Nvidia-IMEX: v570.172.08-1
+ NVIDIA Driver: 570.158.01
+ NVIDIA CUDA 12.4, 12.5, 12.6, 12.8 stack:
  + CUDA, NCCL and cuDDN installation directories: `/usr/local/cuda-xx.x/`
    + Example: `/usr/local/cuda-12.8/`, `/usr/local/cuda-12.8/`
  + Compiled NCCL Version:
    + For CUDA directory of 12.4, compiled NCCL Version 2.22.3\$1CUDA12.4
    + For CUDA directory of 12.5, compiled NCCL Version 2.22.3\$1CUDA12.5
    + For CUDA directory of 12.6, compiled NCCL Version 2.24.3\$1CUDA12.6
    + For CUDA directory of 12.8, compiled NCCL Version 2.27.5\$1CUDA12.8
  + Default CUDA: 12.8
    + PATH `/usr/local/cuda` points to CUDA 12.8
    + Updated below env vars:
      + `LD_LIBRARY_PATH` to have `/usr/local/cuda-12.8/lib:/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8:/usr/local/cuda-12.8/targets/sbsa-linux/lib:/usr/local/cuda-12.8/nvvm/lib64:/usr/local/cuda-12.8/extras/CUPTI/lib64`
      + `PATH` to have `/usr/local/cuda-12.8/bin/:/usr/local/cuda-12.8/include/`
      + For any different CUDA version, please update `LD_LIBRARY_PATH` accordingly.
+ EFA installer: 1.42.0
+ Nvidia GDRCopy: 2.5.1
+ AWS OFI NCCL plugin comes with EFA installer
  + Paths `/opt/amazon/ofi-nccl/lib` and `/opt/amazon/ofi-nccl/efa` are added to `LD_LIBRARY_PATH`.
+ AWS CLI v2 at `/usr/local/bin/aws`
+ EBS volume type: gp3
+ Python: `/usr/bin/python3.9`

## SageMaker HyperPod AMI releases for Amazon EKS: August 6, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250806"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following updates:

------
#### [ K8s v1.28 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------
#### [ K8s v1.29 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------
#### [ K8s v1.30 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------
#### [ K8s v1.31 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------
#### [ K8s v1.32 ]
+ **Neuron packages:**
  + **aws-neuronx-collectives:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-dkms:** 2.23.9.0-dkms
  + **aws-neuronx-runtime-lib:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-k8-plugin:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler:** 2.27.7.0-1
  + **aws-neuronx-tools:** 2.25.145.0-1

------

**Important**  
Deep Learning Base OSS Nvidia Driver AMI (Amazon Linux 2) Version 70.3
Deep Learning Base Proprietary Nvidia Driver AMI (Amazon Linux 2) Version 68.4
Latest CUDA 12.8 support
Upgraded Nvidia Driver to from 570.158.01 to 570.172.08 to fix CVE's present in the Nvidia Security Bulletin for July

## SageMaker HyperPod AMI releases for Amazon EKS: July 31, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250731"></a>

Amazon SageMaker HyperPod now supports a new AMI for Amazon EKS clusters that updates the base operating system to Amazon Linux 2023. This release provides several improvements from Amazon Linux 2 (AL2). HyperPod releases new AMIs regularly, and we recommend that you run all of your HyperPod clusters on the latest and most secure versions of AMIs to address vulnerabilities and phase out outdated software and libraries.

### Key upgrades
<a name="sagemaker-hyperpod-release-ami-eks-20250731-specs"></a>
+ **Operating System**: Amazon Linux 2023 (updated from Amazon Linux 2, or AL2)
+ **Package Manager**: DNF is the default package management tool, replacing YUM used in AL2
+ **Networking Service**: `systemd-networkd` manages network interfaces, replacing ISC `dhclient` used in AL2
+ **Linux Kernel**: Version 6.1, updated from the kernel used in AL2
+ **Glibc**: Version 2.34, updated from the version in AL2
+ **GCC**: Version 11.5.0, updated from the version in AL2
+ **NFS**: Version 1:2.6.1, updated from version 1:1.3.4 in AL2
+ **NVIDIA Driver**: Version 570.172.08, a newer driver version
+ **Python**: Version 3.9, replacing Python 2.7 used in AL2
+ **NVME**: Version 1.11.1, a newer version of the NVMe driver

### Before you upgrade
<a name="sagemaker-hyperpod-release-ami-eks-20250731-prereqs"></a>

There are a few important things to know before upgrading. With AL2023, several packages have been added, upgraded or removed compared to AL2. We strongly recommend that you test your applications with AL2023 before upgrading your clusters. For a comprehensive list of all package changes in AL2023, see [Package changes in Amazon Linux 2023](https://docs.aws.amazon.com/linux/al2023/release-notes/compare-packages.html).

The following are some of the significant changes between AL2 and AL2023:
+ **Python 3.10**: The most significant update, apart from the operating system, is the Python version upgrade. After upgrading, clusters have Python 3.10 as default. While some Python 3.8 distributed training workloads might be compatible with Python 3.10, we strongly recommend that you test your specific workloads separately. If migration to Python 3.10 proves challenging but you still want to upgrade your cluster for other new features, you can install an older Python version by using the command `yum install python-xx.x` with [lifecycle scripts](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-lifecycle-best-practices-slurm.html) before running any workloads. Ensure you test both your existing lifecycle scripts and application code for compatibility.
+ **NVIDIA runtime enforcement**: AL2023 strictly enforces the NVIDIA container runtime requirements, causing containers with hard-coded NVIDIA environment variables (such as `NVIDIA_VISIBLE_DEVICES: "all"`) to fail on CPU-only nodes (whereas AL2 ignored these settings when no GPU drivers are present). You can override the enforcement by setting `NVIDIA_VISIBLE_DEVICES: "void"` in your pod specification or by using CPU-only images.
+ **cgroup v2**: AL2023 features the next generation of unified control group hierarchy (cgroup v2). cgroup v2 is used for container runtimes and is also used by `systemd`. While AL2023 still includes code that can make the system run using cgroup v1, this isn't a recommended configuration.
+ **Amazon VPC CNI and `eksctl` versions**: AL2023 also requires your Amazon VPC CNI version to be 1.16.2 or greater and your `eksctl` version to be 0.176.0 or greater.
+ **EFA on FSx for Lustre**: You can now use EFA on FSx for Lustre, which enables you to achieve application performance comparable to on-premises AI/ML or HPC (high performance computing) clusters, while benefiting from the scalability, flexibility and elasticity of cloud computing.

Additionally, upgrading to AL2023 requires at minimum version `1.0.643.0_1.0.192.0` of Health Monitoring Agent. Complete the following procedure to update the Health Monitoring Agent:

1. If you use HyperPod lifecycle scripts from the GitHub repository [awsome-distributed-training](https://github.com/aws-samples/awsome-distributed-training)), make sure to pull the latest version. Earlier versions are not compatible with AL2023. The new lifecycle script ensures that `containerd` uses the additional mounted storage for pulling in container images in AL2023.

1. Pull in the latest version of the [HyperPod CLI git repository](https://github.com/aws/sagemaker-hyperpod-cli/tree/main).

1. Update dependencies with the following command: `helm dependencies update helm_chart/HyperPodHelmChart`

1. As mentioned on the step 4 in the [README of HyperPodHelmChart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#step-four-whenever-you-want-to-upgrade-the-installation-of-helm-charts), run the following command to upgrade the version of dependencies running on the cluster: `helm upgrade dependencies helm_chart/HyperPodHelmChart -namespace kube-system`

### Workloads that have been tested on upgraded EKS clusters
<a name="sagemaker-hyperpod-release-ami-eks-20250731-tested"></a>

The following are some use cases where the upgrade has been tested:
+ **Backwards compatibility**: Popular distributed training jobs involving PyTorch should be backwards compatible on the new AMI. However, since your workloads may depend on specific Python or Linux libraries, we recommend first testing on a smaller scale or subset of nodes before upgrading your larger clusters.
+ **Accelerator testing**: Jobs across various instance types, utilizing both NVIDIA accelerators (for the P and G instance families) and AWS Neuron accelerators (for Trn instances) have been tested.

### How to upgrade your AMI and associated workloads
<a name="sagemaker-hyperpod-release-ami-eks-20250731-upgrade"></a>

You can upgrade to the new AMI using one of the following methods:
+ Use the [create-cluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API to create a new cluster with the latest AMI.
+ Use the [update-cluster-software](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API to upgrade your existing cluster. Note that this option re-runs any lifecycle scripts.

The cluster is unavailable during the update process. We recommend planning for this downtime and restarting the training workload from an existing checkpoint after the upgrade completes. As a best practice, we recommend that you perform testing on a smaller cluster before upgrading your larger clusters.

If the update command fails, first identify the cause of the failure. For lifecycle script failures, make the necessary corrections to your scripts and retry. For any other issues that cannot be resolved, contact [AWS Support](https://aws.amazon.com/premiumsupport/).

### Troubleshooting
<a name="sagemaker-hyperpod-release-ami-eks-20250731-troubleshooting"></a>

Use the following section to help with troubleshooting any issues you encounter when upgrading to AL2023.

**How do I fix errors such as `"nvml error: driver not loaded: unknown"` on CPU-only cluster nodes?**

If containers that worked on CPU AL2 Amazon EKS nodes now fail on AL2023, your container image may have hard-coded NVIDIA environment variables. You can check for hard-coded environment variables with the following command:

```
docker inspect image:tag | grep -i nvidia
```

AL2023 strictly enforces these requirements whereas AL2 was more lenient on CPU-only nodes. One solution is to override the AL2023 enforcement by setting certain NVIDIA environment variables in your Amazon EKS pod specification, as shown in the following example:

```
yaml
containers:
- name: your-container
image: your-image:tag
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "void"
- name: NVIDIA_DRIVER_CAPABILITIES
value: ""
```

Another alternative is to use CPU-only container images (such as `pytorch/pytorch:latest-cpu`) or build custom images without NVIDIA dependencies.

## SageMaker HyperPod AMI releases for Amazon EKS: July 15, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250715"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following updates:

------
#### [ K8s v1.28 ]
+ **Latest NVIDIA Driver:** 550.163.01
+ **Default CUDA:** 12.4
+ **EFA Installer:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.29 ]
+ **Nvidia Driver Version:** 550.163.01
+ **CUDA Version:** 12.4
+ **EFA Installer:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.30 ]
+ **Nvidia Driver Version:** 550.163.01
+ **CUDA Version:** 12.4
+ **EFA installer version:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.31 ]
+ **Nvidia Driver Version:** 550.163.01
+ **CUDA Version:** 12.4
+ **EFA installer version:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.32 ]
+ **Nvidia Driver Version:** 550.163.01
+ **CUDA Version:** 12.4
+ **EFA installer version:** 1.38.0
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.26.43.0\$147cc904ea-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.16.2.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.16.1.0\$10a6506a47-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.26.26.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.26.42.0\$12ff3b5c7d-1
  + **aws-neuronx-tools.x86\$164:** 2.24.54.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------

## SageMaker HyperPod AMI releases for Amazon EKS: June 09, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250609"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

------
#### [ Neuron SDK Updates ]
+ **aws-neuronx-dkms.noarch:** 2.21.37.0 (from 2.20.74.0)

------

## SageMaker HyperPod AMI releases for Amazon EKS: May 22, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250522"></a>

**AMI general updates**

**SageMaker HyperPod DLAMI for Amazon EKS support**

------
#### [ Deep Learning Base AMI AL2 ]
+ **Latest NVIDIA Driver:** 550.163.01
+ **CUDA Stack updates:**
  + **Default CUDA:** 12.1
  + **NCCL Version:** 2.22.3
+ **EFA Installer:** 1.38.0
+ **AWS OFI NCCL:** 1.13.2
+ **Linux Kernel:** 5.10
+ **GDRCopy:** 2.4

**Important**  
**NVIDIA Container Toolkit 1.17.4 update:** CUDA compat libraries mounting is now disabled
**EFA Updates from 1.37 to 1.38:**  
AWS OFI NCCL plugin now located in /opt/amazon/ofi-nccl
Previous location /opt/aws-ofi-nccl/ is deprecated

------
#### [ Neuron SDK Updates ]
+ **aws-neuronx-dkms.noarch:** 2.20.74.0 (from 2.20.28.0)
+ **aws-neuronx-collectives.x86\$164:** 2.25.65.0\$19858ac9a1-1 (from 2.24.59.0\$1838c7fc8b-1)
+ **aws-neuronx-runtime-lib.x86\$164:** 2.25.57.0\$1166c7a468-1 (from 2.24.53.0\$1f239092cc-1)
+ **aws-neuronx-tools.x86\$164:** 2.23.9.0 (from 2.22.61.0)
+ **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.15.12.0 (from 0.14.12.0)
+ **aws-neuronx-gpsimd-tools.x86\$164:** 0.15.1.0\$15d31b6a3f (from 0.14.6.0\$1241eb69f4)
+ **aws-neuronx-k8-plugin.x86\$164: **2.25.24.0 (from 2.24.23.0)
+ **aws-neuronx-k8-scheduler.x86\$164:** 2.25.24.0 (from 2.24.23.0)

**Support notes:**
+ AMI components including CUDA versions may be removed or changed based on framework support policy
+ Kernel version is pinned for compatibility. Users should avoid updates unless required for security patches
+ For EC2 instances with multiple network cards, please refer to EFA configuration guide for proper setup

------

## SageMaker HyperPod AMI releases for Amazon EKS: May 07, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250507"></a>

------
#### [ Installed the latest version of AWS Neuron SDK ]
+ **tensorflow-model-server-neuron.x86\$164** 2.8.0.2.3.0.0-0 neuron

------

## SageMaker HyperPod AMI releases for Amazon EKS: April 28, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250428"></a>

**Improvements for K8s**
+ Upgraded NVIDIA driver from version 550.144.03 to 550.163.01. This upgrade is to address Common Vulnerabilities and Exposures (CVEs) present in the [NVIDIA GPU Display Security Bulletin for April 2025](https://nvidia.custhelp.com/app/answers/detail/a_id/5630).

**SageMaker HyperPod DLAMI for Amazon EKS support**

------
#### [ Installed the latest version of AWS Neuron SDK ]
+ **aws-neuronx-dkms.noarch:** 2.20.28.0-dkms
+ **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
+ **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
+ **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
+ **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
+ **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
+ **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
+ **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
+ **aws-neuron-tools.x86\$164:** 2.1.4.0-1
+ **aws-neuronx-collectives.x86\$164:** 2.24.59.0\$1838c7fc8b-1
+ **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
+ **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.14.12.0-1
+ **aws-neuronx-gpsimd-tools.x86\$164:** 0.14.6.0\$1241eb69f4-1
+ **aws-neuronx-k8-plugin.x86\$164:** 2.24.23.0-1
+ **aws-neuronx-k8-scheduler.x86\$164:** 2.24.23.0-1
+ **aws-neuronx-runtime-lib.x86\$164:** 2.24.53.0\$1f239092cc-1
+ **aws-neuronx-tools.x86\$164:** 2.22.61.0-1
+ **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------

## SageMaker HyperPod AMI releases for Amazon EKS: April 18, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250418"></a>

**AMI general updates**
+ New SageMaker HyperPod AMI for Amazon EKS 1.32.1.

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ Deep Learning EKS AMI 1.32.1 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.32.1
  + Containerd Version: 1.7.27
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.29
+ **Amazon SSM Agent:** 3.3.1611.0 
+ **Linux Kernel:** 5.10.235
+ **OSS Nvidia driver:** 550.163.01
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1-1
+ **Nvidia container toolkit:** 1.17.6
+ **AWS OFI NCCL:** 1.13.2
+ **aws-neuronx-tools:** 2.18.3.0
+ **aws-neuronx-runtime-lib:** 2.24.53.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.20.28.0
+ **aws-neuronx-collectives:** 2.24.59.0

------

## SageMaker HyperPod AMI releases for Amazon EKS: February 18, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250218"></a>

**Improvements for K8s**
+ Upgraded Nvidia container toolkit from version 1.17.3 to version 1.17.4.
+ Fixed the issue where customers were unable to connect to nodes after a reboot.
+ Upgraded Elastic Fabric Adapter (EFA) version from 1.37.0 to 1.38.0.
+ The EFA now includes the AWS OFI NCCL plugin, which is located in the `/opt/amazon/ofi-nccl` directory instead of the original `/opt/aws-ofi-nccl/` path. If you need to update your `LD_LIBRARY_PATH` environment variable, make sure to modify the path to point to the new `/opt/amazon/ofi-nccl` location for the OFI NCCL plugin.
+ Removed the emacs package from these DLAMIs. You can install emacs from GNU emac.

**SageMaker HyperPod DLAMI for Amazon EKS support**

------
#### [ Installed the latest version of neuron SDK ]
+ **aws-neuronx-dkms.noarch:** 2.19.64.0-dkms @neuron
+ **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1 @neuron
+ **aws-neuronx-tools.x86\$164:** 2.18.3.0-1 @neuron
+ **aws-neuronx-collectives.x86\$164:** 2.23.135.0\$13e70920f2-1 neuron
+ **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1 neuron
+ **aws-neuronx-gpsimd-customop-lib.x86\$164**
+ **aws-neuronx-gpsimd-tools.x86\$164:** 0.13.2.0\$194ba34927-1 neuron
+ **aws-neuronx-k8-plugin.x86\$164:** 2.23.45.0-1 neuron
+ **aws-neuronx-k8-scheduler.x86\$164:** 2.23.45.0-1 neuron
+ **aws-neuronx-runtime-lib.x86\$164:** 2.23.112.0\$19b5179492-1 neuron
+ **aws-neuronx-tools.x86\$164:** 2.20.204.0-1 neuron
+ **tensorflow-model-server-neuronx.x86\$164**

------

## SageMaker HyperPod AMI releases for Amazon EKS: January 22, 2025
<a name="sagemaker-hyperpod-release-ami-eks-20250122"></a>

**AMI general updates**
+ New SageMaker HyperPod AMI for Amazon EKS 1.31.2.

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ Deep Learning EKS AMI 1.31 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.31.2
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.230
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.37.0
+ **GDRCopy:** 2.4.1-1
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.13.0
+ **aws-neuronx-tools:** 2.18.3
+ **aws-neuronx-runtime-lib:** 2.23.112.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.23.133.0

------

## SageMaker HyperPod AMI releases for Amazon EKS: December 21, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241221"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ K8s v1.28 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.28.15
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.228
+ **OSS NVIDIA driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.37.0
+ **GDRCopy:** 2.4
+ **NVIDIA container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.13.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.23.112.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.23.135.0

------
#### [ K8s v1.29 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.29.10
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.15.0
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.37.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.13.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.23.112.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.23.135.0

------
#### [ K8s v1.30 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.30.6
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987.0
+ **Linux Kernel:** 5.10.228
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.37.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.13.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.23.112.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.23.135.0

------

## SageMaker HyperPod AMI releases for Amazon EKS: December 13, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241213"></a>

**SageMaker HyperPod DLAMI for Amazon EKS upgrade**
+ Updated SSM Agent to version `3.3.1311.0`.

## SageMaker HyperPod AMI releases for Amazon EKS: November 24, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241124"></a>

**AMI general updates**
+ Released in `MEL` (Melbourne) Region.
+ Updated SageMaker HyperPod base DLAMI to the following versions:
  + Kubernetes: 2024-11-01.

## SageMaker HyperPod AMI releases for Amazon EKS: November 15, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241115"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ Deep Learning EKS AMI 1.28 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.28.15
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.228
+ **OSS NVIDIA driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.34.0
+ **GDRCopy:** 2.4
+ **NVIDIA container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.11.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.22.19.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.22.33.0

------
#### [ Deep Learning EKS AMI 1.29 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.29.10
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.228
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.34.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.11.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.22.19.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.22.33.0

------
#### [ Deep Learning EKS AMI 1.30 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.30.6
  + Containerd Version: 1.7.23
  + Runc Version: 1.1.14
  + AWS IAM Authenticator: 0.6.26
+ **Amazon SSM Agent:** 3.3.987
+ **Linux Kernel:** 5.10.228
+ **OSS Nvidia driver:** 550.127.05
+ **NVIDIA CUDA:** 12.4
+ **EFA Installer:** 1.34.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.17.3
+ **AWS OFI NCCL:** 1.11.0
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.22.19.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.18.20.0
+ **aws-neuronx-collectives:** 2.22.33.0

------

## SageMaker HyperPod AMI releases for Amazon EKS: November 11, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241111"></a>

**AMI general updates**
+ Updated SageMaker HyperPod DLAMI with Amazon EKS versions 1.28.13, 1.29.8, 1.30.4.

## SageMaker HyperPod AMI releases for Amazon EKS: October 21, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20241021"></a>

**AMI general updates**
+ Updated SageMaker HyperPod base DLAMI to the following versions:
  + Amazon EKS: 1.28.11, 1.29.6, 1.30.2.

## SageMaker HyperPod AMI releases for Amazon EKS: September 10, 2024
<a name="sagemaker-hyperpod-release-ami-eks-20240910"></a>

**SageMaker HyperPod DLAMI for Amazon EKS support**

The AMIs include the following:

------
#### [ Deep Learning EKS AMI 1.28 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.28.11
  + Containerd Version: 1.7.20
  + Runc Version: 1.1.11
  + AWS IAM Authenticator: 0.6.21
+ **Amazon SSM Agent:** 3.3.380
+ **Linux Kernel:** 5.10.223
+ **OSS NVIDIA driver:** 535.183.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.32.0
+ **GDRCopy:** 2.4
+ **NVIDIA container toolkit:** 1.16.1
+ **AWS OFI NCCL:** 1.9.1
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.21.41.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.17.17.0
+ **aws-neuronx-collectives:** 2.21.46.0

------
#### [ Deep Learning EKS AMI 1.29 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.29.6
  + Containerd Version: 1.7.20
  + Runc Version: 1.1.11
  + AWS IAM Authenticator: 0.6.21
+ **Amazon SSM Agent:** 3.3.380
+ **Linux Kernel:** 5.10.223
+ **OSS Nvidia driver:** 535.183.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.32.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.16.1
+ **AWS OFI NCCL:** 1.9.1
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.21.41.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.17.17.0
+ **aws-neuronx-collectives:** 2.21.46.0

------
#### [ Deep Learning EKS AMI 1.30 ]
+ **Amazon EKS Components**
  + Kubernetes Version: 1.30.2
  + Containerd Version: 1.7.20
  + Runc Version: 1.1.11
  + AWS IAM Authenticator: 0.6.21
+ **Amazon SSM Agent:** 3.3.380
+ **Linux Kernel:** 5.10.223
+ **OSS Nvidia driver:** 535.183.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.32.0
+ **GDRCopy:** 2.4
+ **Nvidia container toolkit:** 1.16.1
+ **AWS OFI NCCL:** 1.9.1
+ **aws-neuronx-tools:** 2.18.3.0-1
+ **aws-neuronx-runtime-lib:** 2.21.41.0
+ **aws-neuronx-oci-hook:** 2.4.4.0-1
+ **aws-neuronx-dkms:** 2.17.17.0
+ **aws-neuronx-collectives:** 2.21.46.0

------

# Public AMI releases
<a name="sagemaker-hyperpod-release-public-ami"></a>

The following release notes track the latest updates for Amazon SageMaker HyperPod public AMI releases for Amazon EKS orchestration. Each release note includes a summarized list of packages pre-installed or pre-configured in the SageMaker HyperPod DLAMIs for Amazon EKS support. Each DLAMI is built on AL2023 and supports a specific Kubernetes version. For information about Amazon SageMaker HyperPod feature releases, see [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

This page is regularly updated to provide comprehensive AMI lifecycle management information including security vulnerabilities, deprecation announcements, and patching recommendations. As part of a commitment to maintaining secure and up-to-date infrastructure, SageMaker AI continuously monitors all HyperPod public AMIs for critical vulnerabilities using automated scanning workflows. When critical security issues are identified, AMIs are systematically deprecated with appropriate migration guidance. Regular updates include Common Vulnerabilites and Exposures (CVE) remediation status, compliance findings, and recommended actions to ensure that you can maintain secure HyperPod environments while minimizing operational disruption during AMI transitions.

## SageMaker HyperPod public AMI releases: August 04, 2025
<a name="sagemaker-hyperpod-release-public-ami-2025-08-04"></a>

Amazon SageMaker HyperPod now supports new public AMIs for Amazon EKS clusters. The AMIs include the following:

------
#### [ K8s v1.32 ]

AMI Name: HyperPod EKS 1.32 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.32.3
  + Containerd Version: 1.7.23
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.29
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.30 ]

AMI Name: HyperPod EKS 1.30 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.30.11
  + Containerd Version: 1.7.\$1
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.28
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.31 ]

AMI Name: HyperPod EKS 1.31 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.31.7
  + Containerd Version: 1.7.\$1
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.28
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.29 ]

AMI Name: HyperPod EKS 1.29 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.29.15
  + Containerd Version: 1.7.\$1
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.28
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------
#### [ K8s v1.28 ]

AMI Name: HyperPod EKS 1.28 x86\$164 AMI Amazon Linux 2 2025080407
+ **Amazon EKS Components**
  + Kubernetes Version: 1.28.15
  + Containerd Version: 1.7.\$1
  + Runc Version: 1.2.6
  + AWS IAM Authenticator: 0.6.28
+ **Amazon SSM Agent:** 3.3.2299.0
+ **Linux Kernel:** 5.10.238-234.956.amzn2.x86\$164
+ **OSS NVIDIA driver:** 550.163.01
+ **NVIDIA CUDA:** 12.2
+ **EFA Installer:** 1.38.0
+ **GDRCopy:** 2.4.1
+ **NVIDIA container toolkit:** 1.17.8
+ **AWS OFI NCCL:** 1.13.0-aws
+ **Neuron packages:**
  + **aws-neuronx-dkms.noarch:** 2.22.2.0-dkms
  + **aws-neuronx-oci-hook.x86\$164:** 2.4.4.0-1
  + **aws-neuronx-tools.x86\$164:** 2.18.3.0-1
  + **aws-neuron-dkms.noarch:** 2.3.26.0-dkms
  + **aws-neuron-k8-plugin.x86\$164:** 1.9.3.0-1
  + **aws-neuron-k8-scheduler.x86\$164:** 1.9.3.0-1
  + **aws-neuron-runtime.x86\$164:** 1.6.24.0-1
  + **aws-neuron-runtime-base.x86\$164:** 1.6.21.0-1
  + **aws-neuron-tools.x86\$164:** 2.1.4.0-1
  + **aws-neuronx-collectives.x86\$164:** 2.27.34.0\$1ec8cd5e8b-1
  + **aws-neuronx-gpsimd-customop.x86\$164:** 0.2.3.0-1
  + **aws-neuronx-gpsimd-customop-lib.x86\$164:** 0.17.1.0-1
  + **aws-neuronx-gpsimd-tools.x86\$164:** 0.17.0.0\$1aacc27699-1
  + **aws-neuronx-k8-plugin.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-k8-scheduler.x86\$164:** 2.27.7.0-1
  + **aws-neuronx-runtime-lib.x86\$164:** 2.27.23.0\$18deec4dbf-1
  + **aws-neuronx-tools.x86\$164:** 2.25.145.0-1
  + **tensorflow-model-server-neuron.x86\$164:** 2.8.0.2.3.0.0-0
  + **tensorflow-model-server-neuronx.x86\$164:** 2.10.1.2.12.2.0-0

------