# Orchestrating SageMaker HyperPod clusters with Amazon EKS
<a name="sagemaker-hyperpod-eks"></a>

SageMaker HyperPod is a SageMaker AI-managed service that enables large-scale training of foundation models on long-running and resilient compute clusters, integrating with Amazon EKS for orchestrating the HyperPod compute resources. You can run uninterrupted training jobs spanning weeks or months at scale using Amazon EKS clusters with HyperPod resiliency features that check for various hardware failures and automatically recover faulty nodes. 

Key features for cluster admin users include the following.
+ Provisioning resilient HyperPod clusters and attaching them to an EKS control plane
+ Enabling dynamic capacity management, such as adding more nodes, updating software, and deleting clusters
+ Enabling access to the cluster instances directly through `kubectl` or SSM/SSH
+ Offering [resiliency capabilities](sagemaker-hyperpod-eks-resiliency.md), including basic health checks, deep health checks, a health-monitoring agent, and support for PyTorch job auto-resume
+ Integrating with observability tools such as [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html), [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html), and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html)

For data scientist users, EKS support in HyperPod enables the following.
+ Running containerized workloads for training foundation models on the HyperPod cluster
+ Running inference on the EKS cluster, leveraging the integration between HyperPod and EKS
+ Leveraging the job auto-resume capability for [Kubeflow PyTorch training (PyTorchJob)](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/)

**Note**  
Amazon EKS enables user-managed orchestration of tasks and infrastructure on SageMaker HyperPod through the Amazon EKS Control Plane. Ensure that user access to the cluster through the Kubernetes API Server endpoint follows the principle of least-privilege, and that network egress from the HyperPod cluster is secured.  
To learn more about securing access to the Amazon EKS API Server, see [Control network access to cluster API server endpoint](https://docs.aws.amazon.com/eks/latest/userguide/cluster-endpoint.html).  
To learn more about securing network access on HyperPod, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

The high-level architecture of Amazon EKS support in HyperPod involves a 1-to-1 mapping between an EKS cluster (control plane) and a HyperPod cluster (worker nodes) within a VPC, as shown in the following diagram.

![\[EKS and HyperPod VPC architecture with control plane, cluster nodes, and AWS services.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-diagram.png)


# Managing SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-operate"></a>

This section provides guidance on managing SageMaker HyperPod through the SageMaker AI console UI or the AWS Command Line Interface (CLI). It explains how to perform various tasks related to SageMaker HyperPod, whether you prefer a visual interface or working with commands.

**Topics**
+ [Getting started with Amazon EKS support in SageMaker HyperPod](sagemaker-hyperpod-eks-prerequisites.md)
+ [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md)
+ [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md)
+ [Custom Amazon Machine Images (AMIs) for SageMaker HyperPod clusters](hyperpod-custom-ami-support.md)
+ [Managing SageMaker HyperPod EKS clusters using the SageMaker console](sagemaker-hyperpod-eks-operate-console-ui.md)
+ [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-eks-console-create-cluster-cfn.md)
+ [Managing SageMaker HyperPod EKS clusters using the AWS CLI](sagemaker-hyperpod-eks-operate-cli-command.md)
+ [HyperPod managed tiered checkpointing](managed-tier-checkpointing.md)
+ [SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md)
+ [Usage reporting for cost attribution in SageMaker HyperPod](sagemaker-hyperpod-usage-reporting.md)
+ [Configuring storage for SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-setup-storage.md)
+ [Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters](sagemaker-hyperpod-eks-ebs.md)
+ [Configuring custom Kubernetes labels and taints in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-custom-labels-and-taints.md)

# Getting started with Amazon EKS support in SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-prerequisites"></a>

In addition to the general [Prerequisites for using SageMaker HyperPod](sagemaker-hyperpod-prerequisites.md) for SageMaker HyperPod, check the following requirements and considerations for orchestrating SageMaker HyperPod clusters using Amazon EKS.

**Important**  
You can set up resources configuration for creating SageMaker HyperPod clusters using the AWS Management Console and CloudFormation. For more information, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) and [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-eks-console-create-cluster-cfn.md).

**Requirements**

**Note**  
Before creating a HyperPod cluster, you need a running Amazon EKS cluster configured with VPC and installed using Helm.
+ If using the SageMaker AI console, you can create an Amazon EKS cluster within the HyperPod cluster console page. For more information, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).
+ If using AWS CLI, you should create an Amazon EKS cluster before creating a HyperPod cluster to associate with. For more information, see [Create an Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) in the Amazon EKS User Guide.

When provisioning your Amazon EKS cluster, consider the following:

1. **Kubernetes version support**
   + SageMaker HyperPod supports Kubernetes versions 1.28, 1.29, 1.30, 1.31, 1.32, 1.33, and 1.34.

1. **Amazon EKS cluster authentication mode**
   + The authentication mode of an Amazon EKS cluster supported by SageMaker HyperPod are `API` and `API_AND_CONFIG_MAP`.

1. **Networking**
   + SageMaker HyperPod requires the Amazon VPC Container Network Interface (CNI) plug-in version 1.18.3 or later.
**Note**  
[AWS VPC CNI plugin for Kubernetes](https://github.com/aws/amazon-vpc-cni-k8s) is the only CNI supported by SageMaker HyperPod.
   + The [type of the subnet](https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html#subnet-types) in your VPC must be private for HyperPod clusters.

1. **IAM roles**
   + Ensure the necessary IAM roles for HyperPod are set up as guided in the [AWS Identity and Access Management for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md) section.

1. **Amazon EKS cluster add-ons**
   + You can continue using the various add-ons provided by Amazon EKS such as [Kube-proxy](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-kube-proxy.html), [CoreDNS](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-coredns.html), the [Amazon VPC Container Network Interface (CNI)](https://docs.aws.amazon.com/eks/latest/userguide/add-ons-vpc-cni.html) plugin, Amazon EKS pod identity, the GuardDuty agent, the Amazon FSx Container Storage Interface (CSI) driver, the Mountpoint for Amazon S3 CSI driver, the AWS Distro for OpenTelemetry, and the CloudWatch Observability agent.

**Considerations for configuring SageMaker HyperPod clusters with Amazon EKS**
+ You must use distinct IAM roles based on the type of your nodes. For HyperPod nodes, use a role based on [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod). For Amazon EKS nodes, see [Amazon EKS node IAM role](https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html).
+ You can provision and mount additional Amazon EBS volumes on SageMaker HyperPod nodes using two approaches: use [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs) for cluster-level volume provisioning (available when creating or updating instance groups), or use the Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver for dynamic pod-level volume management. With [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceGroupSpecification.html#sagemaker-Type-ClusterInstanceGroupSpecification-InstanceStorageConfigs), set the [local path](https://kubernetes.io/docs/concepts/storage/volumes/#local) to `/opt/sagemaker` to properly mount the volumes to your Amazon EKS pods. For information about how to deploy the [Amazon EBS CSI](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) controller on HyperPod nodes, see [Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters](sagemaker-hyperpod-eks-ebs.md).
+ If you use instance-type labels for defining scheduling constraints, ensure that you use the SageMaker AI ML instance types prefixed with `ml.`. For example, for P5 instances, use `ml.p5.48xlarge` instead of `p5.48xlarge`.

**Considerations for configuring network for SageMaker HyperPod clusters with Amazon EKS**
+ Each HyperPod cluster instance supports one Elastic Network Interface (ENI). For the maximum number of Pods per instance type, refer to the following table.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ Only Pods with `hostNetwork = true` have access to the Amazon EC2 Instance Metadata Service (IMDS) by default. Use the Amazon EKS Pod identity or the [IAM roles for service accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) to manage access to the AWS credentials for Pods.
+ EKS-orchestrated HyperPod clusters support dual IP addressing modes, allowing configuration with IPv4 or IPv6 for IPv6 Amazon EKS clusters in IPv6-enabled VPC and subnet environments. For more information, see [Setting up SageMaker HyperPod with a custom Amazon VPC](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-optional-vpc).

**Considerations for using the HyperPod cluster resiliency features**
+ Node auto-replacement is not supported for CPU instances.
+ The HyperPod health monitoring agent needs to be installed for node auto-recovery to work. The agent can be installed using Helm. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).
+ The HyperPod deep health check and health monitoring agent supports GPU and Trn instances.
+ SageMaker AI applies the following taint to nodes when they are undergoing deep health checks:

  ```
  effect: NoSchedule
  key: sagemaker.amazonaws.com/node-health-status
  value: Unschedulable
  ```
**Note**  
You cannot add custom taints to nodes in instance groups with `DeepHealthChecks` turned on.

 Once your Amazon EKS cluster is running, configure your cluster using the Helm package manager as instructed in [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md) before creating your HyperPod cluster.

# Installing packages on the Amazon EKS cluster using Helm
<a name="sagemaker-hyperpod-eks-install-packages-using-helm-chart"></a>

Before creating a SageMaker HyperPod cluster and attaching it to an Amazon EKS cluster, you should install packages using [Helm](https://helm.sh/), a package manager for Kubernetes. Helm is an open-source tool for setting up a installation process for Kubernetes clusters. It enables the automation and streamlining of dependency installations and simplifies various setups needed for preparing the Amazon EKS cluster as the orchestrator (control plane) for a SageMaker HyperPod cluster.

The SageMaker HyperPod service team provides a Helm chart package, which bundles key dependencies such as device/EFA plug-ins, plug-ins, [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/training/), and associated permission configurations.

**Important**  
This Helm installation step is required. If you set up your Amazon EKS cluster using the [AWS Management Console](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md) or [CloudFormation](smcluster-getting-started-eks-console-create-cluster-cfn.md), you can skip this step because the installation is handled automatically during the setup process. If you set up the cluster directly using the APIs, use the provided Helm chart to configure your Amazon EKS cluster. Failure to configure your Amazon EKS cluster using the provided Helm chart might result in the SageMaker HyperPod cluster not functioning correctly or the creation process failing entirely. The `aws-hyperpod` namespace name cannot be modified.

1. [Install Helm](https://helm.sh/docs/intro/install/) on your local machine.

1. Download the Helm charts provided by SageMaker HyperPod located at `helm_chart/HyperPodHelmChart` in the [SageMaker HyperPod CLI repository](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart).

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli.git
   cd sagemaker-hyperpod-cli/helm_chart
   ```

1. Update the dependencies of the Helm chart, preview the changes that will be made to your Kubernetes cluster, and install the Helm chart.

   ```
   helm dependencies update HyperPodHelmChart
   ```

   ```
   helm install hyperpod-dependencies HyperPodHelmChart --namespace kube-system --dry-run
   ```

   ```
   helm install hyperpod-dependencies HyperPodHelmChart --namespace kube-system
   ```

In summary, the Helm installation sets up various components for your Amazon EKS cluster, including job scheduling and queueing (Kueue), storage management, MLflow integration, and Kubeflow. Additionally, the charts install the following components for integrating with the SageMaker HyperPod cluster resiliency features, which are required components.
+ **Health monitoring agent** – This installs the health-monitoring agent provided by SageMaker HyperPod. This is required if you want to get your HyperPod cluster be monitored. Health-monitoring agents are provided as Docker images as follows. In the provided `values.yaml` in the Helm charts, the image is preset. The agent support GPU-based instances and Trainium-accelerator-based instances (`trn1`, `trn1n`, `inf2`). It is installed to the `aws-hyperpod` namespace. To find your supported URI, see [Supported Regions and their ECR URIs in the sagemaker-hyperpod-cli repository on GitHub](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/readme.md#6-notes).
+ **Deep health check** – This sets up a `ClusterRole`, a ServiceAccount (`deep-health-check-service-account`) in the `aws-hyperpod` namespace, and a `ClusterRoleBinding` to enable the SageMaker HyperPod deep health check feature. For more information about the Kubernetes RBAC file for deep health check, see the configuration file [https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/deep-health-check/templates/deep-health-check-rbac.yaml](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/deep-health-check/templates/deep-health-check-rbac.yaml) in the SageMaker HyperPod CLI GitHub repository. 
+ **`job-auto-restart`** - This sets up a `ClusterRole`, a ServiceAccount (`job-auto-restart`) in the `aws-hyperpod` namespace, and a `ClusterRoleBinding`, to enable the auto-restart feature for PyTorch training jobs in SageMaker HyperPod. For more information about the Kubernetes RBAC file for `job-auto-restart`, see the configuration file [https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/job-auto-restart/templates/job-auto-restart-rbac.yaml](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/charts/job-auto-restart/templates/job-auto-restart-rbac.yaml) in the SageMaker HyperPod CLI GitHub repository. 
+ **Kubeflow MPI operator** – The [MPI Operator](https://github.com/kubeflow/mpi-operator) is a Kubernetes operator that simplifies running distributed Machine Learning (ML) and High-Performance Computing (HPC) workloads using the Message Passing Interface (MPI) on Kubernetes clusters. It installs MPI Operator v0.5. It is installed to the `mpi-operator` namespace.
+ **`nvidia-device-plugin`** – This is a Kubernetes device plug-in that allows you to automatically expose NVIDIA GPUs for consumption by containers in your Amazon EKS cluster. It allows Kubernetes to allocate and provide access to the requested GPUs for that container. Required when using an instance type with GPU.
+ **`neuron-device-plugin`** – This is a Kubernetes device plug-in that allows you to automatically expose AWS Inferentia chips for consumption by containers in your Amazon EKS cluster. It allows Kubernetes to access and utilize the AWS Inferentia chips on the cluster nodes. Required when using a Neuron instance type.
+ **`aws-efa-k8s-device-plugin`** – This is a Kubernetes device plug-in that enables the use of AWS Elastic Fabric Adapter (EFA) on Amazon EKS clusters. EFA is a network device that provides low-latency and high-throughput communication between instances in a cluster. Required when using an EFA supported instance type.

For more information about the installation procedure using the provided Helm charts, see the [README file in the SageMaker HyperPod CLI repository](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart).

# Setting up Kubernetes role-based access control
<a name="sagemaker-hyperpod-eks-setup-rbac"></a>

Cluster admin users also need to set up [Kubernetes role-based access control (RBAC)](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) for data scientist users to use the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) to run workloads on HyperPod clusters orchestrated with Amazon EKS.

## Option 1: Set up RBAC using Helm chart
<a name="sagemaker-hyperpod-eks-setup-rbac-helm"></a>

The SageMaker HyperPod service team provides a Helm sub-chart for setting up RBAC. To learn more, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

## Option 2: Set up RBAC manually
<a name="sagemaker-hyperpod-eks-setup-rbac-manual"></a>

Create `ClusterRole` and `ClusterRoleBinding` with the minimum privilege, and create `Role` and `RoleBinding` with mutation permissions.

**To create `ClusterRole` & `ClusterRoleBinding` for data scientist IAM role**

Create a cluster-level configuration file `cluster_level_config.yaml` as follows.

```
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: hyperpod-scientist-user-cluster-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: hyperpod-scientist-user-cluster-role-binding
subjects:
- kind: Group
  name: hyperpod-scientist-user-cluster-level
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: hyperpod-scientist-user-cluster-role # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io
```

Apply the configuration to the EKS cluster.

```
kubectl apply -f cluster_level_config.yaml
```

**To create Role and RoleBinding in namespace**

This is the namespace training operator that run training jobs and Resiliency will monitor by default. Job auto-resume can only support in `kubeflow` namespace or namespace prefixed `aws-hyperpod`. 

Create a role configuration file `namespace_level_role.yaml` as follows. This example creates a role in the `kubeflow` namespace

```
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: kubeflow
  name: hyperpod-scientist-user-namespace-level-role
###
#  1) add/list/describe/delete pods
#  2) get/list/watch/create/patch/update/delete/describe kubeflow pytroch job
#  3) get pod log
###
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["create", "get"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get", "list"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["get", "create"]
- apiGroups: ["kubeflow.org"]
  resources: ["pytorchjobs", "pytorchjobs/status"]
  verbs: ["get", "list", "create", "delete", "update", "describe"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create", "update", "get", "list", "delete"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create", "get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  namespace: kubeflow
  name: hyperpod-scientist-user-namespace-level-role-binding
subjects:
- kind: Group
  name: hyperpod-scientist-user-namespace-level
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: hyperpod-scientist-user-namespace-level-role # this must match the name of the Role or ClusterRole you wish to bind to
  apiGroup: rbac.authorization.k8s.io
```

Apply the configuration to the EKS cluster.

```
kubectl apply -f namespace_level_role.yaml
```

## Create an access entry for Kubernetes groups
<a name="sagemaker-hyperpod-eks-setup-rbac-access-entry"></a>

After you have set up RBAC using one of the two options above, use the following sample command replacing the necessary information.

```
aws eks create-access-entry \
    --cluster-name <eks-cluster-name> \
    --principal-arn arn:aws:iam::<AWS_ACCOUNT_ID_SCIENTIST_USER>:role/ScientistUserRole \
    --kubernetes-groups '["hyperpod-scientist-user-namespace-level","hyperpod-scientist-user-cluster-level"]'
```

For the `principal-arn` parameter, you need to use the [IAM users for scientists](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-user).

# Custom Amazon Machine Images (AMIs) for SageMaker HyperPod clusters
<a name="hyperpod-custom-ami-support"></a>

Using base Amazon Machine Images (AMIs) provided and made public by Amazon SageMaker HyperPod, you can build custom AMIs. With a custom AMI, you can create specialized environments for AI workloads with pre-configured software stacks, driver customizations, proprietary dependencies, and security agents. This capability eliminates the need for complex post-launch bootstrapping using lifecycle configuration scripts.

With custom AMIs, you can standardize environments across different stages, accelerate startup times, and have full control over your runtime environment while leveraging SageMaker HyperPod's infrastructure capabilities and scaling advantages. This helps you maintain control over your AI infrastructure while still benefiting from SageMaker HyperPod's optimized base runtime.

You can build upon the SageMaker HyperPod performance-tuned base images by adding security agents, compliance tools, and specialized libraries while preserving all the distributed training benefits. This capability removes the previously required choice between infrastructure optimization and organizational security policies.

The custom AMI experience integrates seamlessly with established enterprise security workflows. Security teams build hardened images using SageMaker HyperPod's public AMIs as a base, and AI platform teams can specify these custom AMIs when creating or updating clusters through the SageMaker HyperPod APIs. The APIs validate image compatibility, handle necessary permissions, and maintain backwards compatibility so existing workflows continue functioning. Organizations with stringent security protocols can eliminate the error-prone alternative of installing security agents at runtime through lifecycle scripts. By aligning with enterprise security practices rather than forcing organizations to adapt their protocols to SageMaker HyperPod's limitations, custom AMIs remove a common barrier to adoption for security-conscious organizations running critical AI workloads.

For release notes on updates to the public AMIs, see [Public AMI releases](sagemaker-hyperpod-release-public-ami.md). To learn how to get started with building a custom AMI and using it in your HyperPod clusters, see the following topics.

**Topics**
+ [Build a custom AMI](hyperpod-custom-ami-how-to.md)
+ [Cluster management with custom AMIs](hyperpod-custom-ami-cluster-management.md)

# Build a custom AMI
<a name="hyperpod-custom-ami-how-to"></a>

The following page explains how to build a custom Amazon Machine Image (AMI) using Amazon SageMaker HyperPod base AMIs. You begin by selecting a base AMI, and then you create your own customized AMI using any of the common methods for creating new images, such as the AWS CLI.

## Select a SageMaker HyperPod base AMI
<a name="hyperpod-custom-ami-select-base"></a>

You can select a SageMaker HyperPod base AMI through one of the following methods.

### AWS console selection
<a name="hyperpod-custom-ami-console-selection"></a>

You can select public SageMaker HyperPod AMIs through the AWS console or by using the `DescribeImages` API call. SageMaker HyperPod AMIs are public and visible in every AWS account. You can find them in the Amazon EC2 AMI catalog by applying a filter to search for public AMIs owned by Amazon.

To find SageMaker HyperPod AMIs in the console:

1. Sign in to the Amazon EC2 console.

1. In the left navigation pane, choose **AMIs**.

1. For the **Image type** dropdown, select **Public images**.

1. In the search bar filters, set the **Owner alias** filter to **amazon**.

1. Search for AMIs prefixed as **HyperPod EKS** and select the AMI (preferably latest) that works for your use case. For instance, you can choose an AMI between Kubernetes 1.31 versus Kubernetes 1.30.

### Fetch latest public AMI ID through the AWS CLI
<a name="hyperpod-custom-ami-cli-fetch"></a>

If you want to always use the latest release public AMI, it is more efficient to use the public SageMaker HyperPod SSM parameter that contains the value of the latest AMI ID released by SageMaker HyperPod.

The following example shows how to retrieve the latest AMI ID using the AWS CLI:

```
aws ssm get-parameter \
  --name "/aws/service/sagemaker-hyperpod/ami/x86_64/eks-1.31-amazon-linux-2/latest/ami-id" \
  --region us-east-1 \
  --query "Parameter.Value" \
  --output text
```

**Note**  
Replace the parameter name with the corresponding Kubernetes version as required. For example, if you want to use Kubernetes 1.30, use the following parameter: `/aws/service/hyperpod/ami/x86_64/eks-1.30-amazon-linux-2/latest/ami-id`.

## Build your custom AMI
<a name="hyperpod-custom-ami-build"></a>

After you have selected a SageMaker HyperPod public AMI, use that as the base AMI to build your own custom AMI with one of the following methods. Note that this is not an exhaustive list for building AMIs. You can use any method of your choice for building AMIs. SageMaker HyperPod does not have any specific recommendation.
+ **AWS Management Console**: You can launch an Amazon EC2 instance using the SageMaker HyperPod AMI, make desired customizations, and then create an AMI from that instance.
+ **AWS CLI**: You can also use the `aws ec2 create-image` command to create an AMI from an existing Amazon EC2 instance after performing the customization.
+ **HashiCorp Packer**: Packer is an open-source tool from HashiCorp that enables you to create identical machine images for multiple platforms from a single source configuration. It supports creating AMIs for AWS, as well as images for other cloud providers and virtualization platforms.
+ **Image Builder**: EC2 Image Builder is a fully managed AWS service that makes it easier to automate the creation, maintenance, validation, sharing, and deployment of Linux or Windows Server images. For more information, see the [EC2 Image Builder User Guide](https://docs.aws.amazon.com/imagebuilder/latest/userguide/what-is-image-builder.html).

### Build a custom AMI with customer managed AWS KMS encryption
<a name="hyperpod-custom-ami-build-kms"></a>

The following sections describe how to build a custom AMI with a customer managed AWS KMS key to encrypt your HyperPod cluster volumes. For more information about customer managed keys in HyperPod and granting the required IAM and KMS key policy permissions, see [Customer managed AWS KMS key encryption for SageMaker HyperPod](smcluster-cmk.md). If you plan to use a custom AMI that is encrypted with a customer managed key, ensure that you also encrypt your HyperPod cluster's Amazon EBS root volume with the same key.

#### AWS CLI example: Create a new AMI using EC2 Image Builder and a HyperPod base image
<a name="hyperpod-custom-ami-cli-example"></a>

The following example shows how to create an AMI using Image Builder with AWS KMS encryption:

```
aws imagebuilder create-image-recipe \
    name "hyperpod-custom-recipe" \
    version "1.0.0" \
    parent-image "<hyperpod-base-image-id>" \
    block-device-mappings DeviceName="/dev/xvda",Ebs={VolumeSize=100,VolumeType=gp3,Encrypted=true,KmsKeyId=arn:aws:kms:us-east-1:111122223333:key/key-id,DeleteOnTermination=true}
```

#### Amazon EC2 console: Create a new AMI from an Amazon EC2
<a name="hyperpod-custom-ami-console-example"></a>

To create an AMI from an Amazon EC2 instance using the Amazon EC2 console:

1. Right-click on your customized Amazon EC2 instance and choose **Create Image**.

1. In the **Encryption** section, select **Encrypt snapshots**.

1. Select your KMS key from the dropdown. For example: `arn:aws:kms:us-east-2:111122223333:key/<your-kms-key-id>` or use the key alias: `alias/<your-hyperpod-key>`.

#### AWS CLI example: Create a new AMI from an Amazon EC2 instance
<a name="hyperpod-custom-ami-cli-create-image"></a>

Use the `aws ec2 create-image` command with AWS KMS encryption:

```
aws ec2 create-image \
    instance-id "<instance-id>" \
    name "MyCustomHyperPodAMI" \
    description "Custom HyperPod AMI" \
    block-device-mappings '[
        {
            "DeviceName": "/dev/xvda",
            "Ebs": {
                "Encrypted": true,
                "KmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id",
                "VolumeType": "gp2" 
            }
        }
    ]'
```

# Cluster management with custom AMIs
<a name="hyperpod-custom-ami-cluster-management"></a>

After the custom AMI is built, you can use it for creating or updating an Amazon SageMaker HyperPod cluster. You can also scale up or add instance groups that use the new AMI.

## Permissions required for cluster operations
<a name="hyperpod-custom-ami-permissions"></a>

Add the following permissions to the cluster admin user who operates and configures SageMaker HyperPod clusters. The following policy example includes the minimum set of permissions for cluster administrators to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters with custom AMI.

Note that AMI and AMI EBS snapshot sharing permissions are included through `ModifyImageAttribute` and `ModifySnapshotAttribute` API permissions as part of the following policy. For scoping down the sharing permissions, you can take the following steps:
+ Add tags to control the AMI sharing permissions to AMI and AMI snapshot. For example, you can tag the AMI with `AllowSharing` as `true`.
+ Add the context key in the policy to only allow AMI sharing for AMIs tagged with certain tags.

The following policy is a scoped down policy to ensure only AMIs tagged with `AllowSharing` as `true` are allowed.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::111122223333:role/your-execution-role-name"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreateCluster",
                "sagemaker:DeleteCluster",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:UpdateCluster",
                "sagemaker:UpdateClusterSoftware",
                "sagemaker:BatchDeleteClusterNodes",
                "eks:DescribeCluster",
                "eks:CreateAccessEntry",
                "eks:DescribeAccessEntry",
                "eks:DeleteAccessEntry",
                "eks:AssociateAccessPolicy",
                "iam:CreateServiceLinkedRole",
                "ec2:DescribeImages",
                "ec2:DescribeSnapshots"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:ModifyImageAttribute",
                "ec2:ModifySnapshotAttribute"
            ],
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "ec2:ResourceTag/AllowSharing": "true"
                }
            }
        }
    ]
}
```

------

**Important**  
If you plan to use an encrypted custom AMI, then make sure that your KMS key meets the permissions described in [Customer managed AWS KMS key encryption for SageMaker HyperPod](smcluster-cmk.md). Additionally, ensure that your custom AMI's KMS key is also used to encrypt your cluster's Amazon EBS root volume.

## Create a cluster
<a name="hyperpod-custom-ami-api-create"></a>

You can specify your custom AMI in the `ImageId` field for the `CreateCluster` operation.

The following examples show how to create a cluster with a custom AMI, both with and without an AWS KMS customer managed key for encrypting the cluster volumes.

------
#### [ Standard example ]

The following example shows how to create a cluster with a custom AMI.

```
aws sagemaker create-cluster \
   --cluster-name <exampleClusterName> \
   --orchestrator 'Eks={ClusterArn='<eks_cluster_arn>'}' \
   --node-provisioning-mode Continuous \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ImageId": "<your_custom_ami>",
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "InstanceStorageConfigs": [
   
        {
            "EbsVolumeConfig": {
                "VolumeSizeInGB": 200
            }
        }
   ]
}' --vpc-config '{
   "SecurityGroupIds": ["<security_group>"],
   "Subnets": ["<subnet>"]
}'
```

------
#### [ Customer managed key example ]

The following example shows how to create a cluster with a custom AMI while specifying your own AWS KMS customer managed key for encrypting the cluster's Amazon EBS volumes. It is possible to specify different customer managed keys for the root volume and the instance storage volume. If you don't use customer managed keys in the `InstanceStorageConfigs` field, then an AWS owned KMS key is used to encrypt the volumes. If you use different keys for the root volume and secondary instance storage volumes, then set the required KMS key policies on both of your keys.

```
aws sagemaker create-cluster \
   --cluster-name <exampleClusterName> \
   --orchestrator 'Eks={ClusterArn='<eks_cluster_arn>'}' \
   --node-provisioning-mode Continuous \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ImageId": "<your_custom_ami>",
   "ExecutionRole": "<arn:aws:iam:us-east-1:444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "InstanceStorageConfigs": [
             # Root volume configuration
            {
                "EbsVolumeConfig": {
                    "RootVolume": True,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            },
            # Instance storage volume configuration
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            }
   ]
}' --vpc-config '{
   "SecurityGroupIds": ["<security_group>"],
   "Subnets": ["<subnet>"]
}'
```

------

## Update the cluster software
<a name="hyperpod-custom-ami-api-update"></a>

If you want to update an existing instance group on your cluster with your custom AMI, you can use the `UpdateClusterSoftware` operation and specify your custom AMI in the `ImageId` field. Note that unless you specify the name of a specific instance group in your request, then the new image is applied to all of the instance groups in your cluster.

The following example shows how to update a cluster's platform software with a custom AMI:

```
aws sagemaker update-cluster-software \
   --cluster-name <exampleClusterName> \
   --instance-groups <instanceGroupToUpdate> \
   --image-id <customAmiId>
```

## Scale up an instance group
<a name="hyperpod-custom-ami-scale-up"></a>

The following examples show how to scale up an instance group for a cluster using a custom AMI, both with and without using an AWS KMS customer managed key for encryption.

------
#### [ Standard example ]

The following example shows how to scale up an instance group with a custom AMI.

```
aws sagemaker update-cluster \
    --cluster-name <exampleClusterName> --instance-groups '[{                  
    "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}]'
```

------
#### [ Customer managed key example ]

The following example shows how to update and scale up your cluster with a custom AMI while specifying your own AWS KMS customer managed key for encrypting the cluster's Amazon EBS volumes. It is possible to specify different customer managed keys for the root volume and the instance storage volume. If you don't use customer managed keys in the `InstanceStorageConfigs` field, then an AWS owned KMS key is used to encrypt the volumes. If you use different keys for the root volume and secondary instance storage volumes, then set the required KMS key policies on both of your keys.

```
aws sagemaker update-cluster \
    --cluster-name <exampleClusterName> --instance-groups '[{                  
    "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>",
   "InstanceStorageConfigs": [
             # Root volume configuration
            {
                "EbsVolumeConfig": {
                    "RootVolume": True,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            },
            # Instance storage volume configuration
            {
                "EbsVolumeConfig": {
                    "VolumeSizeInGB": 100,
                    "VolumeKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/key-id"
                }
            }
   ]
}]'
```

------

## Add an instance group
<a name="hyperpod-custom-ami-add-instance-group"></a>

The following example shows how to add an instance group to a cluster using a custom AMI:

```
aws sagemaker update-cluster \
   --cluster-name "<exampleClusterName>" \
   --instance-groups '{
   "InstanceGroupName": "<exampleGroupName>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}' '{
   "InstanceGroupName": "<exampleGroupName2>",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 1,
   "LifeCycleConfig": {
      "SourceS3Uri": "<s3://amzn-s3-demo-bucket>",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "<arn:aws:iam::444455556666:role/Admin>",
   "ThreadsPerCore": 1,
   "ImageId": "<your_custom_ami>"
}'
```

# Managing SageMaker HyperPod EKS clusters using the SageMaker console
<a name="sagemaker-hyperpod-eks-operate-console-ui"></a>

The following topics provide guidance on how to manage SageMaker HyperPod in the SageMaker AI console.

**Topics**
+ [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md)
+ [Browsing, viewing, and editing SageMaker HyperPod clusters](sagemaker-hyperpod-eks-operate-console-ui-browse-view-edit.md)
+ [Deleting a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-console-ui-delete-cluster.md)

# Creating a SageMaker HyperPod cluster with Amazon EKS orchestration
<a name="sagemaker-hyperpod-eks-operate-console-ui-create-cluster"></a>

The following tutorial demonstrates how to create a new SageMaker HyperPod cluster and set it up with Amazon EKS orchestration through the SageMaker AI console UI.

**Topics**
+ [Create cluster](#smcluster-getting-started-eks-console-create-cluster-page)
+ [Deploy resources](#smcluster-getting-started-eks-console-create-cluster-deploy)

## Create cluster
<a name="smcluster-getting-started-eks-console-create-cluster-page"></a>

To navigate to the **SageMaker HyperPod Clusters** page and choose Amazon EKS orchestration, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **HyperPod Clusters** in the left navigation pane and then **Cluster Management**.

1. On the **SageMaker HyperPod Clusters** page, choose **Create HyperPod cluster**. 

1. On the **Create HyperPod cluster** drop-down, choose **Orchestrated by Amazon EKS**.

1. On the EKS cluster creation page, you will see two options, choose the option that best fits your needs.

   1. **Quick setup** - To get started immediately with default settings, choose **Quick setup**. With this option, SageMaker AI will create new resources such as VPC, subnets, security groups, Amazon S3 bucket, IAM role, and FSx for Lustre in the process of creating your cluster.

   1. **Custom setup** - To integrate with existing AWS resources or have specific networking, security, or storage requirements, choose **Custom setup**. With this option, you can choose to use the existing resources or create new ones, and you can customize the configuration that best fits your needs.

## Quick setup
<a name="smcluster-getting-started-eks-console-create-cluster-default"></a>

On the **Quick setup** section, follow these steps to create your HyperPod cluster with Amazon EKS orchestration.

### General settings
<a name="smcluster-getting-started-eks-console-create-cluster-default-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

### Instance groups
<a name="smcluster-getting-started-eks-console-create-cluster-default-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. Follow these steps to add an instance group.

1. For **Instance group type**, choose **Standard** or **Restricted Instance Group (RIG)**. Typically, you will choose **Standard**, which provides a general purpose computing environment without additional security restrictions. **Restricted Instance Group (RIG)** is a specialized environment for foundational models customization such as Amazon Nova. For more information about setting up RIG for Amazon Nova model customization, see Amazon Nova customization on SageMaker HyperPod in the [Amazon Nova 1.0 user guide](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) or the [Amazon Nova 2.0 user guide](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html).

1. For **Name**, specify a name for the instance group.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group.
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. For **Instance deep health checks**, choose your option. Deep health checks monitor instance health during creation and after software updates, automatically recovering faulty instances through reboots or replacements when enabled.

1. If your instance type supports GPU partitioning with Multi-Instance GPU (MIG), you can enable GPU partition configuration for the instance group. GPU partitioning allows you to divide GPUs into smaller, isolated partitions for improved resource utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

   1. Toggle **Use GPU partition** to enable GPU partitioning for this instance group.

   1. Select a **GPU partition profile** from the available options for your instance type. Each profile defines the GPU slice configuration and memory allocation.

1. Choose **Add instance group**.

### Quick setup defaults
<a name="smcluster-getting-started-eks-console-create-cluster-default-settings"></a>

This section lists all the default settings for your cluster creation, including all the new AWS resources that will be created during the cluster creation process. Review the default settings.

## Custom setup
<a name="smcluster-getting-started-eks-console-create-cluster-custom"></a>

On the **Custom setup** section, follow these steps to create your first HyperPod cluster with Amazon EKS orchestration.

### General settings
<a name="smcluster-getting-started-eks-console-create-cluster-custom-general"></a>

Specify a name for the new cluster. You can’t change the name after the cluster is created.

For **Instance recovery**, choose **Automatic - *recommended*** or **None**. 

### Networking
<a name="smcluster-getting-started-eks-console-create-cluster-custom-network"></a>

Configure network settings within the cluster and in-and-out of the cluster. For orchestration of SageMaker HyperPod cluster with Amazon EKS, the VPC is automatically set to the one configured with the EKS cluster you selected.

1. For **VPC**, choose your own VPC if you already have one that gives SageMaker AI access to your VPC. To create a new VPC, follow the instructions at [Create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/create-vpc.html) in the *Amazon Virtual Private Cloud User Guide*. You can leave it as **None** to use the default SageMaker AI VPC.

1. For **VPC IPv4 CIDR block**, enter the starting IP of your VPC.

1. For **Availability Zones**, choose the Availability Zones (AZ) where HyperPod will create subnets for your cluster. Choose AZs that match the location of your accelerated compute capacity.

1. For **Security group(s)**, choose security groups that are either attached to the Amazon EKS cluster or whose inbound traffic is permitted by the security group associated with the Amazon EKS cluster. To create new security groups, go to the Amazon VPC console.

### Orchestration
<a name="smcluster-getting-started-eks-console-create-cluster-custom-orchestration"></a>

Follow these steps to create or select an Amazon EKS cluster to use as an orchestrator. 

1. For **EKS cluster**, choose either create a new Amazon EKS cluster or use an existing one. 

   If you need to create a new EKS cluster, you can create it from the EKS cluster section without having to open the Amazon EKS console.
**Note**  
The VPC subnet you choose for HyperPod has to be private.   
After submitting a new EKS cluster creation request, wait until the EKS cluster becomes `Active`.

1. For **Kubernetes version**, choose a version from the drop-down menu. For more information about Kubernetes versions, see [Understand the Kubernetes version lifecycle on EKS](https://docs.aws.amazon.com//eks/latest/userguide/kubernetes-versions.html) from the *Amazon EKS User Guide*.

1. For **Operators**, choose **Use default Helm charts and add-ons** or **Don't install operators**. The option defaults to **Use default Helm charts and add-ons**, which will be used to install operators on the EKS cluster. For more information about the default Helm charts and add-ons, see [https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart/HyperPodHelmChart](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart/HyperPodHelmChart) from the GitHub repository. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

1. For **Enabled operators**, view the list of enabled operators. To edit the operators, uncheck the box at top and choose operators to enable for the EKS cluster. 
**Note**  
To use HyperPod with EKS, you must install Helm charts and add-ons that enable operators on the EKS cluster. These components configure EKS as the control plane for HyperPod and provide the necessary setup for workload management and orchestration.

### Instance groups
<a name="smcluster-getting-started-eks-console-create-cluster-custom-instance-groups"></a>

To add an instance group, choose **Add group**. Each instance group can be configured differently, and you can create a heterogeneous cluster that consists of multiple instance groups with various instance types. To deploy a cluster, you must add at least one instance group. Follow these steps to add an instance group.

1. For **Instance group type**, choose **Standard** or **Restricted Instance Group (RIG)**. Typically, you will choose **Standard**, which provides a general purpose computing environment without additional security restrictions. **Restricted Instance Group (RIG)** is a specialized environment for foundational models customization such as Amazon Nova. For more information about setting up RIG for Amazon Nova model customization, see Amazon Nova customization on SageMaker HyperPod in the [Amazon Nova 1.0 user guide](https://docs.aws.amazon.com//nova/latest/userguide/nova-hp.html) or the [Amazon Nova 2.0 user guide](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp.html).

1. For **Name**, specify a name for the instance group.

1.  For **Instance capacity**, choose either on-demand capacity or a training plan to reserve your compute resources.

1. For **Instance type**, choose the instance for the instance group.
**Important**  
Ensure that you choose an instance type with sufficient quotas and enough unassigned IP addresses for your account. To view or request additional quotas, see [SageMaker HyperPod quotas](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-quotas).

1. For **Instance quantity**, specify an integer not exceeding the instance quota for cluster usage. For this tutorial, enter **1** for all three groups.

1. For **Target Availability Zone**, choose the Availability Zone where your instances will be provisioned. The Availability Zone should correspond to the location of your accelerated compute capacity.

1. For **Additional storage volume per instance (GB) - optional**, specify an integer between 1 and 16384 to set the size of an additional Elastic Block Store (EBS) volume in gigabytes (GB). The EBS volume is attached to each instance of the instance group. The default mount path for the additional EBS volume is `/opt/sagemaker`. After the cluster is successfully created, you can SSH into the cluster instances (nodes) and verify if the EBS volume is mounted correctly by running the `df -h` command. Attaching an additional EBS volume provides stable, off-instance, and independently persisting storage, as described in the [Amazon EBS volumes](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-volumes.html) section in the *Amazon Elastic Block Store User Guide*.

1. For **Instance deep health checks**, choose your option. Deep health checks monitor instance health during creation and after software updates, automatically recovering faulty instances through reboots or replacements when enabled. To learn more, see [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md)

1. For **Use GPU partition - optional**, if your instance type supports GPU partitioning with Multi-Instance GPU (MIG), you can enable this option to configure the GPU partition profile for the instance group. GPU partitioning allows you to divide GPUs into smaller, isolated partitions for improved resource utilization. For more information, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

   1. Toggle **Use GPU partition** to enable GPU partitioning for this instance group.

   1. Select a **GPU partition profile** from the available options for your instance type. Each profile defines the GPU slice configuration and memory allocation.

1. Choose **Add instance group**.

### Lifecycle scripts
<a name="smcluster-getting-started-eks-console-create-cluster-custom-lifecycle"></a>

You can choose to use the default lifecycle scripts or the custom lifecycle scripts, which will be stored in your Amazon S3 bucket. You can view the default lifecycle scripts in the [Awesome Distributed Training GitHub repository](https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts). To learn more about the lifecycle scripts, see [Customizing SageMaker HyperPod clusters using lifecycle scripts](sagemaker-hyperpod-lifecycle-best-practices-slurm.md).

1. For **Lifecycle scripts**, choose to use default or custom lifecycle scripts.

1. For **S3 bucket for lifecycle scripts**, choose to create a new bucket or use an existing bucket to store the lifecycle scripts.

### Permissions
<a name="smcluster-getting-started-eks-console-create-cluster-custom-permissions"></a>

Choose or create an IAM role that allows HyperPod to run and access necessary AWS resources on your behalf. For more information, see [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).

### Storage
<a name="smcluster-getting-started-eks-console-create-cluster-custom-storage"></a>

Configure the FSx for Lustre file system to be provisioned on the HyperPod cluster.

1. For **File system**, choose an existing FSx for Lustre file system, to create a new FSx for Lustre file system, or don't provision an FSx for Lustre file system.

1. For **Throughput per unit of storage**, choose the throughput that will be available per TiB of provisioned storage.

1. For **Storage capacity**, enter a capacity value in TB.

1. For **Data compression type**, choose **LZ4** to enable data compression.

1. For **Lustre version**, view the value that's recommended for the new file systems.

### Tags - optional
<a name="smcluster-getting-started-eks-console-create-cluster-tags"></a>

For **Tags - *optional***, add key and value pairs to the new cluster and manage the cluster as an AWS resource. To learn more, see [Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).

## Deploy resources
<a name="smcluster-getting-started-eks-console-create-cluster-deploy"></a>

After you complete the cluster configurations using either **Quick setup** or **Custom setup**, choose the following option to start resource provisioning and cluster creation.
+  **Submit** - SageMaker AI will start provisioning the default configuration resources and creating the cluster. 
+ **Download CloudFormation template parameters** - You will download the configuration parameter JSON file and run AWS CLI command to deploy the CloudFormation stack to provision the configuration resources and creating the cluster. You can edit the downloaded parameter JSON file if needed. If you choose this option, see more instructions in [Creating SageMaker HyperPod clusters using CloudFormation templates](smcluster-getting-started-eks-console-create-cluster-cfn.md).

# Browsing, viewing, and editing SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-eks-operate-console-ui-browse-view-edit"></a>

Use the following instructions to browse, view, and edit SageMaker HyperPod clusters orchestrated by Amazon EKS in the SageMaker AI console.

**Topics**
+ [To browse your SageMaker HyperPod clusters](#sagemaker-hyperpod-eks-operate-console-ui-browse-clusters)
+ [To view details of each SageMaker HyperPod cluster](#sagemaker-hyperpod-eks-operate-console-ui-view-details-of-clusters)
+ [To edit a SageMaker HyperPod cluster](#sagemaker-hyperpod-eks-operate-console-ui-edit-clusters)

## To browse your SageMaker HyperPod clusters
<a name="sagemaker-hyperpod-eks-operate-console-ui-browse-clusters"></a>

Under **Clusters** on the SageMaker HyperPod page in the SageMaker AI console, all created clusters should be listed under the **Clusters** section, which provides a summary view of clusters, their ARNs, status, and creation time.

## To view details of each SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-view-details-of-clusters"></a>

Under **Clusters** on the SageMaker HyperPod page in the SageMaker AI console, the cluster names are activated as links. Choose the cluster name link to see details of each cluster.

## To edit a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-edit-clusters"></a>

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to update.

1. Select your cluster, and choose **Edit**.

1. In the **Edit <your-cluster>** page, you can edit the configurations of existing instance groups, add more instance groups, delete instance groups, and change tags for the cluster. After making changes, choose **Submit**. 

   1. In the **Configure instance groups** section, you can add more instance groups by choosing **Create instance group**.

   1. In the **Configure instance groups** section, you can choose **Edit** to change its configuration or **Delete** to remove the instance group permanently.
**Important**  
When deleting an instance group, consider the following points:  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal.
The removal process cannot be undone.
**Note**  
Deleting an instance group will terminate all compute resources associated with that group.

   1. In the **Tags** section, you can update tags for the cluster.

# Deleting a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-console-ui-delete-cluster"></a>

Use the following instructions to delete SageMaker HyperPod clusters orchestrated by Amazon EKS in the SageMaker AI console.

1. Under **Clusters** in the main pane of the SageMaker HyperPod console, choose the cluster you want to delete.

1. Select your cluster, and choose **Delete**.

1. In the pop-up window for cluster deletion, review the cluster information carefully to confirm that you chose the right cluster to delete.

1. After you reviewed the cluster information, choose **Yes, delete cluster**.

1. In the text field to confirm this deletion, type **delete**.

1. Choose **Delete** on the lower right corner of the pop-up window to finish sending the cluster deletion request.

**Note**  
When cluster deletion fails due to attached SageMaker HyperPod task governance policies, you will need to [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md).

# Creating SageMaker HyperPod clusters using CloudFormation templates
<a name="smcluster-getting-started-eks-console-create-cluster-cfn"></a>

You can create SageMaker HyperPod clusters using the CloudFormation templates for HyperPod. You must install AWS CLI to proceed.

**Topics**
+ [Configure resources in the console and deploy using CloudFormation](#smcluster-getting-started-eks-console-create-cluster-deploy-console)
+ [Configure and deploy resources using CloudFormation](#smcluster-getting-started-eks-console-create-cluster-deploy-cfn)

## Configure resources in the console and deploy using CloudFormation
<a name="smcluster-getting-started-eks-console-create-cluster-deploy-console"></a>

You can configure resources using the AWS Management Console and deploy using the CloudFormation templates.

Follow these steps.

1. *Instead of choosing **Submit***, choose **Download CloudFormation template parameters** at the end of the tutorial in [Getting started with SageMaker HyperPod using the SageMaker AI console](smcluster-getting-started-slurm-console.md). The tutorial contains important configuration information you will need to create your cluster successfully.
**Important**  
If you choose **Submit**, you will not be able to deploy a cluster with the same name until you delete the cluster.

   After you choose **Download CloudFormation template parameters**, the **Using the configuration file to create the cluster using the AWS CLI** window will appear on the right side of the page.

1. On the **Using the configuration file to create the cluster using the AWS CLI** window, choose **Download configuration parameters file**. The file will be downloaded to your machine. You can edit the configuration JSON file based on your needs or leave it as-is, if no change is required.

1. In the terminal, navigate to the location of the parameter file `file://params.json`.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url https://aws-sagemaker-hyperpod-cluster-setup.amazonaws.com/templates-slurm/main-stack-slurm-based-template.yaml
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the [CloudFormation console](https://console.aws.amazon.com/cloudformation).

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes. To access the cluster nodes and start running ML workloads, see [Jobs on SageMaker HyperPod clusters](sagemaker-hyperpod-run-jobs-slurm.md).

## Configure and deploy resources using CloudFormation
<a name="smcluster-getting-started-eks-console-create-cluster-deploy-cfn"></a>

You can configure and deploy resources using the CloudFormation templates for SageMaker HyperPod.

Follow these steps.

1. Download a CloudFormation template for SageMaker HyperPod from the [sagemaker-hyperpod-cluster-setup](https://github.com/aws/sagemaker-hyperpod-cluster-setup) GitHub repository.

1. Run the [create-stack](https://docs.aws.amazon.com//cli/latest/reference/cloudformation/create-stack.html) AWS CLI command to deploy the CloudFormation stack that will provision the configured resources and create the HyperPod cluster.

   ```
   aws cloudformation create-stack 
       --stack-name my-stack
       --template-url URL_of_the_file_that_contains_the_template_body
       --parameters file://params.json
       --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM
   ```

1. To view the status of the resources provisioning, navigate to the CloudFormation console.

   After the cluster creation completes, view the new cluster under **Clusters** in the main pane of the SageMaker HyperPod console. You can check the status of it displayed under the **Status** column.

1. After the status of the cluster turns to `InService`, you can start logging into the cluster nodes.

# Managing SageMaker HyperPod EKS clusters using the AWS CLI
<a name="sagemaker-hyperpod-eks-operate-cli-command"></a>

The following topics provide guidance on writing SageMaker HyperPod API request files in JSON format and run them using the AWS CLI commands.

**Topics**
+ [Creating a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-cli-command-create-cluster.md)
+ [Retrieving SageMaker HyperPod cluster details](sagemaker-hyperpod-eks-operate-cli-command-cluster-details.md)
+ [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md)
+ [Updating the SageMaker HyperPod platform software](sagemaker-hyperpod-eks-operate-cli-command-update-cluster-software.md)
+ [Accessing SageMaker HyperPod cluster nodes](sagemaker-hyperpod-eks-operate-access-through-terminal.md)
+ [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md)
+ [Deleting a SageMaker HyperPod cluster](sagemaker-hyperpod-eks-operate-cli-command-delete-cluster.md)

# Creating a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-create-cluster"></a>

Learn how to create SageMaker HyperPod clusters orchestrated by Amazon EKS using the AWS CLI.

1. Before creating an SageMaker HyperPod cluster:

   1. Ensure that you have an existing Amazon EKS cluster up and running. For detailed instructions about how to set up an Amazon EKS cluster, see [Create an Amazon EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html) in the *Amazon EKS User Guide*.

   1. Install the Helm chart as instructed in [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md). If you create a [Amazon Nova SageMaker HyperPod cluster](https://docs.aws.amazon.com//nova/latest/nova2-userguide/nova-hp-cluster.html), you will need a separate Helm chart.

1. Prepare a lifecycle configuration script and upload to an Amazon S3 bucket, such as `s3://amzn-s3-demo-bucket/Lifecycle-scripts/base-config/`.

   For a quick start, download the sample script [https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/7.sagemaker-hyperpod-eks/LifecycleScripts/base-config/on_create.sh) from the AWSome Distributed Training GitHub repository, and upload it to the S3 bucket. You can also include additional setup instructions, a series of setup scripts, or commands to be executed during the HyperPod cluster provisioning stage.
**Important**  
If you create an [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod) attaching only the managed [https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-cluster.html), your cluster has access to Amazon S3 buckets with the specific prefix `sagemaker-`.

   If you create a restricted instance group, you don't need to download and run the lifecycle script. Instead, you need to run `install_rig_dependencies.sh`. 

   The prerequisites to run the `install_rig_dependencies.sh` script include:
   + AWS Node (CNI) and CoreDNS should both be enabled. These are standard EKS add-ons that are not managed by the standard SageMaker HyperPod Helm, but can be easily enabled in the EKS console under Add-ons.
   +  The standard SageMaker HyperPod Helm chart should be installed before running this script.

   The `install_rig_dependencies.sh` script performs the following actions. 
   + `aws-node` (CNI): New `rig-aws-node` Daemonset created; existing `aws-node` patched to avoid RIG nodes.
   + `coredns`: Converted to Daemonset for RIGs to support multi-RIG use and prevent overloading.
   + training-operators: Updated with RIG Worker taint tolerations and nodeAffinity favoring non-RIG instances.
   + Elastic Fabric Adapter (EFA): Updated to tolerate RIG worker taint and use correct container images for each Region.

1. Prepare a [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) API request file in JSON format. For `ExecutionRole`, provide the ARN of the IAM role you created with the managed `AmazonSageMakerClusterInstanceRolePolicy` from the section [IAM role for SageMaker HyperPod](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-role-for-hyperpod).
**Note**  
Ensure that your SageMaker HyperPod cluster is deployed within the same Virtual Private Cloud (VPC) as your Amazon EKS cluster. The subnets and security groups specified in the SageMaker HyperPod cluster configuration must allow network connectivity and communication with the Amazon EKS cluster's API server endpoint.

   ```
   // create_cluster.json
   {
       "ClusterName": "string",
       "InstanceGroups": [{
           "InstanceGroupName": "string",
           "InstanceType": "string",
           "InstanceCount": number,
           "LifeCycleConfig": {
               "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
               "OnCreate": "on_create.sh"
           },
           "ExecutionRole": "string",
           "ThreadsPerCore": number,
           "OnStartDeepHealthChecks": [
               "InstanceStress", "InstanceConnectivity"
           ]
       }],
       "RestrictedInstanceGroups": [ 
         { 
            "EnvironmentConfig": { 
               "FSxLustreConfig": { 
                  "PerUnitStorageThroughput": number,
                  "SizeInGiB": number
               }
            },
            "ExecutionRole": "string",
            "InstanceCount": number,
            "InstanceGroupName": "string",
            "InstanceStorageConfigs": [ 
               { ... }
            ],
            "InstanceType": "string",
            "OnStartDeepHealthChecks": [ "string" ],
            "OverrideVpcConfig": { 
               "SecurityGroupIds": [ "string" ],
               "Subnets": [ "string" ]
            },
            "ScheduledUpdateConfig": { 
               "DeploymentConfig": { 
                  "AutoRollbackConfiguration": [ 
                     { 
                        "AlarmName": "string"
                     }
                  ],
                  "RollingUpdatePolicy": { 
                     "MaximumBatchSize": { 
                        "Type": "string",
                        "Value": number
                     },
                     "RollbackMaximumBatchSize": { 
                        "Type": "string",
                        "Value": number
                     }
                  },
                  "WaitIntervalInSeconds": number
               },
               "ScheduleExpression": "string"
            },
            "ThreadsPerCore": number,
            "TrainingPlanArn": "string"
         }
      ],
       "VpcConfig": {
           "SecurityGroupIds": ["string"],
           "Subnets": ["string"]
       },
       "Tags": [{
           "Key": "string",
           "Value": "string"
       }],
       "Orchestrator": {
           "Eks": {
               "ClusterArn": "string",
               "KubernetesConfig": {
                   "Labels": {
                       "nvidia.com/mig.config": "all-3g.40gb"
                   }
               }
           }
       },
       "NodeRecovery": "Automatic"
   }
   ```
**Flexible instance groups**  
Instead of specifying a single `InstanceType`, you can use the `InstanceRequirements` parameter to specify multiple instance types for an instance group. Note the following:  
`InstanceType` and `InstanceRequirements` are mutually exclusive. You must specify one or the other, but not both.
`InstanceRequirements.InstanceTypes` is an ordered list that determines provisioning priority. SageMaker HyperPod attempts to provision the first instance type in the list and falls back to subsequent types if capacity is unavailable. You can specify up to 20 instance types, and the list must not contain duplicates.
Flexible instance groups require continuous node provisioning mode.
The following example shows an instance group using `InstanceRequirements`:  

   ```
   {
       "InstanceGroupName": "flexible-ig",
       "InstanceRequirements": {
           "InstanceTypes": ["ml.p5.48xlarge", "ml.p4d.24xlarge", "ml.g6.48xlarge"]
       },
       "InstanceCount": 10,
       "LifeCycleConfig": {
           "SourceS3Uri": "s3://amzn-s3-demo-bucket-sagemaker/lifecycle-script-directory/src/",
           "OnCreate": "on_create.sh"
       },
       "ExecutionRole": "arn:aws:iam::111122223333:role/iam-role-for-cluster"
   }
   ```

   Note the following when configuring to create a new SageMaker HyperPod cluster associating with an EKS cluster.
   + You can configure up to 20 instance groups under the `InstanceGroups` parameter.
   + For `Orchestator.Eks.ClusterArn`, specify the ARN of the EKS cluster you want to use as the orchestrator.
   + For `OnStartDeepHealthChecks`, add `InstanceStress` and `InstanceConnectivity` to enable [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md).
   + For `NodeRecovery`, specify `Automatic` to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.
   + For the `Tags` parameter, you can add custom tags for managing the SageMaker HyperPod cluster as an AWS resource. You can add tags to your cluster in the same way you add them in other AWS services that support tagging. To learn more about tagging AWS resources in general, see [Tagging AWS Resources User Guide](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html).
   + For the `VpcConfig` parameter, specify the information of the VPC used in the EKS cluster. The subnets must be private.
   + For `Orchestrator.Eks.KubernetesConfig.Labels`, you can optionally specify Kubernetes labels to apply to the nodes. To enable GPU partitioning with Multi-Instance GPU (MIG), add the `nvidia.com/mig.config` label with the desired MIG profile. For example, `"nvidia.com/mig.config": "all-3g.40gb"` configures all GPUs with the 3g.40gb partition profile. For more information about GPU partitioning and available profiles, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).

1. Run the [create-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) command as follows.
**Important**  
When running the `create-cluster` command with the `--cli-input-json` parameter, you must include the `file://` prefix before the complete path to the JSON file. This prefix is required to ensure that the AWS CLI recognizes the input as a file path. Omitting the `file://` prefix results in a parsing parameter error.

   ```
   aws sagemaker create-cluster \
       --cli-input-json file://complete/path/to/create_cluster.json
   ```

   This should return the ARN of the new cluster.
**Important**  
You can use the [update-cluster](https://docs.aws.amazon.com//cli/latest/reference/ecs/update-cluster.html) operation to remove a restricted instance group (RIG). When a RIG is scaled down to 0, the FSx for Lustre file system won't be deleted. To completely remove the FSx for Lustre file system, you must remove the RIG entirely.   
Removing a RIG will not delete any artifacts stored in the service-managed Amazon S3 bucket. However, you should ensure all artifacts in the FSx for Lustre file system are fully synchronized to Amazon S3 before removal. We recommend waiting at least 30 minutes after job completion to ensure complete synchronization of all artifacts from the FSx for Lustre file system to the service-managed Amazon S3 bucket.
**Important**  
When using an onboarded On-Demand Capacity Reservation (ODCR), you must map your instance group to the same Availability Zone ID (AZ ID) as the ODCR by setting `OverrideVpcConfig` with a subnet in the matching AZ ID.  
CRITICAL: Verify `OverrideVpcConfig` configuration before deployment to avoid incurring duplicate charges for both ODCR and On-Demand Capacity.

# Retrieving SageMaker HyperPod cluster details
<a name="sagemaker-hyperpod-eks-operate-cli-command-cluster-details"></a>

Learn how to retrieve SageMaker HyperPod cluster details using the AWS CLI.

## Describe a cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-describe-cluster"></a>

Run [describe-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster.html) to check the status of the cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster --cluster-name your-hyperpod-cluster
```

After the status of the cluster turns to **InService**, proceed to the next step. Using this API, you can also retrieve failure messages from running other HyperPod API operations.

## List details of cluster nodes
<a name="sagemaker-hyperpod-eks-operate-cli-command-list-cluster-nodes"></a>

Run [list-cluster-nodes](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) to check the key information of the cluster nodes.

```
aws sagemaker list-cluster-nodes --cluster-name your-hyperpod-cluster
```

This returns a response, and the `InstanceId` is what you need to use for logging (using `aws ssm`) into them.

## Describe details of a cluster node
<a name="sagemaker-hyperpod-eks-operate-cli-command-describe-cluster-node"></a>

Run [describe-cluster-node](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-cluster-node.html) to retrieve details of a cluster node. You can get the cluster node ID from list-cluster-nodes output. You can specify either the name or the ARN of the cluster.

```
aws sagemaker describe-cluster-node \
    --cluster-name your-hyperpod-cluster \
    --node-id i-111222333444555aa
```

## List clusters
<a name="sagemaker-hyperpod-eks-operate-cli-command-list-clusters"></a>

Run [list-clusters](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-clusters.html) to list all clusters in your account.

```
aws sagemaker list-clusters
```

You can also add additional flags to filter the list of clusters down. To learn more about what this command runs at low level and additional flags for filtering, see the [ListClusters](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusters.html) API reference.

# Updating SageMaker HyperPod cluster configuration
<a name="sagemaker-hyperpod-eks-operate-cli-command-update-cluster"></a>

Run [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) to update the configuration of a cluster.

**Note**  
Important considerations:  
You cannot change the EKS cluster information that your HyperPod cluster is associated after the cluster is created. 
If deep health checks are running on the cluster, this API will not function as expected. You might encounter an error message stating that deep health checks are in progress. To update the cluster, you should wait until the deep health checks finish.

1. Create an [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API request file in JSON format. Make sure that you specify the right cluster name and instance group name to update. For each instance group, you can change the instance type, the number of instances, the lifecycle configuration entrypoint script, and the path to the script.
**Note**  
You can use the `UpdateCluster` to scale down or remove entire instance groups from your SageMaker HyperPod cluster. For additional instructions on how to scale down or delete instance groups, see [Scaling down a SageMaker HyperPod cluster](smcluster-scale-down.md).

   1. For `ClusterName`, specify the name of the cluster you want to update.

   1. For `InstanceGroupName`

      1. To update an existing instance group, specify the name of the instance group you want to update.

      1. To add a new instance group, specify a new name not existing in your cluster.

   1. For `InstanceType`

      1. To update an existing instance group, you must match the instance type you initially specified to the group.

      1. To add a new instance group, specify an instance type you want to configure the group with.

      For instance groups that use `InstanceRequirements` instead of `InstanceType`, you can add or remove instance types from the `InstanceTypes` list. However, you cannot remove an instance type that has active nodes running on it. You also cannot switch between `InstanceType` and `InstanceRequirements` when updating an existing instance group. `InstanceType` and `InstanceRequirements` are mutually exclusive.

   1. For `InstanceCount`

      1. To update an existing instance group, specify an integer that corresponds to your desired number of instances. You can provide a higher or lower value (down to 0) to scale the instance group up or down.

      1. To add a new instance group, specify an integer greater or equal to 1. 

   1. For `LifeCycleConfig`, you can change the values for both `SourceS3Uri` and `OnCreate` as you want to update the instance group.

   1. For `ExecutionRole`

      1. For updating an existing instance group, keep using the same IAM role you attached during cluster creation.

      1. For adding a new instance group, specify an IAM role you want to attach.

   1. For `ThreadsPerCore`

      1. For updating an existing instance group, keep using the same value you specified during cluster creation.

      1. For adding a new instance group, you can choose any value from the allowed options per instance type. For more information, search the instance type and see the **Valid threads per core** column in the reference table at [CPU cores and threads per CPU core per instance type](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/cpu-options-supported-instances-values.html) in the *Amazon EC2 User Guide*.

   1. For `OnStartDeepHealthChecks`, add `InstanceStress` and `InstanceConnectivity` to enable [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md).

   1. For `NodeRecovery`, specify `Automatic` to enable automatic node recovery. SageMaker HyperPod replaces or reboots instances (nodes) when issues are found by the health-monitoring agent.

   The following code snippet is a JSON request file template you can use. For more information about the request syntax and parameters of this API, see the [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) API reference.

   ```
   // update_cluster.json
   {
       // Required
       "ClusterName": "name-of-cluster-to-update",
       // Required
       "InstanceGroups": [{
           "InstanceGroupName": "string",
           "InstanceType": "string",
           "InstanceCount": number,
           "LifeCycleConfig": {
               "SourceS3Uri": "string",
               "OnCreate": "string"
           },
           "ExecutionRole": "string",
           "ThreadsPerCore": number,
           "OnStartDeepHealthChecks": [
               "InstanceStress", "InstanceConnectivity"
           ]
       }],
       "NodeRecovery": "Automatic"
   }
   ```

1. Run the following `update-cluster` command to submit the request. 

   ```
   aws sagemaker update-cluster \
       --cli-input-json file://complete/path/to/update_cluster.json
   ```

# Updating the SageMaker HyperPod platform software
<a name="sagemaker-hyperpod-eks-operate-cli-command-update-cluster-software"></a>

When you create your SageMaker HyperPod cluster, SageMaker HyperPod selects an Amazon Machine Image (AMI) corresponding to the Kubernetes version of your Amazon EKS cluster.

Run [update-cluster-software](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster-software.html) to update existing clusters with software and security patches provided by the SageMaker HyperPod service. For `--cluster-name`, specify either the name or the ARN of the cluster to update.

**Important**  
When this API is called, SageMaker HyperPod doesn’t drain or redistribute the jobs (Pods) running on the nodes. Make sure to check if there are any jobs running on the nodes before calling this API.
The patching process replaces the root volume with the updated AMI, which means that your previous data stored in the instance root volume will be lost. Make sure that you back up your data from the instance root volume to Amazon S3 or Amazon FSx for Lustre.
All cluster nodes experience downtime (nodes appear as `<NotReady>` in the output of `kubectl get node`) while the patching is in progress. We recommend that you terminate all workloads before patching and resume them after the patch completes.   
If the security patch fails, you can retrieve failure messages by running the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) API as instructed at [Describe a cluster](sagemaker-hyperpod-eks-operate-cli-command-cluster-details.md#sagemaker-hyperpod-eks-operate-cli-command-describe-cluster).

```
aws sagemaker update-cluster-software --cluster-name your-hyperpod-cluster
```

**Rolling upgrades with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, rolling upgrades spread each instance type proportionally across batches. For example, if an instance group has 100 instances (10 P5 and 90 G6) and you configure a 10% batch size, each batch contains 1 P5 instance and 9 G6 instances.

 When calling the `UpdateClusterSoftware` API, SageMaker HyperPod updates the Kubernetes version of the nodes by selecting the latest [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami) based on the Kubernetes version of your Amazon EKS cluster. It then runs the lifecycle scripts in the Amazon S3 bucket that you specified during the cluster creation or update. 

You can verify the kubelet version of a node by running the `kubectl describe node` command.

The Kubernetes version of SageMaker HyperPod cluster nodes does not automatically update when you update your Amazon EKS cluster version. After updating the Kubernetes version for your Amazon EKS cluster, you must use the `UpdateClusterSoftware` API to update your SageMaker HyperPod cluster nodes to the same Kubernetes version.

 It is recommended to update your SageMaker HyperPod cluster after updating your Amazon EKS nodes, and avoid having more than one version difference between the Amazon EKS cluster version and the SageMaker HyperPod cluster nodes version.

The SageMaker HyperPod service team regularly rolls out new [SageMaker HyperPod DLAMI](sagemaker-hyperpod-ref.md#sagemaker-hyperpod-ref-hyperpod-ami)s for enhancing security and improving user experiences. We recommend that you always keep updating to the latest SageMaker HyperPod DLAMI. For future SageMaker HyperPod DLAMI updates for security patching, follow up with [Amazon SageMaker HyperPod release notes](sagemaker-hyperpod-release-notes.md).

**Note**  
You can only run this API programmatically. The patching functionality is not implemented in the SageMaker HyperPod console UI.

# Accessing SageMaker HyperPod cluster nodes
<a name="sagemaker-hyperpod-eks-operate-access-through-terminal"></a>

You can directly access the nodes of a SageMaker HyperPod cluster in service using the AWS CLI commands for AWS Systems Manager (SSM). Run `aws ssm start-session` with the host name of the node in format of `sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id]`. You can retrieve the cluster ID, the instance ID, and the instance group name from the [SageMaker HyperPod console](sagemaker-hyperpod-operate-slurm-console-ui.md#sagemaker-hyperpod-operate-slurm-console-ui-view-details-of-clusters) or by running `describe-cluster` and `list-cluster-nodes` from the [AWS CLI commands for SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-list-cluster-nodes). For example, if your cluster ID is `aa11bbbbb222`, the cluster node name is `controller-group`, and the cluster node ID is `i-111222333444555aa`, the SSM `start-session` command should be the following.

**Note**  
If you haven't set up AWS Systems Manager, follow the instructions provided at [Setting up AWS Systems Manager and Run As for cluster user access control](sagemaker-hyperpod-prerequisites.md#sagemaker-hyperpod-prerequisites-ssm).

```
$ aws ssm start-session \
    --target sagemaker-cluster:aa11bbbbb222_controller-group-i-111222333444555aa \
    --region us-west-2
Starting session with SessionId: s0011223344aabbccdd
root@ip-111-22-333-444:/usr/bin#
```

# Scaling down a SageMaker HyperPod cluster
<a name="smcluster-scale-down"></a>

You can scale down the number of instances running on your Amazon SageMaker HyperPod cluster. You might want to scale down a cluster for various reasons, such as reduced resource utilization or cost optimization.

The following page outlines two main approaches to scaling down:
+ **Scale down at the instance group level:** This approach uses the `UpdateCluster` API, with which you can:
  + Scale down the instance counts for specific instance groups independently. SageMaker AI handles the termination of nodes in a way that reaches the new target instance counts you've set for each group. See [Scale down an instance group](#smcluster-scale-down-updatecluster).
  + Completely delete instance groups from your cluster. See [Delete instance groups](#smcluster-remove-instancegroup).
+ **Scale down at the instance level:** This approach uses the `BatchDeleteClusterNodes` API, with which you can specify the individual nodes you want to terminate. See [Scale down at the instance level](#smcluster-scale-down-batchdelete).

**Note**  
When scaling down at the instance level with `BatchDeleteCusterNodes`, you can only terminate a maximum of 99 instances at a time. `UpdateCluster` supports terminating any number of instances.

## Important considerations
<a name="smcluster-scale-down-considerations"></a>
+ When scaling down a cluster, you should ensure that the remaining resources are sufficient to handle your workload and that any necessary data migration or rebalancing is properly handled to avoid disruptions. 
+ Make sure to back up your data to Amazon S3 or an FSx for Lustre file system before invoking the API on a worker node group. This can help prevent any potential data loss from the instance root volume. For more information about backup, see [Use the backup script provided by SageMaker HyperPod](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software-backup).
+ To invoke this API on an existing cluster, you must first patch the cluster by running the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API. For more information about patching a cluster, see [Update the SageMaker HyperPod platform software of a cluster](sagemaker-hyperpod-operate-slurm-cli-command.md#sagemaker-hyperpod-operate-slurm-cli-command-update-cluster-software).
+ Metering/billing for on-demand instances will automatically be stopped after scale down. To stop metering for scaled-down reserved instances, you should reach out to your AWS account team for support.
+ You can use the released capacity from the scaled-down reserved instances to scale up another SageMaker HyperPod cluster.

## Scale down at the instance group level
<a name="smcluster-scale-down-or-delete"></a>

The [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) operation allows you to make changes to the configuration of your SageMaker HyperPod cluster, such as scaling down the number of instances of an instance group or removing entire instance groups. This can be useful when you want to adjust the resources allocated to your cluster based on changes in your workload, optimize costs, or change the instance type of an instance group.

### Scale down an instance group
<a name="smcluster-scale-down-updatecluster"></a>

Use this approach when you have an instance group that is idle and it's safe to terminate any of the instances for scaling down. When you submit an `UpdateCluster` request to scale down, HyperPod randomly chooses instances for termination and scales down to the specified number of nodes for the instance group.

**Scale-down behavior with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, HyperPod terminates the lowest-priority instance types first during scale-down. The priority is determined by the order of instance types in the `InstanceTypes` list, where the first type has the highest priority. This protects higher-priority instances, which are typically higher-performance, during scale-down operations.

**Note**  
When you scale the number of instances in an instance group down to 0, all the instances within that group will be terminated. However, the instance group itself will still exist as part of the SageMaker HyperPod cluster. You can scale the instance group back up at a later time, using the same instance group configuration.   
Alternatively, you can choose to remove an instance group permanently. For more information, see [Delete instance groups](#smcluster-remove-instancegroup).

**To scale down with `UpdateCluster`**

1. Follow the steps outlined in [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md). When you reach step **1.d** where you specify the **InstanceCount** field, enter a number that is smaller than the current number of instances to scale down the cluster.

1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI command to submit your request.

The following is an example of an `UpdateCluster` JSON object. Consider the case where your instance group currently has 2 running instances. If you set the **InstanceCount** field to 1, as shown in the example, then HyperPod randomly selects one of the instances and terminates it.

```
{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training-instances",
      "InstanceType": "instance-type",
      "InstanceCount": 1,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    }
  ],
  "NodeRecovery": "Automatic"
}
```

### Delete instance groups
<a name="smcluster-remove-instancegroup"></a>

You can use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) operation to remove entire instance groups from your SageMaker HyperPod cluster when they are no longer needed. This goes beyond simple scaling down, allowing you to completely eliminate specific instance groups from your cluster's configuration. 

**Note**  
When removing an instance group:  
All instances within the targeted group are terminated.
The entire group configuration is deleted from the cluster.
Any workloads running on that instance group are stopped.

**To delete instance groups with `UpdateCluster`**

1. When following the steps outlined in [Updating SageMaker HyperPod cluster configuration](sagemaker-hyperpod-eks-operate-cli-command-update-cluster.md):

   1. Set the optional `InstanceGroupsToDelete` parameter in your `UpdateCluster` JSON and pass the comma-separated list of instance group names that you want to delete.

   1.  When you specify the `InstanceGroups` list, ensure that the specifications of the instance groups you are removing are no longer listed in the `InstanceGroups` list.

1. Run the [update-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI command to submit your request.

**Important**  
Your SageMaker HyperPod cluster must always maintain at least one instance group.
Ensure all critical data is backed up before removal.
The removal process cannot be undone.

The following is an example of an `UpdateCluster` JSON object. Consider the case where a cluster currently has 3 instance groups, a *training*, a *prototype-training*, and an *inference-serving* group. You want to delete the *prototype-training* group.

```
{
  "ClusterName": "name-of-cluster-to-update",
  "InstanceGroups": [
    {
      "InstanceGroupName": "training",
      "InstanceType": "instance-type",
      "InstanceCount": ,
      "LifeCycleConfig": {
        "SourceS3Uri": "s3://amzn-s3-demo-bucket/training-script.py",
        "OnCreate": "s3://amzn-s3-demo-bucket/setup-script.sh"
      },
      "ExecutionRole": "arn:aws:iam::123456789012:role/SageMakerRole",
      "ThreadsPerCore": number-of-threads,
      "OnStartDeepHealthChecks": [
        "InstanceStress",
        "InstanceConnectivity"
      ]
    },
    {
      "InstanceGroupName": "inference-serving",
      "InstanceType": "instance-type",
      "InstanceCount": 2,
      [...]
    },
  ],
  "InstanceGroupsToDelete": [ "prototype-training" ],
  "NodeRecovery": "Automatic"
}
```

## Scale down at the instance level
<a name="smcluster-scale-down-batchdelete"></a>

The `BatchDeleteClusterNodes` operation allows you to scale down a SageMaker HyperPod cluster by specifying the individual nodes you want to terminate. `BatchDeleteClusterNodes` provides more granular control for targeted node removal and cluster optimization. For example, you might use `BatchDeleteClusterNodes` to delete targeted nodes for maintenance, rolling upgrades, or rebalancing resources geographically.

**API request and response**

When you submit a `BatchDeleteClusterNodes` request, SageMaker HyperPod deletes nodes by their instance IDs. The API accepts a request with the cluster name and a list of node IDs to be deleted. 

The response includes two sections: 
+  `Failed`: A list of errors of type `[ BatchDeleteClusterNodesError ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchDeleteClusterNodesError.html)` - one per instance ID.
+  `Successful`: The list of instance IDs successfully terminated. 

**Validation and error handling**

The API performs various validations, such as:
+ Verifying the node ID format (prefix of `i-` and Amazon EC2 instance ID structure). 
+ Checking the node list length, with a limit of 99 or fewer node IDs in a single `BatchDeleteClusterNodes` request.
+ Ensuring a valid SageMaker HyperPod cluster with the input cluster-name is present and that no cluster-level operations (update, system update, patching, or deletion) are in progress.
+ Handling cases where instances are not found, have invalid status, or are in use.

**API Response Codes**
+  The API returns a `200` status code for successful (e.g., all input nodes succeeded validation) or partially successful requests (e.g., some input nodes fail validation). 
+  If all of these validations fail (e.g., all input nodes fail validation), the API will return a `400` Bad Request response with the appropriate error messages and error codes. 

**Example**

The following is an example of **scaling down a cluster at the instance level** using the AWS CLI:

```
aws sagemaker batch-delete-cluster-nodes --cluster-name "cluster-name" --node-ids '["i-111112222233333", "i-111112222233333"]'
```

# Deleting a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-eks-operate-cli-command-delete-cluster"></a>

Run [delete-cluster](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-cluster.html) to delete a cluster. You can specify either the name or the ARN of the cluster.

```
aws sagemaker delete-cluster --cluster-name your-hyperpod-cluster
```

This API only cleans up the SageMaker HyperPod resources and doesn’t delete any resources of the associated EKS cluster. This includes the Amazon EKS cluster, EKS Pod identities, Amazon FSx volumes, and EKS add-ons. This also includes the initial configuration you added to your EKS cluster. If you want to clean up all resources, make sure that you also clean up the EKS resources separately. 

Make sure that you first delete the SageMaker HyperPod resources, followed by the EKS resources. Performing the deletion in the reverse order may result in lingering resources.

**Important**  
When this API is called, SageMaker HyperPod doesn’t drain or redistribute the jobs (Pods) running on the nodes. Make sure to check if there are any jobs running on the nodes before calling this API.

# HyperPod managed tiered checkpointing
<a name="managed-tier-checkpointing"></a>

This section explains how managed tiered checkpointing works and the benefits it provides for large-scale model training.

Amazon SageMaker HyperPod managed tiered checkpointing helps you train large-scale generative AI models more efficiently. It uses multiple storage tiers, including your cluster’s CPU memory. This approach reduces your time to recovery and minimizes loss in training progress. It also uses underutilized memory resources in your training infrastructure.

Managed tiered checkpointing enables saving checkpoints at a higher frequency to memory. It periodically persists them to durable storage. This maintains both performance and reliability during your training process.

This guide covers how to set up, configure, and use managed tiered checkpointing with PyTorch frameworks on Amazon EKS HyperPod clusters.

## How managed tiered checkpointing works
<a name="managed-tier-checkpointing-works"></a>

Managed tiered checkpointing uses a multi-tier storage approach. CPU memory serves as the primary tier to store model checkpoints. Secondary tiers include persistent storage options like Amazon S3.

When you save a checkpoint, the system stores it in allocated memory space across your cluster nodes. It automatically replicates data across adjacent compute nodes for enhanced reliability. This replication strategy protects against single or multiple node failures while providing fast access for recovery operations.

The system also periodically saves checkpoints to persistent storage according to your configuration. This ensures long-term durability of your training progress.

Key components include:
+ **Memory management system**: A memory management daemon that provides disaggregated memory as a service for checkpoint storage
+ **HyperPod Python library**: Interfaces with the disaggregated storage APIs and provides utilities for saving, loading, and managing checkpoints across tiers
+ **Checkpoint replication**: Automatically replicates checkpoints across multiple nodes for fault tolerance

The system integrates seamlessly with PyTorch training loops through simple API calls. It requires minimal changes to your existing code.

## Benefits
<a name="managed-tier-checkpointing-benefits"></a>

Managed tiered checkpointing delivers several advantages for large-scale model training:
+ **Improved usability**: Manages checkpoint save, replication, persistence, and recovery
+ **Faster checkpoint operations**: Memory-based storage provides faster save and load times compared to disk-based checkpointing, leading to faster recovery
+ **Fault tolerance**: Automatic checkpoint replication across nodes protects against hardware node failures
+ **Minimal code changes**: Simple API integration requires only minor modifications to existing training scripts
+ **Improved training throughput**: Reduced checkpoint overhead means more time spent on actual training

**Topics**
+ [How managed tiered checkpointing works](#managed-tier-checkpointing-works)
+ [Benefits](#managed-tier-checkpointing-benefits)
+ [Set up managed tiered checkpointing](managed-tier-checkpointing-setup.md)
+ [Removing managed tiered checkpointing](managed-tier-checkpointing-remove.md)
+ [Security considerations for managed tiered checkpointing](managed-tier-security-considerations.md)

# Set up managed tiered checkpointing
<a name="managed-tier-checkpointing-setup"></a>

This section contains setup process for managed tiered checkpointing for Amazon SageMaker HyperPod. You’ll learn how to enable the capability on your cluster and implement checkpointing in your training code.

**Topics**
+ [Prerequisites](#managed-tier-checkpointing-setup-prerequisites)
+ [Step 1: Enable managed tiered checkpointing for your cluster](#managed-tier-checkpointing-setup-step-enable-for-cluster)
+ [Step 2: Install the Python library in your training image](#managed-tier-checkpointing-setup-step-install-library)
+ [Step 3: Save checkpoints in your training loop](#managed-tier-checkpointing-setup-step-save-checkpoint-in-loop)
+ [Step 4: Load checkpoints for recovery](#managed-tier-checkpointing-setup-step-load-checkpoint)
+ [Validate your managed tiered checkpointing operations](#managed-tier-checkpointing-setup-validation)

## Prerequisites
<a name="managed-tier-checkpointing-setup-prerequisites"></a>

Before setting up managed tiered checkpointing, ensure you have:
+ An Amazon EKS HyperPod cluster with sufficient CPU memory available for checkpoint allocation
+ PyTorch training workloads and DCP jobs (both are supported)
+ Appropriate IAM permissions for cluster management, including:
  + Amazon CloudWatch and Amazon S3 write permissions for the training pod to read/write checkpoints and push metrics
  + These permissions can be configured via [EKS OIDC setup](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)

## Step 1: Enable managed tiered checkpointing for your cluster
<a name="managed-tier-checkpointing-setup-step-enable-for-cluster"></a>

**Important**  
You must opt in to use managed tiered checkpointing.

Enable managed tiered checkpointing through the HyperPod APIs when creating or updating your cluster. The service automatically installs the memory management system when you specify the `TieredStorageConfig` parameter.

For new clusters, you can use [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-cluster.html) AWS CLI.

```
aws sagemaker create-cluster \
    --cluster-name cluster-name \
    --orchestrator "Eks={ClusterArn=eks-cluster-arn}" \
    --instance-groups '{
        "InstanceGroupName": "instance-group-name",
        "InstanceType": "instance-type",
        "InstanceCount": instance-count,
        "LifeCycleConfig": {
            "SourceS3Uri": "s3-path-to-lifecycle-scripts",
            "OnCreate": "lifecycle-script-name"
        },
        "ExecutionRole": "instance-group-iam-role",
        "ThreadsPerCore": threads-per-core,
        "InstanceStorageConfigs": [
            { "EbsVolumeConfig": {"VolumeSizeInGB": volume-size} }
        ]
    }' \
    --vpc-config '{
        "SecurityGroupIds": ["security-group-ids"],
        "Subnets": ["subnets"]
    }' \
    --tiered-storage-config '{
        "Mode": "Enable"
    }'
```

The `InstanceMemoryAllocationPercentage` parameter specifies the `percentage` (int) of cluster memory to allocate for checkpointing. The range is 20-100.

## Step 2: Install the Python library in your training image
<a name="managed-tier-checkpointing-setup-step-install-library"></a>

Install the [Amazon SageMaker checkpointing library](https://pypi.org/project/amzn-sagemaker-checkpointing/) and its dependencies in your training image by adding it to your Dockerfile:

```
# Add this line to your training image Dockerfile
RUN pip install amzn-sagemaker-checkpointing s3torchconnector tenacity torch boto3 s3torchconnector
```

## Step 3: Save checkpoints in your training loop
<a name="managed-tier-checkpointing-setup-step-save-checkpoint-in-loop"></a>

In your training loop, you can asynchronously save checkpoints using PyTorch DCP. The following is an example on how to do so.

```
import torch
import torch.distributed as dist
from torch.distributed.checkpoint import async_save, load
from amzn_sagemaker_checkpointing.checkpointing.filesystem.filesystem import (
    SageMakerTieredStorageWriter,
    SageMakerTieredStorageReader
)

# Initialize distributed training
dist.init_process_group(backend="nccl")

# Configure checkpointing
checkpoint_config = SageMakerCheckpointConfig(
    # Unique ID for your training job 
    # Allowed characters in ID include: alphanumeric, hyphens, and underscores
    namespace=os.environ.get('TRAINING_JOB_NAME', f'job-{int(time.time())}'),

    # Number of distributed processes/available GPUs
    world_size=dist.get_world_size(),

    # S3 storage location, required for SageMakerTieredStorageReader for read fallbacks
    # Required for SageMakerTieredStorageWriter when save_to_s3 is True
    s3_tier_base_path="s3://my-bucket/checkpoints"
)

# Your model and optimizer
model = MyModel()
optimizer = torch.optim.AdamW(model.parameters())

# Training loop
future = None
in_memory_ckpt_freq = 10
s3_ckpt_freq = 50

for training_step in range(1000):
    # ... training code ...
    
    # Save checkpoint
    if (training_step % in_memory_ckpt_freq == 0 or 
        training_step % s3_ckpt_freq == 0):
        # Create state dictionary
        state_dict = {
            "model": model.state_dict(),
            "optimizer": optimizer.state_dict(),
            "step": training_step,
            "epoch": epoch
        }
        
        # Create storage writer for current step
        checkpoint_config.save_to_s3 = training_step % s3_ckpt_freq == 0
        storage_writer = SageMakerTieredStorageWriter(
            checkpoint_config=checkpoint_config,
            step=training_step
        )

        # wait for previous checkpoint to get completed
        if future is not None:
            exc = future.exception()
            if exc:
                print(f"Failure in saving previous checkpoint:{str(exc)}")
                # Handle failures as required
            else:
                result = future.result()
                # Process results from save, if required
        
        # Async save checkpoint using PyTorch DCP
        future = async_save(state_dict=state_dict, storage_writer=storage_writer)
        
        # Continue training while checkpoint saves in background
```

## Step 4: Load checkpoints for recovery
<a name="managed-tier-checkpointing-setup-step-load-checkpoint"></a>

The following is an example on loading a checkpoint.

```
# Create state dictionary template
state_dict = {
    "model": model.state_dict(),
    "optimizer": optimizer.state_dict(),
    "step": 0,
    "epoch": 0
}

# Load latest checkpoint
storage_reader = SageMakerTieredStorageReader(checkpoint_config=checkpoint_config)
load(state_dict, storage_reader=storage_reader)

# Load specific checkpoint step
storage_reader = SageMakerTieredStorageReader(
    checkpoint_config=checkpoint_config, 
    step=500 # Or don't pass step if you have to load the latest available step.
)
try:
    load(state_dict, storage_reader=storage_reader)
except BaseException as e:
    print(f"Checkpoint load failed: {str(e)}")
    # Add additional exception handling
```

## Validate your managed tiered checkpointing operations
<a name="managed-tier-checkpointing-setup-validation"></a>

You can validate your managed tiered checkpointing operations with logs.

**Custom logging (optional)**

You can integrate checkpointing logs with other logs by passing a custom logger to the library. For example, you can add a custom logger to your training code so that all logs from the library are also collected in the training logger.

**Enhanced service logging (optional)**

For enhanced debugging and service visibility, you can mount the checkpointing log path `/var/log/sagemaker_checkpointing` from within your pod to a path `/var/logs/sagemaker_checkpointing` on your host. This ensures that only library-specific logs are collected separately. This provides the service team with enhanced visibility for debugging and support.

# Removing managed tiered checkpointing
<a name="managed-tier-checkpointing-remove"></a>

This section explains how to disable managed tiered checkpointing when you no longer need it.

To disable managed tiered checkpointing, use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/update-cluster.html) AWS CLI to update your cluster configuration:

```
aws sagemaker update-cluster \
    --cluster-name cluster-name \
    --tiered-storage-config '{ "Mode": "Disable" }'
```

This removes the memory management daemon from your cluster. The daemon is implemented as a standard Kubernetes DaemonSet and follows standard Kubernetes lifecycle management.

# Security considerations for managed tiered checkpointing
<a name="managed-tier-security-considerations"></a>

This section covers important security considerations when using managed tiered checkpointing. It includes Python pickle usage, Amazon S3 encryption, and network endpoint security.

**Python pickle usage**

Managed tiered checkpointing uses Python’s pickle module to deserialize checkpoint data stored in Amazon S3. This implementation has important security implications:
+ **Extended trust boundary**: When using managed tiered checkpointing with Amazon S3, the Amazon S3 bucket becomes part of your cluster’s trust boundary.
+ **Code execution risk**: Python’s pickle module can execute arbitrary code during deserialization. If an unauthorized user gains write access to your checkpoint Amazon S3 bucket, they could potentially craft malicious pickle data that executes when loaded by managed tiered checkpointing.

**Best practices for Amazon S3 storage**

When using managed tiered checkpointing with Amazon S3 storage:
+ **Restrict Amazon S3 bucket access**: Ensure that only authorized users and roles associated with your training cluster have access to the Amazon S3 bucket used for checkpointing.
+ **Implement bucket policies**: Configure appropriate bucket policies to prevent unauthorized access or modifications.
+ **Validate access patterns**: Implement logging for validating access patterns to your checkpoint Amazon S3 buckets.
+ **Validate bucket names**: Use caution with bucket name selection to avoid potential bucket hijacking.

**Network endpoints**

Managed tiered checkpointing enables network endpoints on each of your compute nodes on the following ports: 9200/TCP, 9209/UDP, 9210/UDP, 9219/UDP, 9220/UDP, 9229/UDP, 9230/UDP, 9239/UDP, 9240/UDP. These ports are necessary for the checkpointing service to function and maintain data synchronization.

By default, SageMaker’s network configuration restricts access to these endpoints for security purposes. We recommend that you maintain these default restrictions.

When configuring your network settings for your nodes and VPC, follow AWS best practices for VPCs, security groups, and ACLs. For more information, see the following:
+ [Amazon SageMaker HyperPod prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-prerequisites.html#sagemaker-hyperpod-prerequisites-optional-vpcCluster)
+ [VPC security best practices](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-best-practices.html)

# SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance"></a>

SageMaker HyperPod task governance is a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. This provides administrators with the capability to set:
+ Priority levels for various tasks
+ Compute allocation for each team
+ How each team lends and borrows idle compute
+ If a team preempts their own tasks

HyperPod task governance also provides Amazon EKS cluster Observability, offering real-time visibility into cluster capacity. This includes compute availability and usage, team allocation and utilization, and task run and wait time information, setting you up for informed decision-making and proactive resource management. 

The following sections cover how to set up, understand key concepts, and use HyperPod task governance for your Amazon EKS clusters.

**Topics**
+ [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md)
+ [Dashboard](sagemaker-hyperpod-eks-operate-console-ui-governance-metrics.md)
+ [Tasks](sagemaker-hyperpod-eks-operate-console-ui-governance-tasks.md)
+ [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md)
+ [Example HyperPod task governance AWS CLI commands](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md)
+ [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md)
+ [Attribution document for Amazon SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-attributions.md)

# Setup for SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup"></a>

The following section provides information on how to get set up with the Amazon CloudWatch Observability EKS and SageMaker HyperPod task governance add-ons.

Ensure that you have the minimum permission policy for HyperPod cluster administrators with Amazon EKS, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs and manage SageMaker HyperPod clusters within your AWS account, performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 

**Topics**
+ [Dashboard setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md)
+ [Task governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md)

# Dashboard setup
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard"></a>

Use the following information to get set up with Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on. This sets you up with a detailed visual dashboard that provides a view into metrics for your EKS cluster hardware, team allocation, and tasks.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [HyperPod Amazon CloudWatch Observability EKS add-on prerequisites](#hp-eks-dashboard-prerequisites)
+ [HyperPod Amazon CloudWatch Observability EKS add-on setup](#hp-eks-dashboard-setup)

## HyperPod Amazon CloudWatch Observability EKS add-on prerequisites
<a name="hp-eks-dashboard-prerequisites"></a>

The following section includes the prerequisites needed before installing the Amazon EKS Observability add-on.
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin).
+ Attach the `CloudWatchAgentServerPolicy` IAM policy to your worker nodes. To do so, enter the following command. Replace `my-worker-node-role` with the IAM role used by your Kubernetes worker nodes.

  ```
  aws iam attach-role-policy \
  --role-name my-worker-node-role \
  --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
  ```

## HyperPod Amazon CloudWatch Observability EKS add-on setup
<a name="hp-eks-dashboard-setup"></a>

Use the following options to set up the Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on.

------
#### [ Setup using the SageMaker AI console ]

The following permissions are required for setup and visualizing the HyperPod task governance dashboard. This section expands upon the permissions listed in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). 

To manage task governance, use the sample policy:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:ListClusters",
                "sagemaker:DescribeCluster",
                "sagemaker:ListComputeQuotas",
                "sagemaker:CreateComputeQuota",
                "sagemaker:UpdateComputeQuota",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:DeleteComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "sagemaker:CreateClusterSchedulerConfig",
                "sagemaker:UpdateClusterSchedulerConfig",
                "sagemaker:DeleteClusterSchedulerConfig",
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:DescribeAddon",
                "eks:DescribeCluster",
                "eks:DescribeAccessEntry",
                "eks:ListAssociatedAccessPolicies",
                "eks:AssociateAccessPolicy",
                "eks:DisassociateAccessPolicy"
            ],
            "Resource": "*"
        }
    ]
}
```

------

To grant permissions to manage Amazon CloudWatch Observability Amazon EKS and view the HyperPod cluster dashboard through the SageMaker AI console, use the sample policy below:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "sagemaker:ListComputeQuotas",
                "sagemaker:DescribeComputeQuota",
                "sagemaker:ListClusterSchedulerConfigs",
                "sagemaker:DescribeClusterSchedulerConfig",
                "eks:DescribeCluster",
                "cloudwatch:GetMetricData",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon CloudWatch Observability EKS. To ensure task governance related metrics are included in the **Dashboard**, enable the Kueue metrics checkbox. Enabling the Kueue metrics enables CloudWatch **Metrics** costs, after free-tier limit is reached. For more information, see **Metrics** in [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/).

------
#### [ Setup using the EKS AWS CLI ]

Use the following EKS AWS CLI command to install the add-on:

```
aws eks create-addon --cluster-name cluster-name 
--addon-name amazon-cloudwatch-observability 
--configuration-values "configuration json"
```

Below is an example of the JSON of the configuration values:

```
{
    "agent": {
        "config": {
            "logs": {
                "metrics_collected": {
                    "kubernetes": {
                        "kueue_container_insights": true,
                        "enhanced_container_insights": true
                    },
                    "application_signals": { }
                }
            },
            "traces": {
                "traces_collected": {
                    "application_signals": { }
                }
            }
        },
    },
}
```

------
#### [ Setup using the EKS Console UI ]

1. Navigate to the [EKS console](https://console.aws.amazon.com/eks/home#/clusters).

1. Choose your cluster.

1. Choose **Add-ons**.

1. Find the **Amazon CloudWatch Observability** add-on and install. Install version >= 2.4.0 for the add-on. 

1. Include the following JSON, Configuration values:

   ```
   {
       "agent": {
           "config": {
               "logs": {
                   "metrics_collected": {
                       "kubernetes": {
                           "kueue_container_insights": true,
                           "enhanced_container_insights": true
                       },
                       "application_signals": { }
                   },
               },
               "traces": {
                   "traces_collected": {
                       "application_signals": { }
                   }
               }
           },
       },
   }
   ```

------

Once the EKS Observability add-on has been successfully installed, you can view your EKS cluster metrics under the HyperPod console **Dashboard** tab.

# Task governance setup
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance"></a>

This section includes information on how to set up the Amazon SageMaker HyperPod task governance EKS add-on. This includes granting permissions that allows you to set task prioritization, compute allocation for teams, how idle compute is shared, and task preemption for teams.

If you are having issues setting up, please see [Troubleshoot](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md) for known troubleshooting solutions.

**Topics**
+ [Kueue Settings](#hp-eks-task-governance-kueue-settings)
+ [HyperPod Task governance prerequisites](#hp-eks-task-governance-prerequisites)
+ [HyperPod task governance setup](#hp-eks-task-governance-setup)

## Kueue Settings
<a name="hp-eks-task-governance-kueue-settings"></a>

HyperPod task governance EKS add-on installs [Kueue](https://github.com/kubernetes-sigs/kueue/tree/main/apis/kueue) for your HyperPod EKS clusters. Kueue is a kubernetes-native system that manages quotas and how jobs consume them. 


| EKS HyperPod task governance add-on version | Version of Kueue that is installed as part of the add-on | 
| --- | --- | 
|  v1.1.3  |  v0.12.0  | 

**Note**  
Kueue v.012.0 and higher don't include kueue-rbac-proxy as part of the installation. Previous versions might have kueue-rbac-proxy installed. For example, if you're using Kueue v0.8.1, you might have kueue-rbac-proxy v0.18.1.

HyperPod task governance leverages Kueue for Kubernetes-native job queueing, scheduling, and quota management, and is installed with the HyperPod task governance EKS add-on. When installed, HyperPod creates and modifies SageMaker AI-managed Kubernetes resources such as `KueueManagerConfig`, `ClusterQueues`, `LocalQueues`, `WorkloadPriorityClasses`, `ResourceFlavors`, and `ValidatingAdmissionPolicies`. While Kubernetes administrators have the flexibility to modify the state of these resources, it is possible that any changes made to a SageMaker AI-managed resource may be updated and overwritten by the service.

The following information outlines the configuration settings utilized by the HyperPod task governance add-on for setting up Kueue.

```
  apiVersion: config.kueue.x-k8s.io/v1beta1
    kind: Configuration
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: :8443
      enableClusterQueueResources: true
    webhook:
      port: 9443
    manageJobsWithoutQueueName: false
    leaderElection:
      leaderElect: true
      resourceName: c1f6bfd2.kueue.x-k8s.io
    controller:
      groupKindConcurrency:
        Job.batch: 5
        Pod: 5
        Workload.kueue.x-k8s.io: 5
        LocalQueue.kueue.x-k8s.io: 1
        ClusterQueue.kueue.x-k8s.io: 1
        ResourceFlavor.kueue.x-k8s.io: 1
    clientConnection:
      qps: 50
      burst: 100
    integrations:
      frameworks:
      - "batch/job"
      - "kubeflow.org/mpijob"
      - "ray.io/rayjob"
      - "ray.io/raycluster"
      - "jobset.x-k8s.io/jobset"
      - "kubeflow.org/mxjob"
      - "kubeflow.org/paddlejob"
      - "kubeflow.org/pytorchjob"
      - "kubeflow.org/tfjob"
      - "kubeflow.org/xgboostjob"
      - "pod"
      - "deployment"
      - "statefulset"
      - "leaderworkerset.x-k8s.io/leaderworkerset"
      podOptions:
        namespaceSelector:
          matchExpressions:
            - key: kubernetes.io/metadata.name
              operator: NotIn
              values: [ kube-system, kueue-system ]
    fairSharing:
      enable: true
      preemptionStrategies: [LessThanOrEqualToFinalShare, LessThanInitialShare]
    resources:
      excludeResourcePrefixes: []
```

For more information about each configuration entry, see [Configuration](https://kueue.sigs.k8s.io/docs/reference/kueue-config.v1beta1/#Configuration) in the Kueue documentation.

## HyperPod Task governance prerequisites
<a name="hp-eks-task-governance-prerequisites"></a>
+ Ensure that you have the minimum permission policy for HyperPod cluster administrators, in [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin). This includes permissions to run the SageMaker HyperPod core APIs, manage SageMaker HyperPod clusters within your AWS account, and performing the tasks in [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md). 
+ You will need to have your Kubernetes version >= 1.30. For instructions, see [Update existing clusters to the new Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).
+ If you already have Kueue installed in their clusters, uninstall Kueue before installing the EKS add-on.
+ A HyperPod node must already exist in the EKS cluster before installing the HyperPod task governance add-on. 

## HyperPod task governance setup
<a name="hp-eks-task-governance-setup"></a>

The following provides information on how to get set up with HyperPod task governance.

------
#### [ Setup using the SageMaker AI console ]

The following provides information on how to get set up with HyperPod task governance using the SageMaker HyperPod console.

You already have all of the following permissions attached if you have already granted permissions to manage Amazon CloudWatch Observability EKS and view the HyperPod cluster dashboard through the SageMaker AI console in the [HyperPod Amazon CloudWatch Observability EKS add-on setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md#hp-eks-dashboard-setup). If you have not set this up, use the sample policy below to grant permissions to manage the HyperPod task governance add-on and view the HyperPod cluster dashboard through the SageMaker AI console.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "eks:ListAddons",
                "eks:CreateAddon",
                "eks:UpdateAddon",
                "eks:DescribeAddon",
                "eks:DescribeAddonVersions",
                "sagemaker:DescribeCluster",
                "sagemaker:DescribeClusterNode",
                "sagemaker:ListClusterNodes",
                "sagemaker:ListClusters",
                "eks:DescribeCluster",
                "eks:AccessKubernetesApi"
            ],
            "Resource": "*"
        }
    ]
}
```

------

Navigate to the **Dashboard** tab in the SageMaker HyperPod console to install the Amazon SageMaker HyperPod task governance Add-on. 

------
#### [ Setup using the Amazon EKS AWS CLI ]

Use the example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/create-addon.html) EKS AWS CLI command to set up the HyperPod task governance Amazon EKS API and console UI using the AWS CLI:

```
aws eks create-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

------

You can view the **Policies** tab in the HyperPod SageMaker AI console if the install was successful. You can also use the following example [https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/eks/describe-addon.html) EKS AWS CLI command to check the status. 

```
aws eks describe-addon --region region --cluster-name cluster-name --addon-name amazon-sagemaker-hyperpod-taskgovernance
```

# Dashboard
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-metrics"></a>

Amazon SageMaker HyperPod task governance provides a comprehensive dashboard view of your Amazon EKS cluster utilization metrics, including hardware, team, and task metrics. The following provides information on your HyperPod EKS cluster dashboard.

The dashboard provides a comprehensive view of cluster utilization metrics, including hardware, team, and task metrics. You will need to install the EKS add-on to view the dashboard. For more information, see [Dashboard setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-dashboard.md).

In the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/), under **HyperPod Clusters**, you can navigate to the HyperPod console and view your list of HyperPod clusters in your Region. Choose your cluster and navigate to the **Dashboard** tab. The dashboard contains the following metrics. You can download the data for a section by choosing the corresponding **Export**.

**Utilization**

Provides health of the EKS cluster point-in-time and trend-based metrics for critical compute resources. By default, **All Instance Groups** are shown. Use the dropdown menu to filter your instance groups. The metrics included in this section are:
+ Number of total, running, and pending recovery instances. The number of pending recovery instances refer to the number of instances that need attention for recovery.
+ GPUs, GPU memory, vCPUs, and vCPUs memory.
+ GPU utilization, GPU memory utilization, vCPU utilization, and vCPU memory utilization.
+ An interactive graph of your GPU and vCPU utilization. 

**Teams**

Provides information into team-specific resource management. This includes:
+ Instance and GPU allocation.
+ GPU utilization rates.
+ Borrowed GPU statistics.
+ Task status (running or pending).
+ A bar chart view of GPU utilization versus compute allocation across teams.
+ Team detailed GPU and vCPU-related information. By default, the information displayed includes **All teams**. You can filter by team and instances by choosing the dropdown menus. In the interactive plot you can filter by time.

**Tasks**

**Note**  
To view your HyperPod EKS cluster tasks in the dashboard:  
Configure Kubernetes Role-Based Access Control (RBAC) for data scientist users in the designated HyperPod namespace to authorize task execution on Amazon EKS-orchestrated clusters. Namespaces follow the format `hyperpod-ns-team-name`. To establish RBAC permissions, refer to the [team role creation instructions](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
Ensure that your job is submitted with the appropriate namespace and priority class labels. For a comprehensive example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).

Provides information on task-related metrics. This includes number of running, pending, and preempted tasks, and run and wait time statistics. By default, the information displayed includes **All teams**. You can filter by team by choosing the dropdown menu. In the interactive plot you can filter by time.

# Tasks
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks"></a>

The following provides information on Amazon SageMaker HyperPod EKS cluster tasks. Tasks are operations or jobs that are sent to the cluster. These can be machine learning operations, like training, running experiments, or inference. The viewable task details list include status, run time, and how much compute is being used per task. 

In the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/), under **HyperPod Clusters**, you can navigate to the HyperPod console and view your list of HyperPod clusters in your Region. Choose your cluster and navigate to the **Tasks** tab.

For the **Tasks** tab to be viewable from anyone besides the administrator, the administrator needs to [add an access entry to the EKS cluster for the IAM role](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html). 

**Note**  
To view your HyperPod EKS cluster tasks in the dashboard:  
Configure Kubernetes Role-Based Access Control (RBAC) for data scientist users in the designated HyperPod namespace to authorize task execution on Amazon EKS-orchestrated clusters. Namespaces follow the format `hyperpod-ns-team-name`. To establish RBAC permissions, refer to the [team role creation instructions](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
Ensure that your job is submitted with the appropriate namespace and priority class labels. For a comprehensive example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).

For EKS clusters, kubeflow (PyTorch, MPI, TensorFlow) tasks are shown. By default, PyTorch tasks are shown. You can filter for PyTorch, MPI, TensorFlow tasks by choosing the dropdown menu or using the search field. The information that is shown for each task includes the task name, status, namespace, priority class, and creation time. 

# Using topology-aware scheduling in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-scheduling"></a>

Topology-aware scheduling in Amazon SageMaker HyperPod task governance optimizes the training efficiency of distributed machine learning workloads by placing pods based on the physical network topology of your Amazon EC2 instances. By considering the hierarchical structure of AWS infrastructure, including Availability Zones, network blocks, and physical racks, topology-aware scheduling ensures that pods requiring frequent communication are scheduled in close proximity to minimize network latency. This intelligent placement is particularly beneficial for large-scale machine learning training jobs that involve intensive pod-to-pod communication, resulting in reduced training times and more efficient resource utilization across your cluster.

**Note**  
To use topology-aware scheduling, make sure that your version of HyperPod task governance is v1.2.2-eksbuild.1 or higher.

Topology-aware scheduling supports the following instance types:
+ ml.p3dn.24xlarge
+ ml.p4d.24xlarge
+ ml.p4de.24xlarge
+ ml.p5.48xlarge
+ ml.p5e.48xlarge
+ ml.p5en.48xlarge
+ ml.p6e-gb200.36xlarge
+ ml.trn1.2xlarge
+ ml.trn1.32xlarge
+ ml.trn1n.32xlarge
+ ml.trn2.48xlarge
+ ml.trn2u.48xlarge

Topology-aware scheduling integrates with your existing HyperPod workflows while providing flexible topology preferences through both kubectl YAML files and the HyperPod CLI. HyperPod task governance automatically configures cluster nodes with topology labels and works with HyperPod task governance policies and resource borrowing mechanisms, ensuring that topology-aware scheduling doesn't disrupt your current operational processes. With built-in support for both preferred and required topology specifications, you can fine-tune workload placement to match your specific performance requirements while maintaining the flexibility to fall back to standard scheduling when topology constraints cannot be satisfied.

By leveraging topology-aware labels in HyperPod, you can enhance their machine learning workloads through intelligent pod placement that considers the physical network infrastructure. HyperPod task governance automatically optimizes pod scheduling based on the hierarchical data center topology, which directly translates to reduced network latency and improved training performance for distributed ML tasks. This topology awareness is particularly valuable for large-scale machine learning workloads, as it minimizes communication overhead by strategically placing related pods closer together in the network hierarchy. The result is optimized communication network latency between pods, more efficient resource utilization, and better overall performance for compute-intensive AI/ML applications, all achieved without you needing to manually manage complex network topology configurations.

The following are labels for the available topology network layers that HyperPod task governance can schedule pods in:
+ topology.k8s.aws/network-node-layer-1
+ topology.k8s.aws/network-node-layer-2
+ topology.k8s.aws/network-node-layer-3
+ topology.k8s.aws/ultraserver-id

To use topology-aware scheduling, include the following labels in your YAML file:
+ kueue.x-k8s.io/podset-required-topology - indicates that this job must have the required pods and that all pods in the nodes must be scheduled within the same topology layer.
+ kueue.x-k8s.io/podset-preferred-topology - indicates that this job must have the pods, but that scheduling pods within the same topology layer is preferred but not required. HyperPod task governance will try to schedule the pods within one layer before trying the next topology layer.

If resources don’t share the same topology label, the job will be suspended. The job will be in the waitlist. Once Kueue sees that there are enough resources, it will admit and run the job.

The following example demonstrates how to use the labels in your YAML files:

```
apiVersion: batch/v1
kind: Job
metadata:
  name: test-tas-job
  namespace: hyperpod-ns-team-name
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
    kueue.x-k8s.io/priority-class: PRIORITY_CLASS-priority
spec:
  parallelism: 10
  completions: 10
  suspend: true
  template:
    metadata:
      labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
      annotations:
        kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
        or
        kueue.x-k8s.io/podset-preferred-topology: "topology.k8s.aws/network-node-layer-3"
    spec:
      nodeSelector:
        topology.k8s.aws/network-node-layer-3: TOPOLOGY_LABEL_VALUE
      containers:
        - name: dummy-job
          image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
          args: ["3600s"]
          resources:
            requests:
              cpu: "100"
      restartPolicy: Never
```

The following table explains the new parameters you can use in the kubectl YAML file.


| Parameter | Description | 
| --- | --- | 
| kueue.x-k8s.io/queue-name | The name of the queue to use to run the job. The format of this queue-name must be hyperpod-ns-team-name-localqueue. | 
| kueue.x-k8s.io/priority-class | Lets you specify a priority for pod scheduling. This specification is optional. | 
| annotations | Contains the topology annotation that you attach to the job. Available topologies are kueue.x-k8s.io/podset-required-topology and kueue.x-k8s.io/podset-preferred-topology. You can use either an annotation or nodeSelector, but not both at the same time. | 
| nodeSelector | Specifies the network layer that represents the layer of Amazon EC2 instance placement. Use either this field or an annotation, but not both at the same time. In your YAML file, you can also use the nodeSelector parameter to choose the exact layer for your pods. To get the value of your label, use the [ DescribeInstanceTopology](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html) API operation. | 

You can also use the HyperPod CLI to run your job and use topology aware scheduling. For more information about the HyperPod CLI, see [SageMaker HyperPod CLI commands](sagemaker-hyperpod-eks-hyperpod-cli-reference.md).

```
hyp create hyp-pytorch-job \                                            
  --version 1.1 \
  --job-name sample-pytorch-job \
  --image 123456789012.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest \
  --pull-policy "Always" \
  --tasks-per-node 1 \
  --max-retry 1 \
  --priority high-priority \
  --namespace hyperpod-ns-team-name \
  --queue-name hyperpod-ns-team-name-localqueue \
  --preferred-topology-label topology.k8s.aws/network-node-layer-1
```

The following is an example configuration file that you might use to run a PytorchJob with topology labels. The file is largely similar if you want to run MPI and Tensorflow jobs. If you want to run those jobs instead, remember to change the configuration file accordingly, such as using the correct image instead of PyTorchJob. If you’re running a PyTorchJob, you can assign different topologies to the master and worker nodes. PyTorchJob always has one master node, so we recommend that you use topology to support worker pods instead.

```
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  annotations: {}
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
  name: tas-test-pytorch-job
  namespace: hyperpod-ns-team-name
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            imagePullPolicy: Always
            name: pytorch
    Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
        metadata:
          # annotations:
            # kueue.x-k8s.io/podset-required-topology: "topology.k8s.aws/network-node-layer-3"
          labels:
            kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                cpu: 1
              requests:
                memory: 200Mi
                cpu: 1
          #nodeSelector:
          #  topology.k8s.aws/network-node-layer-3: xxxxxxxxxxx
```

To see the topologies for your cluster, use the [ DescribeInstanceTopology](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTopology.html) API operation. By default, the topologies are hidden in the AWS Management Console and Amazon SageMaker Studio. Follow these steps to see them in the interface that you’re using.

**SageMaker Studio**

1. In SageMaker Studio, navigate to your cluster.

1. In the Tasks view, choose the options menu in the Name column, then choose **Manage columns**.

1. Select **Requested topology** and **Topology constraint** to add the columns to see the topology information in the list of Kubernetes pods.

**AWS Management Console**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under **HyperPod clusters**, choose **Cluster management**.

1. Choose the **Tasks** tab, then choose the gear icon.

1. Under instance attributes, toggle **Requested topology** and **Topology constraint**.

1. Choose **Confirm** to see the topology information in the table.

# Using gang scheduling in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling"></a>

In distributed ML training, a job often requires multiple pods running concurrently across nodes with pod-to-pod communication. HyperPod task governance uses Kueue's `waitForPodsReady` feature to implement gang scheduling. When enabled, the workload is monitored by Kueue until all of its pods are ready, meaning scheduled, running, and passing the optional readiness probe. If not all pods of the workload are ready within the configured timeout, the workload is evicted and requeued.

Gang scheduling provides the following benefits:
+ **Prevents resource waste** — Kueue evicts and requeues the workload if all pods do not become ready, ensuring resources are not held indefinitely by partially running workloads.
+ **Avoids deadlocks** — Prevents jobs from holding partial resources and blocking each other indefinitely.
+ **Automatic recovery** — If pods aren't ready within the timeout, the workload is evicted and requeued with configurable exponential backoff, rather than hanging indefinitely.

## Activate gang scheduling
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling-activate"></a>

To activate gang scheduling, you must have a HyperPod Amazon EKS cluster with the task governance Amazon EKS add-on installed. The add-on status must be `Active` or `Degraded`.

**Note**  
Gang scheduling can also be configured directly using `kubectl` by editing the Kueue configuration on the cluster.

**Activate gang scheduling (SageMaker AI console)**

1. Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/) and navigate to your HyperPod cluster.

1. Choose the **Policy management** tab.

1. In the **Task governance** section, open **Actions**, then choose **Configure gang scheduling**.

1. Toggle gang scheduling on and configure the settings.

1. Choose **Save**. The Kueue controller restarts to apply the change.

## Gang scheduling configuration settings
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-tasks-gang-scheduling-settings"></a>

The following table describes the configuration settings for gang scheduling.


| Setting | Description | Default | 
| --- | --- | --- | 
| timeout | How long Kueue waits for all pods to become ready before evicting and requeuing the workload. | 5m | 
| recoveryTimeout | How long Kueue waits for a pod to recover after a node failure before requeuing the workload. Set to 0s to disable. Defaults to the value of timeout if not set. | 5m | 
| blockAdmission | When enabled, workloads are admitted sequentially. No new workload is admitted until all pods of the current one are ready. Prevents deadlocks on resource-constrained clusters. | Off | 
| requeuingStrategy timestamp | Whether requeue order uses Creation (original submission time, preserves queue position) or Eviction (time of last eviction, effectively deprioritizing repeatedly failing jobs). | Eviction | 
| requeuingStrategy backoffLimitCount | Maximum requeue attempts before Kueue permanently deactivates the workload. Leave empty for unlimited retries. | Unlimited | 
| requeuingStrategy backoffBaseSeconds | The base time in seconds for exponential backoff when requeuing a workload after each consecutive timeout. The exponent is 2. | 60s | 
| requeuingStrategy backoffMaxSeconds | Cap on the exponential backoff delay. Once reached, Kueue continues requeuing at this fixed interval. | 3600s | 

**Note**  
Modifying gang scheduling settings restarts the Kueue controller, which may temporarily delay job admission. This applies whether you are enabling, disabling, or updating any value. Running jobs are not interrupted.

**Note**  
Gang scheduling is cluster-wide. It applies to all Kueue-managed workloads on the cluster, not just specific teams or queues.

# Policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies"></a>

Amazon SageMaker HyperPod task governance simplifies how your Amazon EKS cluster resources are allocated and how tasks are prioritized. The following provides information on HyperPod EKS cluster policies. For information on how to set up task governance, see [Task governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup-task-governance.md).

The policies are divided up into **Compute prioritization** and **Compute allocation**. The policy concepts below will be organized in the context of these policies.

**Compute prioritization**, or cluster policy, determines how idle compute is borrowed and how tasks are prioritized by teams.
+ **Idle compute allocation** defines how idle compute is allocated across teams. That is, how unused compute can be borrowed from teams. When choosing an **Idle compute allocation**, you can choose between:
  + **First-come first-serve**: When applied, teams are not prioritized against each other and each incoming task is equally likely to obtain over-quota resources. Tasks are prioritized based on order of submission. This means a user may be able to use 100% of the idle compute if they request it first.
  + **Fair-share**: When applied, teams borrow idle compute based on their assigned **Fair-share weight**. These weights are defined in **Compute allocation**. For more information on how this can be used, see [Sharing idle compute resources examples](#hp-eks-task-governance-policies-examples).
+ **Task prioritization** defines how tasks are queued as compute becomes available. When choosing a **Task prioritization**, you can choose between:
  + **First-come first-serve**: When applied, tasks are queued in the order they are requested.
  + **Task ranking**: When applied, tasks are queued in the order defined by their prioritization. If this option is chosen, you must add priority classes along with the weights at which they should be prioritized. Tasks of the same priority class will be executed on a first-come first-serve basis. When enabled in Compute allocation, tasks are preempted from lower priority tasks by higher priority tasks within the team.

    When data scientists submit jobs to the cluster, they use the priority class name in the YAML file. The priority class is in the format `priority-class-name-priority`. For an example, see [Submit a job to SageMaker AI-managed queue and namespace](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md#hp-eks-cli-start-job).
  + **Priority classes**: These classes establish a relative priority for tasks when borrowing capacity. When a task is running using borrowed quota, it may be preempted by another task of higher priority than it, if no more capacity is available for the incoming task. If **Preemption** is enabled in the **Compute allocation**, a higher priority task may also preempt tasks within its own team.
+ **Unallocated resource sharing** enables teams to borrow compute resources that are not allocated to any team through compute quota. When enabled, unallocated cluster capacity becomes available for teams to borrow automatically. For more information, see [How unallocated resource sharing works](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works).

**Compute allocation**, or compute quota, defines a team’s compute allocation and what weight (or priority level) a team is given for fair-share idle compute allocation. 
+ **Team name**: The team name. A corresponding **Namespace** will be created, of type `hyperpod-ns-team-name`. 
+ **Members**: Members of the team namespace. You will need to set up a Kubernetes role-based access control (RBAC) for data scientist users that you want to be part of this team, to run tasks on HyperPod clusters orchestrated with Amazon EKS. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).
+ **Fair-share weight**: This is the level of prioritization assigned to the team when **Fair-share** is applied for **Idle compute allocation**. The highest priority has a weight of 100 and the lowest priority has a weight of 0. Higher weight enables a team to access unutilized resources within shared capacity sooner. A zero weight signifies the lowest priority, implying this team will always be at a disadvantage compared to other teams. 

  The fair-share weight provides a comparative edge to this team when vying for available resources against others. Admission prioritizes scheduling tasks from teams with the highest weights and the lowest borrowing. For example, if Team A has a weight of 10 and Team B has a weight of 5, Team A would have priority in accessing unutilized resources as in would have jobs that are scheduled earlier than Team B.
+ **Task preemption**: Compute is taken over from a task based on priority. By default, the team loaning idle compute will preempt tasks from other teams. 
+ **Lending and borrowing**: How idle compute is being lent by the team and if the team can borrow from other teams.
  + **Percentage-based borrow limit**: The limit of idle compute that a team is allowed to borrow, expressed as a percentage of their guaranteed quota. A team can borrow up to 10,000% of allocated compute. The value you provide here is interpreted as a percentage. For example, a value of 500 will be interpreted as 500%. This percentage applies uniformly across all resource types (CPU, GPU, Memory) and instance types in the team's quota.
  + **Absolute borrow limit**: The limit of idle compute that a team is allowed to borrow, defined as absolute resource values per instance type. This provides granular control over borrowing behavior for specific instance types. You need to specify absolute limits using the same schema as **Compute quota**, including instance count, accelerators, vCPU, memory, or accelerator partitions. You can specify absolute limits for one or more instance types in your team's quota.

For information on how these concepts are used, such as priority classes and name spaces, see [Example HyperPod task governance AWS CLI commands](sagemaker-hyperpod-eks-operate-console-ui-governance-cli.md).

## Sharing idle compute resources examples
<a name="hp-eks-task-governance-policies-examples"></a>

The total reserved quota should not surpass the cluster's available capacity for that resource, to ensure proper quota management. For example, if a cluster comprises 20 `ml.c5.2xlarge` instances, the cumulative quota assigned to teams should remain under 20. 

If the **Compute allocation** policies for teams allow for **Lend and Borrow** or **Lend**, the idle capacity is shared between these teams. For example, Team A and Team B have **Lend and Borrow** enabled. Team A has a quota of 6 but is using only 2 for its jobs, and Team B has a quota of 5 and is using 4 for its jobs. A job that is submitted to Team B requiring 4 resources. 3 will be borrowed from Team A. 

If any team's **Compute allocation** policy is set to **Don't Lend**, the team would not be able to borrow any additional capacity beyond its own allocations.

## How unallocated resource sharing works
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works"></a>

Unallocated resource sharing automatically manages the pool of resources that are not allocated to any compute quota in your cluster. This means HyperPod continuously monitors your cluster state and automatically updates to the correct configuration over time.

**Initial Setup**
+ When you set `IdleResourceSharing` to `Enabled` in your ClusterSchedulerConfig (by default it is `Disabled`), HyperPod task governance begins monitoring your cluster and calculates available idle resources by subtracting team quotas from total node capacity.
+ Unallocated resource sharing ClusterQueues are created to represent the borrowable resource pool.
+ When you first enable unallocated resource sharing, infrastructure setup takes several mins. You can monitor the progress through policy `Status` and `DetailedStatus` in ClusterSchedulerConfig.

**Ongoing Reconciliation**
+ HyperPod task governance continuously monitors for changes such as node additions or removals and cluster queue quota updates.
+  When changes occur, unallocated resource sharing recalculates quota and updates ClusterQueues. Reconciliation typically completes within seconds. 

**Monitoring**

 You can verify that unallocated resource sharing is fully configured by checking for unallocated resource sharing ClusterQueues: 

```
kubectl get clusterqueue | grep hyperpod-ns-idle-resource-sharing
```

When you see ClusterQueues with names like `hyperpod-ns-idle-resource-sharing-cq-1`, unallocated resource sharing is active. Note that multiple unallocated resource sharing ClusterQueues may exist depending on the number of resource flavors in your cluster. 

## Node eligibility for unallocated resource sharing
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-node-eligibility"></a>

Unllocated Resource Sharing only includes nodes that meet the following requirements:

1. **Node Ready Status**
   + Nodes must be in `Ready` status to contribute to the unallocated resource pool.
   + Nodes in `NotReady` or other non-ready states are excluded from capacity calculations.
   + When a node becomes `Ready`, it is automatically included in the next reconciliation cycle.

1. **Node Schedulable Status**
   + Nodes with `spec.unschedulable: true` are excluded from unallocated resource sharing.
   + When a node becomes schedulable again, it is automatically included in the next reconciliation cycle.

1. **MIG Configuration (GPU nodes only)**
   + For GPU nodes with MIG (Multi-Instance GPU) partitioning, the `nvidia.com/mig.config.state` label must show `success` for the node to contribute MIG profiles to unallocated resource sharing.
   + These nodes will be retried automatically once MIG configuration completes successfully.

1. **Supported Instance Types**
   + The instance must be a supported SageMaker HyperPod instance type.
   + See the list of supported instance types in the SageMaker HyperPod cluster.

**Topics**
+ [Sharing idle compute resources examples](#hp-eks-task-governance-policies-examples)
+ [How unallocated resource sharing works](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-how-it-works)
+ [Node eligibility for unallocated resource sharing](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-idle-resource-sharing-node-eligibility)
+ [Create policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-create.md)
+ [Edit policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit.md)
+ [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md)
+ [Allocating compute quota in Amazon SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation.md)

# Create policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-create"></a>

You can create your **Cluster policy** and **Compute allocation** configurations in the **Policies** tab. The following provides instructions on how to create the following configurations.
+ Create your **Cluster policy** to update how tasks are prioritized and idle compute is allocated.
+ Create **Compute allocation** to create a new compute allocation policy for a team.
**Note**  
When you create a **Compute allocation** you will need to set up a Kubernetes role-based access control (RBAC) for data scientist users in the corresponding namespace to run tasks on HyperPod clusters orchestrated with Amazon EKS. The namespaces have the format `hyperpod-ns-team-name`. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).

For information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Create HyperPod task governance policies**

This procedure assumes that you have already created an Amazon EKS cluster set up with HyperPod. If you have not already done so, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To create your **Cluster policy**:

   1. Choose the corresponding **Edit** to update how tasks are prioritized and idle compute is allocated.

   1. After you have made your changes, choose **Submit**.

1. To create a **Compute allocation**:

1. 

   1. Choose the corresponding **Create**. This takes you to the compute allocation creation page.

   1. After you have made your changes, choose **Submit**.

# Edit policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit"></a>

You can edit your **Cluster policy** and **Compute allocation** configurations in the **Policies** tab. The following provides instructions on how to edit the following configurations.
+ Edit your **Cluster policy** to update how tasks are prioritized and idle compute is allocated.
+ Edit **Compute allocation** to create a new compute allocation policy for a team.
**Note**  
When you create a **Compute allocation** you will need to set up a Kubernetes role-based access control (RBAC) for data scientist users in the corresponding namespace to run tasks on HyperPod clusters orchestrated with Amazon EKS. The namespaces have the format `hyperpod-ns-team-name`. To set up a Kubernetes RBAC, use the instructions in [create team role](https://github.com/aws/sagemaker-hyperpod-cli/tree/main/helm_chart#5-create-team-role).

For more information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Edit HyperPod task governance policies**

This procedure assumes that you have already created an Amazon EKS cluster set up with HyperPod. If you have not already done so, see [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](sagemaker-hyperpod-eks-operate-console-ui-create-cluster.md).

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To edit your **Cluster policy**:

   1. Choose the corresponding **Edit** to update how tasks are prioritized and idle compute is allocated.

   1. After you have made your changes, choose **Submit**.

1. To edit your **Compute allocation**:

1. 

   1. Choose the configuration you wish to edit under **Compute allocation**. This takes you to the configuration details page.

   1. If you wish to edit these configurations, choose **Edit**.

   1. After you have made your changes, choose **Submit**.

# Delete policies
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete"></a>

You can delete your **Cluster policy** and **Compute allocation** configurations using the SageMaker AI console or AWS CLI. The following page provides instructions on how to delete your SageMaker HyperPod task governance policies and configurations.

For more information about the HyperPod task governance EKS cluster policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

**Note**  
If you are having issues with listing or deleting task governance policies, you may need to update your cluster administrator minimum set of permissions. See the **Amazon EKS** tab in the [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin) section. For additional information, see [Deleting clusters](sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot.md#hp-eks-troubleshoot-delete-policies).

## Delete HyperPod task governance policies (console)
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-console"></a>

The following uses the SageMaker AI console to delete your HyperPod task governance policies.

**Note**  
You cannot delete your **Cluster policy** (`ClusterSchedulerConfig`) using the SageMaker AI console. To learn how to do so using the AWS CLI, see [Delete HyperPod task governance policies (AWS CLI)](#sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-cli).

**To delete task governance policies (console)**

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. On the left navigation pane, under **HyperPod Clusters**, choose **Cluster Management**.

1. Choose your Amazon EKS cluster listed under **SageMaker HyperPod clusters**.

1. Choose the **Policies** tab.

1. To delete your **Compute allocation** (`ComputeQuota`):

   1. In the **Compute allocation** section, select the configuration you want to delete.

   1. In the **Actions** dropdown menu, choose **Delete**.

   1. Follow the instructions in the UI to complete the task.

## Delete HyperPod task governance policies (AWS CLI)
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete-cli"></a>

The following uses the AWS CLI to delete your HyperPod task governance policies.

**Note**  
If you are having issues using the following commands, you may need to update your AWS CLI. For more information, see [Installing or updating to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).

**To delete task governance policies (AWS CLI)**

First set your variables for the AWS CLI commands that follow.

```
REGION=aws-region
```

1. Get the *cluster-arn* associated with the policies you wish to delete. You can use the following AWS CLI command to list the clusters in your AWS Region.

   ```
   aws sagemaker list-clusters \
       --region ${REGION}
   ```

1. To delete your compute allocations (`ComputeQuota`):

   1. List all of the compute quotas associated with the HyperPod cluster.

      ```
      aws sagemaker list-compute-quotas \
          --cluster-arn cluster-arn \
          --region ${REGION}
      ```

   1. For each `compute-quota-id` you wish to delete, run the following command to delete the compute quota.

      ```
      aws sagemaker delete-compute-quota \
          --compute-quota-id compute-quota-id \
          --region ${REGION}
      ```

1. To delete your cluster policies (`ClusterSchedulerConfig`):

   1. List all of the cluster policies associated with the HyperPod cluster.

      ```
      aws sagemaker list-cluster-scheduler-configs \
          --cluster-arn cluster-arn \
          --region ${REGION}
      ```

   1. For each `cluster-scheduler-config-id` you wish to delete, run the following command to delete the compute quota.

      ```
      aws sagemaker delete-cluster-scheduler-config 
          --cluster-scheduler-config-id scheduler-config-id \
          --region ${REGION}
      ```

# Allocating compute quota in Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation"></a>

Cluster administrators can decide how the organization uses purchased compute. Doing so reduces waste and idle resources. You can allocate compute quota such that teams can borrow unused resources from each other. Compute quota allocation in HyperPod task governance lets administrators allocate resources at the instance level and at a more granular resource level. This capability provides flexible and efficient resource management for teams by allowing granular control over individual compute resources instead of requiring entire instance allocations. Allocating at a granular level eliminates inefficiencies of traditional instance-level allocation. Through this approach, you can optimize resource utilization and reduce idle compute.

Compute quota allocation supports three types of resource allocation: accelerators, vCPU, and memory. Accelerators are components in accelerated computer instances that that perform functions, such as floating point number calculations, graphics processing, or data pattern matching. Accelerators include GPUs, Trainium accelerators, and neuron cores. For multi-team GPU sharing, different teams can receive specific GPU allocations from the same instance type, maximizing utilization of accelerator hardware. For memory-intensive workloads that require additional RAM for data preprocessing or model caching scenarios, you can allocate memory quota beyond the default GPU-to-memory ratio. For CPU-heavy preprocessing tasks that need substantial CPU resources alongside GPU training, you can allocate independent CPU resource allocation.

Once you provide a value, HyperPod task governance calculates the ratio using the formula **allocated resource divided by the total amount of resources available in the instance**. HyperPod task governance then uses this ratio to apply default allocations to other resources, but you can override these defaults and customize them based on your use case. The following are sample scenarios of how HyperPod task governance allocates resources based on your values:
+ **Only accelerator specified** - HyperPod task governance applies the default ratio to vCPU and memory based on the accelerator values.
+ **Only vCPU specified** - HyperPod task governance calculates the ratio and applies it to memory. Accelerators are set to 0.
+ **Only memory specified** - HyperPod task governance calculates the ratio and applies it to vCPU because compute is required to run memory-specified workloads. Accelerators are set to 0.

To programmatically control quota allocation, you can use the [ ComputeQuotaResourceConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ComputeQuotaResourceConfig.html) object and specify your allocations in integers.

```
{
    "ComputeQuotaConfig": {
        "ComputeQuotaResources": [{
            "InstanceType": "ml.g5.24xlarge",
            "Accelerators": "16",
            "vCpu": "200.0",
            "MemoryInGiB": "2.0"
        }]
    }
}
```

To see all of the allocated allocations, including the defaults, use the [ DescribeComputeQuota](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeComputeQuota.html) operation. To update your allocations, use the [ UpdateComputeQuota](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateComputeQuota.html) operation.

You can also use the HyperPod CLI to allocate compute quotas. For more information about the HyperPod CLI, see [Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-run-jobs.md). The following example demonstrates how to set compute quotas using the HyperPod CLI.

```
hyp create hyp-pytorch-job --version 1.1 --job-name sample-job \
--image 123456789012.dkr.ecr.us-west-2.amazonaws.com/ptjob:latest \
--pull-policy "Always" \
--tasks-per-node 1 \
--max-retry 1 \
--priority high-priority \
--namespace hyperpod-ns-team-name \
--queue-name hyperpod-ns-team-name-localqueue \
--instance-type sample-instance-type \
--accelerators 1 \
--vcpu 3 \
--memory 1 \
--accelerators-limit 1 \
--vcpu-limit 4 \
--memory-limit 2
```

To allocate quotas using the AWS console, follow these steps.

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Under HyperPod clusters, choose **Cluster management**.

1. Under **Compute allocations**, choose **Create**.

1. If you don’t already have instances, choose **Add allocation** to add an instance.

1. Under **Allocations**, choose to allocate by instances or individual resources. If you allocate by individual resources, SageMaker AI automatically assigns allocations to other resources by the ratio that you chose. To override this ratio-based allocation, use the corresponding toggle to override that compute.

1. Repeat steps 4 and 5 to configure additional instances.

After allocating compute quota, you can then submit jobs through the HyperPod CLI or `kubectl`. HyperPod efficiently schedules workloads based on available quota. 

# Allocating GPU partition quota
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions"></a>

You can extend compute quota allocation to support GPU partitioning, enabling fine-grained resource sharing at the GPU partition level. When GPU partitioning is enabled on supported GPUs in the cluster, each physical GPU can be partitioned into multiple isolated GPUs with defined compute, memory, and streaming multiprocessor allocations. For more information about GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md). You can allocate specific GPU partitions to teams, allowing multiple teams to share a single GPU while maintaining hardware-level isolation and predictable performance.

For example, an ml.p5.48xlarge instance with 8 H100 GPUs can be partitioned into GPU partitions, and you can allocate individual partitions to different teams based on their task requirements. When you specify GPU partition allocations, HyperPod task governance calculates proportional vCPU and memory quotas based on the GPU partition, similar to GPU-level allocation. This approach maximizes GPU utilization by eliminating idle capacity and enabling cost-effective resource sharing across multiple concurrent tasks on the same physical GPU.

## Creating Compute Quotas
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions-creating"></a>

```
aws sagemaker create-compute-quota \
  --name "fractional-gpu-quota" \
  --compute-quota-config '{
    "ComputeQuotaResources": [
      {
        "InstanceType": "ml.p4d.24xlarge",
        "AcceleratorPartition": {
            "Count": 4,
            "Type": "mig-1g.5gb"
        }
      }
    ],
    "ResourceSharingConfig": { 
      "Strategy": "LendAndBorrow", 
      "BorrowLimit": 100 
    }
  }'
```

## Verifying Quota Resources
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-policies-compute-allocation-gpu-partitions-verifying"></a>

```
# Check ClusterQueue
kubectl get clusterqueues
kubectl describe clusterqueue QUEUE_NAME

# Check ResourceFlavors
kubectl get resourceflavor
kubectl describe resourceflavor FLAVOR_NAME
```

# Example HyperPod task governance AWS CLI commands
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-cli"></a>

You can use HyperPod with EKS through Kubectl or through HyperPod custom CLI. You can use these commands through Studio or AWS CLI. The following provides SageMaker HyperPod task governance examples, on how to view cluster details using the HyperPod AWS CLI commands. For more information, including how to install, see the [HyperPod CLI Github repository](https://github.com/aws/sagemaker-hyperpod-cli).

**Topics**
+ [Get cluster accelerator device quota information](#hp-eks-cli-get-clusters)
+ [Submit a job to SageMaker AI-managed queue and namespace](#hp-eks-cli-start-job)
+ [List jobs](#hp-eks-cli-list-jobs)
+ [Get job detailed information](#hp-eks-cli-get-job)
+ [Suspend and unsuspend jobs](#hp-eks-cli-patch-job)
+ [Debugging jobs](#hp-eks-cli-other)

## Get cluster accelerator device quota information
<a name="hp-eks-cli-get-clusters"></a>

The following example command gets the information on the cluster accelerator device quota.

```
hyperpod get-clusters -n hyperpod-ns-test-team
```

The namespace in this example, `hyperpod-ns-test-team`, is created in Kubernetes based on the team name provided, `test-team`, when the compute allocation is created. For more information, see [Edit policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-edit.md).

Example response:

```
[
    {
        "Cluster": "hyperpod-eks-test-cluster-id",
        "InstanceType": "ml.g5.xlarge",
        "TotalNodes": 2,
        "AcceleratorDevicesAvailable": 1,
        "NodeHealthStatus=Schedulable": 2,
        "DeepHealthCheckStatus=Passed": "N/A",
        "Namespaces": {
            "hyperpod-ns-test-team": {
                "TotalAcceleratorDevices": 1,
                "AvailableAcceleratorDevices": 1
            }
        }
    }
]
```

## Submit a job to SageMaker AI-managed queue and namespace
<a name="hp-eks-cli-start-job"></a>

The following example command submits a job to your HyperPod cluster. If you have access to only one team, the HyperPod AWS CLI will automatically assign the queue for you in this case. Otherwise if multiple queues are discovered, we will display all viable options for you to select.

```
hyperpod start-job --job-name hyperpod-cli-test --job-kind kubeflow/PyTorchJob --image docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd --entry-script /opt/pytorch-mnist/mnist.py --pull-policy IfNotPresent --instance-type ml.g5.xlarge --node-count 1 --tasks-per-node 1 --results-dir ./result --priority training-priority
```

The priority classes are defined in the **Cluster policy**, which defines how tasks are prioritized and idle compute is allocated. When a data scientist submits a job, they use one of the priority class names with the format `priority-class-name-priority`. In this example, `training-priority` refers to the priority class named “training”. For more information on policy concepts, see [Policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies.md).

If a priority class is not specified, the job is treated as a low priority job, with a task ranking value of 0. 

If a priority class is specified, but does not correspond to one of the priority classes defined in the **Cluster policy**, the submission fails and an error message provides the defined set of priority classes.

You can also submit the job using a YAML configuration file using the following command: 

```
hyperpod start-job --config-file ./yaml-configuration-file-name.yaml
```

The following is an example YAML configuration file that is equivalent to submitting a job as discussed above.

```
defaults:
  - override hydra/job_logging: stdout
hydra:
  run:
    dir: .
  output_subdir: null
training_cfg:
  entry_script: /opt/pytorch-mnist/mnist.py
  script_args: []
  run:
    name: hyperpod-cli-test
    nodes: 1
    ntasks_per_node: 1
cluster:
  cluster_type: k8s
  instance_type: ml.g5.xlarge
  custom_labels:
    kueue.x-k8s.io/priority-class: training-priority
  cluster_config:
    label_selector:
      required:
        sagemaker.amazonaws.com/node-health-status:
          - Schedulable
      preferred:
        sagemaker.amazonaws.com/deep-health-check-status:
          - Passed
      weights:
        - 100
    pullPolicy: IfNotPresent
base_results_dir: ./result
container: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-bc09cfd
env_vars:
  NCCL_DEBUG: INFO
```

Alternatively, you can submit a job using `kubectl` to ensure the task appears in the **Dashboard** tab. The following is an example kubectl command.

```
kubectl apply -f ./yaml-configuration-file-name.yaml
```

When submitting the job, include your queue name and priority class labels. For example, with the queue name `hyperpod-ns-team-name-localqueue` and priority class `priority-class-name-priority`, you must include the following labels:
+ `kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue` 
+ `kueue.x-k8s.io/priority-class: priority-class-name-priority`

The following YAML configuration snippet demonstrates how to add labels to your original configuration file to ensure your task appears in the **Dashboard** tab:

```
metadata:
    name: job-name
    namespace: hyperpod-ns-team-name
    labels:
        kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
        kueue.x-k8s.io/priority-class: priority-class-name-priority
```

## List jobs
<a name="hp-eks-cli-list-jobs"></a>

The following command lists the jobs and their details.

```
hyperpod list-jobs
```

Example response:

```
{
    "jobs": [
        {
            "Name": "hyperpod-cli-test",
            "Namespace": "hyperpod-ns-test-team",
            "CreationTime": "2024-11-18T21:21:15Z",
            "Priority": "training",
            "State": "Succeeded"
        }
    ]
}
```

## Get job detailed information
<a name="hp-eks-cli-get-job"></a>

The following command provides a job’s details. If no namespace is specified, HyperPod AWS CLI will fetch a namespace managed by SageMaker AI that you have access to.

```
hyperpod get-job --job-name hyperpod-cli-test
```

Example response:

```
{
    "Name": "hyperpod-cli-test",
    "Namespace": "hyperpod-ns-test-team",
    "Label": {
        "app": "hyperpod-cli-test",
        "app.kubernetes.io/managed-by": "Helm",
        "kueue.x-k8s.io/priority-class": "training"
    },
    "CreationTimestamp": "2024-11-18T21:21:15Z",
    "Status": {
        "completionTime": "2024-11-18T21:25:24Z",
        "conditions": [
            {
                "lastTransitionTime": "2024-11-18T21:21:15Z",
                "lastUpdateTime": "2024-11-18T21:21:15Z",
                "message": "PyTorchJob hyperpod-cli-test is created.",
                "reason": "PyTorchJobCreated",
                "status": "True",
                "type": "Created"
            },
            {
                "lastTransitionTime": "2024-11-18T21:21:17Z",
                "lastUpdateTime": "2024-11-18T21:21:17Z",
                "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test is running.",
                "reason": "PyTorchJobRunning",
                "status": "False",
                "type": "Running"
            },
            {
                "lastTransitionTime": "2024-11-18T21:25:24Z",
                "lastUpdateTime": "2024-11-18T21:25:24Z",
                "message": "PyTorchJob hyperpod-ns-test-team/hyperpod-cli-test successfully completed.",
                "reason": "PyTorchJobSucceeded",
                "status": "True",
                "type": "Succeeded"
            }
        ],
            "replicaStatuses": {
                "Worker": {
                    "selector": "training.kubeflow.org/job-name=hyperpod-cli-test,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker",
                    "succeeded": 1
                }
            },
        "startTime": "2024-11-18T21:21:15Z"
    },
    "ConsoleURL": "https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/cluster-management/hyperpod-eks-test-cluster-id“
}
```

## Suspend and unsuspend jobs
<a name="hp-eks-cli-patch-job"></a>

If you want to remove some submitted job from the scheduler, HyperPod AWS CLI provides `suspend` command to temporarily remove the job from orchestration. The suspended job will no longer be scheduled unless the job is manually unsuspended by the `unsuspend` command

To temporarily suspend a job:

```
hyperpod patch-job suspend --job-name hyperpod-cli-test
```

To add a job back to the queue:

```
hyperpod patch-job unsuspend --job-name hyperpod-cli-test
```

## Debugging jobs
<a name="hp-eks-cli-other"></a>

The HyperPod AWS CLI also provides other commands for you to debug job submission issues. For example `list-pods` and `get-logs` in the HyperPod AWS CLI Github repository.

# Troubleshoot
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-troubleshoot"></a>

The following page contains known solutions for troubleshooting your HyperPod EKS clusters.

**Topics**
+ [Dashboard tab](#hp-eks-troubleshoot-dashboard)
+ [Tasks tab](#hp-eks-troubleshoot-tasks)
+ [Policies](#hp-eks-troubleshoot-policies)
+ [Deleting clusters](#hp-eks-troubleshoot-delete-policies)
+ [Unallocated resource sharing](#hp-eks-troubleshoot-unallocated-resource-sharing)

## Dashboard tab
<a name="hp-eks-troubleshoot-dashboard"></a>

**The EKS add-on fails to install**

For the EKS add-on installation to succeed, you will need to have a Kubernets version >= 1.30. To update, see [Update Kubernetes version](https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html).

For the EKS add-on installation to succeed, all of the nodes need to be in **Ready** status and all of the pods need to be in **Running** status. 

To check the status of your nodes, use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-cluster-nodes.html) AWS CLI command or navigate to your EKS cluster in the [EKS console](https://console.aws.amazon.com/eks/home#/clusters) and view the status of your nodes. Resolve the issue for each node or reach out to your administrator. If the node status is **Unknown**, delete the node. Once all nodes statuses are **Ready**, retry installing the EKS add-on in HyperPod from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

To check the status of your pods, use the [Kubernetes CLI](https://kubernetes.io/docs/reference/kubectl/) command `kubectl get pods -n cloudwatch-agent` or navigate to your EKS cluster in the [EKS console](https://console.aws.amazon.com/eks/home#/clusters) and view the status of your pods with the namespace `cloudwatch-agent`. Resolve the issue for the pods or reach out to your administrator to resolve the issues. Once all pod statuses are **Running**, retry installing the EKS add-on in HyperPod from the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

For more troubleshooting, see [Troubleshooting the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html#Container-Insights-setup-EKS-addon-troubleshoot).

## Tasks tab
<a name="hp-eks-troubleshoot-tasks"></a>

If you see the error message about how the **Custom Resource Definition (CRD) is not configured on the cluster**, grant `EKSAdminViewPolicy` and `ClusterAccessRole` policies to your domain execution role. 
+ For information on how to get your execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).
+ To learn how to attach policies to an IAM user or group, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

## Policies
<a name="hp-eks-troubleshoot-policies"></a>

The following lists solutions to errors relating to policies using the HyperPod APIs or console.
+ If the policy is in `CreateFailed` or `CreateRollbackFailed` status, you need to delete the failed policy and create a new one.
+ If the policy is in `UpdateFailed` status, retry the update with the same policy ARN.
+ If the policy is in `UpdateRollbackFailed` status, you need to delete the failed policy and then create a new one.
+ If the policy is in `DeleteFailed` or `DeleteRollbackFailed` status, retry the delete with the same policy ARN.
  + If you ran into an error while trying to delete the **Compute prioritization**, or cluster policy, using the HyperPod console, try to delete the `cluster-scheduler-config` using the API. To check the status of the resource, go to the details page of a compute allocation.

To see more details into the failure, use the describe API.

## Deleting clusters
<a name="hp-eks-troubleshoot-delete-policies"></a>

The following lists known solutions to errors relating to deleting clusters.
+ When cluster deletion fails due to attached SageMaker HyperPod task governance policies, you will need to [Delete policies](sagemaker-hyperpod-eks-operate-console-ui-governance-policies-delete.md).
+ When cluster deletion fails due to the missing the following permissions, you will need to update your cluster administrator minimum set of permissions. See the **Amazon EKS** tab in the [IAM users for cluster admin](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-admin) section.
  + `sagemaker:ListComputeQuotas`
  + `sagemaker:ListClusterSchedulerConfig`
  + `sagemaker:DeleteComputeQuota`
  + `sagemaker:DeleteClusterSchedulerConfig`

## Unallocated resource sharing
<a name="hp-eks-troubleshoot-unallocated-resource-sharing"></a>

If your unallocated resource pool capacity is less than expected:

1. **Check node ready status**

   ```
   kubectl get nodes
   ```

   Verify all nodes show `Ready` status in the STATUS column.

1. **Check node schedulable status**

   ```
   kubectl get nodes -o custom-columns=NAME:.metadata.name,UNSCHEDULABLE:.spec.unschedulable
   ```

   Verify nodes show `<none>` or `false` (not `true`).

1. **List unallocated resource sharing ClusterQueues:**

   ```
   kubectl get clusterqueue | grep hyperpod-ns-idle-resource-sharing
   ```

   This shows all unallocated resource sharing ClusterQueues. If the ClusterQueues are not showing up, check the `FailureReason` under ClusterSchedulerConfig policy to see if there are any failure messages to continue the debugging.

1. **Verify unallocated resource sharing quota:**

   ```
   kubectl describe clusterqueue hyperpod-ns-idle-resource-sharing-<index>
   ```

   Check the `spec.resourceGroups[].flavors[].resources` section to see the quota allocated for each resource flavor.

   Multiple unallocated resource sharing ClusterQueues may exist depending on the number of resource flavors in your cluster. 

1. **Check MIG configuration status (GPU nodes):**

   ```
   kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}{end}'
   ```

   Verify MIG-enabled nodes show `success` state.

# Attribution document for Amazon SageMaker HyperPod task governance
<a name="sagemaker-hyperpod-eks-operate-console-ui-governance-attributions"></a>

In the following you can learn about attributions and third-party licenses for material used in Amazon SageMaker HyperPod task governance.

**Topics**
+ [[base-files](https://packages.debian.org/bookworm/base-files)](#hp-eks-task-governance-attributions-base-files)
+ [[netbase](https://packages.debian.org/source/stable/netbase)](#hp-eks-task-governance-attributions-netbase)
+ [[golang-lru](https://github.com/hashicorp/golang-lru)](#hp-eks-task-governance-attributions-golang-lru)

## [base-files](https://packages.debian.org/bookworm/base-files)
<a name="hp-eks-task-governance-attributions-base-files"></a>

```
This is the Debian prepackaged version of the Debian Base System
Miscellaneous files. These files were written by Ian Murdock
<imurdock@debian.org> and Bruce Perens <bruce@pixar.com>.

This package was first put together by Bruce Perens <Bruce@Pixar.com>,
from his own sources.

The GNU Public Licenses in /usr/share/common-licenses were taken from
ftp.gnu.org and are copyrighted by the Free Software Foundation, Inc.

The Artistic License in /usr/share/common-licenses is the one coming
from Perl and its SPDX name is "Artistic License 1.0 (Perl)".


Copyright © 1995-2011 Software in the Public Interest.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

On Debian systems, the complete text of the GNU General
Public License can be found in `/usr/share/common-licenses/GPL'.
```

## [netbase](https://packages.debian.org/source/stable/netbase)
<a name="hp-eks-task-governance-attributions-netbase"></a>

```
Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Comment:
 This package was created by Peter Tobias tobias@et-inf.fho-emden.de on
 Wed, 24 Aug 1994 21:33:28 +0200 and maintained by Anthony Towns
 <ajt@debian.org> until 2001.
 It is currently maintained by Marco d'Itri <md@linux.it>.

Files: *
Copyright:
 Copyright © 1994-1998 Peter Tobias
 Copyright © 1998-2001 Anthony Towns
 Copyright © 2002-2022 Marco d'Itri
License: GPL-2
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License, version 2, as
 published by the Free Software Foundation.
 .
 This program is distributed in the hope that it will be useful,
 but WITHOUT ANY WARRANTY; without even the implied warranty of
 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 GNU General Public License for more details.
 .
 You should have received a copy of the GNU General Public License along
 with this program; if not, write to the Free Software Foundation,
 Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
 .
 On Debian systems, the complete text of the GNU General Public License
 version 2 can be found in '/usr/share/common-licenses/GPL-2'.
```

## [golang-lru](https://github.com/hashicorp/golang-lru)
<a name="hp-eks-task-governance-attributions-golang-lru"></a>

```
Copyright © 2014 HashiCorp, Inc.

Mozilla Public License, version 2.0

1. Definitions

1.1. "Contributor"

     means each individual or legal entity that creates, contributes to the
     creation of, or owns Covered Software.

1.2. "Contributor Version"

     means the combination of the Contributions of others (if any) used by a
     Contributor and that particular Contributor's Contribution.

1.3. "Contribution"

     means Covered Software of a particular Contributor.

1.4. "Covered Software"

     means Source Code Form to which the initial Contributor has attached the
     notice in Exhibit A, the Executable Form of such Source Code Form, and
     Modifications of such Source Code Form, in each case including portions
     thereof.

1.5. "Incompatible With Secondary Licenses"
     means

     a. that the initial Contributor has attached the notice described in
        Exhibit B to the Covered Software; or

     b. that the Covered Software was made available under the terms of
        version 1.1 or earlier of the License, but not also under the terms of
        a Secondary License.

1.6. "Executable Form"

     means any form of the work other than Source Code Form.

1.7. "Larger Work"

     means a work that combines Covered Software with other material, in a
     separate file or files, that is not Covered Software.

1.8. "License"

     means this document.

1.9. "Licensable"

     means having the right to grant, to the maximum extent possible, whether
     at the time of the initial grant or subsequently, any and all of the
     rights conveyed by this License.

1.10. "Modifications"

     means any of the following:

     a. any file in Source Code Form that results from an addition to,
        deletion from, or modification of the contents of Covered Software; or

     b. any new file in Source Code Form that contains any Covered Software.

1.11. "Patent Claims" of a Contributor

      means any patent claim(s), including without limitation, method,
      process, and apparatus claims, in any patent Licensable by such
      Contributor that would be infringed, but for the grant of the License,
      by the making, using, selling, offering for sale, having made, import,
      or transfer of either its Contributions or its Contributor Version.

1.12. "Secondary License"

      means either the GNU General Public License, Version 2.0, the GNU Lesser
      General Public License, Version 2.1, the GNU Affero General Public
      License, Version 3.0, or any later versions of those licenses.

1.13. "Source Code Form"

      means the form of the work preferred for making modifications.

1.14. "You" (or "Your")

      means an individual or a legal entity exercising rights under this
      License. For legal entities, "You" includes any entity that controls, is
      controlled by, or is under common control with You. For purposes of this
      definition, "control" means (a) the power, direct or indirect, to cause
      the direction or management of such entity, whether by contract or
      otherwise, or (b) ownership of more than fifty percent (50%) of the
      outstanding shares or beneficial ownership of such entity.


2. License Grants and Conditions

2.1. Grants

     Each Contributor hereby grants You a world-wide, royalty-free,
     non-exclusive license:

     a. under intellectual property rights (other than patent or trademark)
        Licensable by such Contributor to use, reproduce, make available,
        modify, display, perform, distribute, and otherwise exploit its
        Contributions, either on an unmodified basis, with Modifications, or
        as part of a Larger Work; and

     b. under Patent Claims of such Contributor to make, use, sell, offer for
        sale, have made, import, and otherwise transfer either its
        Contributions or its Contributor Version.

2.2. Effective Date

     The licenses granted in Section 2.1 with respect to any Contribution
     become effective for each Contribution on the date the Contributor first
     distributes such Contribution.

2.3. Limitations on Grant Scope

     The licenses granted in this Section 2 are the only rights granted under
     this License. No additional rights or licenses will be implied from the
     distribution or licensing of Covered Software under this License.
     Notwithstanding Section 2.1(b) above, no patent license is granted by a
     Contributor:

     a. for any code that a Contributor has removed from Covered Software; or

     b. for infringements caused by: (i) Your and any other third party's
        modifications of Covered Software, or (ii) the combination of its
        Contributions with other software (except as part of its Contributor
        Version); or

     c. under Patent Claims infringed by Covered Software in the absence of
        its Contributions.

     This License does not grant any rights in the trademarks, service marks,
     or logos of any Contributor (except as may be necessary to comply with
     the notice requirements in Section 3.4).

2.4. Subsequent Licenses

     No Contributor makes additional grants as a result of Your choice to
     distribute the Covered Software under a subsequent version of this
     License (see Section 10.2) or under the terms of a Secondary License (if
     permitted under the terms of Section 3.3).

2.5. Representation

     Each Contributor represents that the Contributor believes its
     Contributions are its original creation(s) or it has sufficient rights to
     grant the rights to its Contributions conveyed by this License.

2.6. Fair Use

     This License is not intended to limit any rights You have under
     applicable copyright doctrines of fair use, fair dealing, or other
     equivalents.

2.7. Conditions

     Sections 3.1, 3.2, 3.3, and 3.4 are conditions of the licenses granted in
     Section 2.1.


3. Responsibilities

3.1. Distribution of Source Form

     All distribution of Covered Software in Source Code Form, including any
     Modifications that You create or to which You contribute, must be under
     the terms of this License. You must inform recipients that the Source
     Code Form of the Covered Software is governed by the terms of this
     License, and how they can obtain a copy of this License. You may not
     attempt to alter or restrict the recipients' rights in the Source Code
     Form.

3.2. Distribution of Executable Form

     If You distribute Covered Software in Executable Form then:

     a. such Covered Software must also be made available in Source Code Form,
        as described in Section 3.1, and You must inform recipients of the
        Executable Form how they can obtain a copy of such Source Code Form by
        reasonable means in a timely manner, at a charge no more than the cost
        of distribution to the recipient; and

     b. You may distribute such Executable Form under the terms of this
        License, or sublicense it under different terms, provided that the
        license for the Executable Form does not attempt to limit or alter the
        recipients' rights in the Source Code Form under this License.

3.3. Distribution of a Larger Work

     You may create and distribute a Larger Work under terms of Your choice,
     provided that You also comply with the requirements of this License for
     the Covered Software. If the Larger Work is a combination of Covered
     Software with a work governed by one or more Secondary Licenses, and the
     Covered Software is not Incompatible With Secondary Licenses, this
     License permits You to additionally distribute such Covered Software
     under the terms of such Secondary License(s), so that the recipient of
     the Larger Work may, at their option, further distribute the Covered
     Software under the terms of either this License or such Secondary
     License(s).

3.4. Notices

     You may not remove or alter the substance of any license notices
     (including copyright notices, patent notices, disclaimers of warranty, or
     limitations of liability) contained within the Source Code Form of the
     Covered Software, except that You may alter any license notices to the
     extent required to remedy known factual inaccuracies.

3.5. Application of Additional Terms

     You may choose to offer, and to charge a fee for, warranty, support,
     indemnity or liability obligations to one or more recipients of Covered
     Software. However, You may do so only on Your own behalf, and not on
     behalf of any Contributor. You must make it absolutely clear that any
     such warranty, support, indemnity, or liability obligation is offered by
     You alone, and You hereby agree to indemnify every Contributor for any
     liability incurred by such Contributor as a result of warranty, support,
     indemnity or liability terms You offer. You may include additional
     disclaimers of warranty and limitations of liability specific to any
     jurisdiction.

4. Inability to Comply Due to Statute or Regulation

   If it is impossible for You to comply with any of the terms of this License
   with respect to some or all of the Covered Software due to statute,
   judicial order, or regulation then You must: (a) comply with the terms of
   this License to the maximum extent possible; and (b) describe the
   limitations and the code they affect. Such description must be placed in a
   text file included with all distributions of the Covered Software under
   this License. Except to the extent prohibited by statute or regulation,
   such description must be sufficiently detailed for a recipient of ordinary
   skill to be able to understand it.

5. Termination

5.1. The rights granted under this License will terminate automatically if You
     fail to comply with any of its terms. However, if You become compliant,
     then the rights granted under this License from a particular Contributor
     are reinstated (a) provisionally, unless and until such Contributor
     explicitly and finally terminates Your grants, and (b) on an ongoing
     basis, if such Contributor fails to notify You of the non-compliance by
     some reasonable means prior to 60 days after You have come back into
     compliance. Moreover, Your grants from a particular Contributor are
     reinstated on an ongoing basis if such Contributor notifies You of the
     non-compliance by some reasonable means, this is the first time You have
     received notice of non-compliance with this License from such
     Contributor, and You become compliant prior to 30 days after Your receipt
     of the notice.

5.2. If You initiate litigation against any entity by asserting a patent
     infringement claim (excluding declaratory judgment actions,
     counter-claims, and cross-claims) alleging that a Contributor Version
     directly or indirectly infringes any patent, then the rights granted to
     You by any and all Contributors for the Covered Software under Section
     2.1 of this License shall terminate.

5.3. In the event of termination under Sections 5.1 or 5.2 above, all end user
     license agreements (excluding distributors and resellers) which have been
     validly granted by You or Your distributors under this License prior to
     termination shall survive termination.

6. Disclaimer of Warranty

   Covered Software is provided under this License on an "as is" basis,
   without warranty of any kind, either expressed, implied, or statutory,
   including, without limitation, warranties that the Covered Software is free
   of defects, merchantable, fit for a particular purpose or non-infringing.
   The entire risk as to the quality and performance of the Covered Software
   is with You. Should any Covered Software prove defective in any respect,
   You (not any Contributor) assume the cost of any necessary servicing,
   repair, or correction. This disclaimer of warranty constitutes an essential
   part of this License. No use of  any Covered Software is authorized under
   this License except under this disclaimer.

7. Limitation of Liability

   Under no circumstances and under no legal theory, whether tort (including
   negligence), contract, or otherwise, shall any Contributor, or anyone who
   distributes Covered Software as permitted above, be liable to You for any
   direct, indirect, special, incidental, or consequential damages of any
   character including, without limitation, damages for lost profits, loss of
   goodwill, work stoppage, computer failure or malfunction, or any and all
   other commercial damages or losses, even if such party shall have been
   informed of the possibility of such damages. This limitation of liability
   shall not apply to liability for death or personal injury resulting from
   such party's negligence to the extent applicable law prohibits such
   limitation. Some jurisdictions do not allow the exclusion or limitation of
   incidental or consequential damages, so this exclusion and limitation may
   not apply to You.

8. Litigation

   Any litigation relating to this License may be brought only in the courts
   of a jurisdiction where the defendant maintains its principal place of
   business and such litigation shall be governed by laws of that
   jurisdiction, without reference to its conflict-of-law provisions. Nothing
   in this Section shall prevent a party's ability to bring cross-claims or
   counter-claims.

9. Miscellaneous

   This License represents the complete agreement concerning the subject
   matter hereof. If any provision of this License is held to be
   unenforceable, such provision shall be reformed only to the extent
   necessary to make it enforceable. Any law or regulation which provides that
   the language of a contract shall be construed against the drafter shall not
   be used to construe this License against a Contributor.


10. Versions of the License

10.1. New Versions

      Mozilla Foundation is the license steward. Except as provided in Section
      10.3, no one other than the license steward has the right to modify or
      publish new versions of this License. Each version will be given a
      distinguishing version number.

10.2. Effect of New Versions

      You may distribute the Covered Software under the terms of the version
      of the License under which You originally received the Covered Software,
      or under the terms of any subsequent version published by the license
      steward.

10.3. Modified Versions

      If you create software not governed by this License, and you want to
      create a new license for such software, you may create and use a
      modified version of this License if you rename the license and remove
      any references to the name of the license steward (except to note that
      such modified license differs from this License).

10.4. Distributing Source Code Form that is Incompatible With Secondary
      Licenses If You choose to distribute Source Code Form that is
      Incompatible With Secondary Licenses under the terms of this version of
      the License, the notice described in Exhibit B of this License must be
      attached.

Exhibit A - Source Code Form License Notice

      This Source Code Form is subject to the
      terms of the Mozilla Public License, v.
      2.0. If a copy of the MPL was not
      distributed with this file, You can
      obtain one at
      http://mozilla.org/MPL/2.0/.

If it is not possible or desirable to put the notice in a particular file,
then You may include the notice in a location (such as a LICENSE file in a
relevant directory) where a recipient would be likely to look for such a
notice.

You may add additional accurate notices of copyright ownership.

Exhibit B - "Incompatible With Secondary Licenses" Notice

      This Source Code Form is "Incompatible
      With Secondary Licenses", as defined by
      the Mozilla Public License, v. 2.0.
```

# Usage reporting for cost attribution in SageMaker HyperPod
<a name="sagemaker-hyperpod-usage-reporting"></a>

Usage reporting in SageMaker HyperPod EKS-orchestrated clusters provides granular visibility into compute resource consumption. The capability allows organizations to implement transparent cost attribution, allocating cluster costs to teams, projects, or departments based on their actual usage. By tracking metrics such as GPU/CPU hours, and Neuron Core utilization - captured in *both team-level aggregates and task-specific breakdowns* - usage reporting complements HyperPod's [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) functionality, ensuring fair cost distribution in shared multi-tenant clusters by:
+ Eliminating guesswork in cost allocation
+ Directly linking expenses to measurable resource consumption
+ Enforcing usage-based accountability in shared infrastructure environments

## Prerequisites
<a name="sagemaker-hyperpod-usage-reporting-prerequisites"></a>

To use this capability:
+ You need:
  + An active **SageMaker HyperPod environment** with a running EKS-orchestrated cluster.
  + (Strongly recommended) **Task Governance configured** with compute quotas and priority rules. For setup instructions, see [Task Governance setup](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md).
+ Familiarize yourself with these core concepts:
  + **Allocated compute quota:** Resources reserved for a team based on predefined quotas in their Task Governance policies. This is *guaranteed capacity* for their workloads.
  + **Borrowed compute:** Idle resources from the shared cluster pool that teams can temporarily use *beyond their allocated quota*. Borrowed compute is assigned dynamically based on priority rules in the Task Governance policies and availability of unused resources.
  + **Compute usage:** The measurement of resources (GPU, CPU, Neuron Core hours) consumed by a team, tracked as:
    + **Allocated utilization**: Usage within the team's quota.
    + **Borrowed utilization**: Usage beyond the quota, drawn from the shared pool.
  + **Cost attribution:** The process of allocating cluster costs to teams based on their *actual compute usage*, including both resources consumed within their predefined quota and resources temporarily used from the shared cluster pool beyond their quota.

## Reports types
<a name="sagemaker-hyperpod-usage-reporting-report-types"></a>

HyperPod's usage reports provide varying operational granularity:
+ **Summary reports** provide organization-wide visibility into compute usage, aggregating total GPU/CPU/Neuron Core hours per team (namespace) while distinguishing between *regular usage* (resources from a team's allocated quota) and *borrowed compute* (overflow capacity from shared pools).
+ **Detailed reports** offer task-level breakdowns by team, tracking exact compute hours spent running specific tasks – including preempted tasks, hourly utilization patterns, and namespace-specific allocations.

**Important**  
HyperPod usage reporting tracks compute utilization across *all Kubernetes namespaces* in a cluster—including those managed by Task Governance, default namespaces, and namespaces created **outside of Task Governance** (e.g., via direct Kubernetes API calls or external tools). This infrastructure-level monitoring ensures comprehensive usage-based accountability, preventing gaps in cost attribution for shared clusters regardless of how namespaces are managed.

## Reports formats and time range
<a name="sagemaker-hyperpod-usage-reporting-formats"></a>

Using the Python script provided in [Generate reports](sagemaker-hyperpod-usage-reporting-generate.md), administrators can generate usage reports on demand in CSV or PDF formats, selecting time ranges from daily snapshots to 180-day (6-month) historical windows.

**Note**  
You can configure the historical window to extend beyond the default 180-day maximum when setting up the reporting infrastructure. For more information on configuring the data retention period, see [Install Usage Report Infrastructure using CloudFormation](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#install-usage-report-infrastructure-using-cloudformation). 

## Illustrative use cases
<a name="sagemaker-hyperpod-usage-reporting-use-cases"></a>

This capability addresses critical scenarios in multi-tenant AI/ML environments such as:

1. **Cost allocation for shared clusters**: An administrator manages a HyperPod cluster shared by 20 teams training generative AI models. Using a *summary usage report*, they analyze daily GPU utilization over 180 days and discover Team A consumed 200 GPU hours of a specific instance type—170 from their allocated quota and 30 from borrowed compute. The administrator invoices Team A based on this reported usage.

1. **Auditing and dispute resolution**: A finance team questions cost attribution accuracy, citing inconsistencies. The administrator can export a *detailed task-level report* to audit discrepancies. By cross-referencing timestamps, instance types, and preempted jobs within the team's namespace, the report transparently reconcile disputed usage data.

# Reports details and data breakdown
<a name="sagemaker-hyperpod-usage-reporting-content"></a>

SageMaker HyperPod's usage reports provide two distinct lenses for analyzing compute resource consumption: **summary reports** for cost allocation and **detailed reports** for granular auditing. Summary reports aggregate cluster-wide usage by team or namespace, highlighting trends in allocated versus borrowed compute across GPU, CPU, and Neuron Core resources. Detailed reports drill into individual tasks, exposing metrics such as execution windows, task status, and priority-class utilization. In this section, we break down the structure of these reports, understand their key metrics, and demonstrate how administrators and finance teams can cross-reference summary trends with task-level data to validate cost attribution accuracy, resolve discrepancies, and optimize shared infrastructure.

## Common report headers
<a name="sagemaker-hyperpod-usage-reporting-content-headers"></a>

Both summary and detailed reports include the following metadata to contextualize the usage data:
+ **ClusterName:** The EKS-orchestrated Hyperpod cluster name where resources were consumed.
+ **Type:** The report category (`Summary Utilization Report` or `Detailed Utilization Report`).
+ **Date Generated:** When the report was created (e.g., `2025-04-18`).
+ **Date Range (UTC):** The timeframe covered (e.g., `2025-04-16 to 2025-04-18`).
+ **Missing data periods:** Gaps in data collection due to cluster downtime or monitoring issues (e.g., `2025-04-16 00:00:00 to 2025-04-19 00:00:00`).

## Summary reports
<a name="sagemaker-hyperpod-usage-reporting-content-summary"></a>

Summary reports provide a per-day high-level overview of compute resource consumption across teams/namespaces, and instance types distinguishing between allocated (reserved quota) and borrowed (lended pool) utilization. These reports are ideal for invoice generation, cost attribution statements, or capacity forecasting.

*Example: A summary report might show that Team A used 200 GPU hours—170 from their allocated quota and 30 borrowed.*

Here's a structured breakdown of the key columns in a summary report:
+ **Date:** The date of the reported usage (e.g., `2025-04-18`).
+ **Namespace:** The Kubernetes namespace associated with the team (e.g., `hyperpod-ns-ml-team`).
+ **Team:** The Owning team/department (e.g., `ml-team`).
+ **Instance Type:** The compute instance used (e.g., ml.g5.4xlarge).
+ **Total/Allocated/Borrowed Utilization (Hours):** The breakdown of GPU, CPU, or Neuron Core usage by category.

  Where:
  + **Total utilization = Allocated utilization \$1 Borrowed utilization**
  + **Allocated utilization** is the actual GPU CPU, or Neuron Core hours a team has used, capped at 100% of their allocated quota.
  + **Borrowed utilization** is the actual GPU, CPU, or Neuron Core hours a team has used *beyond their allocated quota*, drawn from the shared cluster pool based on Task Governance priority rules and resource availability.

Example: 72 GPU hours total (48 allocated, 24 borrowed).

**Note**  
Only total utilization is displayed for namespaces not managed by Task Governance.

## Detailed reports
<a name="sagemaker-hyperpod-usage-reporting-content-detailed"></a>

Detailed reports provide forensic-level visibility into compute usage, breaking down resource consumption by task, exposing granular metrics like task execution windows, status (e.g., Succeeded, Failed), and priority-class usage. These reports are ideal for billing discrepancies validation, or ensuring compliance with governance policies.

Here's a structured breakdown of the key columns in a detailed report:
+ **Date:** The date of the reported usage (e.g., `2025-04-18`).
+ **Period Start/End:** Exact execution window (UTC) for the task. (e.g., `19:54:34`)
+ **Namespace:** The Kubernetes namespace associated with the team (e.g., `hyperpod-ns-ml-team`).
+ **Team:** The Owning team/department (e.g., `ml-team`).
+ **Task:** The identifier for the job/pod (e.g., `pytorchjob-ml-pytorch-job-2p5zt-db686`).
+ **Instance:** The compute instance used (e.g., `ml.g5.4xlarge`).
+ **Status:** Task outcome (Succeeded, Failed, Preempted).
+ **Total Utilization:** Total consumption (hours and instance count) of GPU, CPU, or Neuron Core resources.
+ **Priority Class:** The priority tier assigned (e.g., training-priority).

# Generate reports
<a name="sagemaker-hyperpod-usage-reporting-generate"></a>

This guide provides step-by-step instructions to configure and manage usage reporting for your SageMaker HyperPod clusters. Follow these procedures to deploy infrastructure, generate custom reports, and remove resources when no longer needed.

## Set up usage reporting
<a name="sagemaker-hyperpod-usage-reporting-install"></a>

**Note**  
Before configuring the SageMaker HyperPod usage report infrastructure in your SageMaker HyperPod cluster, ensure you have met all prerequisites detailed in this [https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#prerequisites](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#prerequisites).

Usage reporting in HyperPod requires:
+ Deploying SageMaker HyperPod usage report AWS resources using an CloudFormation stack
+ Installing the SageMaker HyperPod usage report Kubernetes operator via a Helm chart

You can find comprehensive installation instructions in the [SageMaker HyperPod usage report GitHub repository](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md). Specifically, follow the steps in the [Set up](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#set-up-usage-reporting) section.

## Generate usage reports on demand
<a name="sagemaker-hyperpod-usage-reporting-use"></a>

Once the usage reporting infrastructure and Kubernetes operator are installed, job data for your SageMaker HyperPod cluster is automatically collected and stored in the S3 bucket you configured during setup. The operator continuously captures detailed usage metrics in the background, creating raw data files in the `raw` directory of your designated S3 bucket.

To generate an on-demand usage report, you can use the `run.py` script provided in the [SageMaker HyperPod usage report GitHub repository](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md) to extract and export usage metrics. Specifically, you can find the script and comprehensive instructions for generating a report in the [Generate Reports](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#generate-reports) section.

The script allows you to:
+ Specify custom date ranges for report generation
+ Choose between detailed and summary report types
+ Export reports in CSV or PDF formats
+ Direct reports to a specific S3 location

## Clean up usage reporting resources
<a name="sagemaker-hyperpod-usage-reporting-cleanup"></a>

When you no longer need your SageMaker HyperPod usage reporting infrastructure, follow the steps in [Clean Up Resources](https://github.com/awslabs/sagemaker-hyperpod-usage-report/blob/main/README.md#clean-up-resources) to clean up the Kubernetes operator and AWS resources (in that order). Proper resource deletion helps prevent unnecessary costs.

# Configuring storage for SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-setup-storage"></a>

Cluster admin needs to configure storage for data scientist users to manage input and output data and storing checkpoints during training on SageMaker HyperPod clusters.

**Handling large datasets (input/output data)**
+ **Data access and management**: Data scientists often work with large datasets that are required for training machine learning models. Specifying storage parameters in the job submission allows them to define where these datasets are located (e.g., Amazon S3 buckets, persistent volumes in Kubernetes) and how they are accessed during the job execution.
+ **Performance optimization**: The efficiency of accessing input data can significantly impact the performance of the training job. By optimizing storage parameters, data scientists can ensure that data is read and written efficiently, reducing I/O bottlenecks.

**Storing checkpoints**
+ **Checkpointing in training**: During long-running training jobs, it’s common practice to save checkpoints—intermediate states of the model. This allows data scientists to resume training from a specific point in case of a failure, rather than starting from scratch.
+ **Data recovery and experimentation**: By specifying the storage location for checkpoints, data scientists can ensure that these checkpoints are securely stored, potentially in a distributed storage system that offers redundancy and high availability. This is crucial for recovering from interruptions and for experimenting with different training strategies.

**Tip**  
For a hands-on experience and guidance on how to set up storage for SageMaker HyperPod cluster orchestrated with Amazon EKS, see the following sections in the [Amazon EKS Support in SageMaker HyperPod workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e).  
[Set up Amazon FSx for Lustre on SageMaker HyperPod](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/01-cluster/06-fsx-for-lustre)
[Set up a mountpoint for Amazon S3](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/01-cluster/09-s3-mountpoint) using [Mountpoint for Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mountpoint.html)

# Using the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters
<a name="sagemaker-hyperpod-eks-ebs"></a>

SageMaker HyperPod supports the Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver, which manages the lifecycle of Amazon EBS volumes as storage for the Kubernetes volumes that you create. With the Amazon EBS CSI driver, you can create, attach, and manage your Amazon EBS volumes for your machine learning workloads running on SageMaker HyperPod clusters with Amazon EKS orchestration.

**Topics**
+ [Key storage capabilities](#sagemaker-hyperpod-eks-ebs-features)
+ [Use cases](#sagemaker-hyperpod-eks-ebs-use)
+ [Setting up the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters](#sagemaker-hyperpod-eks-ebs-setup)
+ [Using the APIs](#sagemaker-hyperpod-eks-ebs-setup-apis)

## Key storage capabilities
<a name="sagemaker-hyperpod-eks-ebs-features"></a>

The Amazon EBS CSI driver on SageMaker HyperPod supports the following storage capabilities.
+ Static provisioning: Associates pre-created Amazon EBS volumes with Kubernetes [persistent volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) for use in your pods.
+ Dynamic provisioning: Automatically creates Amazon EBS volumes and associated persistent volumes from [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims). Parameters can be passed via [https://kubernetes.io/docs/concepts/storage/storage-classes/](https://kubernetes.io/docs/concepts/storage/storage-classes/) for fine-grained control over volume creation.
+ Volume resizing: Expands existing volumes by updating the [https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) size specification without disrupting running workloads. This can be essential for handling growing model repositories or adapting to larger nodes without service disruption.
+ Volume snapshots: Creates point-in-time snapshots of volumes for backup, recovery, and data versioning.
+ Block volumes: Provides raw block device access for high-performance applications requiring direct storage access.
+ Volume modification: Changes volume properties such as type, input or output operations per second (IOPS), or throughput using [volume attributes classes](https://kubernetes.io/docs/concepts/storage/volume-attributes-classes/).

For more information about the Amazon EBS CSI driver, see [Use Kubernetes volume storage with Amazon EBS](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) from the *Amazon EKS User Guide*.

For more information about storage to pods in your cluster, see [Storage](https://kubernetes.io/docs/concepts/storage/) from the *Kubernetes Documentation*.

## Use cases
<a name="sagemaker-hyperpod-eks-ebs-use"></a>

The Amazon EBS CSI driver integration enables several key use cases for both training and inference workloads on SageMaker HyperPod EKS clusters.

**Training workloads**
+ Dataset storage: Provision volumes for training datasets that persist across pod restarts
+ Checkpoint storage: Save model checkpoints and intermediate training results
+ Shared artifacts: Access common datasets and model artifacts across multiple training jobs

**Inference workloads**
+ Model storage: Dynamically provision appropriately sized volumes based on model requirements
+ Container caching: Create ephemeral storage for improved inference performance
+ Event logging: Store inference results and logs with persistent storage

## Setting up the Amazon EBS CSI driver on SageMaker HyperPod EKS clusters
<a name="sagemaker-hyperpod-eks-ebs-setup"></a>

The Amazon Elastic Block Store (Amazon EBS) Container Storage Interface (CSI) driver allows you to dynamically provision and manage Amazon EBS volumes for your containerized workloads running on SageMaker HyperPod clusters with EKS orchestration. This section walks you through installing and configuring the Amazon EBS CSI driver to enable persistent storage for your machine learning workloads.

### Prerequisites
<a name="sagemaker-hyperpod-eks-ebs-setup-prerequisite"></a>

Before you begin, do the following:
+ [Install and configure the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html)
+ [Create a SageMaker HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html)
+ Install the Amazon EBS CSI driver with the version of [v1.47.0](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/CHANGELOG.md#v1470)

### Additional permissions
<a name="sagemaker-hyperpod-eks-ebs-setup-permissions"></a>

To set up the Amazon EBS CSI driver add-on, follow the instructions in [Use Kubernetes volume storage with Amazon EBS](https://docs.aws.amazon.com/eks/latest/userguide/ebs-csi.html) from the *Amazon EKS User Guide*. You should also add the following additional permissions to the IAM role used to run the driver add-on. Note that this is the IAM role specified in your service account configuration for the driver add-on, not the HyperPod cluster execution role.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:AttachClusterNodeVolume",
                "sagemaker:DetachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        }
    ]
}
```

------

## Using the APIs
<a name="sagemaker-hyperpod-eks-ebs-setup-apis"></a>

As an alternative, you can use the [AttachClusterNodeVolume](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AttachClusterNodeVolume.html) and [DetachClusterNodeVolume](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DetachClusterNodeVolume.html) API operations to attach and detach your Amazon EBS volumes to SageMaker HyperPod EKS cluster instances.

**Key requirements for using these APIs include the following.**
+ Both the Amazon EBS volume and SageMaker HyperPod EKS cluster must be owned by the same AWS account.
+ The calling principal needs specific minimum permissions to successfully perform the attach or detach operation. For more information about the minimum permissions, see the following sections.
+ After attaching a volume to your HyperPod node, follow the instructions in [Accessing SageMaker HyperPod cluster nodes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-access-through-terminal.html) to access the cluster node, and [Make a volume available for use](https://docs.aws.amazon.com/ebs/latest/userguide/ebs-using-volumes.html) to mount the attached volume.

### Required permissions for `sagemaker:AttachClusterNodeVolume`
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-attach"></a>

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:AttachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "ec2:AttachVolume",
                "ec2:DescribeVolumes"
            ],
            "Resource": "arn:aws:ec2:us-east-1:111122223333:volume/*"
        }
    ]
}
```

------

### Required permissions for `sagemaker:DetachClusterNodeVolume`
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-detach"></a>

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement":
    [
        {
            "Effect": "Allow",
            "Action":
            [
                "sagemaker:DetachClusterNodeVolume"
            ],
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:cluster/*"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "eks:DescribeCluster"
            ],
            "Resource": "arn:aws:eks:us-east-1:111122223333:cluster/my-cluster-name"
        },
        {
            "Effect": "Allow",
            "Action":
            [
                "ec2:DetachVolume",
                "ec2:DescribeVolumes"
            ],
            "Resource": "arn:aws:ec2:us-east-1:111122223333:volume/*"
        }
    ]
}
```

------

### Required permissions for AWS KMS keys
<a name="sagemaker-hyperpod-eks-ebs-setup-apis-kms"></a>

Add the following AWS KMS permissions only if you're using customer managed KMS keys to encrypt your Amazon EBS volumes attached to HyperPod cluster nodes. These permissions are not required if you're using AWS-managed KMS keys (the default encryption option).

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Id": "key-default-1",
    "Statement":
    [
        {
            "Effect": "Allow",
            "Principal":
            {
                "AWS": "arn:aws:iam::111122223333:role/caller-role"
            },
            "Action": "kms:DescribeKey",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Principal":
            {
                "AWS": "arn:aws:iam::111122223333:role/caller-role"
            },
            "Action": "kms:CreateGrant",
            "Resource": "*",
            "Condition":
            {
                "StringEquals":
                {
                    "kms:CallerAccount": "111122223333",
                    "kms:ViaService": "ec2.us-east-1.amazonaws.com"
                },
                "ForAnyValue:StringEquals":
                {
                    "kms:EncryptionContextKeys": "aws:ebs:id"
                },
                "Bool":
                {
                    "kms:GrantIsForAWSResource": true
                },
                "ForAllValues:StringEquals":
                {
                    "kms:GrantOperations":
                    [
                        "Decrypt"
                    ]
                }
            }
        }
    ]
}
```

------

**Note**  
These AWS KMS permissions are not required for `sagemaker:DetachClusterNodeVolume` when detaching a Cluster Auto Volume Attachment (CAVA) volume encrypted with customer managed KMS keys.

# Configuring custom Kubernetes labels and taints in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints"></a>

Amazon SageMaker HyperPod clusters with Amazon Elastic Kubernetes Service (Amazon EKS) orchestrator support custom Kubernetes labels and taints for nodes within instance groups. Labels and taints are fundamental scheduling and organization mechanisms in Kubernetes that give you fine-grained control over pod placement and resource utilization.

Labels are key-value pairs that can be attached to Kubernetes objects, allowing you to organize and select resources based on attributes. Taints, working in conjunction with tolerations, are node-specific properties that influence pod scheduling by repelling pods that don't have matching tolerations. Together, these mechanisms enable you to isolate workloads, assign them according to hardware specifications, and ensure optimal resource utilization.

## Common use cases
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-use-cases"></a>

The following are common scenarios where custom labels and taints are beneficial:
+ **Preventing system pods on expensive instances** - Apply taints to GPU instances to prevent system pods and other non-critical workloads from consuming expensive compute resources
+ **Integration with existing tooling** - Apply labels that match your organization's established infrastructure patterns and node affinity configurations

## Configuring labels and taints
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-configure"></a>

You can configure custom Kubernetes labels and taints at the instance group level using the `KubernetesConfig` parameter in your cluster configuration. Labels and taints are applied to all nodes in the instance group and persist throughout the cluster's lifecycle.

The `KubernetesConfig` parameter is declarative, meaning you specify the complete desired state of labels and taints for an instance group. SageMaker HyperPod then reconciles the actual state of the nodes to match this desired state.
+ **Adding labels or taints** - Include the new labels or taints in the `KubernetesConfig` along with any existing ones you want to keep
+ **Updating labels or taints** - Modify the values in the `KubernetesConfig` for the labels or taints you want to change, and include all others you want to keep
+ **Removing labels or taints** - Omit the labels or taints you want to remove from the `KubernetesConfig`, keeping only those you want to retain

### Creating a cluster with labels and taints
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-create"></a>

When creating a new SageMaker HyperPod cluster, include the `KubernetesConfig` parameter in your instance group configuration. The following example shows how to create a cluster with custom labels and taints:

```
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "InstanceCount": 4,
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://my-bucket/lifecycle-config.sh",
            "OnCreate": "on-create.sh"
        },
        "ExecutionRole": "arn:aws:iam::123456789012:role/HyperPodExecutionRole",
        "ThreadsPerCore": 1,
        "KubernetesConfig": { 
            "Labels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "Taints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            },
            {
                "key": "dedicated",
                "value": "ml-workloads",
                "effect": "NoExecute"
            }]
        }
    }],
    "VpcConfig": {
        "SecurityGroupIds": ["sg-0123456789abcdef0"],
        "Subnets": ["subnet-0123456789abcdef0", "subnet-0123456789abcdef1"]
    },
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "arn:aws:eks:us-west-2:123456789012:cluster/my-eks-cluster"
        }
    }
}
```

In this example:
+ **Labels** - Three custom labels are applied: `env=prod`, `team=ml-training`, and `gpu-type=a100`
+ **Taints** - Two taints are configured to prevent unwanted pod scheduling

### Updating labels and taints on an existing cluster
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-update"></a>

You can modify labels and taints on an existing cluster using the `UpdateCluster` API. The following example shows how to update the `KubernetesConfig` for an instance group:

```
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "KubernetesConfig": { 
            "Labels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100",
                "cost-center": "ml-ops"
            },
            "Taints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }]
}
```

When you update labels and taints, SageMaker HyperPod applies the changes to all nodes in the instance group. The service manages the transition from current to desired state, which you can monitor using the `DescribeCluster` API.

## Monitoring label and taint application
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-monitor"></a>

SageMaker HyperPod provides APIs to monitor the status of labels and taints as they are applied to your cluster nodes.

### Checking cluster-level status
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-describe-cluster"></a>

Use the `DescribeCluster` API to view the current and desired states of labels and taints at the instance group level. The following example shows the response structure:

```
{
    "ClusterName": "my-cluster",
    "ClusterStatus": "InService",
    "InstanceGroups": [{
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "CurrentInstanceCount": 4,
        "TargetInstanceCount": 4,
        "KubernetesConfig": {
            "CurrentLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "DesiredLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "CurrentTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }],
            "DesiredTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }]
}
```

When the `CurrentLabels` match `DesiredLabels` and `CurrentTaints` match `DesiredTaints`, all nodes in the instance group have the specified configuration applied. If they differ, the cluster is still in the process of applying the changes.

### Checking individual node status
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-describe-node"></a>

For node-level details, use the `DescribeClusterNode` API to check the label and taint configuration of individual nodes. The following example shows the response structure:

```
{
    "NodeDetails": { 
        "InstanceId": "i-0123456789abcdef0",
        "InstanceGroupName": "worker-group-1",
        "InstanceType": "ml.p4d.24xlarge",
        "InstanceStatus": {
            "Status": "Running",
            "Message": "Node is healthy"
        },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://my-bucket/lifecycle-config.sh",
            "OnCreate": "on-create.sh"
        },
        "LaunchTime": 1699564800.0,
        "KubernetesConfig": {
            "CurrentLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "DesiredLabels": {
                "env": "prod",
                "team": "ml-training",
                "gpu-type": "a100"
            },
            "CurrentTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }],
            "DesiredTaints": [{
                "key": "gpu",
                "value": "true",
                "effect": "NoSchedule"
            }]
        }
    }
}
```

Node-level monitoring is useful for troubleshooting when labels or taints are not applying correctly to specific nodes, or when you need to verify the configuration of a particular instance.

## Reserved prefixes
<a name="sagemaker-hyperpod-eks-custom-labels-and-taints-reserved-prefixes"></a>

Certain prefixes are reserved for system use and should not be used for custom labels or taints. The following prefixes are reserved:
+ `kubernetes.io/` - Reserved for Kubernetes core components
+ `k8s.io/` - Reserved for Kubernetes core components
+ `sagemaker.amazonaws.com/` - Reserved for SageMaker HyperPod
+ `eks.amazonaws.com/` - Reserved for Amazon EKS
+ `k8s.aws/` - Reserved for Amazon EKS
+ `karpenter.sh/` - Reserved for Karpenter autoscaling

Labels and taints with these prefixes are managed by system components and should not be overwritten with custom values.

# Checkpointless training in Amazon SageMaker HyperPod
<a name="sagemaker-eks-checkpointless"></a>

Checkpointless training on Amazon SageMaker HyperPod enables faster recovery from training infrastructure faults. The following documentation helps you get started with checkpointless training and fine-tuning for NeMo-supported models.

Checkpointless training has the following pre-requisites:
+ [Getting started with Amazon EKS support in SageMaker HyperPod](sagemaker-hyperpod-eks-prerequisites.md)
+ [Installing the training operator](sagemaker-eks-operator-install.md). You must install v1.2.0 or above.

 Checkpointless training on SageMaker HyperPod is built on top of the [ NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/exp_manager.html#experiment-manager). You can run checkpointless training with pre-created SageMaker HyperPod recipes. If you're familiar with NeMo, the process of using the checkpointless training recipes is similar. With minor changes, you can start training a model using checkpointless training features that enable you to recover quickly from training faults.

The following HyperPod recipes are pre-configured with checkpointless training optimizations. You can specify your data paths as part of the recipe and use the associated launch script to run training (see the quick start guide below):


| Model | Method | Size | Nodes | Instance | Accelerator | Recipe | Script | Tutorial | 
| --- | --- | --- | --- | --- | --- | --- | --- | --- | 
| GPT OSS | Full finetune example | 120b | 16 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_full_fine_tuning.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html) | 
| GPT OSS | LoRA-example | 120b | 2 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_lora.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft.html) | 
| Llama3 | Pretrain example | 70b | 16 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/training/llama/checkpointless_llama3_70b_pretrain.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-pretraining-llama3.html) | 
| Llama3 | LoRA-example | 70b | 2 | p5.48xlarge | GPU H100 | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/llama/checkpointless_llama3_70b_lora.yaml) | [link](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh) | [link](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft-llama.html) | 

The following quick-start guide provides tutorials for using checkpointless training recipes:

**Getting started examples**
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b](sagemaker-eks-checkpointless-recipes-finetune.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b](sagemaker-eks-checkpointless-recipes-peft.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b](sagemaker-eks-checkpointless-recipes-pretraining-llama3.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b](sagemaker-eks-checkpointless-recipes-peft-llama.md)

If you’d like to pre-train or fine-tune custom models, see [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models](sagemaker-eks-checkpointless-recipes-custom.md).

To learn more about incorporating specific checkpointless training components, [HyperPod checkpointless training features](sagemaker-eks-checkpointless-features.md).

# Amazon SageMaker HyperPod checkpointless training tutorials
<a name="sagemaker-eks-checkpointless-recipes"></a>

[ HyperPod checkpointless training recipes](https://github.com/aws/sagemaker-hyperpod-checkpointless-training) are predefined job configurations with checkpointless training features enabled. Using these recipes, makes it easier to get started with checkpointless training on HyperPod.

**Topics**
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b](sagemaker-eks-checkpointless-recipes-finetune.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b](sagemaker-eks-checkpointless-recipes-peft.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b](sagemaker-eks-checkpointless-recipes-pretraining-llama3.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b](sagemaker-eks-checkpointless-recipes-peft-llama.md)
+ [Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models](sagemaker-eks-checkpointless-recipes-custom.md)

# Tutorials - Amazon SageMaker HyperPod Checkpointless Full Finetuning GPT OSS 120b
<a name="sagemaker-eks-checkpointless-recipes-finetune"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-finetune-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-finetune-recipes-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure your version of Python is greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Launch training jobs with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-finetune-launcher"></a>

You can use the Amazon SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh`

   your\$1container: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_full_fine_tuning \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_full_fine_tuning.sh
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The `STATUS` will turn to `COMPLETED` when you run `kubectl get pods`.

## Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-finetune-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. update the examples/gpt\$1oss/launch/full\$1finetune\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml
   + image: A Deep Learning container. To find the most recent release of the checkpointless training container, see [checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + resume.restore\$1config.path=<path\$1to\$1pretrained\$1weights>: The path to downloaded pretrained model weigths in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + dataset.dataset\$1path=<path\$1to\$1dataset>: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with full\$1finetune\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml

   ```
   kubectl apply -f examples/gpt_oss/launch/full_finetune_gpt_oss_120b_checkpointless_p5.yaml
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA GPT OSS 120b
<a name="sagemaker-eks-checkpointless-recipes-peft"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-peft-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-peft-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and < 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:
   + SageMaker HyperPod recipes method:

     ```
     # install SageMaker HyperPod Recipes.
     git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
     cd sagemaker-hyperpod-recipes
     pip3 install -r requirements.txt
     ```
   + kubectl with pre-defined job yaml method

     ```
     # install SageMaker HyperPod checkpointless training.
     git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
     cd sagemaker-hyperpod-checkpointless-training
     ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-peft-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh`

   your\$1contrainer: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/gpt_oss/checkpointless_gpt_oss_120b_lora \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/gpt_oss/run_checkpointless_gpt_oss_120b_lora.sh
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                             READY   STATUS             RESTARTS        AGE
gpt-oss-120b-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

## Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-peft-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. update the examples/gpt\$1oss/launch/peft\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml
   + image: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + resume.restore\$1config.path=<path\$1to\$1pretrained\$1weights>: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-peft.html#sagemaker-eks-checkpointless-recipes-peft-prereqs) step.
   + dataset.dataset\$1path=<path\$1to\$1dataset>: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with peft\$1gpt\$1oss\$1120b\$1checkpointless\$1p5.yaml

   ```
   kubectl apply -f examples/gpt_oss/launch/peft_gpt_oss_120b_checkpointless_p5.yaml
   ```

After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

```
kubectl get pods

NAME                                             READY   STATUS             RESTARTS        AGE
gpt-120b-lora-checkpointless-worker-0             0/1    running               0            36s
```

If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

```
kubectl describe pod <name of pod>
```

After the job STATUS changes to Running, you can examine the log by using the following command.

```
kubectl logs <name of pod>
```

The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining Llama 3 70b
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [ source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Method 1: Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh`

   A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=training/llama/checkpointless_llama3_70b_pretrain \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.data.global_batch_size=16 \
       recipes.data.micro_batch_size=4 \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/llama/run_checkpointless_llama3_70b_pretrain.sh
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                             READY   STATUS             RESTARTS        AGE
   llama-3-70b-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

## Method 2: Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-pretraining-llama3-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. Update the `examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml`
   + `image`: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + `resume.restore_config.path=<path_to_pretrained_weights>`: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + `dataset.dataset_path=<path_to_dataset>`: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with `pretrain_llama3_70b_checkpointless_p5.yaml`

   ```
   kubectl apply -f examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                                             READY   STATUS             RESTARTS        AGE
   llama3-pretrain-checkpointless-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless PEFT-LoRA Llama 3 70b
<a name="sagemaker-eks-checkpointless-recipes-peft-llama"></a>

The following sequence of steps is required to run checkpointless training recipes on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ Pick a supported checkpointless training recipe for Llama 70B or GPT-OSS 120B from the [ source](https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection).
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. [ Install Helm](https://helm.sh/docs/intro/install/)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies using one of the following methods:

   1. Method 1: SageMaker HyperPod recipes method:

      ```
      # install SageMaker HyperPod Recipes.
      git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
      cd sagemaker-hyperpod-recipes
      pip3 install -r requirements.txt
      ```

   1. Method 2: kubectl with pre-defined job yaml method

      ```
      # install SageMaker HyperPod checkpointless training.
      git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
      cd sagemaker-hyperpod-checkpointless-training
      ```

You can now launch the checkpointless training recipe using either the NeMo-style launcher or using kubectl.

## Method 1: Launch the training job with the recipes launcher
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-recipes-launcher"></a>

Alternatively, you can use the SageMaker HyperPod recipes to submit your training job. Using the recipes involves updating k8s.yaml, config.yaml and running the launch script.

1. Update `launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh`

   A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).

   ```
   #!/bin/bash
   
   SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
   TRAIN_DIR="${TRAIN_DIR}"
   VAL_DIR="${VAL_DIR}"
   EXP_DIR="${EXP_DIR}"
   LOG_DIR="${LOG_DIR}"
   CONTAINER_MOUNT="/data"
   CONTAINER="${CONTAINER}"
   MODEL_NAME_OR_PATH="${MODEL_NAME_OR_PATH}"
   
   HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/main.py" \
       recipes=fine-tuning/llama/checkpointless_llama3_70b_lora \
       recipes.dataset.dataset_path="${TRAIN_DIR}" \
       recipes.exp_manager.exp_dir="${EXP_DIR}" \
       recipes.log_dir="${LOG_DIR}" \
       recipes.resume.restore_config.path="${MODEL_NAME_OR_PATH}" \
       base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/results" \
       git.use_default=false \
       cluster=k8s \
       cluster_type=k8s \
       container="${CONTAINER}" \
       +cluster.hostNetwork=true \
       +cluster.persistent_volume_claims.0.claimName=fsx-claim \
       +cluster.persistent_volume_claims.0.mountPath="${CONTAINER_MOUNT}" \
       +recipes.dataset.val_dataset_path="${VAL_DIR}" \
       ++recipes.callbacks.3.test_fault_config.fault_prob_between_lock=1 \
   ```

1. Launch the training job

   ```
   bash launcher_scripts/llama/run_checkpointless_llama3_70b_lora.sh
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                             READY   STATUS             RESTARTS        AGE
   llama-3-70b-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

## Method 2: Launch the training job with kubectl with pre-defined yaml
<a name="sagemaker-eks-checkpointless-recipes-peft-llama-kubectl"></a>

Another option is to launch the training through kubectl with a pre-defined job yaml.

1. Update the `examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml`
   + `image`: A Deep Learning container. To find the most recent release of the checkpointless training container, see [ checkpointless training release notes](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-release-notes.html).
   + `resume.restore_config.path=<path_to_pretrained_weights>`: The path to downloaded pretrained model weights in Nemo format in [ Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless-recipes-finetune.html#sagemaker-eks-checkpointless-recipes-finetune-prereqs) step.
   + `dataset.dataset_path=<path_to_dataset>`: The path to the dataset that stored in the shared storage

1. Submit the job using kubectl with `peft_llama3_70b_checkpointless_p5.yaml`

   ```
   kubectl apply -f examples/llama3/launch/peft_llama3_70b_checkpointless_p5.yaml
   ```

1. After you've submitted the training job, you can use the following command to verify if you submitted it successfully.

   ```
   kubectl get pods
   
   NAME                                             READY   STATUS             RESTARTS        AGE
   llama3-70b-lora-checkpointless-worker-0             0/1    running               0            36s
   ```

1. If the STATUS is at PENDING or ContainerCreating, run the following command to get more details

   ```
   kubectl describe pod <name of pod>
   ```

1. After the job STATUS changes to Running, you can examine the log by using the following command.

   ```
   kubectl logs <name of pod>
   ```

   The STATUS will turn to Completed when you run kubectl get pods

# Tutorials - Amazon SageMaker HyperPod Checkpointless Pretraining or Finetuning Custom Models
<a name="sagemaker-eks-checkpointless-recipes-custom"></a>

The following sequence of steps is required to run checkpointless training with your custom model on HyperPod.

## Prerequisites
<a name="sagemaker-eks-checkpointless-recipes-custom-prereqs"></a>

Before you start setting up your environment, make sure you have:
+ [ Enabled Amazon EKS support in Amazon SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [ Set up HyperPod training operator (v1.2\$1)](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html)
+ A shared storage location. It can be an Amazon FSx file system or NFS system that's accessible from the cluster nodes.
+ Data in one of the following formats:
  + JSON
  + JSONGZ (Compressed JSON)
  + ARROW
+ [ Download the hugging face model weights](https://huggingface.co/docs/hub/models-downloading) and covert to [ Nemo supported format](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/hf-integration.html#importing-from-hugging-face).
+ Setup your environment

## Kubernetes environment setup
<a name="sagemaker-eks-checkpointless-recipes-custom-kubernetes"></a>

To set up your Kubernetes environment, do the following:

1. Set up the virtual environment. Make sure you're using Python greater than or equal to 3.10 and lower than 3.14.

   ```
   python3 -m venv ${PWD}/venv
   source venv/bin/activate
   ```

1. [ Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html)

1. Connect to your Kubernetes cluster

   ```
   aws eks update-kubeconfig --region "${CLUSTER_REGION}" --name "${CLUSTER_NAME}"
   ```

1. Install dependencies

   ```
   # install SageMaker HyperPod checkpointless training.
   git clone git@github.com:aws/sagemaker-hyperpod-checkpointless-training.git
   cd sagemaker-hyperpod-checkpointless-training
   ```

## Checkpointless training modification instructions
<a name="sagemaker-eks-checkpointless-recipes-custom-modification-instructions"></a>

To incrementally adopt checkpointless training for custom models, follow the integration guide (here we use Llama 3 70b pretraining as an example), which involves:
+ Fast communicator creation
+ Memory-mapped dataloader (MMAP)
+ In-process & Checkpointless recovery

### Component 1: Fast communicator creation
<a name="sagemaker-eks-checkpointless-recipes-custom-component1"></a>

This is to optimize time to establish connections between the workers. There is no code changes needed and only requires setting env variables

```
  # Enable Rootless features
  export HPCT_USE_ROOTLESS=1 && \
  sysctl -w net.ipv4.ip_local_port_range="20000 65535" && \

  hyperpodrun --nproc_per_node=8 \
              ...
              --inprocess-restart \
              ...
```

The full change can be found in the [ llama3 70 pretrain launch job config](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml).

### Component 2: Memory-mapped dataloader (MMAP)
<a name="sagemaker-eks-checkpointless-recipes-custom-component2"></a>

MMAP caches to store pre-fetched data samples & enable immediate training start without needing to wait for data preprocessing. It requires minimal code changes to adopt by wrapping existing dataloader.

```
data_module = MMAPDataModule(
  data_module=base_data_module,
  mmap_config=CacheResumeMMAPConfig(cache_dir=…)
)
```

### Components 3 and 4: In-process and checkpointless recovery
<a name="sagemaker-eks-checkpointless-recipes-custom-components3-4"></a>

This enables failure recovery without restart training processes or loading from checkpoints. Additional code changes needed (strategy & training config update, wrap existing main)

```
@HPWrapper(
  health_check=CudaHealthCheck(),
  hp_api_factory=HPAgentK8sAPIFactory(),
  abort_timeout=60.0,
...)
def run_main(
  cfg,
  caller: Optional[HPCallWrapper] = None):
...


CheckpointlessMegatronStrategy(
  **self.cfg.strategy,
  ddp=self.ddp,
)
```

The full change can be found in the [llama3 70 pretrain entry script](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/llama3_70b_pretrain_checkpointless.py) and the corresponding training config change can be found in the [ llama3 70b training config](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/config/llama3_70b_peft_checkpointless.yaml).

### Launch training
<a name="sagemaker-eks-checkpointless-recipes-custom-launch"></a>

You can now launch the checkpointless training using kubectl.

```
kubectl apply -f your_job_config.yaml
```

# HyperPod checkpointless training features
<a name="sagemaker-eks-checkpointless-features"></a>

See the following pages to learn about the training features in checkpointless training.

**Topics**
+ [Amazon SageMaker HyperPod checkpointless training repositories](#sagemaker-eks-checkpointless-repositories)
+ [Collective communication initialization improvements](sagemaker-eks-checkpointless-features-communication.md)
+ [Memory mapped dataloader](sagemaker-eks-checkpointless-features-mmap.md)
+ [In-process recovery and checkpointless training](sagemaker-eks-checkpointless-in-process-recovery.md)

## Amazon SageMaker HyperPod checkpointless training repositories
<a name="sagemaker-eks-checkpointless-repositories"></a>

[ HyperPod checkpointless training](https://github.com/aws/sagemaker-hyperpod-checkpointless-training#) accelerates recovery from cluster faults in large-scale distributed training environments through framework-level optimizations. These optimizations are delivered via a base container image that includes enhanced NCCL initialization improvements, data loading optimizations, and in-process and checkpointless recovery components. The HyperPod checkpointless training package is built on this foundation.

Checkpointless training is enabled via three optimization tracks that run in concert:
+ **Communication initilization improvements (NCCL and Gloo)** - Eliminate communication bottlenecks by decentralizing rank peer and ring information (red box below).
+ **Data loading optimizations** - Reduce the time required to serve the first batch of data during restart operations (orange boxes below).
+ **Program restart overhead reduction** - Minimize restart costs and enable checkpointless replenishment through process recovery on healthy nodes (blue and green boxes below).

![\[alt text not found\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-optimization-tracks.png)


# Collective communication initialization improvements
<a name="sagemaker-eks-checkpointless-features-communication"></a>

NCCL and Gloo are fundamental communication libraries that enable collective operations (such as all-reduce and broadcast) across distributed training processes. However, traditional NCCL and Gloo initialization can create bottlenecks during fault recovery.

The standard recovery process requires all processes to connect to a centralized TCPStore and coordinate through a root process, introducing an expensive overhead that becomes particularly problematic during restarts. This centralized design creates three critical issues: coordination overhead from mandatory TCPStore connections, recovery delays as each restart must repeat the full initialization sequence, and a single point of failure in the root process itself. This imposes an expensive, centralized coordination steps every time training initializes or restarts.

HyperPod checkpointless training eliminates these coordination bottlenecks, enabling the faster recovery from faults by making initialization "rootless" and "TCPStoreless."

## Rootless configurations
<a name="sagemaker-eks-checkpointless-features-communication-rootless-config"></a>

To enable Rootless, one can simply expose the following environment variables.

```
export HPCT_USE_ROOTLESS=1 && \
sysctl -w net.ipv4.ip_local_port_range="20000 65535" && \
```

HPCT\$1USE\$1ROOTLESS: 0 or 1. Use to turn on and off rootless

sysctl -w net.ipv4.ip\$1local\$1port\$1range="20000 65535": Set the system port range

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/llama3/launch/pretrain_llama3_70b_checkpointless_p5.yaml#L111-L113) for enabling Rootless.

## Rootless
<a name="sagemaker-eks-checkpointless-features-communication-rootless"></a>

HyperPod checkpointless training offers novel initialization methods, Rootless and TCPStoreless, for NCCL and Gloo process groups.

The implementation of these optimizations involves modifying NCCL, Gloo, and PyTorch:
+ Extending third-party library APIs to enable Rootless and Storeless NCCL and Gloo optimizations while maintaining backward compatibility
+ Updating process group backends to conditionally use optimized paths and handle in-process recovery issues
+ Bypassing expensive TCPStore creation at the PyTorch distributed layer while maintaining symmetric address patterns through global group counters

The following graph shows the architecture of the distributed training libraries and the changes made in checkpointless training.

![\[The following graph shows the architecture of the distributed training libraries and the changes made in checkpointless training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-training-libraries.png)


### NCCL and Gloo
<a name="sagemaker-eks-checkpointless-features-communication-nccl-gloo"></a>

These are independent packages that perform the core functionality of collective communications. They provide key APIs, such as ncclCommInitRank, to initialize communication networks, manage the underlying resources, and perform collective communications. After making custom changes in NCCL and Gloo, the Rootless and Storeless optimizes (e.g., skip connecting to the TCPStore) initialization of the communication network. You can switch between using the the original code paths or optimized code paths flexibly.

### PyTorch process group backend
<a name="sagemaker-eks-checkpointless-features-communication-pytorch"></a>

The process group backends, specifically ProcessGroupNCCL and ProcessGroupGloo, implement the ProcessGroup APIs by invoking the APIs of their corresponding underlying libraries. Since we extend the third-party libraries' APIs, we have to invoke them properly and make code path switches based on customers' configurations.

In addition to optimization code paths, we also change the process group backend to support in-process recovery.

# Memory mapped dataloader
<a name="sagemaker-eks-checkpointless-features-mmap"></a>

Another restart overhead stems from data loading: the training cluster remains idle while the dataloader initializes, downloads data from remote file systems, and processes it into batches.

To address this, we introduce the Memory Mapped DataLoader(MMAP) Dataloader, which caches prefetched batches in persistent memory, ensuring they remain available even after a fault-induced restart. This approach eliminates dataloader setup time and enables training to resume immediately using cached batches, while the dataloader concurrently reinitializes and fetches subsequent data in the background. The data cache resides on each rank that requires training data and maintains two types of batches: recently consumed batches that have been used for training, and prefetched batches ready for immediate use.

![\[This image illustrates the MMAP Dataloader, caches, and consumed batches.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-mmap-dataloader.png)


MMAP dataloader offers two following features:
+ **Data Prefetching** - Proactively fetches and caches data generated by the dataloader
+ **Persistent Caching** - Stores both consumed and prefetched batches in a temporary filesystem that survives process restarts

Using the cache, the training job will benefit from:
+ **Reduced Memory Footprint** - Leverages memory-mapped I/O to maintain a single shared copy of data in host CPU memory, eliminating redundant copies across GPU processes (e.g., reduces from 8 copies to 1 on a p5 instance with 8 GPUs)
+ **Faster Recovery** - Reduces Mean Time to Restart (MTTR) by enabling training to resume immediately from cached batches, eliminating the wait for dataloader reinitialization and first-batch generation

## MMAP configurations
<a name="sagemaker-eks-checkpointless-features-communication-mmap-config"></a>

To use MMAP, simply pass in the your original data module into `MMAPDataModule`

```
data_module=MMAPDataModule(
    data_module=MY_DATA_MODULE(...),
    mmap_config=CacheResumeMMAPConfig(
        cache_dir=self.cfg.mmap.cache_dir,
        checkpoint_frequency=self.cfg.mmap.checkpoint_frequency),
)
```

`CacheResumeMMAPConfig`: MMAP Dataloader parameters control cache directory location, size limits, and data fetching delegation. By default, only TP rank 0 per node fetches data from the source, while other ranks in the same data replication group read from the shared cache, eliminating redundant transfers.

`MMAPDataModule`: It wraps the original data module and returns the mmap dataloader for both train and validation.

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/gpt_oss/gpt_oss_120b_full_finetune_checkpointless.py#L101-L109) for enabling MMAP.

## API reference
<a name="sagemaker-eks-checkpointless-mmap-reference"></a>

### CacheResumeMMAPConfig
<a name="sagemaker-eks-checkpointless-mmap-reference-cacheresume"></a>

```
class hyperpod_checkpointless_training.dataloader.config.CacheResumeMMAPConfig(
  cache_dir='/dev/shm/pdl_cache',
  prefetch_length=10,
  val_prefetch_length=10,
  lookback_length=2,
  checkpoint_frequency=None,
  model_parallel_group=None,
  enable_batch_encryption=False)
```

Configuration class for cache-resume memory-mapped (MMAP) dataloader functionality in HyperPod checkpointless training.

This configuration enables efficient data loading with caching and prefetching capabilities, allowing training to resume quickly after failures by maintaining cached data batches in memory-mapped files.

**Parameters**
+ **cache\$1dir** (str, optional) – Directory path for storing cached data batches. Default: "/dev/shm/pdl\$1cache"
+ **prefetch\$1length** (int, optional) – Number of batches to prefetch ahead during training. Default: 10
+ **val\$1prefetch\$1length** (int, optional) – Number of batches to prefetch ahead during validation. Default: 10
+ **lookback\$1length** (int, optional) – Number of previously used batches to keep in cache for potential reuse. Default: 2
+ **checkpoint\$1frequency** (int, optional) – Frequency of model checkpointing steps. Used for cache performance optimization. Default: None
+ **model\$1parallel\$1group** (object, optional) – Process group for model parallelism. If None, will be created automatically. Default: None
+ **enable\$1batch\$1encryption** (bool, optional) – Whether to enable encryption for cached batch data. Default: False

**Methods**

```
create(dataloader_init_callable,
    parallel_state_util,
   step,
    is_data_loading_rank,
   create_model_parallel_group_callable,
    name='Train',
   is_val=False,
   cached_len=0)
```

Creates and returns a configured MMAP dataloader instance.

**Parameters**
+ **dataloader\$1init\$1callable** (Callable) – Function to initialize the underlying dataloader
+ **parallel\$1state\$1util** (object) – Utility for managing parallel state across processes
+ **step** (int) – The data step to resume from during training
+ **is\$1data\$1loading\$1rank** (Callable) – Function that returns True if current rank should load data
+ **create\$1model\$1parallel\$1group\$1callable** (Callable) – Function to create model parallel process group
+ **name** (str, optional) – Name identifier for the dataloader. Default: "Train"
+ **is\$1val** (bool, optional) – Whether this is a validation dataloader. Default: False
+ **cached\$1len** (int, optional) – Length of cached data if resuming from existing cache. Default: 0

Returns `CacheResumePrefetchedDataLoader` or `CacheResumeReadDataLoader` – Configured MMAP dataloader instance

Raises `ValueError` if the step parameter is `None`.

**Example**

```
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig

# Create configuration
config = CacheResumeMMAPConfig(
    cache_dir="/tmp/training_cache",
    prefetch_length=20,
    checkpoint_frequency=100,
    enable_batch_encryption=False
)

# Create dataloader
dataloader = config.create(
    dataloader_init_callable=my_dataloader_init,
    parallel_state_util=parallel_util,
    step=current_step,
    is_data_loading_rank=lambda: rank == 0,
    create_model_parallel_group_callable=create_mp_group,
    name="TrainingData"
)
```

**Notes**
+ The cache directory should have sufficient space and fast I/O performance (e.g., /dev/shm for in-memory storage).
+ Setting `checkpoint_frequency` improves cache performance by aligning cache management with model checkpointing
+ For validation dataloaders (`is_val=True`), the step is reset to 0 and cold start is forced
+ Different dataloader implementations are used based on whether the current rank is responsible for data loading

### MMAPDataModule
<a name="sagemaker-eks-checkpointless-mmap-reference-mmapdatamodule"></a>

```
class hyperpod_checkpointless_training.dataloader.mmap_data_module.MMAPDataModule(  
    data_module,  
    mmap_config,  
    parallel_state_util=MegatronParallelStateUtil(),  
    is_data_loading_rank=None)
```

A PyTorch Lightning DataModule wrapper that applies memory-mapped (MMAP) data loading capabilities to existing DataModules for checkpointless training.

This class wraps an existing PyTorch Lightning DataModule and enhances it with MMAP functionality, enabling efficient data caching and fast recovery during training failures. It maintains compatibility with the original DataModule interface while adding checkpointless training capabilities.

Parameters

data\$1module (pl.LightningDataModule)  
The underlying DataModule to wrap (e.g., LLMDataModule)

mmap\$1config (MMAPConfig)  
The MMAP configuration object that defines caching behavior and parameters

`parallel_state_util` (MegatronParallelStateUtil, optional)  
Utility for managing parallel state across distributed processes. Default: MegatronParallelStateUtil()

`is_data_loading_rank` (Callable, optional)  
Function that returns True if the current rank should load data. If None, defaults to parallel\$1state\$1util.is\$1tp\$10. Default: None

**Attributes**

`global_step` (int)  
Current global training step, used for resuming from checkpoints

`cached_train_dl_len` (int)  
Cached length of the training dataloader

`cached_val_dl_len` (int)  
Cached length of the validation dataloader

**Methods**

```
setup(stage=None)
```

Setup the underlying data module for the specified training stage.

`stage` (str, optional)  
Stage of training ('fit', 'validate', 'test', or 'predict'). Default: None

```
train_dataloader()
```

Create the training DataLoader with MMAP wrapping.

*Returns:* DataLoader – MMAP-wrapped training DataLoader with caching and prefetching capabilities

```
val_dataloader()
```

Create the validation DataLoader with MMAP wrapping.

*Returns:* DataLoader – MMAP-wrapped validation DataLoader with caching capabilities

```
test_dataloader()
```

Create the test DataLoader if the underlying data module supports it.

*Returns:* DataLoader or None – Test DataLoader from the underlying data module, or None if not supported

```
predict_dataloader()
```

Create the predict DataLoader if the underlying data module supports it.

*Returns:* DataLoader or None – Predict DataLoader from the underlying data module, or None if not supported

```
load_checkpoint(checkpoint)
```

Load checkpoint information to resume training from a specific step.

checkpoint (dict)  
Checkpoint dictionary containing 'global\$1step' key

```
get_underlying_data_module()
```

Get the underlying wrapped data module.

*Returns:* pl.LightningDataModule – The original data module that was wrapped

```
state_dict()
```

Get the state dictionary of the MMAP DataModule for checkpointing.

*Returns:* dict – Dictionary containing cached dataloader lengths

```
load_state_dict(state_dict)
```

Load the state dictionary to restore MMAP DataModule state.

`state_dict` (dict)  
State dictionary to load

**Properties**

```
data_sampler
```

Expose the underlying data module's data sampler to NeMo framework.

*Returns:* object or None – The data sampler from the underlying data module

**Example**

```
from hyperpod_checkpointless_training.dataloader.mmap_data_module import MMAPDataModule  
from hyperpod_checkpointless_training.dataloader.config import CacheResumeMMAPConfig  
from my_project import MyLLMDataModule  

# Create MMAP configuration  
mmap_config = CacheResumeMMAPConfig(  
    cache_dir="/tmp/training_cache",  
    prefetch_length=20,  
    checkpoint_frequency=100  
)  

# Create original data module  
original_data_module = MyLLMDataModule(  
    data_path="/path/to/data",  
    batch_size=32  
)  

# Wrap with MMAP capabilities  
mmap_data_module = MMAPDataModule(  
    data_module=original_data_module,  
    mmap_config=mmap_config  
)  

# Use in PyTorch Lightning Trainer  
trainer = pl.Trainer()  
trainer.fit(model, data=mmap_data_module)  

# Resume from checkpoint  
checkpoint = {"global_step": 1000}  
mmap_data_module.load_checkpoint(checkpoint)
```

**Notes**
+ The wrapper delegates most attribute access to the underlying data module using \$1\$1getattr\$1\$1
+ Only data loading ranks actually initialize and use the underlying data module; other ranks use fake dataloaders
+ Cached dataloader lengths are maintained to optimize performance during training resumption

# In-process recovery and checkpointless training
<a name="sagemaker-eks-checkpointless-in-process-recovery"></a>

HyperPod checkpointless training uses model redundancy to enable fault-tolerant training. The core principle is that model and optimizer states are fully replicated across multiple node groups, with weight updates and optimizer state changes synchronously replicated within each group. When a failure occurs, healthy replicas complete their optimizer steps and transmit the updated model/optimizer states to recovering replicas.

This model redundancy-based approach enables several fault handling mechanisms:
+ **In-process recovery:** processes remain active despite faults, keeping all model and optimizer states in GPU memory with the latest values
+ **Graceful abort handling:** controlled aborts and resource cleanup for affected operations
+ **Code block re-execution:** re-running only the affected code segments within a Re-executable Code Block (RCB)
+ **Checkpointless recovery with no lost training progress:** since processes persist and states remain in memory, no training progress is lost; when a fault occurs training resumes from the previous step, as opposed to resuming from the last saved checkpoint

**Checkpointless Configurations**

Here is the core snippet of checkpointless training.

```
from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank
    wait_rank()
      
def main():
    @HPWrapper(
        health_check=CudaHealthCheck(),
        hp_api_factory=HPAgentK8sAPIFactory(),
        abort_timeout=60.0,
        checkpoint_manager=PEFTCheckpointManager(enable_offload=True),
        abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),
        finalize=CheckpointlessFinalizeCleanup(),
    )
    def run_main(cfg, caller: Optional[HPCallWrapper] = None):
        ...
        trainer = Trainer(
            strategy=CheckpointlessMegatronStrategy(...,
                num_distributed_optimizer_instances=2),
            callbacks=[..., CheckpointlessCallback(...)],
            )
        trainer.fresume = resume
        trainer._checkpoint_connector = CheckpointlessCompatibleConnector(trainer)
        trainer.wrapper = caller
```
+ `wait_rank`: All ranks will wait for the rank information from the HyperpodTrainingOperator infrastructure.
+ `HPWrapper`: Python function wrapper that enables restart capabilities for a Re-executable Code Block (RCB). The implementation uses a context manager rather than a Python decorator because decorators cannot determine the number of RCBs to monitor at runtime.
+ `CudaHealthCheck`: Ensures the CUDA context for the current process is in a healthy state by synchronizing with the GPU. Uses the device specified by the LOCAL\$1RANK environment variable, or defaults to the main thread's CUDA device if LOCAL\$1RANK is not set.
+ `HPAgentK8sAPIFactory`: This API enables checkpointless training to query the training status of other pods in the Kubernetes training cluster. It also provides an infrastructure-level barrier that ensures all ranks successfully complete abort and restart operations before proceeding.
+ `CheckpointManager`: Manages in-memory checkpoints and peer-to-peer recovery for checkpointless fault tolerance. It has the following core responsibilities:
  + **In-Memory Checkpoint Management**: Saves and manages NeMo model checkpoints in memory for fast recovery without disk I/O during checkpointless recovery scenarios.
  + **Recovery Feasibility Validation**: Determines if checkpointless recovery is possible by validating global step consistency, rank health, and model state integrity.
  + **Peer-to-Peer Recovery Orchestration**: Coordinates checkpoint transfer between healthy and failed ranks using distributed communication for fast recovery.
  + **RNG State Management**: Preserves and restores random number generator states across Python, NumPy, PyTorch, and Megatron for deterministic recovery.
  + **[Optional] Checkpoint Offload**: Offload in memory checkpoint to CPU if GPU does not have enough memory capacity.
+ `PEFTCheckpointManager`: It extends `CheckpointManager` by keeping the base model weights for PEFT finetuning.
+ `CheckpointlessAbortManager`: Manages abort operations in a background thread when an error is encountered. By default, it aborts TransformerEngine, Checkpointing, TorchDistributed, and DataLoader. Users can register custom abort handlers as needed. After the abort completes, all communication must cease and all processes and threads must terminate to prevent resource leaks.
+ `CheckpointlessFinalizeCleanup`: Handles final cleanup operations in the main thread for components that cannot be safely aborted or cleaned up in the background thread.
+ `CheckpointlessMegatronStrategy`: This inherits from the `MegatronStrategy` from in Nemo. Note that checkpointless training requires `num_distributed_optimizer_instances` to be least 2 so that there will be optimizer replication. The strategy also takes care of essential attribute registration and process group initialization, e.g., rootless.
+ `CheckpointlessCallback`: Lightning callback that integrates NeMo training with checkpointless training's fault tolerance system. It has the following core responsibilities:
  + **Training Step Lifecycle Management**: Tracks training progress and coordinates with ParameterUpdateLock to enable/disable checkpointless recovery based on training state (first step vs subsequent steps).
  + **Checkpoint State Coordination**: Manages in-memory PEFT base model checkpoint saving/restoring.
+ `CheckpointlessCompatibleConnector`: A PTL `CheckpointConnector` that attempts to pre-load the checkpoint file to memory, with the source path determined in this priority:
  + try checkpointless recovery
  + if checkpointless return None, fallback to parent.resume\$1start()

See [the example](https://github.com/aws/sagemaker-hyperpod-checkpointless-training/blob/main/examples/gpt_oss/gpt_oss_120b_full_finetune.py) to add checkpointless training features to codes.

**Concepts**

This section introduces checkpointless training concepts. Checkpointless training on Amazon SageMaker HyperPod supports in-process recovery. This API interface follows a similar format as the NVRx APIs.

**Concept - Re-Executable Code Block (RCB)**

When a failure occurs, healthy processes remain alive, but a portion of the code must be re-executed to recover the training states and python stacks. A Re-executable Code Block (RCB) is a specific code segment that re-runs during failure recovery. In the following example, the RCB encompasses the entire training script (i.e., everything under main()), meaning that each failure recovery restarts the training script while preserving the in-memory model and optimizer states.

**Concept - Faults control**

A fault controller module receives notifications when failures occur during checkpointless training. This fault controller includes the following components:
+ **Fault detection module:** Receives infrastructure fault notifications
+ **RCB definition APIs:** Enables users to define the re-executable code block (RCB) in their code
+ **Restart module:** Terminates the RCB, cleans up resources, and restarts the RCB

![\[This image illustrates how a fault controller module receives notifications when failure occurs during checkpointless training.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-fault-controller-module.png)


**Concept - Model redundancy**

Large model training usually requires a large enough data parallel size to train models efficiently. In traditional data parallelism like PyTorch DDP and Horovod, the model is fully replicated. More advanced sharded data parallelism techniques like DeepSpeed ZeRO optimizer and FSDP also support hybrid sharding mode, which allows sharding the model/optimizer states within the sharding group and fully replicating across replication groups. NeMo also has this hybrid sharding feature through an argument num\$1distributed\$1optimizer\$1instances, which allows redundancy.

However, adding redundancy indicates that the model will not be fully sharded across the entire cluster, resulting in higher device memory usage. The amount of redundant memory will vary depending on the specific model sharding techniques implemented by the user. The low-precision model weights, gradients, and activation memory will not be affected, since they are sharded through model parallelism. The high-precision master model weights/gradients and optimizer states will be affected. Adding one redundant model replica increases device memory usage by roughly the equivalent of one DCP checkpoint size.

Hybrid sharding breaks the collectives across the entire DP groups into relatively smaller collectives. Previously there was a reduce-scatter and an all-gather across the entire DP group. After the hybrid sharding, the reduce-scatter is only running inside each model replica, and there will be an all-reduce across model replica groups. The all-gather is also running inside each model replica. As a result, the entire communication volume remains roughly unchanged, but collectives are running with smaller groups, so we expect better latency.

**Concept - Failure and Restart Types**

The following table records different failure types and associated recovery mechanisms. Checkpointless training first attempts failure recovery via an in-process recovery, followed by a process-level restart. It falls back to a job-level restart only in the event of a catastrophic failure (e.g., multiple nodes fail at the same time).


| Failure Type | Cause | Recovery Type | Recovery Mechanism | 
| --- | --- | --- | --- | 
| In-process failure | Code-level errors, exceptions | In-Process Recovery (IPR) | Rerun RCB within existing process; healthy processes remain active | 
| Process restart failure | Corrupted CUDA context, terminated process | Process Level Restart (PLR) | SageMaker HyperPod training operator restarts processes; skips K8s pod restart | 
| Node replacement failure | Permanent node/GPU hardware failure | Job Level Restart (JLR) | Replace failed node; restart entire training job | 

**Concept - Atomic lock protection for optimizer step**

Model execution is divided into three phases: forward propagation, backward propagation, and optimizer step. Recovery behavior varies based on the failure timing:
+ **Forward/backward propagation:** Roll back to the beginning of the current training step and broadcast model states to replacement node(s)
+ **Optimizer step:** Allow healthy replicas to complete the step under lock protection, then broadcast the updated model states to replacement node(s)

This strategy ensures completed optimizer updates are never discarded, helping reduce fault recovery time.

![\[This image illustrates how failure is handled depending on if it occurs before or after failure.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-optimizer.png)


## Checkpointless Training Flow Diagram
<a name="sagemaker-eks-checkpointless-training-flow"></a>

![\[This diagram illustrates the checkpointless training flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-checkpointless-training-flow.png)


The following steps outline the failure detection and checkpointless recovery process:

1. Training loop starts

1. Fault occurs

1. Evaluate checkpointless resume feasibility

1. Check if it is feasible to do checkpointless resume
   + If feasible, Attempt checkpointless reusme
     + If resumes fails, fallback to checkpoint loading from storage
     + If resume succeeds, training continues from recovered state
   + If not feasible, fall back to checkpoint loading from storage

1. Clean up resources - abort all process groups and backends and free resources in preparation for restart.

1. Resume training loop - a new training loop begins, and the process returns to step 1.

## API reference
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference"></a>

### wait\$1rank
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-wait_rank"></a>

```
hyperpod_checkpointless_training.inprocess.train_utils.wait_rank()
```

Waits for and retrieves rank information from HyperPod, then updates the current process environment with distributed training variables.

This function obtains the correct rank assignment and environment variables for distributed training. It ensures that each process gets the appropriate configuration for its role in the distributed training job.

**Parameters**

None

**Returns**

**None**

**Behavior**
+ **Process Check**: Skips execution if called from a subprocess (only runs in MainProcess)
+ **Environment Retrieval**: Gets current `RANK` and `WORLD_SIZE` from environment variables
+ **HyperPod Communication**: Calls `hyperpod_wait_rank_info()` to retrieve rank information from HyperPod
+ **Environment Update**: Updates the current process environment with worker-specific environment variables received from HyperPod

**Environment Variables**

The function reads the following environment variables:
+ **RANK** (*int*) – Current process rank (default: -1 if not set)
+ **WORLD\$1SIZE** (*int*) – Total number of processes in the distributed job (default: 0 if not set)

**Raises**
+ **AssertionError** – If the response from HyperPod is not in the expected format or if required fields are missing

**Example**

```
from hyperpod_checkpointless_training.inprocess.train_utils import wait_rank  

# Call before initializing distributed training  
wait_rank()  

# Now environment variables are properly set for this rank  
import torch.distributed as dist  
dist.init_process_group(backend='nccl')
```

**Notes**
+ Only executes in the main process; subprocess calls are automatically skipped
+ The function blocks until HyperPod provides the rank information

### HPWrapper
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPWrapper"></a>

```
class hyperpod_checkpointless_training.inprocess.wrap.HPWrapper(  
    *,  
    abort=Compose(HPAbortTorchDistributed()),  
    finalize=None,  
    health_check=None,  
    hp_api_factory=None,  
    abort_timeout=None,  
    enabled=True,  
    trace_file_path=None,  
    async_raise_before_abort=True,  
    early_abort_communicator=False,  
    checkpoint_manager=None,  
    check_memory_status=True)
```

*Python function wrapper that enables restart capabilities for a Re-executable Code Block (RCB) in HyperPod checkpointless training.*

*This wrapper provides fault tolerance and automatic recovery capabilities by monitoring training execution and coordinating restarts across distributed processes when failures occur. It uses a context manager approach rather than a decorator to maintain global resources throughout the training lifecycle.*

**Parameters**
+ **abort** (*Abort*, *optional*) – Asynchronously aborts execution when failures are detected. Default: `Compose(HPAbortTorchDistributed())`
+ **finalize** (*Finalize*, *optional*) – Rank-local finalize handler executed during restart. Default: `None`
+ **health\$1check** (*HealthCheck*, *optional*) – Rank-local health check executed during restart. Default: `None`
+ **hp\$1api\$1factory** (*Callable*, *optional*) – Factory function for creating a HyperPod API to interact with HyperPod. Default: `None`
+ **abort\$1timeout** (*float*, *optional*) – Timeout for abort call in fault controlling thread. Default: `None`
+ **enabled** (*bool*, *optional*) – Enables the wrapper functionality. When `False`, the wrapper becomes a pass-through. Default: `True`
+ **trace\$1file\$1path** (*str*, *optional*) – Path to the trace file for VizTracer profiling. Default: `None`
+ **async\$1raise\$1before\$1abort** (*bool*, *optional*) – Enable raise before abort in fault controlling thread. Default: `True`
+ **early\$1abort\$1communicator** (*bool*, *optional*) – Abort communicator (NCCL/Gloo) before aborting dataloader. Default: `False`
+ **checkpoint\$1manager** (*Any*, *optional*) – Manager for handling checkpoints during recovery. Default: `None`
+ **check\$1memory\$1status** (*bool*, *optional*) – Enable memory status checking and logging. Default: `True`

**Methods**

```
def __call__(self, fn)
```

*Wraps a function to enable restart capabilities.*

**Parameters:**
+ **fn** (*Callable*) – The function to wrap with restart capabilities

**Returns:**
+ **Callable** – Wrapped function with restart capabilities, or original function if disabled

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
from hyperpod_checkpointless_training.nemo_plugins.patches import patch_megatron_optimizer  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_connector import CheckpointlessCompatibleConnector  
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory  
from hyperpod_checkpointless_training.inprocess.abort import CheckpointlessFinalizeCleanup, CheckpointlessAbortManager   
      
@HPWrapper(  
    health_check=CudaHealthCheck(),  
    hp_api_factory=HPAgentK8sAPIFactory(),  
    abort_timeout=60.0,  
    checkpoint_manager=CheckpointManager(enable_offload=False),  
    abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),  
    finalize=CheckpointlessFinalizeCleanup(),  
)def training_function():  
    # Your training code here  
    pass
```

**Notes**
+ The wrapper requires `torch.distributed` to be available
+ When `enabled=False`, the wrapper becomes a pass-through and returns the original function unchanged
+ The wrapper maintains global resources like monitoring threads throughout the training lifecycle
+ Supports VizTracer profiling when `trace_file_path` is provided
+ Integrates with HyperPod for coordinated fault handling across distributed training

### HPCallWrapper
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPCallWrapper"></a>

```
class hyperpod_checkpointless_training.inprocess.wrap.HPCallWrapper(wrapper)
```

Monitors and manages the state of a Restart Code Block (RCB) during execution.

This class handles the lifecycle of RCB execution, including failure detection, coordination with other ranks for restarts, and cleanup operations. It manages distributed synchronization and ensures consistent recovery across all training processes.

**Parameters**
+ **wrapper** (*HPWrapper*) – The parent wrapper containing global in-process recovery settings

**Attributes**
+ **step\$1upon\$1restart** (*int*) – Counter that tracks steps since the last restart, used for determining restart strategy

**Methods**

```
def initialize_barrier()
```

Wait for HyperPod barrier synchronization after encountering an exception from RCB.

```
def start_hp_fault_handling_thread()
```

Start the fault handling thread for monitoring and coordinating failures.

```
def handle_fn_exception(call_ex)
```

Process exceptions from the execution function or RCB.

**Parameters:**
+ **call\$1ex** (*Exception*) – Exception from the monitoring function

```
def restart(term_ex)
```

Execute restart handler including finalization, garbage collection, and health checks.

**Parameters:**
+ **term\$1ex** (*RankShouldRestart*) – Termination exception triggering the restart

```
def launch(fn, *a, **kw)
```

*Execute the RCB with proper exception handling.*

**Parameters:**
+ **fn** (*Callable*) – Function to be executed
+ **a** – Function arguments
+ **kw** – Function keyword arguments

```
def run(fn, a, kw)
```

Main execution loop that handles restarts and barrier synchronization.

**Parameters:**
+ **fn** (*Callable*) – Function to be executed
+ **a** – Function arguments
+ **kw** – Function keyword arguments

```
def shutdown()
```

Shutdown fault handling and monitoring threads.

**Notes**
+ Automatically handles `RankShouldRestart` exceptions for coordinated recovery
+ Manages memory tracking and aborts, garbage collection during restarts
+ Supports both in-process recovery and PLR (Process-Level Restart) strategies based on failure timing

### CudaHealthCheck
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-cudahealthcheck"></a>

```
class hyperpod_checkpointless_training.inprocess.health_check.CudaHealthCheck(timeout=datetime.timedelta(seconds=30))
```

Ensures that the CUDA context for the current process is in a healthy state during checkpointless training recovery.

This health check synchronizes with the GPU to verify that the CUDA context is not corrupted after a training failure. It performs GPU synchronization operations to detect any issues that might prevent successful training resumption. The health check is executed after distributed groups are destroyed and finalization is complete.

**Parameters**
+ **timeout** (*datetime.timedelta*, *optional*) – Timeout duration for GPU synchronization operations. Default: `datetime.timedelta(seconds=30)`

**Methods**

```
__call__(state, train_ex=None)
```

Execute the CUDA health check to verify GPU context integrity.

**Parameters:**
+ **state** (*HPState*) – Current HyperPod state containing rank and distributed information
+ **train\$1ex** (*Exception*, *optional*) – The original training exception that triggered the restart. Default: `None`

**Returns:**
+ **tuple** – A tuple containing `(state, train_ex)` unchanged if health check passes

**Raises:**
+ **TimeoutError** – If GPU synchronization times out, indicating a potentially corrupted CUDA context

**State Preservation**: Returns the original state and exception unchanged if all checks pass

**Example**

```
import datetime  
from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# Create CUDA health check with custom timeout  
cuda_health_check = CudaHealthCheck(  
    timeout=datetime.timedelta(seconds=60)  
)  
  
# Use with HPWrapper for fault-tolerant training  
@HPWrapper(  
    health_check=cuda_health_check,  
    enabled=True  
)  
def training_function():  
    # Your training code here  
    pass
```

**Notes**
+ Uses threading to implement timeout protection for GPU synchronization
+ Designed to detect corrupted CUDA contexts that could prevent successful training resumption
+ Should be used as part of the fault tolerance pipeline in distributed training scenarios

### HPAgentK8sAPIFactory
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-HPAgentK8sAPIFactory"></a>

```
class hyperpod_checkpointless_training.inprocess.train_utils.HPAgentK8sAPIFactory()
```

Factory class for creating HPAgentK8sAPI instances that communicate with HyperPod infrastructure for distributed training coordination.

This factory provides a standardized way to create and configure HPAgentK8sAPI objects that handle communication between training processes and the HyperPod control plane. It encapsulates the creation of the underlying socket client and API instance, ensuring consistent configuration across different parts of the training system.

**Methods**

```
__call__()
```

Create and return an HPAgentK8sAPI instance configured for HyperPod communication.

**Returns:**
+ **HPAgentK8sAPI** – Configured API instance for communicating with HyperPod infrastructure

**Example**

```
from hyperpod_checkpointless_training.inprocess.train_utils import HPAgentK8sAPIFactory  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.inprocess.health_check import CudaHealthCheck  
  
# Create the factory  
hp_api_factory = HPAgentK8sAPIFactory()  
  
# Use with HPWrapper for fault-tolerant training  
hp_wrapper = HPWrapper(  
    hp_api_factory=hp_api_factory,  
    health_check=CudaHealthCheck(),  
    abort_timeout=60.0,  
    enabled=True  
)  
  
@hp_wrapper  
def training_function():  
    # Your distributed training code here  
    pass
```

**Notes**
+ Designed to work seamlessly with HyperPod's Kubernetes-based infrastructure. It is essential for coordinated fault handling and recovery in distributed training scenarios

### CheckpointManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointManager"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager.CheckpointManager(  
    enable_checksum=False,  
    enable_offload=False)
```

Manages in-memory checkpoints and peer-to-peer recovery for checkpointless fault tolerance in distributed training.

This class provides the core functionality for HyperPod checkpointless training by managing NeMo model checkpoints in memory, validating recovery feasibility, and orchestrating peer-to-peer checkpoint transfer between healthy and failed ranks. It eliminates the need for disk I/O during recovery, significantly reducing mean time to recovery (MTTR).

**Parameters**
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation for integrity checks during recovery. Default: `False`
+ **enable\$1offload** (*bool*, *optional*) – Enable checkpoint offloading from GPU to CPU memory to reduce GPU memory usage. Default: `False`

**Attributes**
+ **global\$1step** (*int* or *None*) – Current training step associated with the saved checkpoint
+ **rng\$1states** (*list* or *None*) – Stored random number generator states for deterministic recovery
+ **checksum\$1manager** (*MemoryChecksumManager*) – Manager for model state checksum validation
+ **parameter\$1update\$1lock** (*ParameterUpdateLock*) – Lock for coordinating parameter updates during recovery

**Methods**

```
save_checkpoint(trainer)
```

Save NeMo model checkpoint in memory for potential checkpointless recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Called by CheckpointlessCallback at batch end or during exception handling
+ Creates recovery points without disk I/O overhead
+ Stores complete model, optimizer, and scheduler states

```
delete_checkpoint()
```

Delete the in-memory checkpoint and perform cleanup operations.

**Notes:**
+ Clears checkpoint data, RNG states, and cached tensors
+ Performs garbage collection and CUDA cache cleanup
+ Called after successful recovery or when checkpoint is no longer needed

```
try_checkpointless_load(trainer)
```

Attempt checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored checkpoint if successful, None if fallback to disk needed

**Notes:**
+ Main entry point for checkpointless recovery
+ Validates recovery feasibility before attempting P2P transfer
+ Always cleans up in-memory checkpoints after recovery attempt

```
checkpointless_recovery_feasible(trainer, include_checksum_verification=True)
```

Determine if checkpointless recovery is possible for the current failure scenario.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **include\$1checksum\$1verification** (*bool*, *optional*) – Whether to include checksum validation. Default: `True`

**Returns:**
+ **bool** – True if checkpointless recovery is feasible, False otherwise

**Validation Criteria:**
+ Global step consistency across healthy ranks
+ Sufficient healthy replicas available for recovery
+ Model state checksum integrity (if enabled)

```
store_rng_states()
```

Store all random number generator states for deterministic recovery.

**Notes:**
+ Captures Python, NumPy, PyTorch CPU/GPU, and Megatron RNG states
+ Essential for maintaining training determinism after recovery

```
load_rng_states()
```

Restore all RNG states for deterministic recovery continuation.

**Notes:**
+ Restores all previously stored RNG states
+ Ensures training continues with identical random sequences

```
maybe_offload_checkpoint()
```

Offload checkpoint from GPU to CPU memory if offload is enabled.

**Notes:**
+ Reduces GPU memory usage for large models
+ Only executes if `enable_offload=True`
+ Maintains checkpoint accessibility for recovery

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
# Use with HPWrapper for complete fault tolerance  
@HPWrapper(  
    checkpoint_manager=CheckpointManager(),  
    enabled=True  
)  
def training_function():  
    # Training code with automatic checkpointless recovery  
    pass
```

**Validation**: Verifies checkpoint integrity using checksums (if enabled)

**Notes**
+ Uses distributed communication primitives for efficient P2P transfer
+ Automatically handles tensor dtype conversions and device placement
+ **MemoryChecksumManager** – Handles model state integrity validation

### PEFTCheckpointManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-PEFTCheckpointManager"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager.PEFTCheckpointManager(  
    *args,  
    **kwargs)
```

Manages checkpoints for PEFT (Parameter-Efficient Fine-Tuning) with separate base and adapter handling for optimized checkpointless recovery.

This specialized checkpoint manager extends CheckpointManager to optimize PEFT workflows by separating base model weights from adapter parameters.

**Parameters**

Inherits all parameters from **CheckpointManager**:
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation. Default: `False`
+ **enable\$1offload** (*bool*, *optional*) – Enable checkpoint offloading to CPU memory. Default: `False`

**Additional Attributes**
+ **params\$1to\$1save** (*set*) – Set of parameter names that should be saved as adapter parameters
+ **base\$1model\$1weights** (*dict* or *None*) – Cached base model weights, saved once and reused
+ **base\$1model\$1keys\$1to\$1extract** (*list* or *None*) – Keys for extracting base model tensors during P2P transfer

**Methods**

```
maybe_save_base_model(trainer)
```

Save base model weights once, filtering out adapter parameters.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Only saves base model weights on first call; subsequent calls are no-ops
+ Filters out adapter parameters to store only frozen base model weights
+ Base model weights are preserved across multiple training sessions

```
save_checkpoint(trainer)
```

Save NeMo PEFT adapter model checkpoint in memory for potential checkpointless recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Notes:**
+ Automatically calls `maybe_save_base_model()` if base model not yet saved
+ Filters checkpoint to include only adapter parameters and training state
+ Significantly reduces checkpoint size compared to full model checkpoints

```
try_base_model_checkpointless_load(trainer)
```

Attempt PEFT base model weights checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored base model checkpoint if successful, None if fallback needed

**Notes:**
+ Used during model initialization to recover base model weights
+ Does not clean up base model weights after recovery (preserves for reuse)
+ Optimized for model-weights-only recovery scenarios

```
try_checkpointless_load(trainer)
```

Attempt PEFT adapter weights checkpointless recovery by loading state from peer ranks.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **dict** or **None** – Restored adapter checkpoint if successful, None if fallback needed

**Notes:**
+ Recovers only adapter parameters, optimizer states, and schedulers
+ Automatically loads optimizer and scheduler states after successful recovery
+ Cleans up adapter checkpoints after recovery attempt

```
is_adapter_key(key)
```

Check if state dict key belongs to adapter parameters.

**Parameters:**
+ **key** (*str* or *tuple*) – State dict key to check

**Returns:**
+ **bool** – True if key is adapter parameter, False if base model parameter

**Detection Logic:**
+ Checks if key is in `params_to_save` set
+ Identifies keys containing ".adapter." substring
+ Identifies keys ending with ".adapters"
+ For tuple keys, checks if parameter requires gradients

```
maybe_offload_checkpoint()
```

Offload base model weights from GPU to CPU memory.

**Notes:**
+ Extends parent method to handle base model weight offloading
+ Adapter weights are typically small and don't require offloading
+ Sets internal flag to track offload state

**Notes**
+ Designed specifically for Parameter-Efficient Fine-Tuning scenarios (LoRA, Adapters, etc.)
+ Automatically handles separation of base model and adapter parameters

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import PEFTCheckpointManager  
# Use with HPWrapper for complete fault tolerance  
@HPWrapper(  
    checkpoint_manager=PEFTCheckpointManager(),  
    enabled=True  
)  
def training_function():  
    # Training code with automatic checkpointless recovery  
    pass
```

### CheckpointlessAbortManager
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessAbortManager"></a>

```
class hyperpod_checkpointless_training.inprocess.abort.CheckpointlessAbortManager()
```

Factory class for creating and managing abort component compositions for checkpointless fault tolerance.

This utility class provides static methods to create, customize, and manage abort component compositions used during fault handling in HyperPod checkpointless training. It simplifies the configuration of abort sequences that handle cleanup of distributed training components, data loaders, and framework-specific resources during failure recovery.

**Parameters**

None (all methods are static)

**Static Methods**

```
get_default_checkpointless_abort()
```

Get the default abort compose instance containing all standard abort components.

**Returns:**
+ **Compose** – Default composed abort instance with all abort components

**Default Components:**
+ **AbortTransformerEngine()** – Cleans up TransformerEngine resources
+ **HPCheckpointingAbort()** – Handles checkpointing system cleanup
+ **HPAbortTorchDistributed()** – Aborts PyTorch distributed operations
+ **HPDataLoaderAbort()** – Stops and cleans up data loaders

```
create_custom_abort(abort_instances)
```

*Create a custom abort compose with only the specified abort instances.*

**Parameters:**
+ **abort\$1instances** (*Abort*) – Variable number of abort instances to include in the compose

**Returns:**
+ **Compose** – New composed abort instance containing only the specified components

**Raises:**
+ **ValueError** – If no abort instances are provided

```
override_abort(abort_compose, abort_type, new_abort)
```

Replace a specific abort component in a Compose instance with a new component.

**Parameters:**
+ **abort\$1compose** (*Compose*) – The original Compose instance to modify
+ **abort\$1type** (*type*) – The type of abort component to replace (e.g., `HPCheckpointingAbort`)
+ **new\$1abort** (*Abort*) – The new abort instance to use as replacement

**Returns:**
+ **Compose** – New Compose instance with the specified component replaced

**Raises:**
+ **ValueError** – If abort\$1compose doesn't have 'instances' attribute

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.inprocess.abort import CheckpointlessFinalizeCleanup, CheckpointlessAbortManager  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    abort=CheckpointlessAbortManager.get_default_checkpointless_abort(),  
    health_check=CudaHealthCheck(),  
    finalize=CheckpointlessFinalizeCleanup(),  
    enabled=True  
)  
def training_function():  
    trainer.fit(...)
```

**Notes**
+ Custom configurations allow fine-tuned control over cleanup behavior
+ Abort operations are critical for proper resource cleanup during fault recovery

### CheckpointlessFinalizeCleanup
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessFinalizeCleanup"></a>

```
class hyperpod_checkpointless_training.inprocess.abort.CheckpointlessFinalizeCleanup()
```

Performs comprehensive cleanup after fault detection to prepare for in-process recovery during checkpointless training.

This finalize handler executes framework-specific cleanup operations including Megatron/TransformerEngine abort, DDP cleanup, module reloading, and memory cleanup by destroying training component references. It ensures that the training environment is properly reset for successful in-process recovery without requiring full process termination.

**Parameters**

None

**Attributes**
+ **trainer** (*pytorch\$1lightning.Trainer* or *None*) – Reference to the PyTorch Lightning trainer instance

**Methods**

```
__call__(*a, **kw)
```

**Execute comprehensive cleanup operations for in-process recovery preparation.**

*Parameters:*
+ **a** – Variable positional arguments (inherited from Finalize interface)
+ **kw** – Variable keyword arguments (inherited from Finalize interface)

**Cleanup Operations:**
+ **Megatron Framework Cleanup** – Calls `abort_megatron()` to clean up Megatron-specific resources
+ **TransformerEngine Cleanup** – Calls `abort_te()` to clean up TransformerEngine resources
+ **RoPE Cleanup** – Calls `cleanup_rope()` to clean up rotary position embedding resources
+ **DDP Cleanup** – Calls `cleanup_ddp()` to clean up DistributedDataParallel resources
+ **Module Reloading** – Calls `reload_megatron_and_te()` to reload framework modules
+ **Lightning Module Cleanup** – Optionally clears Lightning module to reduce GPU memory
+ **Memory Cleanup** – Destroys training component references to free memory

```
register_attributes(trainer)
```

*Register the trainer instance for use during cleanup operations.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance to register

**Integration with CheckpointlessCallback**

```
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    ...  
    finalize=CheckpointlessFinalizeCleanup(),   
)  
def training_function():  
    trainer.fit(...)
```

**Notes**
+ Cleanup operations are executed in a specific order to avoid dependency issues
+ Memory cleanup uses garbage collection introspection to find target objects
+ All cleanup operations are designed to be idempotent and safe to retry

### CheckpointlessMegatronStrategy
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessMegatronStrategy"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.megatron_strategy.CheckpointlessMegatronStrategy(*args, **kwargs)
```

NeMo Megatron strategy with integrated checkpointless recovery capabilities for fault-tolerant distributed training.

Note that checkpointless training requires `num_distributed_optimizer_instances` to be least 2 so that there will be optimizer replication. The strategy also takes care of essential attribute registration and process group initialization.

**Parameters**

Inherits all parameters from **MegatronStrategy**:
+ Standard NeMo MegatronStrategy initialization parameters
+ Distributed training configuration options
+ Model parallelism settings

**Attributes**
+ **base\$1store** (*torch.distributed.TCPStore* or *None*) – Distributed store for process group coordination

**Methods**

```
setup(trainer)
```

Initialize the strategy and register fault tolerance components with the trainer.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Setup Operations:**
+ **Parent Setup** – Calls parent MegatronStrategy setup
+ **Fault Injection Registration** – Registers HPFaultInjectionCallback hooks if present
+ **Finalize Registration** – Registers trainer with finalize cleanup handlers
+ **Abort Registration** – Registers trainer with abort handlers that support it

```
setup_distributed()
```

Initialize process group using either TCPStore with prefix or rootless connection.

```
load_model_state_dict(checkpoint, strict=True)
```

Load model state dict with checkpointless recovery compatibility.

**Parameters:**
+ **checkpoint** (*Mapping[str, Any]*) – Checkpoint dictionary containing model state
+ **strict** (*bool*, *optional*) – Whether to strictly enforce state dict key matching. Default: `True`

```
get_wrapper()
```

Get the HPCallWrapper instance for fault tolerance coordination.

**Returns:**
+ **HPCallWrapper** – The wrapper instance attached to the trainer for fault tolerance

```
is_peft()
```

Check if PEFT (Parameter-Efficient Fine-Tuning) is enabled in the training configuration by checking for PEFT callbacks

**Returns:**
+ **bool** – True if PEFT callback is present, False otherwise

```
teardown()
```

Override PyTorch Lightning native teardown to delegate cleanup to abort handlers.

**Example**

```
from hyperpod_checkpointless_training.inprocess.wrap import HPWrapper  
  
# The strategy automatically integrates with HPWrapper  
@HPWrapper(  
    checkpoint_manager=checkpoint_manager,  
    enabled=True  
)  
def training_function():  
    trainer = pl.Trainer(strategy=CheckpointlessMegatronStrategy())  
    trainer.fit(model, datamodule)
```

### CheckpointlessCallback
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessCallback"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.callbacks.CheckpointlessCallback(  
    enable_inprocess=False,  
    enable_checkpointless=False,  
    enable_checksum=False,  
    clean_tensor_hook=False,  
    clean_lightning_module=False)
```

Lightning callback that integrates NeMo training with checkpointless training's fault tolerance system.

This callback manages step tracking, checkpoint saving, and parameter update coordination for in-process recovery capabilities. It serves as the primary integration point between PyTorch Lightning training loops and HyperPod checkpointless training mechanisms, coordinating fault tolerance operations throughout the training lifecycle.

**Parameters**
+ **enable\$1inprocess** (*bool*, *optional*) – Enable in-process recovery capabilities. Default: `False`
+ **enable\$1checkpointless** (*bool*, *optional*) – Enable checkpointless recovery (requires `enable_inprocess=True`). Default: `False`
+ **enable\$1checksum** (*bool*, *optional*) – Enable model state checksum validation (requires `enable_checkpointless=True`). Default: `False`
+ **clean\$1tensor\$1hook** (*bool*, *optional*) – Clear tensor hooks from all GPU tensors during cleanup (expensive operation). Default: `False`
+ **clean\$1lightning\$1module** (*bool*, *optional*) – Enable Lightning module cleanup to free GPU memory after each restart. Default: `False`

**Attributes**
+ **tried\$1adapter\$1checkpointless** (*bool*) – Flag to track if adapter checkpointless restore has been attempted

**Methods**

```
get_wrapper_from_trainer(trainer)
```

Get the HPCallWrapper instance from the trainer for fault tolerance coordination.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **HPCallWrapper** – The wrapper instance for fault tolerance operations

```
on_train_batch_start(trainer, pl_module, batch, batch_idx, *args, **kwargs)
```

Called at the start of each training batch to manage step tracking and recovery.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **pl\$1module** (*pytorch\$1lightning.LightningModule*) – Lightning module being trained
+ **batch** – Current training batch data
+ **batch\$1idx** (*int*) – Index of the current batch
+ **args** – Additional positional arguments
+ **kwargs** – Additional keyword arguments

```
on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)
```

*Release parameter update lock at the end of each training batch.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **pl\$1module** (*pytorch\$1lightning.LightningModule*) – Lightning module being trained
+ **outputs** (*STEP\$1OUTPUT*) – Training step outputs
+ **batch** (*Any*) – Current training batch data
+ **batch\$1idx** (*int*) – Index of the current batch

**Notes:**
+ Lock release timing ensures checkpointless recovery can proceed after parameter updates complete
+ Only executes when both `enable_inprocess` and `enable_checkpointless` are True

```
get_peft_callback(trainer)
```

*Retrieve the PEFT callback from the trainer's callback list.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance

**Returns:**
+ **PEFT** or **None** – PEFT callback instance if found, None otherwise

```
_try_adapter_checkpointless_restore(trainer, params_to_save)
```

*Attempt checkpointless restore for PEFT adapter parameters.*

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer*) – PyTorch Lightning trainer instance
+ **params\$1to\$1save** (*set*) – Set of parameter names to save as adapter parameters

**Notes:**
+ Only executes once per training session (controlled by `tried_adapter_checkpointless` flag)
+ Configures checkpoint manager with adapter parameter information

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.callbacks import CheckpointlessCallback  
from hyperpod_checkpointless_training.nemo_plugins.checkpoint_manager import CheckpointManager  
import pytorch_lightning as pl  
  
# Create checkpoint manager  
checkpoint_manager = CheckpointManager(  
    enable_checksum=True,  
    enable_offload=True  
)  
  
# Create checkpointless callback with full fault tolerance  
checkpointless_callback = CheckpointlessCallback(  
    enable_inprocess=True,  
    enable_checkpointless=True,  
    enable_checksum=True,  
    clean_tensor_hook=True,  
    clean_lightning_module=True  
)  
  
# Use with PyTorch Lightning trainer  
trainer = pl.Trainer(  
    callbacks=[checkpointless_callback],  
    strategy=CheckpointlessMegatronStrategy()  
)  
  
# Training with fault tolerance  
trainer.fit(model, datamodule=data_module)
```

**Memory Management**
+ **clean\$1tensor\$1hook**: Removes tensor hooks during cleanup (expensive but thorough)
+ **clean\$1lightning\$1module**: Frees Lightning module GPU memory during restarts
+ Both options help reduce memory footprint during fault recovery
+ Coordinates with ParameterUpdateLock for thread-safe parameter update tracking

### CheckpointlessCompatibleConnector
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessCompatibleConnector"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.checkpoint_connector.CheckpointlessCompatibleConnector()
```

PyTorch Lightning checkpoint connector that integrates checkpointless recovery with traditional disk-based checkpoint loading.

This connector extends PyTorch Lightning's `_CheckpointConnector` to provide seamless integration between checkpointless recovery and standard checkpoint restoration. It attempts checkpointless recovery first, then falls back to disk-based checkpoint loading if checkpointless recovery is not feasible or fails.

**Parameters**

Inherits all parameters from **\$1CheckpointConnector**

**Methods**

```
resume_start(checkpoint_path=None)
```

Attempt to pre-load checkpoint with checkpointless recovery priority.

**Parameters:**
+ **checkpoint\$1path** (*str* or *None*, *optional*) – Path to disk checkpoint for fallback. Default: `None`

```
resume_end()
```

Complete the checkpoint loading process and perform post-load operations.

**Notes**
+ Extends PyTorch Lightning's internal `_CheckpointConnector` class with checkpointless recovery support
+ Maintains full compatibility with standard PyTorch Lightning checkpoint workflows

### CheckpointlessAutoResume
<a name="sagemaker-eks-checkpointless-in-process-recovery-reference-CheckpointlessAutoResume"></a>

```
class hyperpod_checkpointless_training.nemo_plugins.resume.CheckpointlessAutoResume()
```

Extends NeMo's AutoResume with delayed setup to enable checkpointless recovery validation before checkpoint path resolution.

This class implements a two-phase initialization strategy that allows checkpointless recovery validation to occur before falling back to traditional disk-based checkpoint loading. It conditionally delays AutoResume setup to prevent premature checkpoint path resolution, enabling the CheckpointManager to first validate whether checkpointless peer-to-peer recovery is feasible.

**Parameters**

Inherits all parameters from **AutoResume**

**Methods**

```
setup(trainer, model=None, force_setup=False)
```

Conditionally delay AutoResume setup to enable checkpointless recovery validation.

**Parameters:**
+ **trainer** (*pytorch\$1lightning.Trainer* or *lightning.fabric.Fabric*) – PyTorch Lightning trainer or Fabric instance
+ **model** (*optional*) – Model instance for setup. Default: `None`
+ **force\$1setup** (*bool*, *optional*) – If True, bypass delay and execute AutoResume setup immediately. Default: `False`

**Example**

```
from hyperpod_checkpointless_training.nemo_plugins.resume import CheckpointlessAutoResume  
from hyperpod_checkpointless_training.nemo_plugins.megatron_strategy import CheckpointlessMegatronStrategy  
import pytorch_lightning as pl  
  
# Create trainer with checkpointless auto-resume  
trainer = pl.Trainer(  
    strategy=CheckpointlessMegatronStrategy(),  
    resume=CheckpointlessAutoResume()  
)
```

**Notes**
+ Extends NeMo's AutoResume class with delay mechanism for enabling checkpointless recovery
+ Works in conjunction with `CheckpointlessCompatibleConnector` for complete recovery workflow

# Special considerations
<a name="sagemaker-eks-checkpointless-considerations"></a>

We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to job operations, resource management, and essential service functionality. 

HyperPod managed tiered checkpointing and elastic training: note that HyperPod checkpointless training is currently incompatible with HyperPod managed tiered checkpointing and elastic training.

Checkpointless training recipes for GPT OSS 120B and Llama models are provided to simplify getting started. These recipes have been verified on ml.p5 instances. Using other instance types may require additional modifications to the underlying recipes. These recipes can be adapted to full finetuning workflows as well. For custom models, we recommend reviewing the [getting started examples](https://docs.aws.amazon.com/sagemaker-eks-checkpointless-recipes-custom).

# Appendix
<a name="sagemaker-eks-checkpointless-appendix"></a>

**Monitor training results via HyperPod recipes**

SageMaker HyperPod recipes offer Tensorboard integration to analyze training behavior. These recipes also incorporate VizTracer, which is a low-overhead tool for tracing and visualizing Python code execution. For more information, see [ VizTracer](https://github.com/gaogaotiantian/viztracer).

The tensorboard logs are generated and stored within the `log_dir`. To access and analyze these logs locally, use the following procedure:

1. Download the Tensorboard experiment folder from your training environment to your local machine.

1. Open a terminal or command prompt on your local machine.

1. Navigate to the directory containing the downloaded experiment folder.

1. Launch Tensorboard by running the command:

   ```
   tensorboard --port=<port> --bind_all --logdir experiment.
   ```

1. Open your web browser and visit `http://localhost:8008`.

You can now see the status and visualizations of your training jobs within the Tensorboard interface. Seeing the status and visualizations helps you monitor and analyze the training process. Monitoring and analyzing the training process helps you gain insights into the behavior and performance of your models. For more information about how you monitor and analyze the training with Tensorboard, see the [ NVIDIA NeMo Framework User Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/exp_manager.html#experiment-manager).

**VizTracer**

To enable VizTracer, you can modify your recipe by setting the environment variable `ENABLE_VIZTRACER` to `1`. After the training has completed, your VizTracer profile is in the experiment folder `log_dir/viztracer_xxx.json`. To analyze your profile, you can download it and open it using the **vizviewer** tool:

```
vizviewer --port <port> viztracer_xxx.json
```

This command launches the vizviewer on port 9001. You can view your VizTracer by going to http://localhost:<port> in your browser. After you open VizTracer, you begin analyzing the training. For more information about using VizTracer, see [ VizTracer documentation](https://viztracer.readthedocs.io/en/latest/installation.html).

# Release notes
<a name="sagemaker-eks-checkpointless-release-notes"></a>

See the following release notes to track the latest updates for the SageMaker HyperPod checkpointless training.

**The SageMaker HyperPod checkpointless training v1.0.1**

Date: April 10, 2026

**Bug Fixes**
+ Fixed incorrect CUDA device binding in the fault handling thread. The fault handling thread now correctly sets the CUDA device context by using `LOCAL_RANK`. This fix prevents device mismatch errors during in-process fault recovery.

**The SageMaker HyperPod checkpointless training v1.0.0**

Date: December 03, 2025

**SageMaker HyperPod checkpointless training Features**
+ **Collective Communication Initialization Improvements**: Offers novel initialization methods, Rootless and TCPStoreless for NCCL and Gloo.
+ **Memory-mapped (MMAP)** Dataloader: Caches (persist) prefetched batches so that they are available even when a fault causes a restart of the training job.
+ **Checkpointless**: Enables faster recovery from cluster training faults in large-scale distributed training environments by making framework-level optimizations
+ **Built on Nvidia Nemo and PyTorch Lightning**: Leverages these powerful frameworks for efficient and flexible model training
  + [Nividia NeMo](https://github.com/NVIDIA-NeMo/NeMo)
  + [PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/)

**SageMaker HyperPod Checkpointless training Docker container**

Checkpointless training on HyperPod is built on top of the [ NVIDIA NeMo framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html). HyperPod checkpointless training aims to recover faster from cluster training faults in large-scale distributed training environments by making framework-level optimizations that will be delivered on a base container containing the base image with NCCL and PyTorch optimizations.

**Availability**

Currently images are only available in:

```
eu-north-1
ap-south-1
us-east-2
eu-west-1
eu-central-1
sa-east-1
us-east-1
eu-west-2
ap-northeast-1
us-west-2
us-west-1
ap-southeast-1
ap-southeast-2
```

but not available in the following 3 opt-in Regions:

```
ap-southeast-3
ap-southeast-4
eu-south-2
```

**Container details**

Checkpointless training Docker container for PyTorch v2.6.0 with CUDA v12.9

```
963403601044.dkr.ecr.eu-north-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
423350936952.dkr.ecr.ap-south-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
556809692997.dkr.ecr.us-east-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
942446708630.dkr.ecr.eu-west-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
391061375763.dkr.ecr.eu-central-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
311136344257.dkr.ecr.sa-east-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
327873000638.dkr.ecr.us-east-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
016839105697.dkr.ecr.eu-west-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
356859066553.dkr.ecr.ap-northeast-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
920498770698.dkr.ecr.us-west-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
827510180725.dkr.ecr.us-west-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
885852567298.dkr.ecr.ap-southeast-1.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
304708117039.dkr.ecr.ap-southeast-2.amazonaws.com/hyperpod-checkpointless-training:v1.0.1
```

**Pre-installed packages**

```
PyTorch: v2.6.0
CUDA: v12.9
NCCL: v2.27.5
EFA: v1.43.0
AWS-OFI-NCCL v1.16.0
Libfabric version 2.1
Megatron v0.15.0
Nemo v2.6.0rc0
```

# Using GPU partitions in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-gpu-partitioning"></a>

Cluster administrators can choose how to maximize GPU utilization across their organization. You can enable GPU partitioning with NVIDIA Multi-Instance GPU (MIG) technology to partition GPU resources into smaller, isolated instances for better resource utilization. This capability provides the ability to run multiple smaller sized tasks concurrently on a single GPU instead of dedicating the entire hardware to a single, often underutilized task. This eliminates wasted compute power and memory.

GPU partitioning with MIG technology supports GPUs and allows you to partition a single supported GPU into up to seven separate GPU partitions. Each GPU partition has dedicated memory, cache, and compute resources, providing predictable isolation.

## Benefits
<a name="sagemaker-hyperpod-eks-gpu-partitioning-benefits"></a>
+ **Improved GPU utilization** - Maximize compute efficiency by partitioning GPUs based on compute and memory requirements
+ **Task isolation** - Each GPU partition operates independently with dedicated memory, cache, and compute resources
+ **Task flexibility** - Support a mix of tasks on a single physical GPU, all running in parallel
+ **Flexible setup management** - Support both Do-it-yourself (DIY) Kubernetes configurations using Kubernetes command-line client `kubectl`, and a managed solution with custom labels to easily configure and apply your labels associated with GPU partitions

**Important**  
GPU partitioning with MIG is not supported with flexible instance groups (instance groups that use `InstanceRequirements`). To use MIG, create an instance group with a single `InstanceType`.

## Supported Instance Types
<a name="sagemaker-hyperpod-eks-gpu-partitioning-instance-types"></a>

GPU partitioning with MIG technology is supported on the following HyperPod instance types:

**A100 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p4/](https://aws.amazon.com/ec2/instance-types/p4/)
+ **ml.p4d.24xlarge** - 8 NVIDIA A100 GPUs (80GB HBM2e per GPU)
+ **ml.p4de.24xlarge** - 8 NVIDIA A100 GPUs (80GB HBM2e per GPU)

**H100 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p5/](https://aws.amazon.com/ec2/instance-types/p5/)
+ **ml.p5.48xlarge** - 8 NVIDIA H100 GPUs (80GB HBM3 per GPU)

**H200 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p5/](https://aws.amazon.com/ec2/instance-types/p5/)
+ **ml.p5e.48xlarge** - 8 NVIDIA H200 GPUs (141GB HBM3e per GPU)
+ **ml.p5en.48xlarge** - 8 NVIDIA H200 GPUs (141GB HBM3e per GPU)

**B200 GPU Instances** - [https://aws.amazon.com/ec2/instance-types/p6/](https://aws.amazon.com/ec2/instance-types/p6/)
+ **ml.p6b.48xlarge** - 8 NVIDIA B200 GPUs

## GPU partitions
<a name="sagemaker-hyperpod-eks-gpu-partitioning-profiles"></a>

NVIDIA MIG profiles define how GPUs are partitioned. Each profile specifies the compute and memory allocation per MIG instance. The following are the MIG profiles associated with each GPU type:

**A100 GPU (ml.p4d.24xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p4d.24xlarge | 
| --- | --- | --- | --- | 
| `1g.5gb` | 5 | 7 | 56 | 
| `2g.10gb` | 10 | 3 | 24 | 
| `3g.20gb` | 20 | 2 | 16 | 
| `4g.20gb` | 20 | 1 | 8 | 
| `7g.40gb` | 40 | 1 | 8 | 

**H100 GPU (ml.p5.48xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p5.48xlarge | 
| --- | --- | --- | --- | 
| `1g.10gb` | 10 | 7 | 56 | 
| `1g.20gb` | 20 | 4 | 32 | 
| `2g.20gb` | 20 | 3 | 24 | 
| `3g.40gb` | 40 | 2 | 16 | 
| `4g.40gb` | 40 | 1 | 8 | 
| `7g.80gb` | 80 | 1 | 8 | 

**H200 GPU (ml.p5e.48xlarge and ml.p5en.48xlarge)**


| Profile | Memory (GB) | Instances per GPU | Total per ml.p5en.48xlarge | 
| --- | --- | --- | --- | 
| `1g.18gb` | 18 | 7 | 56 | 
| `1g.35gb` | 35 | 4 | 32 | 
| `2g.35gb` | 35 | 3 | 24 | 
| `3g.71gb` | 71 | 2 | 16 | 
| `4g.71gb` | 71 | 1 | 8 | 
| `7g.141gb` | 141 | 1 | 8 | 

**Topics**
+ [Benefits](#sagemaker-hyperpod-eks-gpu-partitioning-benefits)
+ [Supported Instance Types](#sagemaker-hyperpod-eks-gpu-partitioning-instance-types)
+ [GPU partitions](#sagemaker-hyperpod-eks-gpu-partitioning-profiles)
+ [Setting up GPU partitions on Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning-setup.md)
+ [Node Lifecycle and Labels](sagemaker-hyperpod-eks-gpu-partitioning-labels.md)
+ [Task Submission with MIG](sagemaker-hyperpod-eks-gpu-partitioning-task-submission.md)

# Setting up GPU partitions on Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup"></a>

**Topics**
+ [Prerequisites](#sagemaker-hyperpod-eks-gpu-partitioning-setup-prerequisites)
+ [Creating a Cluster with MIG Configuration](#sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster)
+ [Adding GPU operator to an existing cluster](#sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator)
+ [Updating MIG Configuration](#sagemaker-hyperpod-eks-gpu-partitioning-setup-update)
+ [Verifying MIG Configuration](#sagemaker-hyperpod-eks-gpu-partitioning-setup-verify)
+ [Common Commands for Debugging MIG Configuration](#sagemaker-hyperpod-eks-gpu-partitioning-setup-debug-commands)
+ [Using SageMaker AI Console](#sagemaker-hyperpod-eks-gpu-partitioning-setup-console)

## Prerequisites
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-prerequisites"></a>
+ HyperPod Amazon EKS cluster with supported GPU instances
+ NVIDIA GPU Operator installed
+ Appropriate IAM permissions for cluster management

## Creating a Cluster with MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster"></a>

### Using AWS CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster-cli"></a>

```
aws sagemaker create-cluster \
  --cluster-name my-mig-cluster \
  --orchestrator 'Eks={ClusterArn=arn:aws:eks:region:account:cluster/cluster-name}' \
  --instance-groups '{
    "InstanceGroupName": "gpu-group",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 1,
    "LifeCycleConfig": {
       "SourceS3Uri": "s3://my-bucket",
       "OnCreate": "on_create_script.sh"
    },
    "KubernetesConfig": {
       "Labels": {
          "nvidia.com/mig.config": "all-1g.5gb"
       }
    },
    "ExecutionRole": "arn:aws:iam::account:role/execution-role",
    "ThreadsPerCore": 1
  }' \
  --vpc-config '{
     "SecurityGroupIds": ["sg-12345"],
     "Subnets": ["subnet-12345"]
  }' \
  --node-provisioning-mode Continuous
```

### Using CloudFormation
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-create-cluster-cfn"></a>

```
{
  "ClusterName": "my-mig-cluster",
  "InstanceGroups": [
    {
      "InstanceGroupName": "gpu-group",
      "InstanceType": "ml.p4d.24xlarge",
      "InstanceCount": 1,
      "KubernetesConfig": {
        "Labels": {
          "nvidia.com/mig.config": "all-2g.10gb"
        }
      },
      "ExecutionRole": "arn:aws:iam::account:role/execution-role"
    }
  ],
  "Orchestrator": {
    "Eks": {
      "ClusterArn": "arn:aws:eks:region:account:cluster/cluster-name"
    }
  },
  "NodeProvisioningMode": "Continuous"
}
```

## Adding GPU operator to an existing cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator"></a>

### Install GPU Operator
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-install"></a>

Replace `{$AWS_REGION}` with your cluster region (e.g., us-east-1, us-west-2).

```
helm install gpuo helm_chart/HyperPodHelmChart/charts/gpu-operator \
-f helm_chart/HyperPodHelmChart/charts/gpu-operator/regional-values/values-{$AWS_REGION}.yaml \
-n kube-system
```

### Verify Installation (Wait 2-3 minutes)
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-verify"></a>

Check all GPU operator pods are running:

```
kubectl get pods -n kube-system | grep -E "(gpu-operator|nvidia-)"
```

**Expected pods:**
+ gpu-operator-\$1 - 1 instance (cluster controller)
+ nvidia-device-plugin-daemonset-\$1 - 1 per GPU node (all GPU instances)
+ nvidia-mig-manager-\$1 - 1 per MIG-capable node (A100/H100)

### Remove Old Device Plugin
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-remove"></a>

Disable the existing nvidia-device-plugin:

```
helm upgrade dependencies helm_chart/HyperPodHelmChart \
--set nvidia-device-plugin.devicePlugin.enabled=false \
-n kube-system
```

### Verify GPU Resources
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-add-operator-verify-gpu"></a>

Confirm nodes show GPU capacity. It should display: nvidia.com/gpu: 8 (or your actual GPU count).

```
kubectl describe nodes | grep "nvidia.com/gpu"
```

## Updating MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update"></a>

**Preparing Nodes Before MIG Updates**  
Before updating MIG configurations on your instance group, you must prepare the nodes to prevent workload disruption. Follow these steps to safely drain workloads from the nodes that will be reconfigured.

### Step 1: Identify Nodes in the Instance Group
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-identify"></a>

First, identify all nodes that belong to the instance group you want to update:

```
# List all nodes in the instance group
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=INSTANCE_GROUP_NAME

# Example:
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=p4d-group
```

This command returns a list of all nodes in the specified instance group. Make note of each node name for the following steps.

### Step 2: Cordon and Drain Each Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain"></a>

For each node identified in Step 1, perform the following actions:

#### Cordon the Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain-cordon"></a>

Cordoning prevents new pods from being scheduled on the node:

```
# Cordon a single node
kubectl cordon NODE_NAME

# Example:
kubectl cordon hyperpod-i-014a41a7001adca60
```

#### Drain Workload Pods from the Node
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-drain-evict"></a>

Drain the node to evict all workload pods while preserving system pods:

```
# Drain the node (ignore DaemonSets and evict pods)
kubectl drain NODE_NAME \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300

# Example:
kubectl drain hyperpod-i-014a41a7001adca60 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --force \
  --grace-period=300
```

**Command Options Explained:**
+ `--ignore-daemonsets` - Allows the drain operation to proceed even if DaemonSet pods are present
+ `--delete-emptydir-data` - Deletes pods using emptyDir volumes (required for draining to succeed)
+ `--force` - Forces deletion of pods not managed by a controller (use with caution)
+ `--grace-period=300` - Gives pods 5 minutes to terminate gracefully

**Important**  
The drain operation may take several minutes depending on the number of pods and their termination grace periods
System pods in the following namespaces will remain running: `kube-system`, `cert-manager`, `kubeflow`, `hyperpod-inference-system`, `kube-public`, `mpi-operator`, `gpu-operator`, `aws-hyperpod`, `jupyter-k8s-system`, `hyperpod-observability`, `kueue-system`, and `keda`
DaemonSet pods will remain on the node (they are ignored by design)

### Step 3: Verify No Workload Pods are Running
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-verify"></a>

After draining, verify that no workload pods remain on the nodes (excluding system namespaces):

```
# Check for any remaining pods outside system namespaces
kubectl get pods --all-namespaces --field-selector spec.nodeName=NODE_NAME \
  | grep -v "kube-system" \
  | grep -v "cert-manager" \
  | grep -v "kubeflow" \
  | grep -v "hyperpod-inference-system" \
  | grep -v "kube-public" \
  | grep -v "mpi-operator" \
  | grep -v "gpu-operator" \
  | grep -v "aws-hyperpod" \
  | grep -v "jupyter-k8s-system" \
  | grep -v "hyperpod-observability" \
  | grep -v "kueue-system" \
  | grep -v "keda"

# Example:
kubectl get pods --all-namespaces --field-selector spec.nodeName=hyperpod-i-014a41a7001adca60 \
  | grep -v "kube-system" \
  | grep -v "cert-manager" \
  | grep -v "kubeflow" \
  | grep -v "hyperpod-inference-system" \
  | grep -v "kube-public" \
  | grep -v "mpi-operator" \
  | grep -v "gpu-operator" \
  | grep -v "aws-hyperpod" \
  | grep -v "jupyter-k8s-system" \
  | grep -v "hyperpod-observability" \
  | grep -v "kueue-system" \
  | grep -v "keda"
```

**Expected Output:** If the node is properly drained, this command should return no results (or only show the header row). If any pods are still running, investigate why they weren't evicted and manually delete them if necessary.

### Step 4: Verify Node Readiness Status
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-readiness"></a>

Before proceeding with the MIG update, confirm that all nodes are cordoned:

```
# Check node status - should show "SchedulingDisabled"
kubectl get nodes -l sagemaker.amazonaws.com/instance-group-name=INSTANCE_GROUP_NAME
```

Nodes should show `SchedulingDisabled` in the STATUS column, indicating they are cordoned and ready for the MIG update.

### Update MIG Profile on Existing Cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-update-change"></a>

You can change MIG profiles on existing clusters:

```
aws sagemaker update-cluster \
  --cluster-name my-mig-cluster \
  --instance-groups '{
    "InstanceGroupName": "gpu-group",
    "InstanceType": "ml.p4d.24xlarge",
    "InstanceCount": 1,
    "KubernetesConfig": {
       "Labels": {
          "nvidia.com/mig.config": "all-3g.20gb"
       }
    },
    "ExecutionRole": "arn:aws:iam::account:role/execution-role"
  }'
```

**Note**  
If jobs are already running on a node, the MIG partitioning will fail. User will get error message to drain the nodes before re-attempting the MIG partitioning.

## Verifying MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-verify"></a>

After cluster creation or update, verify the MIG configuration:

```
# Update kubeconfig
aws eks update-kubeconfig --name your-eks-cluster --region us-east-2

# Check MIG labels
kubectl get node NODE_NAME -o=jsonpath='{.metadata.labels}' | grep mig

# Check available MIG resources
kubectl describe node NODE_NAME | grep -A 10 "Allocatable:"
```

## Common Commands for Debugging MIG Configuration
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-debug-commands"></a>

Use the following commands to troubleshoot and validate MIG configuration in your cluster:

```
# Check GPU Operator status
kubectl get pods -n gpu-operator-resources

# View MIG configuration
kubectl exec -n gpu-operator-resources nvidia-driver-XXXXX -- nvidia-smi mig -lgi

# Check device plugin configuration
kubectl logs -n gpu-operator-resources nvidia-device-plugin-XXXXX

# Monitor node events
kubectl get events --field-selector involvedObject.name=NODE_NAME
```

**Note**  
Replace `nvidia-driver-XXXXX` and `nvidia-device-plugin-XXXXX` with the actual pod names from your cluster, and `NODE_NAME` with your node's name.

## Using SageMaker AI Console
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console"></a>

### Creating a New Cluster with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console-create"></a>

1. Navigate to **Amazon SageMaker AI** > **HyperPod Clusters** > **Cluster Management** > **Create HyperPod cluster**

1. Select **Orchestrated by EKS**

1. Choose **Custom setup** and verify **GPU Operator** is enabled by default

1. Under **Instance groups** section, click **Add group**

1. Configure the instance group and navigate to **Advanced Configuration** to enable **Use GPU partition** toggle and choose your desired **MIG configuration** from the dropdown

1. Click **Add Instance group** and complete the remaining cluster configuration

1. Click **Submit** to create the cluster

### Updating MIG Configuration on Existing Cluster
<a name="sagemaker-hyperpod-eks-gpu-partitioning-setup-console-update"></a>

1. Navigate to **Amazon SageMaker AI** > **HyperPod Clusters** > **Cluster Management**

1. Select your existing cluster and click **Edit** on the instance group you want to modify

1. In **Advanced configuration**, toggle **Use GPU partition** if not already enabled and select a different **MIG configuration** from the dropdown

1. Click **Save changes**

# Node Lifecycle and Labels
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels"></a>

Amazon SageMaker HyperPod performs deep health checks on cluster instances during the creation and update of HyperPod clusters before GPU partitioning begins. HyperPod health-monitoring agent continuously monitors the health status of GPU partitioned instances.

## MIG Configuration States
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-states"></a>

Nodes with GPU partition configuration go through several states:
+ **Pending** - Node is being configured with a MIG profile
+ **Configuring** - GPU Operator is applying MIG partitioning
+ **Success** - GPU partitioning completed successfully
+ **Failed** - GPU partitioning encountered an error

## Monitoring Node States
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-monitoring"></a>

```
# Check node health status
kubectl get nodes -l sagemaker.amazonaws.com/node-health-status=Schedulable

# Monitor MIG configuration progress
kubectl get node NODE_NAME -o jsonpath='{.metadata.labels.nvidia\.com/mig\.config\.state}'

# Check for configuration errors
kubectl describe node NODE_NAME | grep -A 5 "Conditions:"
```

## Custom Labels and Taints
<a name="sagemaker-hyperpod-eks-gpu-partitioning-labels-custom"></a>

You can manage MIG configuration with custom labels and taints to label your GPU partitions and apply them across instances:

```
{
  "KubernetesConfig": {
    "Labels": {
      "nvidia.com/mig.config": "all-2g.10gb",
      "task-type": "inference",
      "environment": "production"
    },
    "Taints": [
      {
        "Key": "gpu-task",
        "Value": "mig-enabled",
        "Effect": "NoSchedule"
      }
    ]
  }
}
```

# Task Submission with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission"></a>

**Topics**
+ [Using Kubernetes YAML](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-kubectl)
+ [Using HyperPod CLI](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-cli)
+ [Model Deployment with MIG](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-deployment)
+ [Using HyperPod CLI](#sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli)

## Using Kubernetes YAML
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-kubectl"></a>

```
apiVersion: batch/v1
kind: Job
metadata:
  name: mig-job
  namespace: default
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: pytorch/pytorch:latest
        resources:
          requests:
            nvidia.com/mig-1g.5gb: 1
            cpu: "100m"
            memory: "128Mi"
          limits:
            nvidia.com/mig-1g.5gb: 1
      restartPolicy: Never
```

## Using HyperPod CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-cli"></a>

Use the HyperPod CLI to deploy JumpStart models with MIG support. The following example demonstrates the new CLI parameters for GPU partitioning:

```
# Deploy JumpStart model with MIG
hyp create hyp-jumpstart-endpoint \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.p5.48xlarge \
  --accelerator-partition-type mig-2g.10gb \
  --accelerator-partition-validation True \
  --endpoint-name my-endpoint \
  --tls-certificate-output-s3-uri s3://certificate-bucket/ \
  --namespace default
```

## Model Deployment with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-deployment"></a>

HyperPod Inference allows deploying the models on MIG profiles via Studio Classic, `kubectl` and HyperPod CLI. To deploy JumpStart Models on `kubectl`, CRDs have fields called `spec.server.acceleratorPartitionType` to deploy the model to the desired MIG profile. We run validations to ensure models can be deployed on the MIG profile selected in the CRD. In case you want to disable the MIG validation checks, use `spec.server.validations.acceleratorPartitionValidation` to `False`.

### JumpStart Models
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-jumpstart"></a>

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: JumpStartModel
metadata:
  name: deepseek-model
  namespace: default
spec:
  sageMakerEndpoint:
    name: deepseek-endpoint
  model:
    modelHubName: SageMakerPublicHub
    modelId: deepseek-llm-r1-distill-qwen-1-5b
  server:
    acceleratorPartitionType: mig-7g.40gb
    instanceType: ml.p4d.24xlarge
```

### Deploy model from Amazon S3 using InferenceEndpointConfig
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-s3"></a>

InferenceEndpointConfig allows you to deploy custom model from Amazon S3. To deploy a model on MIG, in `spec.worker.resources` mention MIG profile in `requests` and `limits`. Refer to a simple deployment below:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: custom-model
  namespace: default
spec:
  replicas: 1
  modelName: my-model
  endpointName: my-endpoint
  instanceType: ml.p4d.24xlarge
  modelSourceConfig:
    modelSourceType: s3
    s3Storage:
      bucketName: my-model-bucket
      region: us-east-2
    modelLocation: model-path
  worker:
    resources:
      requests:
        nvidia.com/mig-3g.20gb: 1
        cpu: "5600m"
        memory: "10Gi"
      limits:
        nvidia.com/mig-3g.20gb: 1
```

### Deploy model from FSx for Lustre using InferenceEndpointConfig
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-fsx"></a>

InferenceEndpointConfig allows you to deploy custom model from FSx for Lustre. To deploy a model on MIG, in `spec.worker.resources` mention MIG profile in `requests` and `limits`. Refer to a simple deployment below:

```
apiVersion: inference.sagemaker.aws.amazon.com/v1
kind: InferenceEndpointConfig
metadata:
  name: custom-model
  namespace: default
spec:
  replicas: 1
  modelName: my-model
  endpointName: my-endpoint
  instanceType: ml.p4d.24xlarge
  modelSourceConfig:
    modelSourceType: fsx
    fsxStorage:
      fileSystemId: fs-xxxxx
    modelLocation: location-on-fsx
  worker:
    resources:
      requests:
        nvidia.com/mig-3g.20gb: 1
        cpu: "5600m"
        memory: "10Gi"
      limits:
        nvidia.com/mig-3g.20gb: 1
```

### Using Studio Classic UI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio"></a>

#### Deploying JumpStart Models with MIG
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-deploy"></a>

1. Open **Studio Classic** and navigate to **JumpStart**

1. Browse or search for your desired model (e.g., "DeepSeek", "Llama", etc.)

1. Click on the model card and select **Deploy**

1. In the deployment configuration:
   + Choose **HyperPod** as the deployment target
   + Select your MIG-enabled cluster from the dropdown
   + Under **Instance configuration**:
     + Select instance type (e.g., `ml.p4d.24xlarge`)
     + Choose **GPU Partition Type** from available options
     + Configure **Instance count** and **Auto-scaling** settings

1. Review and click **Deploy**

1. Monitor deployment progress in the **Endpoints** section

#### Model Configuration Options
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-config"></a>

**Endpoint Settings:**
+ **Endpoint name** - Unique identifier for your deployment
+ **Variant name** - Configuration variant (default: AllTraffic)
+ **Instance type** - Must support GPU partition (p series)
+ **MIG profile** - GPU partition
+ **Initial instance count** - Number of instances to deploy
+ **Auto-scaling** - Enable for dynamic scaling based on traffic

**Advanced Configuration:**
+ **Model data location** - Amazon S3 path for custom models
+ **Container image** - Custom inference container (optional)
+ **Environment variables** - Model-specific configurations
+ **Amazon VPC configuration** - Network isolation settings

#### Monitoring Deployed Models
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-studio-monitor"></a>

1. Navigate to **Studio Classic** > **Deployments** > **Endpoints**

1. Select your MIG-enabled endpoint

1. View metrics including:
   + **MIG utilization** - Per GPU partition usage
   + **Memory consumption** - Per GPU partition
   + **Inference latency** - Request processing time
   + **Throughput** - Requests per second

1. Set up **Amazon CloudWatch alarms** for automated monitoring

1. Configure **auto-scaling policies** based on MIG utilization

## Using HyperPod CLI
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli"></a>

### JumpStart Deployment
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli-jumpstart"></a>

The HyperPod CLI JumpStart command includes two new fields for MIG support:
+ `--accelerator-partition-type` - Specifies the MIG configuration (e.g., mig-4g.20gb)
+ `--accelerator-partition-validation` - Validates compatibility between models and MIG profile (default: true)

```
hyp create hyp-jumpstart-endpoint \
  --version 1.1 \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.p4d.24xlarge \
  --endpoint-name js-test \
  --accelerator-partition-type "mig-4g.20gb" \
  --accelerator-partition-validation true \
  --tls-certificate-output-s3-uri s3://my-bucket/certs/
```

### Custom Endpoint Deployment
<a name="sagemaker-hyperpod-eks-gpu-partitioning-task-submission-hyperpod-cli-custom"></a>

For deploying via custom endpoint, use the existing fields `--resources-requests` and `--resources-limits` to enable MIG profile functionality:

```
hyp create hyp-custom-endpoint \
  --namespace default \
  --metadata-name deepseek15b-mig-10-14-v2 \
  --endpoint-name deepseek15b-mig-endpoint \
  --instance-type ml.p4d.24xlarge \
  --model-name deepseek15b-mig \
  --model-source-type s3 \
  --model-location deep-seek-15b \
  --prefetch-enabled true \
  --tls-certificate-output-s3-uri s3://sagemaker-bucket \
  --image-uri lmcache/vllm-openai:v0.3.7 \
  --container-port 8080 \
  --model-volume-mount-path /opt/ml/model \
  --model-volume-mount-name model-weights \
  --s3-bucket-name model-storage-123456789 \
  --s3-region us-east-2 \
  --invocation-endpoint invocations \
  --resources-requests '{"cpu":"5600m","memory":"10Gi","nvidia.com/mig-3g.20gb":"1"}' \
  --resources-limits '{"nvidia.com/mig-3g.20gb":"1"}' \
  --env '{
    "OPTION_ROLLING_BATCH":"vllm",
    "SERVING_CHUNKED_READ_TIMEOUT":"480",
    "DJL_OFFLINE":"true",
    "NUM_SHARD":"1",
    "SAGEMAKER_PROGRAM":"inference.py",
    "SAGEMAKER_SUBMIT_DIRECTORY":"/opt/ml/model/code",
    "MODEL_CACHE_ROOT":"/opt/ml/model",
    "SAGEMAKER_MODEL_SERVER_WORKERS":"1",
    "SAGEMAKER_MODEL_SERVER_TIMEOUT":"3600",
    "OPTION_TRUST_REMOTE_CODE":"true",
    "OPTION_ENABLE_REASONING":"true",
    "OPTION_REASONING_PARSER":"deepseek_r1",
    "SAGEMAKER_CONTAINER_LOG_LEVEL":"20",
    "SAGEMAKER_ENV":"1"
  }'
```

# Cluster resiliency features for SageMaker HyperPod cluster orchestration with Amazon EKS
<a name="sagemaker-hyperpod-eks-resiliency"></a>

SageMaker HyperPod provides the following cluster resiliency features. 

**Topics**
+ [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md)
+ [Basic health checks](sagemaker-hyperpod-eks-resiliency-basic-health-check.md)
+ [Deep health checks](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md)
+ [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md)
+ [Resilience-related Kubernetes labels by SageMaker HyperPod](sagemaker-hyperpod-eks-resiliency-node-labels.md)
+ [Manually quarantine, replace, or reboot a node](sagemaker-hyperpod-eks-resiliency-manual.md)
+ [Suggested resilience configurations](sagemaker-hyperpod-eks-resiliency-config-tips.md)

# Health Monitoring System
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent"></a>

SageMaker HyperPod health monitoring system includes two components 

1. Monitoring agents installed in your nodes, which include the Health Monitoring Agent (HMA) that serves as an on-host health monitor and a set of out-of-node health monitors.

1. Node Recovery System managed by SageMaker HyperPod. The health monitoring system will monitor the node health status continuously via monitoring agents and then take actions automatically when fault is detected using the Node Recovery System. 

![\[This image illustrates how health monitoring system integrated with HyperPod Cluster.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-resilience-event.png)


## Health checks done by the SageMaker HyperPod health-monitoring agent
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent-list-of-checks"></a>

The SageMaker HyperPod health-monitoring agent checks the following.

**NVIDIA GPUs**
+ [DCGM policy violation notifications](https://docs.nvidia.com/datacenter/dcgm/3.0/user-guide/feature-overview.html#notifications)
+ Errors in the `nvidia-smi` output
+ Various errors in the logs generated by the Amazon Elastic Compute Cloud (EC2) platform
+ GPU Count validation — if there’s a mismatch between the expected number of GPUs in a particular instance type (for example: 8 GPUs in ml.p5.48xlarge instance type) and the count returned by `nvidia-smi`, then HMA reboots the node 

**AWS Trainium**
+ Errors in the output from the [AWS Neuron monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html)
+ Outputs generated by the Neuron node problem detector (For more information about the AWS Neuron node problem detector, see [Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters](https://aws.amazon.com/blogs/machine-learning/node-problem-detection-and-recovery-for-aws-neuron-nodes-within-amazon-eks-clusters/).)
+ Various errors in the logs generated by the Amazon EC2 platform
+ Neuron Device Count validation — if there’s a mismatch between the actual number of neuron device count in a particular instance type and the count returned by `neuron-ls`, then HMA reboots the node

 The above checks are passive, background health checks HyperPod runs continuously on your nodes. In addition to these checks, HyperPod also runs deep (or active) health checks during the creation and update of HyperPod clusters. Learn more about [Deep health checks](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-resiliency-deep-health-checks.html).

## Fault Detection
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-fault-detection"></a>

When SageMaker HyperPod detects a fault, it implements a four-part response:

1. **Node Labels**

   1. Health Status: `sagemaker.amazonaws.com/node-health-status`

   1. Fault Type: `sagemaker.amazonaws.com/fault-types` label for high-level categorization

   1. Fault Reason: `sagemaker.amazonaws.com/fault-reasons` label for detailed fault information

1. **Node Taint**

   1. `sagemaker.amazonaws.com/node-health-status=Unschedulable:NoSchedule`

1. **Node Annotation**

   1. Fault details: `sagemaker.amazonaws.com/fault-details`

   1. Records up to 20 faults with timestamps that occurred on the node

1. **Node Conditions**([Kubernetes Node Condition](https://kubernetes.io/docs/reference/node/node-status/#condition))

   1. Reflects current health status in node conditions:
      + Type: Same as fault type
      + Status: `True`
      + Reason: Same as fault reason
      + LastTransitionTime: Fault occurrence time

![\[This image illustrates how the health monitoring system works when detected a fault.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-resilience-workflow.png)


## Logs generated by the SageMaker HyperPod health-monitoring agent
<a name="sagemaker-hyperpod-eks-resiliency-health-monitoring-agent-health-check-results"></a>

The SageMaker HyperPod health-monitoring agent is an out-of-the-box health check feature and continuously runs on all HyperPod clusters. The health monitoring agent publishes detected health events on GPU or Trn instances to CloudWatch under the Cluster log group `/aws/sagemaker/Clusters/`.

The detection logs from the HyperPod health monitoring agent are created as separate log streams named `SagemakerHealthMonitoringAgent` for each node. You can query the detection logs using CloudWatch log insights as follows.

```
fields @timestamp, @message
| filter @message like /HealthMonitoringAgentDetectionEvent/
```

This should return an output similar to the following.

```
2024-08-21T11:35:35.532-07:00
    {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
2024-08-21T11:35:35.532-07:00
    {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}
```

# Basic health checks
<a name="sagemaker-hyperpod-eks-resiliency-basic-health-check"></a>

SageMaker HyperPod performs a set of *basic health checks* on cluster instances during the creation and update of HyperPod clusters. These basic health checks are orchestrator-agnostic, so these checks are applicable regardless of the underlying orchestration platforms supported by SageMaker HyperPod (Amazon EKS or Slurm).

The basic health checks monitor cluster instances for issues related to devices such as accelerators (GPU and Trainium cores) and network devices (Elastic Fabric Adapter, or EFA). To find the list of basic cluster health checks, see [Cluster health checks](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html#sagemaker-hyperpod-resiliency-slurm-cluster-health-check).

# Deep health checks
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks"></a>

SageMaker HyperPod performs *deep health checks* on cluster instances during the creation and update of HyperPod clusters. You can also request deep health checks on-demand for a SageMaker HyperPod cluster using [StartClusterHealthCheck](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StartClusterHealthCheck.html) API. The deep health checks ensure the reliability and stability of the SageMaker HyperPod clusters by testing the underlying hardware and infrastructure components. This proactive approach helps identify and mitigate potential issues early in the cluster lifecycle.

## List of deep health checks done by SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks-list"></a>

SageMaker HyperPod runs the following deep health checks.

**Instance-level deep health checks**


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | GPU/NVLink count | GPU | Verifies GPU/NVLink counts. | 
| Accelerator | [DCGM diagnostics](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/dcgm-diagnostics.html) level 4 | GPU | Assesses the health and functionality of NVIDIA GPUs by running DCGM (NVIDIA Data Center GPU Manager) diagnostics at level 4, including additional memory tests. | 
| Accelerator | Neuron sysfs | Trainium | For Trainium-powered instances, the health of the Neuron devices is determined by reading counters from [Neuron sysfs](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-sysfs-user-guide.html) propagated directly by the Neuron driver. | 
| Accelerator | Neuron hardware check | Trainium | Runs a training workload and verifies the results to test the hardware. | 
| Accelerator | NCCOM local test | Trainium | Evaluates the performance of collective communication operations on single Trainium nodes | 
| Network | EFA | GPU and Trainium | Runs latency and bandwidth benchmarking on the attached EFA device. | 

**Cluster-level deep health checks**


| Category | Utility name | Instance type compatibility | Description | 
| --- | --- | --- | --- | 
| Accelerator | NCCL test | GPU | Verifies the performance of collective communication operations on multiple NVIDIA GPUs | 
| Accelerator | NCCOM cluster test | Trainium | Verifies the performance of collective communication operations on multiple Trainium nodes | 

**Deep health checks with flexible instance groups**  
For instance groups that use `InstanceRequirements` with multiple instance types, deep health checks behave as follows:  
Instance-level deep health checks run only on eligible GPU instance types. CPU instance types within a flexible instance group are skipped.
Cluster-level connectivity tests (such as NCCL AllReduce) run only between instances of the same type within the instance group. This ensures accurate test results that reflect the networking capabilities of each instance type.
If deep health checks are enabled, at least one instance type in the flexible instance group must support deep health checks.

## Logs from the deep health checks
<a name="sagemaker-hyperpod-eks-resiliency-deep-health-checks-log"></a>

The following are example logs from the SageMaker HyperPod deep health checks.

**Cluster-level logs** 

The cluster-level deep health check logs are stored in your CloudWatch log group at `/aws/sagemaker/Clusters/<cluster_name>/<cluster_id>`

The log streams are logged at `DeepHealthCheckResults/<log_stream_id>`.

As an example shown below, the deep health check output logs show the instance ID that failed the checks with the cause of the failure.

```
{
    "level": "error",
    "ts": "2024-06-18T21:15:22Z",
    "msg": "Encountered FaultyInstance. Replace the Instance. Region: us-west-2, InstanceType: p4d.24xlarge. ERROR:Bandwidth has less than threshold: Expected minimum threshold :80,NCCL Test output Bw: 30"
}
```

**Instance-level logs** 

The instance-level deep health check logs are stored at `/var/log/aws/clusters/sagemaker-deep-health-check.log` on each node. SSH into the node and open the log file by running the following command.

```
cat /var/log/aws/clusters/sagemaker-deep-health-check.log
```

The following is an example output of the hardware stress, [NVIDIA DCGM](https://developer.nvidia.com/dcgm) stress, and EFA connectivity test.

```
# Hardware Stress Test output

2024-08-20T21:53:58Z info Executing Hardware stress check with command: stress-ng, and args: [--cpu 32 --vm 2 --hdd 1 --fork 8 --switch 4 --timeout 60 --metrics]

2024-08-20T21:54:58Z info stress-ng success

2024-08-20T21:54:58Z    info    GpuPci Count check success

# DCGM Stress Test

2024-08-20T22:25:02Z    info    DCGM diagnostic health summary: dcgmCheckLevel: 0 dcgmVersion: 3.3.7 gpuDriverVersion: 535.183.01, gpuDeviceIds: [2237] replacementRequired: false rebootRequired:false

# EFA Loopback Test

2024-08-20T22:26:28Z    info    EFA Loopback check passed for device: rdmap0s29 . Output summary is MaxBw: 58.590000, AvgBw: 32.420000, MaxTypicalLat: 30.870000, MinTypicalLat: 20.080000, AvgLat: 21.630000
```

The following is an example output of the NCCL connectivity test.

```
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

           8             2     float     sum      -1    353.9    0.00    0.00      0    304.2    0.00    0.00      0
          16             4     float     sum      -1    352.8    0.00    0.00      0    422.9    0.00    0.00      0
          32             8     float     sum      -1    520.0    0.00    0.00      0    480.3    0.00    0.00      0
          64            16     float     sum      -1    563.0    0.00    0.00      0    416.1    0.00    0.00      0
         128            32     float     sum      -1    245.1    0.00    0.00      0    308.4    0.00    0.00      0
         256            64     float     sum      -1    310.8    0.00    0.00      0    304.9    0.00    0.00      0
         512           128     float     sum      -1    304.9    0.00    0.00      0    300.8    0.00    0.00      0
        1024           256     float     sum      -1    509.3    0.00    0.00      0    495.4    0.00    0.00      0
        2048           512     float     sum      -1    530.3    0.00    0.00      0    420.0    0.00    0.00      0
        4096          1024     float     sum      -1    391.2    0.01    0.01      0    384.5    0.01    0.01      0
        8192          2048     float     sum      -1    328.5    0.02    0.02      0    253.2    0.03    0.03      0
       16384          4096     float     sum      -1    497.6    0.03    0.03      0    490.9    0.03    0.03      0
       32768          8192     float     sum      -1    496.7    0.07    0.07      0    425.0    0.08    0.08      0
       65536         16384     float     sum      -1    448.0    0.15    0.15      0    501.0    0.13    0.13      0
      131072         32768     float     sum      -1    577.4    0.23    0.23      0    593.4    0.22    0.22      0
      262144         65536     float     sum      -1    757.8    0.35    0.35      0    721.6    0.36    0.36      0
      524288        131072     float     sum      -1   1057.1    0.50    0.50      0   1019.1    0.51    0.51      0
     1048576        262144     float     sum      -1   1460.5    0.72    0.72      0   1435.6    0.73    0.73      0
     2097152        524288     float     sum      -1   2450.6    0.86    0.86      0   2583.1    0.81    0.81      0
     4194304       1048576     float     sum      -1   4344.5    0.97    0.97      0   4419.3    0.95    0.95      0
     8388608       2097152     float     sum      -1   8176.5    1.03    1.03      0   8197.8    1.02    1.02      0
    16777216       4194304     float     sum      -1    15312    1.10    1.10      0    15426    1.09    1.09      0
    33554432       8388608     float     sum      -1    30149    1.11    1.11      0    29941    1.12    1.12      0
    67108864      16777216     float     sum      -1    57819    1.16    1.16      0    58635    1.14    1.14      0
   134217728      33554432     float     sum      -1   115699    1.16    1.16      0   115331    1.16    1.16      0
   268435456      67108864     float     sum      -1   227507    1.18    1.18      0   228047    1.18    1.18      0
   536870912     134217728     float     sum      -1   453751    1.18    1.18      0   456595    1.18    1.18      0
  1073741824     268435456     float     sum      -1   911719    1.18    1.18      0   911808    1.18    1.18      0
  2147483648     536870912     float     sum      -1  1804971    1.19    1.19      0  1806895    1.19    1.19      0

2024-08-20T16:22:43.831-07:00

# Out of bounds values : 0 OK

2024-08-20T16:22:43.831-07:00

# Avg bus bandwidth    : 0.488398 

2024-08-20T23:22:43Z    info    Nccl test successful. Summary: NcclMaxAlgoBw: 1.190000, NcclAvgAlgoBw: 0.488398, NcclThresholdAlgoBw: 1.180000, NcclOutOfBoundError: OK, NcclOperations: all_reduce_perf, NcclTotalDevices: 2, NcclNodes: 2, NcclClusterMessage:
```

# Automatic node recovery
<a name="sagemaker-hyperpod-eks-resiliency-node-recovery"></a>

During cluster creation or update, cluster admin users can select the node (instance) recovery option between `Automatic` (Recommended) and `None` at the cluster level. If set to `Automatic`, SageMaker HyperPod reboots or replaces faulty nodes automatically. 

**Important**  
We recommend setting the `Automatic` option.

Automatic node recovery runs when issues are found from health-monitoring agent, basic health checks, and deep health checks. If set to `None`, the health monitoring agent will label the instances when a fault is detected, but it will not automatically initiate any repair or recovery actions on the affected nodes. This option is not recommended.

# Resilience-related Kubernetes labels by SageMaker HyperPod
<a name="sagemaker-hyperpod-eks-resiliency-node-labels"></a>

*Labels* are key-value pairs that are attached to [Kubernetes objects](https://kubernetes.io/docs/concepts/overview/working-with-objects/#kubernetes-objects). SageMaker HyperPod introduces the following labels for the health checks it provides.

## Node health status labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-health-status"></a>

The `node-health-status` labels represent the status of the node health and to be used as part of node selector filter in healthy nodes.


| Label | Description | 
| --- | --- | 
| sagemaker.amazonaws.com/node-health-status: Schedulable | The node has passed basic health checks and is available for running workloads. This health check is the same as the [currently available SageMaker HyperPod resiliency features for Slurm clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html). | 
| sagemaker.amazonaws.com/node-health-status: Unschedulable | The node is running deep health checks and is not available for running workloads. | 
| sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement | The node has failed deep health checks or health-monitoring agent checks and requires a replacement. If automatic node recovery is enabled, the node will be automatically replaced by SageMaker HyperPod. | 
| sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot | The node has failed deep health checks or health-monitoring agent checks and requires a reboot. If automatic node recovery is enabled, the node will be automatically rebooted by SageMaker HyperPod. | 

## Deep health check labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-deep-health-check"></a>

The `deep-health-check-status` labels represent the progress of deep health check on a specific node. Helpful for Kubernetes users to quickly filter for progress of overall deep health checks.


| Label | Description | 
| --- | --- | 
| sagemaker.amazonaws.com/deep-health-check-status: InProgress | The node is running deep health checks and is not available for running workloads. | 
| sagemaker.amazonaws.com/deep-health-check-status: Passed | The node has successfully completed deep health checks and health-monitoring agent checks, and is available for running workloads. | 
| sagemaker.amazonaws.com/deep-health-check-status: Failed | The node has failed deep health checks or health-monitoring agent checks and requires a reboot or replacement. If automatic node recovery is enabled, the node will be automatically rebooted or replaced by SageMaker HyperPod. | 

## Fault type and reason labels
<a name="sagemaker-hyperpod-eks-resiliency-node-labels-fault-type-and-reason"></a>

The following describes the `fault-type` and `fault-reason` labels.
+ `fault-type` labels represent high-level fault categories when health checks fail. These are populated for failures identified during both deep health and health-monitoring agent checks.
+ `fault-reason` labels represent the detailed fault reason associated with a `fault-type`.

## How SageMaker HyperPod labels
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels"></a>

The following topics cover how labeling is done depending on various cases.

**Topics**
+ [When a node is added to a SageMaker HyperPod cluster with deep health check config disabled](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-off)
+ [When a node is added to a SageMaker HyperPod cluster with deep health check config enabled](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-on)
+ [When there are any compute failures on nodes](#sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-node-fails)

### When a node is added to a SageMaker HyperPod cluster with deep health check config disabled
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-off"></a>

When a new node is added into a cluster, and if deep health check is not enabled for the instance group, SageMaker HyperPod runs the same health checks as the [currently available SageMaker HyperPod health checks for Slurm clusters](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-resiliency-slurm.html). 

If the health check passes, the nodes will be marked with the following label.

```
sagemaker.amazonaws.com/node-health-status: Schedulable
```

If the health check doesn't pass, the nodes will be terminated and replaced. This behavior is the same as the way SageMaker HyperPod health check works for Slurm clusters. 

### When a node is added to a SageMaker HyperPod cluster with deep health check config enabled
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-dhc-is-on"></a>

When a new node is added into a SageMaker HyperPod cluster, and if the deep health check test is enabled for the instance group, HyperPod first taints the node and starts the \$12-hour deep health check/stress test on the node. There are 3 possible outputs of the node labels after the deep health check. 

1. When the deep health check test passes

   ```
   sagemaker.amazonaws.com/node-health-status: Schedulable
   ```

1. When the deep health check test fails, and the instance needs to be replaced

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement
   ```

1. When the deep health check test fails, and the instance needs to be rebooted to rerun the deep health check

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot
   ```

If an instance fails the deep health check test, the instance will always be replaced. If the deep health check tests succeeds, the taint on the node will be removed.

### When there are any compute failures on nodes
<a name="sagemaker-hyperpod-eks-resiliency-node-how-it-labels-when-node-fails"></a>

The SageMaker HyperPod health monitor agent also continuously monitors the health status of each node. When it detects any failures (such as GPU failure and driver crash), the agent marks the node with one of the following labels.

1. When the node is unhealthy and needs to be replaced

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReplacement
   ```

1. When the node is unhealthy and needs to be rebooted

   ```
   sagemaker.amazonaws.com/node-health-status: UnschedulablePendingReboot
   ```

 The health monitor agent also taints the node when it detects any node health issues.

# Manually quarantine, replace, or reboot a node
<a name="sagemaker-hyperpod-eks-resiliency-manual"></a>

Learn how to manually quarantine, replace, and reboot a faulty node in SageMaker HyperPod clusters orchestrated with Amazon EKS.

**To quarantine a node and force delete a training pod**

```
kubectl cordon <node-name>
```

After quarantine, force ejecting the Pod. This is useful when you see a pod is stuck in termination for more than 30min or `kubectl describe pod` shows ‘Node is not ready’ in Events

```
kubectl delete pods <pod-name> --grace-period=0 --force
```

SageMaker HyperPod offers two methods for manual node recovery. The preferred approach is using the SageMaker HyperPod Reboot and Replace APIs, which provides a faster and more transparent recovery process that works across all orchestrators. Alternatively, you can use kubectl commands to label nodes for reboot and replace operations. Both methods activate the same SageMaker HyperPod recovery processes.

**To reboot a node using the Reboot API**

To reboot a node you can use the BatchRebootClusterNodes API.

 Here is an example of running the reboot operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-reboot-cluster-nodes \
        --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
        --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

**To replace a node using the Replace API**

To replace a node you can use the BatchReplaceClusterNodes API as follows

 Here is an example of running the replace operation on two Instances of a cluster using the AWS Command Line Interface:

```
 aws sagemaker batch-replace-cluster-nodes \
        --cluster-name arn:aws:sagemaker:ap-northeast-1:123456789:cluster/test-cluster \
        --node-ids i-0123456789abcdef0 i-0fedcba9876543210
```

**Karpenter-managed clusters**  
For SageMaker HyperPod clusters using Karpenter for node provisioning, the `BatchReplaceClusterNodes` API does not guarantee that a replacement node will be created. The specified node *will* be terminated, but replacement depends on Karpenter's pod-demand-based provisioning model. Karpenter only creates new nodes when there are pods in a `Pending` state that cannot be scheduled on existing nodes.  
If the workload from the deleted node can be rescheduled onto remaining nodes in the cluster (for example, if those nodes have sufficient capacity), Karpenter does not provision a replacement. To ensure a replacement node is created, verify that your workload configuration (such as pod anti-affinity rules or resource requests) requires a new node for the displaced pods.  
We are aware of this limitation and are actively working on a solution to enforce node replacement when requested through the API.

**To replace a node using kubectl**

Label the node to replace with `sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement`, which triggers the SageMaker HyperPod [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md). Note that you also need to activate automatic node recovery during cluster creation or update.

```
kubectl label nodes <node-name> \
   sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement
```

**To reboot a node using kubectl**

Label the node to reboot with `sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot`, which triggers the SageMaker HyperPod [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md). Note that you also need to activate automatic node recovery during cluster creation or update.

```
kubectl label nodes <node-name> \
   sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot
```

After the labels `UnschedulablePendingReplacement` or `UnschedulablePendingReboot` are applied, you should be able to see the node is terminated or rebooted in a few minutes. 

# Suggested resilience configurations
<a name="sagemaker-hyperpod-eks-resiliency-config-tips"></a>

When the deep health checks are enabled, whenever a new instance is added to the HyperPod cluster (either during create-cluster or through automatic node replacement), the new instance goes through the deep health check process (instance level stress tests) for about a couple of hours. The following are suggested resilience config combinations depending on possible cases.

1. **Case**: When you have additional spare nodes within a cluster as back-up resources (not using the full capacity), or if you can wait for about 2 hours for the deep health check process to get the less error-prone instances.

   **Recommendation**: Enable the deep health check config throughout the cluster lifecycle. Node auto-recovery config is enabled by default.

1. **Case**: When you don't have additional backup nodes (capacity is fully used for some training load). You want to get the replacement nodes as soon as possible to resume the training job. 

   **Recommendation**: Enable the deep health check during cluster creation, then turn-off the deep health check config after the cluster is created. Node auto recovery config is enabled by default.

1. **Case**: When you don't have additional backup nodes, and you don't want to wait for the \$12 hour deep health check process (small clusters).

   **Recommendation**: disable the deep health check config throughout the cluster life cycle. Node auto recovery config is enabled by default.

If you want to resume the training job from a failure immediately, make sure that you have additional spare nodes as backup resources in the cluster.

# Spot instances in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-spot"></a>

Amazon SageMaker HyperPod supports Amazon EC2 Spot Instances, enabling significant cost savings for fault-tolerant and stateless AI/ML workloads. Use cases include batch inference and training jobs, hyperparameter tuning, and experimental workloads. You can also use Spot Instances to automatically scale your compute capacity when this low-cost capacity is available and scale back to On-Demand capacity when the added Spot capacity is reclaimed.

By default, Spot Instances on HyperPod work with HyperPod’s [continuous provisioning feature](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-scaling-eks.html), which enables SageMaker HyperPod to automatically provision remaining capacity in the background while workloads start immediately on available instances. When node provisioning encounters failures due to capacity constraints or other issues, SageMaker HyperPod automatically retries in the background until clusters reach their desired scale, so your autoscaling operations remain resilient and non-blocking. You can also use Spot Instances with [Karpenter-based](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html) autoscaling.

**Key Capabilities & Concepts to consider**
+ Capture up to 90% cost savings compared to On-Demand instances
+ Use Spot Instances for jobs that can handle interruptions and where job start and completion times are flexible
+ When using Karpenter for automatic scaling, you can configure HyperPod to automatically fallback to On-Demand when Spot capacity is interrupted or unavailable
+ Access a wide range of CPU, GPU, and accelerator instance types supported by HyperPod
+ Capacity availability depends on supply from EC2 and varies by region and instance type
+ You can perform various actions such as identifying the likelihood of obtaining desired instances or getting interrupted, using various tools such as [Spot Instance Advisor](https://aws.amazon.com/ec2/spot/instance-advisor/) provided by EC2

## Getting started
<a name="sagemaker-hyperpod-spot-instance-getstart"></a>

### Prerequisites
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq"></a>

Before you begin, ensure you have:

#### AWS CLI installed and configured
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-cli"></a>

Set up your AWS credentials and region:

```
aws configure
```

Refer to the [AWS credentials documentation](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html) for detailed instructions.

#### IAM Role for SageMaker HyperPod execution
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-iam"></a>

To update the cluster, you must first create [AWS Identity and Access Management](https://aws.amazon.com/iam/) (IAM) permissions for Karpenter. For instructions, see [Create an IAM role for HyperPod autoscaling with Karpenter](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling-iam.html).

#### VPC and EKS Cluster Setup
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-cluster"></a>

**2.1 Create VPC and EKS Cluster**

Follow the [HyperPod EKS setup guide](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-install-packages-using-helm-chart.html) to:

1. Create a VPC with subnets in multiple Availability Zones

1. Create an EKS cluster

1. Install [required dependencies](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-install-packages-using-helm-chart.html) using Helm charts

**2.2 Set Environment Variables**

```
export EKS_CLUSTER_ARN="arn:aws:eks:REGION:ACCOUNT_ID:cluster/CLUSTER_NAME"
export EXECUTION_ROLE="arn:aws:iam::ACCOUNT_ID:role/SageMakerExecutionRole"
export BUCKET_NAME="your-s3-bucket-name"
export SECURITY_GROUP="sg-xxxxx"
export SUBNET="subnet-xxxxx"
export SUBNET1="subnet-xxxxx"
export SUBNET2="subnet-xxxxx"
export SUBNET3="subnet-xxxxx"
```

#### Service quotas for the Spot instances
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-quota"></a>

Verify you have the required quotas for the instances you will create in the SageMaker HyperPod cluster. To review your quotas, on the Service Quotas console, choose AWS services in the navigation pane, then choose SageMaker. For example, the following screenshot shows the available quota for c5 instances.

![\[An image containing cost region information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cluster-quota.png)


#### Check Spot Availability
<a name="sagemaker-hyperpod-spot-instance-getstart-prereq-availability"></a>

Before creating Spot instance groups, check availability in different Availability Zones:

```
aws ec2 get-spot-placement-scores \
  --region us-west-2 \
  --instance-types c5.2xlarge \
  --target-capacity 10 \
  --single-availability-zone \
  --region-names us-west-2
```

**Tip**: Target Availability Zones with higher placement scores for better availability. You can also check Spot Instance Advisor and EC2 Spot pricing for availability. Select required Availability Zone with better availability score and configure Instance group with associated subnet to launch instance in that AZ.

### Creating a Instance Group (No Autoscaling)
<a name="sagemaker-hyperpod-spot-instance-getstart-create"></a>

**CreateCluster (Spot)**

```
aws sagemaker create-cluster \
    --cluster-name clusterNameHere \
    --orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
    --node-provisioning-mode "Continuous" \
    --cluster-role 'arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole' \
    --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]' 
    --vpc-config '{
        "SecurityGroupIds": ["'$SECURITY_GROUP'"],
        "Subnets": ["'$SUBNET'"] 
    }'
```

**Update Cluster (Spot \$1 On-Demand)**

```
aws sagemaker update-cluster \
   --cluster-name "my-cluster" \
   --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-x-az3",
        "InstanceType": "ml.c5.xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://'$BUCKET_NAME'",
            "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET3'"]
        }
    },
    {
        "InstanceGroupName": "auto-spot-c5-2x-az2",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET2'"]
         }
    },
    {   
        "InstanceGroupName": "auto-ondemand-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]'
```

`CapacityRequirements` cannot be modified once an Instance Group is created.

**Describe Cluster**

```
aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --region us-west-2
```

```
## Sample Response
{
    "ClusterName": "my-cluster",
    "InstanceGroups": [
        {
            "InstanceGroupName": "ml.c5.2xlarge",
            "InstanceType": "ml.c5.xlarge",
            "InstanceCount": 5,
            "CurrentCount": 3,
            "CapacityRequirements: { "Spot": {} },
            "ExecutionRole": "arn:aws:iam::account:role/SageMakerExecutionRole",
            "InstanceStorageConfigs": [...],
            "OverrideVpcConfig": {...}
        }
        // Other IGs
    ]
}
```

**DescribeClusterNode**

```
aws sagemaker describe-cluster-node --cluster-name $HP_CLUSTER_NAME --region us-west-2
```

```
## Sample Response
{
  "NodeDetails": {
    "InstanceId": "i-1234567890abcdef1",
    "InstanceGroupName": "ml.c5.2xlarge",
    "CapacityType": "Spot",
    "InstanceStatus": {...}
  }
}
```

### Using Console
<a name="sagemaker-hyperpod-spot-instance-getstart-console"></a>

#### Create and configure a SageMaker HyperPod cluster
<a name="sagemaker-hyperpod-spot-instance-getstart-console-create"></a>

To begin, launch and configure your SageMaker HyperPod EKS cluster and verify that continuous provisioning mode is enabled on cluster creation. Complete the following steps:

1. On the SageMaker AI console, choose HyperPod clusters in the navigation pane.

1. Choose Create HyperPod cluster and Orchestrated on Amazon EKS.

1. For Setup options, select Custom setup.

1. For Name, enter a name.

1. For Instance recovery, select Automatic.

1. For Instance provisioning mode, select Use continuous provisioning.

1. CapacityType : Select Spot 

1. Choose Submit.

Screen shot of Console : 

![\[An image containing the creation cluster flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-create-cluster.png)


This setup creates the necessary configuration such as virtual private cloud (VPC), subnets, security groups, and EKS cluster, and installs operators in the cluster. You can also provide existing resources such as an EKS cluster if you want to use an existing cluster instead of creating a new one. This setup will take around 20 minutes.

#### Adding new Spot Instance Group to the same cluster
<a name="sagemaker-hyperpod-spot-instance-getstart-console-add"></a>

To add an Spot IG to your existing HyperPod EKS cluster. Complete the following steps:

1. On the SageMaker AI console, choose HyperPod clusters in the navigation pane.

1. Select an existing HyperPod cluster with Amazon EKS Orchestration (Ensure continuous provisioning is enabled).

1. Click Edit.

1. On the Edit Cluster page, click Create instance group.

1. Select capacity type: Spot instance in the instance group configuration.

1. Click Create instance group. 

1. Click Submit.

**Screen shot of Console : **

![\[An image containing the instance group creation flow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-instance-group.png)


### Using CloudFormation
<a name="sagemaker-hyperpod-spot-instance-getstart-cfn"></a>

```
Resources:
  TestCluster:
    Type: AWS::SageMaker::Cluster
    Properties:
      ClusterName: "SampleCluster"
      InstanceGroups:
        - InstanceGroupName: group1
          InstanceType: ml.c5.2xlarge
          InstanceCount: 1
          LifeCycleConfig:
            SourceS3Uri: "s3://'$BUCKET_NAME'"
            OnCreate: "on_create_noop.sh"
          ExecutionRole: "'$EXECUTION_ROLE'",
          ThreadsPerCore: 1
          CapacityRequirements:
            Spot: {}
      VpcConfig:
        Subnets:
          - "'$SUBNET1'"
        SecurityGroupIds:
          - "'$SECURITY_GROUP'"
      Orchestrator:
        Eks:
          ClusterArn:
            '$EKS_CLUSTER_ARN'
      NodeProvisioningMode: "Continuous"
      NodeRecovery: "Automatic"
```

Please see [https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-eks-console-create-cluster-cfn.html](https://docs.aws.amazon.com/sagemaker/latest/dg/smcluster-getting-started-eks-console-create-cluster-cfn.html) for details.

### Karpenter based Autoscaling
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter"></a>

#### Create cluster role
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-role"></a>

**Step 1: Navigate to IAM Console**

1. Go to the **AWS Management Console** → **IAM** service

1. Click **Roles** in the left sidebar

1. Click **Create role**

**Step 2: Set up Trust Policy**

1. Select Custom trust policy (instead of AWS service)

1. Replace the default JSON with this trust policy:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "hyperpod.sagemaker.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

click Next

**Step 3: Create Custom Permissions Policy**

Since these are specific SageMaker permissions, you'll need to create a custom policy:

1. Click **Create policy** (opens new tab)

1. Click the **JSON** tab

1. Enter this policy:

   ```
   {
     "Version": "2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "sagemaker:BatchAddClusterNodes",
           "sagemaker:BatchDeleteClusterNodes"
         ],
         "Resource": "*"
       }
     ]
   }
   ```

1. Click **Next**

1. Give it a name like `SageMakerHyperPodRolePolicy`

1. Click **Create policy**

**Step 4: Attach the Policy to Role**

1. Go back to your role creation tab

1. Refresh the policies list

1. Search for and select your newly created policy

1. Click **Next**

**Step 5: Name and Create Role**

1. Enter a role name (e.g., `SageMakerHyperPodRole`)

1. Add a description if desired

1. Review the trust policy and permissions

1. Click **Create role**

#### Verification
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-verify"></a>

After creation, you can verify by:
+ Checking the Trust relationships tab shows the hyperpod service
+ Checking the Permissions tab shows your custom policy
+ The role ARN will be available for use with HyperPod

The role ARN format will be:

```
 arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole
```

#### Create Cluster with AutoScaling:
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-cluster"></a>

For better availability, create IGs in multiple AZs by configuring Subnets. You can also include onDemand IGs for fallback.

```
aws sagemaker create-cluster \
    --cluster-name clusterNameHere \
    --orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
    --node-provisioning-mode "Continuous" \
    --cluster-role 'arn:aws:iam::YOUR-ACCOUNT-ID:role/SageMakerHyperPodRole' \
    --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 0, // For Auto scaling keep instance count as 0
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]' 
--vpc-config '{
    "SecurityGroupIds": ["'$SECURITY_GROUP'"],
    "Subnets": ["'$SUBNET'"] 
}'
--auto-scaling ' {
    "Mode": "Enable",
    "AutoScalerType": "Karpenter"
}'
```

#### Update Cluster (Spot \$1 On-Demand)
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-update-cluster"></a>

```
aws sagemaker update-cluster \
   --cluster-name "my-cluster" \
   --instance-groups '[{
        "InstanceGroupName": "auto-spot-c5-x-az3",
        "InstanceType": "ml.c5.xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} },
        "LifeCycleConfig": {
            "SourceS3Uri": "s3://'$BUCKET_NAME'",
            "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET3'"]
        }
    },
    {
        "InstanceGroupName": "auto-spot-c5-2x-az2",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "CapacityRequirements: { "Spot": {} }
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET2'"]
         }
    },
    {   
        "InstanceGroupName": "auto-ondemand-c5-2x-az1",
        "InstanceType": "ml.c5.2xlarge",
        "InstanceCount": 2,
        "LifeCycleConfig": {
        "SourceS3Uri": "s3://'$BUCKET_NAME'",
        "OnCreate": "on_create_noop.sh"
        },
        "ExecutionRole": "'$EXECUTION_ROLE'",
        "ThreadsPerCore": 1,
        "OverrideVpcConfig": {
            "SecurityGroupIds": ["'$SECURITY_GROUP'"],
            "Subnets": ["'$SUBNET1'"]
         }
    }]'
```

#### Create HyperpodNodeClass
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-class"></a>

`HyperpodNodeClass` is a custom resource that maps to pre-created instance groups in SageMaker HyperPod, defining constraints around which instance types and Availability Zones are supported for Karpenter’s auto scaling decisions. To use `HyperpodNodeClass`, simply specify the names of the `InstanceGroups` of your SageMaker HyperPod cluster that you want to use as the source for the AWS compute resources to use to scale up your pods in your NodePools. The `HyperpodNodeClass` name that you use here is carried over to the NodePool in the next section where you reference it. This tells the NodePool which `HyperpodNodeClass` to draw resources from. To create a `HyperpodNodeClass`, complete the following steps:

1. Create a YAML file (for example, nodeclass.yaml) similar to the following code. Add `InstanceGroup` names that you used at the time of the SageMaker HyperPod cluster creation. You can also add new instance groups to an existing SageMaker HyperPod EKS cluster.

1. Reference the `HyperPodNodeClass` name in your NodePool configuration.

The following is a sample `HyperpodNodeClass` :

```
apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  name: multiazg6
spec:
  instanceGroups:
    # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created
    # before this step can be completed.
    # MaxItems: 10
    - auto-spot-c5-2x-az1
    - auto-spot-c5-2x-az2
    - auto-spot-c5-x-az3
    - auto-ondemand-c5-2x-az1
```

Karpenter prioritizes Spot instance groups over On-Demand instances, using On-Demand as a fallback when specified in the configuration. Instance selection is sorted by EC2 Spot Placement Scores associated with each subnet's availability zone.

**Apply the configuration to your EKS cluster using `kubectl`:**

```
kubectl apply -f nodeclass.yaml
```

The HyperPod cluster must have AutoScaling enabled and the AutoScaling status must change to `InService` before the `HyperpodNodeClass` can be applied. It also shows Instance Groups capacities as Spot or OnDemand. For more information and key considerations, see [Autoscaling on SageMaker HyperPod EKS](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html).

**For example**

```
apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
  creationTimestamp: "2025-11-30T03:25:04Z"
  name: multiazc6
  uid: ef5609be-15dd-4700-89ea-a3370e023690
spec:
  instanceGroups:
  -spot1
status:
  conditions:
  // true when all IGs in the spec are present in SageMaker cluster, false otherwise
  - lastTransitionTime: "2025-11-20T03:25:04Z"
    message: ""
    observedGeneration: 3
    reason: InstanceGroupReady
    status: "True"
    type: InstanceGroupReady
  // true if subnets of IGs are discoverable, false otherwise
  - lastTransitionTime: "2025-11-20T03:25:04Z"
    message: ""
    observedGeneration: 3
    reason: SubnetsReady
    status: "True"
    type: SubnetsReady
  // true when all dependent resources are Ready [InstanceGroup, Subnets]
  - lastTransitionTime: "2025-11-30T05:47:55Z"
    message: ""
    observedGeneration: 3
    reason: Ready
    status: "True"
    type: Ready
  instanceGroups:
  - instanceTypes:
    - ml.c5.2xlarge
    name:auto-spot-c5-2x-az2
    subnets:
    - id: subnet-03ecc649db2ff20d2
      zone: us-west-2a
      zoneId: usw2-az2
  - capacities: {"Spot": {}}
```

#### Create NodePool
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-create-nodepool"></a>

The NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The NodePool can be set to perform various actions, such as: 
+ Define labels and taints to limit the pods that can run on nodes Karpenter creates
+ Limit node creation to certain zones, instance types, and computer architectures, and so on

For more information about NodePool, refer to [NodePools](https://karpenter.sh/docs/concepts/nodepools/). SageMaker HyperPod managed Karpenter supports a limited set of well-known Kubernetes and Karpenter requirements, which we explain in this post.

To create a NodePool, complete the following steps:

Create a YAML file named `nodepool.yaml` with your desired NodePool configuration. The following code is a sample configuration to create a sample NodePool. We specify the NodePool to include our ml.g6.xlarge SageMaker instance type, and we additionally specify it for one zone. Refer to [NodePools](https://karpenter.sh/docs/concepts/nodepools/) for more customizations.

```
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
 name: gpunodepool
spec:
 template:
   spec:
     nodeClassRef:
      group: karpenter.sagemaker.amazonaws.com
      kind: HyperpodNodeClass
      name: multiazg6
     expireAfter: Never
     requirements:
        - key: node.kubernetes.io/instance-type
          operator: Exists
        - key: "node.kubernetes.io/instance-type" // Optional otherwise Karpenter will decide based on Job config resource requirements
          operator: In
          values: ["ml.c5.2xlarge"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-west-2a"]
```

**Tip**: On EC2 Spot interruption, Hyperpod taints node to trigger pod eviction. Karpenter’s **consolidation** process respects pod disruption budgets and performs normal Kubernetes eviction, but if you set consolidateAfter: 0, then consolidation can happen **immediately**, giving very little time for graceful pod eviction. Set it to non zero upto 2 min to allow graceful pod eviction for any checkpointing needs.

**Apply the NodePool to your cluster:**

```
kubectl apply -f nodepool.yaml
```

**Monitor the NodePool status to ensure the Ready condition in the status is set to True:**

```
kubectl get nodepool gpunodepool -oyaml
```

This example shows how a NodePool can be used to specify the hardware (instance type) and placement (Availability Zone) for pods.

**Launch a simple workload**

The following workload runs a Kubernetes deployment where the pods in deployment are requesting for 1 CPU and 256 MB memory per replica, per pod. The pods have not been spun up yet.

```
kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/inflate.yaml
```

When we apply this, we can see a deployment and a single node launch in our cluster, as shown in the following screenshot.

**To scale this component, use the following command:**

```
kubectl scale deployment inflate --replicas 10
```

See [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-autoscaling.html) for more details.

### Managing Node Interruption
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-interrupt"></a>

Spot Instances can be reclaimed at any time. EC2 provides a best-effort 2-minute interruption notice in most cases, but this notice is not guaranteed. In some situations, EC2 may terminate Spot Instances immediately without any advance warning.HyperPod automatically handles both scenarios:
+ With 2-minute notice: Automatically reattempts graceful pod eviction and controlled capacity replacement when Spot capacity becomes available.
+ Without notice (immediate termination): Automatically reattempts node replacement (when Spot capacity becomes available) without graceful eviction 

**How it works**

When EC2 sends a Spot interruption notice, HyperPod automatically:

1. Detects interruption signal 

1. Taints the node: Prevents new pods from being scheduled on the interrupted instance

1. Gracefully evicts pods: Gives running pods time to complete or checkpoint their work (respecting Kubernetes `terminationGracePeriodSeconds`)

1. Replaces capacity: Automatically attempts to provision the replacement instances (Spot or On-Demand based on availability). 

   Capacity replacement works by automatically provisioning replacement instances. When capacity is not immediately available, the system continues checking until resources become accessible. In the case of non-autoscaling instance groups, HyperPod attempts to scale up within the same instance group until the required capacity becomes available. For Karpenter-based instance groups, Karpenter implements a fallback mechanism to other instance groups configured in the Node class when the primary group cannot accommodate the demand. Additionally, you can configure On-Demand as a fallback option, allowing Karpenter to automatically switch to On-Demand instances if it cannot successfully scale up Spot instance groups.

1. Reschedules workloads: Kubernetes automatically reschedules evicted pods on healthy nodes

### Finding your Usage and Bill
<a name="sagemaker-hyperpod-spot-instance-getstart-karpenter-bill"></a>

To check your usage and billing for Spot Instances on HyperPod you can use the AWS Cost Explorer Console. Go to Billing and Cost Management > Bill

![\[An image containing cost region information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cost-region.png)


**To explore usage and billing on Console, go to Billing and Cost Management > Cost Explorer**

![\[An image containing cost and usage.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/Screenshot-cost-usage.png)


# Using UltraServers in Amazon SageMaker HyperPod
<a name="sagemaker-hyperpod-ultraserver"></a>

SageMaker HyperPod support for Ultraservers provides high-performance GPU computing capabilities for AI and machine learning workloads. Built on NVIDIA GB200 and NVL72 architecture, these Ultraservers provide NVLink connectivity across 18 GB200 instances in a dual-rack configuration, totaling 72 B200 GPUs. This NVLink fabric allows workloads to use GPU communications that increase usable GPU capacity and addressable memory beyond what's possible with discrete instances, supporting more complex and resource-intensive AI models. The NVLink connectivity is enabled by NVIDIA IMEX technology, which handles the low-level configuration for secure GPU fabric connections across instances within the same rack.

HyperPod simplifies the deployment and management of these GPU clusters through intelligent topology awareness and automated configuration. The platform automatically discovers and labels nodes with their physical location and capacity block information, which supports topology-aware scheduling for distributed workloads. HyperPod abstracts the complex IMEX configuration requirements, allowing you to focus on workload deployment rather than low-level GPU fabric setup. You can choose flexible deployment options including both self-managed nodes and EKS managed node groups. Amazon EKS provides optimized AMIs that include pre-configured NVIDIA drivers, Fabric Manager, IMEX drivers, and all necessary system software for seamless operation.

The integration includes pod placement capabilities that ensure distributed workloads are scheduled optimally across NVL72 domains using standard Kubernetes topology labels. Built-in monitoring and automated recovery features provide operational support, where the AMI health agent detects GPU errors from kernel logs and can automatically remediate issues or replace faulty nodes in managed node groups. This combination of GPU scale, intelligent workload placement, and automated operations helps you focus on your AI/ML innovations rather than infrastructure complexity, while achieving maximum performance from your GPU investments.

To get set up using UltraServers with your HyperPod cluster, see the following steps:

1. Create an [ EKS-based HyperPod cluster](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html). When you choose an instance group, make sure you choose an UltraServer. 

1. After your cluster is created, use the following commands install operational plugins:

   NVIDIA device plugin v0.17.2

   ```
   kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.2/deployments/static/nvidia-device-plugin.yml
   ```

   FD DaemonSet v0.17.3

   ```
   kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.17.3"
   ```

   GPU feature discovery

   ```
   kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.2/deployments/static/gpu-feature-discovery-daemonset.yaml
   ```

You can now run jobs. The following example demonstrates how to create a domain, configure an IMEX domain, and enable channel allocation. These steps also let you create a pod to provision a channel for NCCL communication.

1. Create a resource specification file to use with Kubectl.

   ```
   cat <<EOF > imex-channel-injection.yaml
   ---
   apiVersion: resource.nvidia.com/v1beta1
   kind: ComputeDomain
   metadata:
     name: imex-channel-injection
   spec:
     numNodes: 1
     channel:
       resourceClaimTemplate:
         name: imex-channel-0
   ---
   apiVersion: v1
   kind: Pod
   metadata:
     name: imex-channel-injection
   spec:
     affinity:
       nodeAffinity:
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
           - matchExpressions:
             - key: nvidia.com/gpu.clique
               operator: Exists
             - key: topology.k8s.aws/ultraserver-id
               operator: In
               values: 
               - <UltraServer-ID>
     containers:
     - name: ctr
       image: ubuntu:22.04
       command: ["bash", "-c"]
       args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
       resources:
         claims:
         - name: imex-channel-0
     resourceClaims:
     - name: imex-channel-0
       resourceClaimTemplateName: imex-channel-0
   EOF
   ```

1. Apply the configuration that you created.

   ```
   kubectl apply -f imex-channel-injection.yaml
   ```

1. To verify that your pod is created, run the `get pods` commands.

   ```
   kubectl get pods
   kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
   ```

1. You can also check the logs from the pod to see if it allocated a communication channel.

   ```
   kubectl logs imex-channel-injection
   ```

   ```
   total 0
   drwxr-xr-x 2 root root     60 Feb 19 10:43 .
   drwxr-xr-x 6 root root    380 Feb 19 10:43 ..
   crw-rw-rw- 1 root root 507, 0 Feb 19 10:43 channel0
   ```

1. You can also check the logs to verify that the automated IMEX configuration is running with an allocated channel.

   ```
   kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --tail=-1
   /etc/nvidia-imex/nodes_config.cfg:
   ```

   ```
   IMEX Log initializing at: 8/8/2025 14:23:12.081
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX version 570.124.06 is running with the following configuration options
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Logging level = 4
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Logging file name/path = /var/log/nvidia-imex.log
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Append to log file = 0
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Max Log file size = 1024 (MBs)
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Use Syslog file = 0
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX Library communication bind interface =
   
   [JAug 8 2025 14:23:12] [INFO] [tid 39] IMEX library communication bind port = 50000
   
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Identified this node as ID 0, using bind IP of '10.115.131.8', and network interface of enP5p9s0
   [Aug 8 2025 14:23:120] [INFO] [tid 39] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist.  Assuming no previous importers.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] NvGpu Library version matched with GPU Driver version
   [Aug 8 2025 14:23:12] [INFO] [tid 63] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 64] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 65] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] Creating gRPC channels to all peers (nPeers = 1).
   [Aug 8 2025 14:23:12] [INFO] [tid 66] Started processing of incoming messages.
   [Aug 8 2025 14:23:12] [INFO] [tid 39] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
   [Aug 8 2025 14:23:12] [INFO] [tid 67] Connection established to node 0 with ip address 10.115.131.8. Number of times connected: 1
   [Aug 8 2025 14:23:12] [INFO] [tid 39] GPU event successfully subscribed
   ```

1. After you've verified everything, delete the workload and remove the configuration.

   ```
   kubectl delete -f imex-channel-injection.yaml
   ```

# IDEs and Notebooks
<a name="sagemaker-hyperpod-eks-cluster-ide"></a>

Amazon SageMaker is introducing a new capability for SageMaker HyperPod EKS clusters, which allows AI developers to run their interactive machine learning workloads directly on the HyperPod EKS cluster. This feature introduces a new add-on called Amazon SageMaker Spaces, that enables AI developers to create and manage self-contained environments for running notebooks.

Administrators can use SageMaker HyperPod Console to install the add-on on their cluster, and define default space configurations such as images, compute resources, local storage for notebook settings (additional storage to be attached to their dev spaces), file systems, and initialization scripts. A one-click installation option will be available with default settings to simplify the admin experience. Admins can use the SageMaker HyperPod Console, kubectl, or HyperPod CLI to install the operator, create default settings, and manage all spaces in a centralized location.

AI developers can use HyperPod CLI to create, update, and delete dev spaces. They have the flexibility to use default configurations provided by admins or customize settings. AI developers can access their spaces on HyperPod using their local VS Code IDEs, and/or their web browser that hosts their JupyterLab or CodeEditor IDE on custom DNS domain configured by their admins. They can also use kubernetes’ port forwarding feature to access spaces in their web browsers.

## Admin
<a name="admin-cx"></a>
+ [Set up permissions](permission-setup.md)
+ [Install SageMaker AI Spaces Add-on](operator-install.md)
+ [Customize add-on](customization.md)
+ [Add users and set up service accounts](add-user.md)
+ [Limits](ds-limits.md)
+ [Task governance for Interactive Spaces on HyperPod](task-governance.md)
+ [Observability](observability.md)

## Data scientist
<a name="data-scientist-cx"></a>
+ [Create and manage spaces](create-manage-spaces.md)
+ [Web browser access](browser-access.md)
+ [Remote access to SageMaker Spaces](vscode-access.md)

## SageMaker Spaces Managed Instance Pricing
<a name="spaces-managed-instance-pricing"></a>

The SageMaker Spaces Add-on/Operator does not incur any additional charge to the customer. However, to support the SSH-over-SSM tunneling required for the *Remote IDE Connection* feature, SageMaker Spaces uses an AWS-managed instance. This instance is registered as an Advanced On-Premises Instance under SSM, and therefore is billed per compute hour.

Please refer to the “On-Premises Instance Management” rate on the AWS Systems Manager pricing page: AWS Systems Manager Pricing: [https://aws.amazon.com/systems-manager/pricing/](https://aws.amazon.com/systems-manager/pricing.com)

# Set up permissions
<a name="permission-setup"></a>

## Roles required for Add-on and its dependencies
<a name="permission-setup-addon"></a>

### IAM Roles Required for SageMaker Spaces on SageMaker HyperPod
<a name="role-hyperpod"></a>

When enabling **SageMaker Spaces (a.k.a****SageMaker IDE / Notebooks)** features on a SageMaker HyperPod (EKS) cluster, several IAM roles must be created and assigned. These roles support secure access, routing, remote IDE sessions, and EBS storage provisioning. The following table summarizes the four roles and when they are required.

### Role Summary Table
<a name="role-table"></a>


| IAM Role | Required? | Purpose | Who Uses It? | Customization allowed by SageMaker Console? | 
| --- | --- | --- | --- | --- | 
|  Spaces Add-on Execution Role  |  Always required  |  Allows the Spaces controller to manage Spaces, generate presigned URLs, manage SSM sessions  |  Add-on controller pod (privileged)  |  ✔ Yes  | 
|  In-Cluster Router Role  |  Required for WebUI access  |  Allows router pod to perform KMS operations for JWT signing (WebUI authentication)  |  In-cluster router pod (privileged)  |  ✔ Yes  | 
|  SSM Managed Instance Role  |  Required for Remote IDE access  |  Used by SSM agent sidecar for SSH-over-SSM remote IDE sessions  |  SSM Agent in Space IDE Pods (not an add-on pod)  |  ✔ Yes  | 
|  IAM Role for EBS CSI Driver Add-on  |  Always required  |  Allows EBS CSI Driver to create/attach/modify volumes for Spaces workloads  |  EBS CSI Driver Add-on  |  Auto created  | 
|  IAM Role for External DNS Add-on  |  Required for WebUI access  |  It ensures that Space endpoints and in-cluster components can be automatically assigned DNS names in the customer’s Route 53 hosted zones.  |  External DNS Add-on  |  Auto created  | 

### 1. Spaces Add-on Execution Role (Required)
<a name="add-n-execution-role"></a>

The Spaces Add-on Execution Role is always required because it is used by the SageMaker Spaces addon-on controller pod, an administrative component installed through the EKS add-on. This role allows the controller to manage Spaces, provision resources, interact with SSM, and generate presigned URLs for both Remote IDE and WebUI access. It also supports KMS access used for request signing for authenticating the WebUI https requests. This role can be automatically created when SageMaker Spaces add-on is installed through the SageMaker Console. For manual creation, AWS provides the `AmazonSageMakerSpacesControllerPolicy` managed policy.

**Reference Trust Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
          "sts:AssumeRole",
          "sts:TagSession"
      ],
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "{{accountId}}",
          "aws:SourceArn": "arn:aws:eks:{{region}}:{{accountId}}:cluster/{{eksClusterName}}"
        }
      }
    }
  ]
}
```

### 2. In-Cluster Router Role (Required for WebUI Authentication)
<a name="in-cluster-role"></a>

The In-Cluster Router Role is used by the** router pod**, a privileged component that authenticates Spaces WebUI sessions. The router uses a KMS key to create and sign JWT tokens that authorize user access to specific Spaces. This role allows the router pod to generate data keys, and decrypt them. Similar to the controller role, it enforces security using tag- and cluster-based scope restrictions. This role can be automatically generated when Spaces add-on is installed via the AWS SageMaker Console, but customers may manually create it.

**Reference Trust Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "pods.eks.amazonaws.com"
      },
      "Action": [
          "sts:AssumeRole",
          "sts:TagSession"
      ],
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "{{accountId}}",
          "aws:SourceArn": "arn:aws:eks:{{region}}:{{accountId}}:cluster/{{eksClusterName}}"
        }
      }
    }
  ]
}
```

**Reference Permission Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "KMSDescribeKey",
            "Effect": "Allow",
            "Action": [
                "kms:DescribeKey"
            ],
            "Resource": "arn:aws:kms:{{region}}:{{accountId}}:key/{{kmsKeyId}}"
        },
        {
            "Sid": "KMSKeyOperations",
            "Effect": "Allow",
            "Action": [
                "kms:GenerateDataKey",
                "kms:Decrypt"
            ],
            "Resource": "arn:aws:kms:{{region}}:{{accountId}}:key/{{kmsKeyId}}",
            "Condition": {
                "StringEquals": {
                    "kms:EncryptionContext:sagemaker:component": "amazon-sagemaker-spaces",
                    "kms:EncryptionContext:sagemaker:eks-cluster-arn": "${aws:PrincipalTag/eks-cluster-arn}"
                }
            }
        }
    ]
}
```

### 3. SSM Managed Instance Role (Required for Remote IDE Access)
<a name="ssm-role"></a>

The SSM Managed Instance Role is passed when registering the SSM managed instance for enabling the remote IDE access. This role allows the SSM agent to register the pod as an SSM Managed Instance and use the SSM Session Manager channels for Remote IDE (SSH-over-SSM) connectivity. It can be created automatically when using the AWS SageMaker Console. For manual deployments, customers must create this role and provide it to the Spaces add-on. The controller pod itself does not assume this role; it only provides it when calling `ssm:CreateActivation`.

**Reference Trust Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ssm.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "{{account}}"
                },
                "ArnEquals": {
                    "aws:SourceArn": "arn:aws:ssm:{{region}}:{{account}}:*"
                }
            }
        }
    ]
}
```

**Reference Permissions Policy**

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:DescribeAssociation"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:association/*",
        "arn:aws:ssm:{{region}}:{{account}}:document/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetDocument",
        "ssm:DescribeDocument"
      ],
      "Resource": "arn:aws:ssm:{{region}}:{{account}}:document/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameter",
        "ssm:GetParameters"
      ],
      "Resource": "arn:aws:ssm:{{region}}:{{account}}:parameter/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:ListInstanceAssociations"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:PutComplianceItems"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateAssociationStatus"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:document/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateInstanceAssociationStatus"
      ],
      "Resource": [
        "arn:aws:ssm:{{region}}:{{account}}:association/*",
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:UpdateInstanceInformation"
      ],
      "Resource": [
        "arn:aws:ec2:{{region}}:{{account}}:instance/*",
        "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetDeployablePatchSnapshotForInstance",
        "ssm:GetManifest",
        "ssm:ListAssociations",
        "ssm:PutInventory",
        "ssm:PutConfigurePackageResult"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssmmessages:CreateControlChannel",
        "ssmmessages:CreateDataChannel",
        "ssmmessages:OpenControlChannel",
        "ssmmessages:OpenDataChannel"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2messages:AcknowledgeMessage",
        "ec2messages:DeleteMessage",
        "ec2messages:FailMessage",
        "ec2messages:GetEndpoint"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ec2messages:GetMessages",
        "ec2messages:SendReply"
      ],
      "Resource": "*",
      "Condition": {
        "ArnLike": {
          "ssm:SourceInstanceARN": "arn:aws:ssm:{{region}}:{{account}}:managed-instance/*"
        }
      }
    }
  ]
}
```

### 4. IAM Role for EBS CSI Driver Add-on
<a name="role-ebs-csi"></a>

The IAM role for the EBS CSI Driver is required because the EBS CSI Driver provisions persistent volumes for Spaces workloads. While the AWS-managed [AmazonEBSCSIDriverPolicy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonEBSCSIDriverPolicy.html) provides baseline permissions, SageMaker HyperPod clusters require [additional capabilities](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-ebs.html#sagemaker-hyperpod-eks-ebs-setup) such as creating fast snapshot restores, tagging cluster-owned volumes, and attaching/detaching volumes for HyperPod-managed nodes. These permissions also include SageMaker-specific APIs such as `sagemaker:AttachClusterNodeVolume`. If EBS CSI Driver is not installed, this role will now be automatically created by the SageMaker Console during Spaces add-on installation, **requiring no customer action**.

### 5. IAM Role for External DNS Add-on
<a name="role-external-nds"></a>

The External DNS add-on manages DNS records for Services and Ingress resources on the HyperPod cluster. It ensures that Space endpoints and in-cluster components can be automatically assigned DNS names in the customer’s Route 53 hosted zones. Today, customers often install External DNS manually via a 1-click option in the EKS console. As part of improving the SageMaker Spaces experience, this role will now be automatically created by the SageMaker Console during Spaces add-on installation, **requiring no customer action**.

## Permission setup for AWS Toolkit to Access SageMaker Spaces
<a name="permission-for-toolkitl"></a>

To allow the AWS VS Code Toolkit resource explorer side panel to discover and connect to SageMaker Spaces, the following IAM permissions are required. These permissions allow the Toolkit to list available SageMaker HyperPod clusters, retrieve cluster details, and obtain a connection token for the associated Amazon EKS cluster.

**Required IAM Policy**

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "SageMakerListClusters",
            "Effect": "Allow",
            "Action": "sagemaker:ListClusters",
            "Resource": "*"
        },
        {
            "Sid": "SageMakerDescribeCluster",
            "Effect": "Allow",
            "Action": "sagemaker:DescribeCluster",
            "Resource": "arn:aws:sagemaker:{{region}}:{{account}}:cluster/cluster-name"
        },
        {
            "Sid": "EksDescribeCluster",
            "Effect": "Allow",
            "Action": "eks:DescribeCluster",
            "Resource": "arn:aws:eks:{{region}}:{{account}}:cluster/cluster-name"
        },
        {
            "Sid": "EksGetToken",
            "Effect": "Allow",
            "Action": "eks:GetToken",
            "Resource": "*"
        }
    ]
}
```

**Scoping Recommendations**
+ Replace cluster-name with the specific SageMaker HyperPod cluster(s) your users need to access.
+ The eks:GetToken action currently does not support resource-level restrictions and must use Resource: "\$1". This is an AWS service limitation. The client side Authentication is performed through [EKS access entries](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html).

# Install SageMaker AI Spaces Add-on
<a name="operator-install"></a>

## Dependencies
<a name="dependencies"></a>

**Amazon EKS Pod Identity Agent add-on**
+ Required for the operator to obtain AWS credentials
+ **Typically pre-installed** on most EKS clusters
+ Installation: Via EKS add-ons

**Cert-manager**
+ Required for TLS certificate management
+ **Pre-installed** if using HyperPod quick cluster create
+ Installation: Via EKS add-ons

**EBS CSI Driver**
+ Required for Space persistent storage (EBS volumes)
+ **Automatically installed** when using SageMaker console to install
+ Requires IAM role with `AmazonEBSCSIDriverPolicy` \$1 HyperPod-specific permissions
+ Installation: Via EKS add-ons. However, make sure follow the guide to install additional permissions needed for HyperPod. 
+ Reference: [Using the Amazon EBS CSI driver on HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-ebs.html)

## Additional dependencies for WebUI Access
<a name="-additional-dependencies"></a>

**AWS Load Balancer Controller**
+ **Pre-installed** if using HyperPod quick cluster create
+ Installation: Via Helm
+ Manual installation guide: [Installing the AWS Load Balancer Controller](https://docs.aws.amazon.com/eks/latest/userguide/lbc-helm.html)

**External DNS**
+ Required when using custom domain for WebUI access
+ Manages Route53 DNS records automatically
+ Requires IAM role with Route53 permissions
+ Installation: Via EKS add-ons

## Installation
<a name="installation"></a>

Before you begin, ensure that you have:
+ An active SageMaker HyperPod cluster with at least one worker node running Kubernetes version 1.30 or later
+ At least one worker node with minimum instance type (XX vCPU, YY GiB memory)

### Installing the Amazon SageMaker Spaces add-on
<a name="space-add-on"></a>

You can install the SageMaker Spaces add-on using either quick install for default settings or custom install for advanced configuration.

#### Quick install
<a name="quick-install"></a>

1. Open the Amazon SageMaker console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose your cluster from the clusters list.

1. On the IDE and Notebooks tab, locate Amazon SageMaker Spaces, then choose Quick install.

Quick install automatically:
+ Creates the required IAM roles for the add-on
+ Enables remote access mode with required IAM roles for Systems Manager
+ Installs the add-on and configures pod identity association

#### Custom install
<a name="custom-install"></a>

1. Open the Amazon SageMaker console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose your cluster from the clusters list.

1. On the IDE and Notebooks tab, locate Amazon SageMaker Spaces, then choose Custom install.

1. Configure the following options:

   **IAM roles needed by add-on**
   + Choose whether to create new IAM roles with recommended permissions or use existing ones with the required permissions (Refer to Admin Permission Set up section above)

   **Remote access configuration**
   + Enable to allow users to connect to spaces from local Visual Studio Code using AWS Systems Manager
   + For SSM managed instance role:
     + **Create new role** – The add-on creates and manages the role with required Systems Manager permissions
     + **Use existing role** – Select a pre-configured role with necessary Systems Manager permissions
   + Ensure the Spaces Add-on Execution Role has PassRole permissions for the SSM managed instance role
**Note**  
Enabling remote access activates AWS Systems Manager advanced-instances tier for additional per-instance charges. For pricing information, see Systems Manager pricing.

   **Web browser access configuration**
   + Enable to allow users to access spaces through a web browser using Route 53 DNS and SSL certificates
   + **Prerequisites:** Install AWS Load Balancer Controller before enabling browser access
   + **Route 53 hosted zone:** Select an existing hosted zone for a domain or subdomain that you own. The domain or subdomain must be registered and under your control to enable DNS management and SSL certificate validation.

     For more details on domain registration, see [Registering a new domain](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/domain-register.html#domain-register-procedure-section) in the Route 53 Developer Guide.
   + **Subdomain:** Enter subdomain prefix (alphanumeric and hyphens only, maximum 63 characters)
   + **SSL certificate:** Select an existing SSL certificate from AWS Certificate Manager. The certificate must be valid and cover both your subdomain (e.g., subdomain.domain.com) and wildcard subdomains (e.g., \$1.subdomain.domain.com) to support individual space access URLs.
   +  **Token signing key:** Select an AWS KMS asymmetric key for JWT token signing. The key is used to encrypt authentication tokens for secure WebUI access. You can create a new asymmetric key in KMS or select an existing one that your account has access to.
**Note**  
Standard Route 53 charges apply for hosted zones and DNS queries. For pricing information, see Route 53 pricing.

#### EKS Addon Installation - Jupyter K8s with WebUI
<a name="webui-install"></a>

##### Configuration File
<a name="configure-file"></a>

Create `addon-config.yaml`:

```
jupyter-k8s:
  workspacePodWatching:
    enable: true

jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    kmsEncryptionContext:
      enabled: true
    traefik:
      shouldInstall: true
    auth:
      kmsKeyId: "<KMS_KEY_ARN>"
```

**Replace the following placeholders:**
+ <DOMAIN\$1NAME>: Your domain name (e.g., `jupyter.example.com`)
+ <ACM\$1CERTIFICATE\$1ARN>: Your ACM certificate ARN (e.g. `arn:aws:acm:us-west-2:111122223333:certificate/12345678-1234-1234-1234-123456789012`, 
+ <KMS\$1KEY\$1ARN>: Your KMS key ARN (e.g., `arn:aws:kms:us-west-2:111122223333:key/12345678-1234-1234-1234-123456789012`

##### Installation via AWS CLI
<a name="install-via-cli"></a>

```
aws eks create-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

**To update existing addon:**

```
aws eks update-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

##### Installation via AWS Management Console
<a name="install-via-console"></a>

1. Go to **EKS Console** → Select your cluster

1. Click **Add-ons** tab → **Add new**

1. Select **SageMaker Spaces** addon

1. Paste the YAML config above in **Optional configuration settings**

1. Click **Next**, then review the addon settings

1. Click **Create**

##### Verify Installation
<a name="install-verify"></a>

```
# Check addon status
aws eks describe-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --region <AWS_REGION>
```

##### Customizing ALB Attributes
<a name="customize-alb"></a>

By default, the addon creates a public load balancer for use with the web UI. You can customize the load balancer attributes using the EKS addon properties.

To create an internal ALB, set the scheme to `internal`:

```
jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    alb:
      scheme: "internal"  # Default is "internet-facing"
```

You can also use the `alb.annotations` field to customize ALB settings:

```
jupyter-k8s-aws-hyperpod:
  clusterWebUI:
    enabled: true
    domain: "<DOMAIN_NAME>"
    awsCertificateArn: "<ACM_CERTIFICATE_ARN>"
    alb:
      scheme: "internal"
      annotations:
        alb.ingress.kubernetes.io/security-groups: "<SECURITY_GROUP_ID>"
        alb.ingress.kubernetes.io/subnets: "<SUBNET_ID_1>,<SUBNET_ID_2>"
        alb.ingress.kubernetes.io/load-balancer-attributes: "idle_timeout.timeout_seconds=60"
```

**Common ALB annotations:**
+ `alb.ingress.kubernetes.io/security-groups`: Specify security groups for the ALB
+ `alb.ingress.kubernetes.io/subnets`: Specify subnets for the ALB
+ `alb.ingress.kubernetes.io/load-balancer-attributes`: Set ALB attributes (idle timeout, access logs, etc.)

See [AWS Load Balancer Controller documentation](https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/guide/ingress/annotations/) for all available annotations.

### Upgrade / versioning of add-on
<a name="upgrade-add-on"></a>

```
aws eks update-addon \
  --cluster-name <CLUSTER_NAME> \
  --addon-name amazon-sagemaker-spaces \
  --configuration-values file://addon-config.yaml \
  --resolve-conflicts OVERWRITE \
  --region <AWS_REGION>
```

# Customize add-on
<a name="customization"></a>

## Template
<a name="customization-template"></a>

Templates are reusable workspace configurations that serve as admin-controlled blueprints for workspace creation. They provide defaults for workspace configuration values, and guardrails to control what data scientists can do. Templates exist at a cluster level, and can be re-used across namespaces. 

SageMaker Spaces creates two system templates as a starting point for data scientists, one for Code Editor and one for JupyterLab. These system templates are managed by the addon and cannot be editied directly. Instead, admins can create new templates and set them as default.

## Task Governance
<a name="customization-governabce"></a>

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-jupyter-template
  namespace: my-namespace
  labels:
    kueue.x-k8s.io/priority-class: <user-input>-priority
spec:
  displayName: "My Custom Jupyter Lab"
  description: "Custom Jupyter Lab with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-jupyterlab"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "jupyterlab"
```

## SMD / Custom images
<a name="customization-image"></a>

Customers can configure image policies through templates by providing a default image and a list of allowed images. Additionally, administrators can choose whether to allow data scientists to bring their own custom images. The system defaults to using the latest SageMaker Distribution, but if you wish to pin to a particular version, you can specify the exact SMD version to use in a template.

Custom image requirements:
+ `curl` if you want to use idle shutdown
+ port 8888
+ remote access

## Remote IDE Requirement
<a name="remote-ide-requirement"></a>

### VS Code version requirement
<a name="remote-ide-requirement-vscode"></a>

VS Code version [v1.90](https://code.visualstudio.com/updates/v1_90) or greater is required. We recommend using the [latest stable version of VS Code](https://code.visualstudio.com/updates).

### Operating system requirements
<a name="remote-ide-requirement-operate"></a>

You need one of the following operating systems to remotely connect to Studio spaces:
+ macOS 13\$1
+ Windows 10
  + [Windows 10 support ends on October 14, 2025](https://support.microsoft.com/en-us/windows/windows-10-support-ends-on-october-14-2025-2ca8b313-1946-43d3-b55c-2b95b107f281)
+ Windows 11
+ Linux
+ Install the official [Microsoft VS Code for Linux](https://code.visualstudio.com/docs/setup/linux)
  + not an open-source version

### Local machine prerequisites
<a name="remote-ide-requirement-machine"></a>

Before connecting your local Visual Studio Code to Studio spaces, ensure your local machine has the required dependencies and network access.

**Note**  
Environments with software installation restrictions may prevent users from installing required dependencies. The AWS Toolkit for Visual Studio Code automatically searches for these dependencies when initiating remote connections and will prompt for installation if any are missing. Coordinate with your IT department to ensure these components are available.

**Required local dependencies**

Your local machine must have the following components installed:
+ **[https://code.visualstudio.com/docs/remote/ssh](https://code.visualstudio.com/docs/remote/ssh)**
+ — Standard VS Code Marketplace extension for remote development
+ **[Session Manager plugin](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html)** — Required for secure session management
+ **SSH Client** — Standard component on most machines ([OpenSSH recommended for Windows](https://learn.microsoft.com/en-us/windows-server/administration/openssh/openssh_install_firstuse))
+ **[https://code.visualstudio.com/docs/configure/command-line](https://code.visualstudio.com/docs/configure/command-line)**
+  Typically included with VS Code installation

**Platform-specific requirements**
+ **Windows users** — PowerShell 5.1 or later is required for SSH terminal connections

**Network connectivity requirements**

Your local machine must have network access to [Session Manager endpoints](https://docs.aws.amazon.com/general/latest/gr/ssm.html). For example, in US East (N. Virginia) (us-east-1) these can be:
+ `[ssm.us-east-1.amazonaws.com](http://ssm.us-east-1.amazonaws.com)`
+ `ssm.us-east-1.api.aws`
+ `[ssmmessages.us-east-1.amazonaws.com](http://ssmmessages.us-east-1.amazonaws.com)`
+ `[ec2messages.us-east-1.amazonaws.com](http://ec2messages.us-east-1.amazonaws.com)`

### Image requirements
<a name="remote-ide-requirement-image"></a>

**SageMaker Distribution images**

When using SageMaker Distribution with remote access, use [SageMaker Distribution](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-distribution.html) version 2.7 or later.

**Custom images**

When you [Bring your own image (BYOI)](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-byoi.html) with remote access, ensure that you follow the [custom image specifications](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated-byoi-specs.html) and ensure the following dependencies are installed:
+ `curl` or `wget` — Required for downloading AWS CLI components
+ `unzip` — Required for extracting AWS CLI installation files
+ `tar` — Required for archive extraction
+ `gzip` — Required for compressed file handling

### Instance requirements
<a name="remote-ide-requirement-instance"></a>
+ **Memory** — 8GB or more
+ Use instances with at least 8GB of memory. The following instance types are *not* supported due to insufficient memory (less than 8GB): `ml.t3.medium`, `ml.c7i.large`, `ml.c6i.large`, `ml.c6id.large`, and `ml.c5.large`. For a more complete list of instance types, see the [Amazon EC2 On-Demand Pricing page](https://aws.amazon.com/ec2/pricing/on-demand/)

## Optimizing Kubernetes Startup Time by Pre-Warming Container Images
<a name="remote-ide-optimize-image"></a>

Container image pulling performance has become a significant bottleneck for many EKS customers, especially as AI/ML workloads rely on increasingly large container images. Pulling and unpacking these large images typically takes several minutes the first time they are used on each EKS node. This delay adds substantial latency when launching SageMaker Spaces and directly impacts user experience—particularly in environments where fast startup is essential, such as notebooks, interactive development jobs. 

Image pre-warming is a technique used to preload specific container images onto every node in the EKS/HyperPod cluster before they are needed. Instead of waiting for a pod to trigger the first pull of a large image, the cluster proactively downloads and caches images across all nodes. This ensures that when workloads launch, the required images are already available locally, eliminating long cold-start delays. Image pre-warming improves SageMaker Spaces startup speed and provides a more predictable and responsive experience for end users.

### Pre-Warming via DaemonSet
<a name="remote-ide-optimize-image-dae"></a>

We recommend using a DaemonSet to preload images. A DaemonSet ensures that one pod runs on every node in the cluster. Each container inside the DaemonSet pod references an image you want to cache. When Kubernetes starts the pod, it automatically pulls the images, warming the cache on each node.

The following example shows how to create a DaemonSet that preloads two GPU images. Each container runs a lightweight `sleep infinity` command to keep the pod active with minimal overhead.

```
cat <<EOF | kubectl apply -n "namespace_1" -f -
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: image-preload-ds
spec:
  selector:
    matchLabels:
      app: image-preloader
  template:
    metadata:
      labels:
        app: image-preloader
    spec:
      containers:
      - name: preloader-3-4-2
        image: public.ecr.aws/sagemaker/sagemaker-distribution:3.4.2-gpu
        command: ["sleep"]
        args: ["infinity"]
        resources:
          requests:
            cpu: 1m
            memory: 16Mi
          limits:
            cpu: 5m
            memory: 32Mi
      - name: preloader-3-3-2
        image: public.ecr.aws/sagemaker/sagemaker-distribution:3.3.2-gpu
        command: ["sleep"]
        args: ["infinity"]
        resources:
          requests:
            cpu: 1m
            memory: 16Mi
          limits:
            cpu: 5m
            memory: 32Mi
EOF
```

### How It Works
<a name="remote-ide-optimize-image-how"></a>
+ Each container references one image.
+ Kubernetes must download each image before starting the container.
+ Once the pod is running on every node, the images are cached locally.
+ Any workload using these images now starts much faster.

## Space default storage (EBS)
<a name="space-storage"></a>

The system uses the EBS CSI driver by default to provision EBS storage volumes for each workspace. SageMaker creates an EBS storage class for use with workspaces, and administrators can customize the default and maximum size of these volumes using template settings. For advanced users working with CLI tools, you can also customize the storage class of the workspace, which allows users to leverage other storage classes including configuring customer-managed KMS keys for their EBS volumes.

Note that EBS volumes are bound to a particular AZ, which means workspaces can only be scheduled on nodes in the same AZ as their storage volume. This can lead to scheduling failures if cluster capacity exists but not in the correct AZ.

## Additional storage
<a name="space-additional-storage"></a>

SageMaker Spaces supports attaching additional storage volumes such as Amazon EFS, FSx for Lustre, or S3 Mountpoint to your development spaces. This allows you to access shared datasets, collaborate on projects, or use high-performance storage for your workloads.

### Prerequisites
<a name="space-additional-storage-prereq"></a>

Before attaching additional storage to spaces, you must:

1. **Install the appropriate CSI driver add-on** via [EKS add-ons](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html) (Amazon EFS CSI Driver, Amazon FSx for Lustre CSI Driver, or Mountpoint for Amazon S3 CSI Driver)

1. **Set up storage resources and PersistentVolumeClaims** following the CSI driver documentation for your specific storage type

1. **Ensure the PVC is available** in the same namespace where you plan to create your space

### Attaching storage to spaces
<a name="space-additional-storage-attach"></a>

Once you have a PersistentVolumeClaim configured, you can attach it to a space using either the HyperPod CLI or kubectl.

**HyperPod CLI**

```
hyp create hyp-space \
    --name my-space \
    --display-name "My Space with FSx" \
    --memory 8Gi \
    --volume name=shared-fsx,mountPath=/shared,persistentVolumeClaimName=my-fsx-pvc
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: "My Space with FSx"
  desiredStatus: Running
  volumes:
  - name: shared-fsx
    mountPath: /shared
    persistentVolumeClaimName: my-fsx-pvc
```

### Multiple volumes
<a name="space-additional-storage-multiple"></a>

You can attach multiple additional storage volumes to a single space by specifying multiple `--volume` flags with the CLI or multiple entries in the `volumes` array with kubectl.

**HyperPod CLI**

```
hyp create hyp-space \
    --name my-space \
    --display-name "My Space with Multiple Storage" \
    --memory 8Gi \
    --volume name=shared-efs,mountPath=/shared,persistentVolumeClaimName=my-efs-pvc \
    --volume name=datasets,mountPath=/datasets,persistentVolumeClaimName=my-s3-pvc
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: "My Space with Multiple Storage"
  desiredStatus: Running
  volumes:
  - name: shared-efs
    mountPath: /shared
    persistentVolumeClaimName: my-efs-pvc
  - name: datasets
    mountPath: /datasets
    persistentVolumeClaimName: my-s3-pvc
```

## Resource configuration
<a name="space-resource-configuration"></a>

SageMaker Spaces allows you to configure compute resources for your development environments, including CPU, memory, and GPU resources to match your workload requirements.

### GPU configuration
<a name="space-gpu-configuration"></a>

SageMaker Spaces supports both whole GPU allocation and GPU partitioning using NVIDIA Multi-Instance GPU (MIG) technology. This allows you to optimize GPU utilization for different types of machine learning workloads.

#### Whole GPU allocation
<a name="space-gpu-whole"></a>

**HyperPod CLI**

```
hyp create hyp-space \
    --name gpu-space \
    --display-name "GPU Development Space" \
    --image public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu \
    --memory 16Gi \
    --gpu 1 \
    --gpu-limit 1
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: gpu-space
spec:
  displayName: "GPU Development Space"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  desiredStatus: Running
  resources:
    requests:
      memory: "16Gi"
      nvidia.com/gpu: "1"
    limits:
      memory: "16Gi"
      nvidia.com/gpu: "1"
```

#### GPU partitioning (MIG)
<a name="space-gpu-mig"></a>

GPU partitioning using NVIDIA Multi-Instance GPU (MIG) technology allows you to partition a single GPU into smaller, isolated instances. Your HyperPod cluster must have GPU nodes that support MIG and have MIG profiles configured. For more information on setting up MIG on your HyperPod cluster, see [GPU partitioning using NVIDIA MIG](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-gpu-partitioning-setup.html).

**HyperPod CLI**

```
hyp create hyp-space \
    --name mig-space \
    --display-name "MIG GPU Space" \
    --image public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu \
    --memory 8Gi \
    --accelerator-partition-type mig-3g.20gb \
    --accelerator-partition-count 1
```

**kubectl**

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: mig-space
spec:
  displayName: "MIG GPU Space"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  desiredStatus: Running
  resources:
    requests:
      memory: "8Gi"
      nvidia.com/mig-3g.20gb: "1"
    limits:
      memory: "8Gi"
      nvidia.com/mig-3g.20gb: "1"
```

## Lifecycle
<a name="space-lifecycle"></a>

Lifecycle configuration provides startup scripts that run when a workspace is created or started. These scripts allow administrators to customize the workspace environment during startup. These are bash scripts with a maximum size of 1 KB. If you need larger setup configuration, we recommend adding a script to the container image and triggering the script from the lifecycle configuration.

We leverage Kubernetes container lifecycle hooks to provide this functionality [https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/). Note that Kubernetes does not provide guarantees of when the startup script will be run in relation to the entrypoint of the container. 

## Idle shutdown
<a name="space-idle-shutdown"></a>

Configure automatic shutdown of idle workspaces to optimize resource usage.

### Idle shutdown
<a name="space-idle-shutdown-spec"></a>

```
idleShutdown:
  enabled: true
  idleShutdownTimeoutMinutes: 30
  detection:
    httpGet:
      path: /api/idle
      port: 8888
      scheme: HTTP
```

### Parameters
<a name="space-idle-shutdown-parameter"></a>

**enabled** (boolean, required) - Enables or disables idle shutdown for the workspace.

**idleShutdownTimeoutMinutes** (integer, required) - Number of minutes of inactivity before the workspace shuts down. Minimum value is 1.

**detection** (object, required) - Defines how to detect workspace idle state.

**detection.httpGet** (object, optional) - HTTP endpoint configuration for idle detection. Uses Kubernetes HTTPGetAction specification.
+ **path** - HTTP path to request
+ **port** - Port number or name
+ **scheme** - HTTP or HTTPS (default: HTTP)

### Configuration Locations
<a name="space-idle-shutdown-configure"></a>

**Workspace Configuration**

Define idle shutdown directly in the workspace specification:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:

      name: my-workspace
spec:
  displayName: "Development Workspace"
  image:
      jupyter/scipy-notebook:latest
  idleShutdown:
    enabled: true

      idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path:
      /api/idle
        port: 8888
```

**Template Configuration**

Define default idle shutdown behavior in a WorkspaceTemplate:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: jupyter-template
spec:
  displayName: "Jupyter Template"
  defaultImage: jupyter/scipy-notebook:latest
  defaultIdleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path: /api/idle
        port: 8888
  idleShutdownOverrides:
    allow: true
    minTimeoutMinutes: 60
    maxTimeoutMinutes: 240
```

### Template Inheritance and Overrides
<a name="space-idle-shutdown-inherit"></a>

Workspaces using a template automatically inherit the template's `defaultIdleShutdown` configuration. Workspaces can override this configuration if the template allows it.

**Override Policy**

Templates control override behavior through `idleShutdownOverrides`:

**allow** (boolean, default: true)- Whether workspaces can override the default idle shutdown configuration.

**minTimeoutMinutes** (integer, optional)- Minimum allowed timeout value for workspace overrides.

**maxTimeoutMinutes** (integer, optional)- Maximum allowed timeout value for workspace overrides.

**Inheritance Example**

Workspace inherits template defaults:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-workspace
spec:
  displayName: "My Workspace"
  templateRef:
    name: jupyter-template
  # Inherits defaultIdleShutdown from template
```

**Override Example**

Workspace overrides template defaults:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-workspace
spec:
  displayName: "My Workspace"
  templateRef:
    name: jupyter-template
  idleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 60  # Must be within template bounds
    detection:
      httpGet:
        path: /api/idle
        port: 8888
```

**Locked Configuration**

Prevent workspace overrides:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: locked-template
spec:
  displayName: "Locked Template"
  defaultImage: jupyter/scipy-notebook:latest
  defaultIdleShutdown:
    enabled: true
    idleShutdownTimeoutMinutes: 30
    detection:
      httpGet:
        path: /api/idle
        port: 8888
  idleShutdownOverrides:
    allow: false  # Workspaces cannot override
```

### Behavior
<a name="space-idle-shutdown-behavior"></a>

When idle shutdown is enabled, the system periodically checks the workspace for activity using the configured HTTP endpoint. If the endpoint indicates the workspace is idle for the specified timeout duration, the workspace automatically stops. You can manually restart the workspace when needed.

## Template updates
<a name="customization-template-updates"></a>

The client tools such as Kubectl or Hyperpod CLI and SDK can be used for managing Spaces within the EKS cluster. Administrators can provision Space Templates for default Space configurations, while Data Scientists can customize their integrated development environments without needing to understand the underlying Kubernetes complexity. For detailed usage instructions, please refer to the CLI and SDK documentation at [https://sagemaker-hyperpod-cli.readthedocs.io/en/latest/index.html](https://sagemaker-hyperpod-cli.readthedocs.io/en/latest/index.html).

Administrators can perform CRUD operations on Space Templates, which serve as the base configurations when creating a Space. Data Scientists can perform CRUD operations on Spaces and override various parameters, including the Multi-Instance GPU profiles for specific compute nodes. They can start, stop, and connect to the Spaces via remote VSCode access and the Web UI. When a Space Template is updated, any subsequently created Space will be configured with the settings in the updated template. Compliance checks will be performed when existing Spaces are updated or started. If any settings are out of bounds or mismatched, the Spaces will fail to update or start.

## Using hyp cli and kubectl
<a name="customization-hyp-cli"></a>

User can perform CRUD on the templates with the Hyperpod CLI

```
### 1. Create a Space Template
hyp create hyp-space-template --file template.yaml

### 2. List Space Templates
hyp list hyp-space-template
hyp list hyp-space-template --output json

### 3. Describe a Space Template
hyp describe hyp-space-template --name my-template
hyp describe hyp-space-template --name my-template --output json

### 4. Update a Space Template
hyp update hyp-space-template --name my-template --file updated-template.yaml

### 5. Delete a Space Template
hyp delete hyp-space-template --name my-template
```

To create custom templates, you can use our system templates as a starting point. This template will work for SMD-like images, however it can be customized based on the images used by admins.

Example custom JupyterLab template:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-jupyter-template
  namespace: my-namespace
spec:
  displayName: "My Custom Jupyter Lab"
  description: "Custom Jupyter Lab with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-jupyterlab"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "jupyterlab"
```

Example custom Code Editor template:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: WorkspaceTemplate
metadata:
  name: my-code-editor-template
  namespace: my-namespace
spec:
  displayName: "My Custom Code Editor"
  description: "Custom Code Editor with specific configurations"
  defaultImage: "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
  allowedImages:
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-cpu"
    - "public.ecr.aws/sagemaker/sagemaker-distribution:latest-gpu"
  defaultResources:
    requests:
      cpu: "1"
      memory: "4Gi"
    limits:
      cpu: "4"
      memory: "16Gi"
  primaryStorage:
    defaultSize: "10Gi"
    minSize: "5Gi"
    maxSize: "50Gi"
    defaultStorageClassName: "sagemaker-spaces-default-storage-class"
    defaultMountPath: "/home/sagemaker-user"
  defaultContainerConfig:
    command: ["/opt/amazon/sagemaker/workspace/bin/entrypoint-workspace-code-editor"]
  defaultPodSecurityContext:
    fsGroup: 1000
  defaultOwnershipType: "Public"
  defaultAccessStrategy:
    name: "hyperpod-access-strategy"
  allowSecondaryStorages: true
  appType: "code-editor"
```

# Add users and set up service accounts
<a name="add-user"></a>

## Fine grained access control - our recommendation
<a name="add-user-access-control"></a>

Users are differentiated based on their Kubernetes username. The user’s Kubernetes username is defined in their Access Entry. To ensure two human users have distinct usernames, there are two options:

1. Recommended - Multiple human users can use the same role as long as each one has their own distinct session name that will persist between sessions. By default, Kubernetes usernames for IAM roles are in the format `arn:aws:sts::{ACCOUNT_ID}:assumed-role/{ROLE_NAME}/{SESSION_NAME}`. With this default, users will already be differentiated by session name. An Admin has a few ways to enforce unique session names per user.
   + SSO login - Users using SSO login will by default have a session name tied to their AWS username
   + Central credentials vending service - For enterprise customers, they may have some internal credential vending service that users can call to get credentials with their identity. 
   + Role based enforcement - Require IAM users to set their `aws:username` as their role session name when they assume an IAM role in your AWS account. Documentation on how to do this is here: [https://aws.amazon.com/blogs/security/easily-control-naming-individual-iam-role-sessions/](https://aws.amazon.com/blogs/security/easily-control-naming-individual-iam-role-sessions/)

1. If 2 Data Scientists are using different access entries (different IAM role or user), they will always be counted as different users.

**Creating access entry**

Required IAM policy for data scientist role:
+ `eks:DescribeCluster`

Required access entry policies
+ `AmazonSagemakerHyperpodSpacePolicy` - scoped to namespace DS should create spaces in
+ `AmazonSagemakerHyperpodSpaceTemplatePolicy` - scoped to “jupyter-k8s-shared” namespace

## Private and Public spaces
<a name="add-user-spaces"></a>

We support 2 types of sharing patterns: “Public” and “OwnerOnly”. Both the “AccessType” and “OwnershipType” fields use these 2 values.
+ AccessType: Public spaces can be accessed by anyone with permissions in the namespace, while OwnerOnly can only be accessed by the space creator as well as administrator users. Administrator users are defined with the following criteria:
+ OwnershipType: Public spaces can be modified/deleted by anyone with permissions in the namespace, OwnerOnly can be modified/deleted by the creator or the Admin.

Admin users are defined by:

1. Part of the `system:masters` Kubernetes group

1. Part of the Kubernetes group defined in the CLUSTER\$1ADMIN\$1GROUP environment variable in the helm chart.

A user’s groups can be configured using EKS access entries. A space can be defined as “Public” or “OwnerOnly” by configuring the spec in the object:

```
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  labels:
    app.kubernetes.io/name: jupyter-k8s
  name: example-workspace
spec:
  displayName: "Example Workspace"
  image: "public.ecr.aws/sagemaker/sagemaker-distribution:3.4.2-cpu"
  desiredStatus: "Running"
  ownershipType: "Public"/"OwnerOnly"
  accessType: "Public"/"OwnerOnly"
  # more fields here
```

# Limits
<a name="ds-limits"></a>

Spaces run as pods on HyperPod EKS nodes with attached EBS volumes. The number of Spaces that can be deployed per node is constrained by AWS infrastructure limits.

**EBS Volume Limits per Node**

Reference: [https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume\$1limits.html](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html)

EC2 nodes have a maximum number of EBS volumes that can be attached. Since each Space typically uses one EBS volume, this limits how many Spaces with dedicated EBS storage can run on a single node.

**Maximum Pods per HyperPod Node**

Reference: [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)

Each HyperPod instance type supports a maximum number of pods based on available IP addresses from the VPC CNI plugin. Since each Space runs as a pod, this directly caps the number of Spaces per node.

**Impact**

The effective limit for Spaces per node is whichever constraint is reached first. 

# Task governance for Interactive Spaces on HyperPod
<a name="task-governance"></a>

This section covers how to optimize your shared Amazon SageMaker HyperPod EKS clusters for Interactive Spaces workloads. You'll learn to configure Kueue's task governance features—including quota management, priority scheduling, and resource sharing policies—to ensure your development workloads run without interruption while maintaining fair allocation across your teams' training, evaluation, and batch processing activities.

## How Interactive Space management works
<a name="task-governance-how"></a>

To effectively manage Interactive Spaces in shared HyperPod EKS clusters, implement the following task governance strategies using Kueue's existing capabilities.

**Priority class configuration**

Define dedicated priority classes for Interactive Spaces with high weights (such as 100) to ensure development pods are admitted and scheduled before other task types. This configuration enables Interactive Spaces to preempt lower-priority jobs during cluster load, which is critical for maintaining uninterrupted development workflows.

**Quota sizing and allocation**

Reserve sufficient compute resources in your team's ClusterQueue to handle expected development workloads. During periods when development resources are idle, unused quota resources can be temporarily allocated to other teams' tasks. When development demand increases, these borrowed resources can be reclaimed to prioritize pending Interactive Space pods.

**Resource Sharing Strategies**

Choose between two quota sharing approaches based on your requirements:

*Strict Resource Control*: Disable quota lending and borrowing to guarantee reserved compute capacity is always available for your Interactive Spaces. This approach requires sizing quotas large enough to independently handle peak development demand and may result in idle nodes during low-usage periods.

*Flexible Resource Sharing*: Enable quota lending to allow other teams to utilize idle development resources when needed. However, disable borrowing to ensure Interactive Spaces never run on borrowed, reclaimable resources that could lead to unexpected evictions.

**Intra-Team Preemption**

Enable intra-team preemption when running mixed workloads (training, evaluation, and Interactive Spaces) under the same quota. This allows Kueue to preempt lower-priority jobs within your team to accommodate high-priority Interactive Space pods, ensuring development work can proceed without depending on external quota borrowing.

## Sample Interactive Space setup
<a name="task-governance-space-setup"></a>

The following example shows how Kueue manages compute resources for Interactive Spaces in a shared Amazon SageMaker HyperPod cluster.

**Cluster configuration and policy setup**

Your cluster has the following configuration:
+ *Team Alpha (Dev Team)*: 8 CPU quota for Interactive Spaces
+ *Team Beta (ML Team)*: 16 CPU quota for training and evaluation
+ *Team Gamma (Research)*: 6 CPU quota for experimentation
+ *Static provisioning*: No autoscaling
+ *Total capacity*: 30 CPUs

The shared CPU pool uses this priority policy:
+ *Interactive Spaces*: Priority 100
+ *Training*: Priority 75
+ *Evaluation*: Priority 50
+ *Batch Processing*: Priority 25

Kueue enforces team quotas and priority classes, with preemption enabled and borrowing disabled for the dev team.

**Initial state: Normal cluster utilization**

In normal operations:
+ *Team Alpha*: Runs 6 Interactive Spaces using 6 CPUs, 2 CPUs idle
+ *Team Beta*: Runs training jobs (12 CPUs) and evaluation (4 CPUs) within its 16 CPU quota
+ *Team Gamma*: Runs research workloads on all 6 CPUs
+ *Resource sharing*: Team Beta borrows Team Alpha's 2 idle CPUs for additional training

**Development spike: Team Alpha requires additional resources**

When Team Alpha's developers need to scale up development work, additional Interactive Space pods require 4 more CPUs. Kueue detects that the new pods are:
+ Within Team Alpha's namespace
+ Priority 100 (Interactive Spaces)
+ Pending admission due to quota constraints

**Kueue's response process**

Kueue follows a three-step process to allocate resources:

1. **Quota check**

   Question: Does Team Alpha have unused quota?
   + *Current usage*: 6 CPUs used, 2 CPUs available
   + *New requirement*: 4 CPUs needed
   + *Result*: Insufficient quota → Proceed to Step 2

1. **Self-preemption within Team Alpha**

   Question: Can lower-priority Team Alpha jobs be preempted?
   + *Available targets*: No lower-priority jobs in Team Alpha
   + *Result*: No preemption possible → Proceed to Step 3

1. **Reclaim borrowed resources**

   Question: Are Team Alpha resources being borrowed by other teams?
   + *Borrowed resources*: Team Beta using 2 CPUs from Team Alpha
   + *Action*: Kueue evicts Team Beta's borrowed training pods, freeing 2 CPUs
   + *Remaining need*: Still need 2 more CPUs → Interactive Spaces remain in NotAdmitted state until resources become available

This approach prioritizes Interactive Spaces while maintaining team quota boundaries and preventing development work from running on unstable borrowed resources.

# Observability
<a name="observability"></a>

## Standard Kubernetes Monitoring
<a name="observability-monitor"></a>

You can monitor Spaces using standard Kubernetes tools like `kubectl` describe and `kubectl` logs.

**Monitoring Space Status**

```
# List all Spaces with status
kubectl get workspace -A

# Get detailed information about a specific Space
kubectl describe workspace <workspace-name>
```

**Viewing Space Logs**

```
# View workspace container logs
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c workspace

# View SSM agent sidecar logs (for remote IDE connectivity)
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c ssm-agent-sidecar

# Follow logs in real-time
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-name> -c workspace -f
```

**Understanding Space Conditions**

Spaces report four condition types in their status:
+ **Available**: `True` when the Space is ready for use. All required resources (pods, services, storage) are running and healthy.
+ **Progressing**: `True` when the Space is being created, updated, or reconciled. Transitions to `False` once stable.
+ **Degraded**: `True` when errors are detected with the Space resources. Check the condition message for details.
+ **Stopped**: `True` when the Space desired status is set to `Stopped`. The pods are terminated but storage and configuration are preserved.

## CloudWatch Logs Integration
<a name="observability-cw"></a>

You can install the CloudWatch logging add-on to send Space logs to Amazon CloudWatch Logs for centralized log management and retention. This enables log aggregation across multiple clusters and integration with CloudWatch Insights for querying and analysis. All of the above available `kubectl` logs are queryable in CloudWatch with this plugin.

**Reference: **[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.html).

## HyperPod Observability Add-on
<a name="observability-addon"></a>

The SageMaker HyperPod observability add-on provides comprehensive dashboards for monitoring Space resource utilization. After installing the add-on, you can view Space memory and CPU usage in the **Tasks** tab of the HyperPod console, which displays metrics in Amazon Managed Grafana dashboards.

**Reference: **[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html)

**Key metrics available:**
+ CPU and memory utilization per Space
+ GPU metrics (if applicable)

# Create and manage spaces
<a name="create-manage-spaces"></a>

Data scientists can list to view all the spaces they have access to, create a space using one of the templates, update space to update the image, file system, and other attributes of space configuration, and delete a space. As a prerequisite, customers must install HyperPod CLI or use kubectl to create and manage spaces. For more details on HyperPod CLI, please see [this](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/README.md#space). To use kubectl commands, please refer to [this guide](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) to install kubectl.

## Create space
<a name="create-manage-spaces-create"></a>

**HyperPod CLI**

Create a Jupyter space

```
hyp create hyp-space \ 
    --name myspace \ 
    --display-name "My Space" \ 
    --memory 8Gi \ 
    --template-ref name=sagemaker-jupyter-template,namespace=jupyter-k8s-system
```

Create a Code Editor space

```
hyp create hyp-space \ 
    --name myspace \ 
    --display-name "My Space" \ 
    --memory 8Gi \ 
    --template-ref name=sagemaker-code-editor-template,namespace=jupyter-k8s-system
```

**kubectl**

```
kubectl apply -f - <<EOF
apiVersion: workspace.jupyter.org/v1alpha1
kind: Workspace
metadata:
  name: my-space
spec:
  displayName: my-space
  desiredStatus: Running
EOF
```

or you can simply apply the yaml file

```
kubectl apply -f my-workspace.yaml
```

## List spaces
<a name="create-manage-spaces-list"></a>

**HyperPod CLI**

```
hyp list hyp-space
```

**kubectl**

```
kubectl get workspaces -n <workspace-namespace> 
```

## Describe a space
<a name="create-manage-spaces-describe"></a>

**HyperPod CLI**

```
hyp describe hyp-space --name myspace
```

**kubectl**

```
# Basic Status reporting
kubectl get workspace my-workspace -n <workspace-namespace>

# Enhanced Workspace Information Retrieval 
kubectl get workspace my-workspace -n <workspace-namespace> -o wide

# Complete Workspace Information Retrieval
kubectl get workspace my-workspace -n <workspace-namespace> -o json
kubectl get workspace my-workspace -n <workspace-namespace> -o yaml
```

## Update a space
<a name="create-manage-spaces-update"></a>

**HyperPod CLI**

```
hyp update hyp-space \
    --name myspace \
    --display-name "Updated My Space"
```

**kubectl**

Update the original workspace YAML file as needed, then re-apply it. Be sure that the metadata name is not modified. You can also use these kubectl command to modify fields without reapplying the entire workspace yaml: 

```
# Open a Terminal IDE and modify the Workspace
kubectl edit workspace -n <workspace-namespace>

# Patch a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"<field name>":"<desired value>"}}' -n <workspace-namespace>
```

## Start/Stop a space
<a name="create-manage-spaces-stop"></a>

**HyperPod CLI**

```
hyp start hyp-space --name myspace
hyp stop hyp-space --name myspace
```

**kubectl**

You can update the desired status field in the Workspace to start/stop a space.

```
# Start a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"desiredStatus":"Running"}}' -n <workspace-namespace>
    
# Stop a Workspace
kubectl patch workspace <workspace-name> --type='merge' -p \
    '{"spec":{"desiredStatus":"Stopped"}}' -n <workspace-namespace>
```

## Get Logs
<a name="create-manage-spaces-log"></a>

**HyperPod CLI**

```
hyp get-logs hyp-space --name myspace
```

**kubectl**

```
# Check Pod Logs
kubectl logs -l workspace.jupyter.org/workspace-name=<workspace-metadata-name>

# Check Pod Events
kubectl describe pod -l workspace.jupyter.org/workspace-name=<workspace-metadata-name>

# Check Operator Logs
kubectl logs -n jupyter-k8s-system deployment/jupyter-k8s-controller-manager
```

## Delete a space
<a name="create-manage-spaces-delete"></a>

**HyperPod CLI**

```
hyp delete hyp-space --name myspace
```

**kubectl**

```
# Delete a Workspace
kubectl delete workspace <workspace-name> -n <namespace>
```

# Web browser access
<a name="browser-access"></a>

Web UI access allows you to connect directly to development spaces running on your SageMaker HyperPod cluster through a secure web browser interface. This provides immediate access to Jupyter Lab and other web-based development environments without requiring local software installation.

## Prerequisites
<a name="browser-access-prereq"></a>

Before setting up web UI access, ensure you have completed the following:
+ *SageMaker Spaces add-on installation*: Follow the [SageMaker Spaces add-on installation ](https://docs.aws.amazon.com/sagemaker/latest/dg/operator-install.html) and enable web UI access during installation
+ *User access to EKS cluster*: Users need EKS Access Entry configured with appropriate permissions. See [Add users and set up service accounts for EKS Access Entry setup details](https://docs.aws.amazon.com/sagemaker/latest/dg/add-user.html)
+ *Development spaces*: Create and start development spaces on your HyperPod cluster
+ *kubectl access*: Ensure kubectl is configured to access your EKS cluster

## Generate Web UI Access URL
<a name="browser-access-url"></a>

**Using HyperPod CLI**

If you have the HyperPod CLI installed, you can use this simplified command:

```
hyp create hyp-space-access --name <space-name> --connection-type web-ui
```

**Using kubectl**

You can also use the `kubectl` command line to create a connection request.

```
kubectl create -f - -o yaml <<EOF
apiVersion: connection.workspace.jupyter.org/v1alpha1
kind: WorkspaceConnection
metadata:
  namespace: <space-namespace>
spec:
  workspaceName: <space-name>
  workspaceConnectionType: web-ui
EOF
```

The URL is present in the `status.workspaceConnectionUrl` of the output of this command.

## Accessing Your Development Space
<a name="browser-access-develop"></a>

1. *Generate the web UI URL* using one of the methods above

1. *Copy the URL* from the response

1. *Open the URL* in your web browser

1. *Access your development environment* through the web interface

## Supported Development Environments
<a name="browser-access-develop-env"></a>

The web UI provides access to:
+ *Jupyter Lab*
+ *Code Editor*

## Troubleshooting
<a name="browser-access-troubleshooting"></a>

**Cannot generate access URLs**

Check the following:
+ SageMaker Spaces add-on is running: kubectl get pods -n sagemaker-spaces-system
+ Development space is running and healthy
+ User has appropriate EKS Access Entry permissions

# Remote access to SageMaker Spaces
<a name="vscode-access"></a>

Remote access allows you to connect your local Visual Studio Code directly to development spaces running on your SageMaker HyperPod cluster. Remote connections use SSM to establish secure, encrypted tunnels between your local machine and the development spaces.

## Prerequisites
<a name="vscode-access-prereq"></a>

Before setting up remote access, ensure you have completed the following:
+ *SageMaker Spaces add-on installation*: Follow [SageMaker Spaces add-on installation](https://docs.aws.amazon.com/sagemaker/latest/dg/operator-install.html) and enable remote access during installation (either Quick install or Custom install with remote access configuration enabled).
+ *User access to EKS cluster*: Users need EKS Access Entry configured with appropriate permissions. See [Add users and set up service accounts for EKS Access Entry setup details](https://docs.aws.amazon.com/sagemaker/latest/dg/add-user.html)
+ *Development spaces*: Create and start development spaces on your HyperPod cluster
+ *kubectl access*: Ensure kubectl is configured to access your EKS cluster

## Generate VS Code remote connection
<a name="vscode-access-remote"></a>

### Using HyperPod CLI
<a name="vscode-access-remote-cli"></a>

If you have the HyperPod CLI installed, you can use this simplified command:

```
hyp create hyp-space-access --name <space-name> --connection-type vscode-remote
```

### Using kubectl
<a name="vscode-access-remote-kubectl"></a>

You can also use the `kubectl` command line to create a connection request.

```
kubectl create -f - -o yaml <<EOF
apiVersion: connection.workspace.jupyter.org/v1alpha1
kind: WorkspaceConnection
metadata:
  namespace: <space-namespace>
spec:
  workspaceName: <space-name>
  workspaceConnectionType: vscode-remote
EOF
```

The URL is present in the `status.workspaceConnectionUrl` of the output of this command.

## Connecting with VS Code
<a name="vscode-access-remote-vscode"></a>

1. Generate the VS Code connection URL using one of the methods above

1. Copy the VS Code URL from the response

1. Click the URL or paste it into your browser

1. VS Code will prompt to open the remote connection

1. Confirm the connection to establish the remote development environment

## Supported Development Environments
<a name="vscode-access-remote-dev-env"></a>

The web UI provides access to:
+ *Jupyter Lab*
+ *Code Editor*

## Troubleshooting
<a name="troubleshooting"></a>

**Cannot generate connection URLs**

*Check the following:*
+ SageMaker Spaces add-on is running: kubectl get pods -n sagemaker-spaces-system
+ Development space is running and healthy
+ Remote access was enabled during add-on installation
+ User has appropriate EKS Access Entry permissions

# Train and deploy models with HyperPod CLI and SDK
<a name="getting-started-hyperpod-training-deploying-models"></a>

Amazon SageMaker HyperPod helps you train and deploy machine learning models at scale. The AWS HyperPod CLI is a unified command-line interface that simplifies machine learning (ML) workflows on AWS. It abstracts infrastructure complexities and provides a streamlined experience for submitting, monitoring, and managing ML training jobs. The CLI is designed specifically for data scientists and ML engineers who want to focus on model development rather than infrastructure management. This topic walks you through three key scenarios: training a PyTorch model, deploying a custom model using trained artifacts, and deploying a JumpStart model. Designed for first-time users, this concise tutorial ensures you can set up, train, and deploy models effortlessly using either the HyperPod CLI or the SDK. The handshake process between training and inference helps you manage model artifacts effectively. 

## Prerequisites
<a name="prerequisites"></a>

Before you begin using Amazon SageMaker HyperPod, make sure you have:
+ An AWS account with access to Amazon SageMaker HyperPod
+ Python 3.9, 3.10, or 3.11 installed
+ AWS CLI configured with appropriate credentials. 

## Install the HyperPod CLI and SDK
<a name="install-cli-sdk"></a>

Install the required package to access the CLI and SDK:

```
pip install sagemaker-hyperpod
```

This command sets up the tools needed to interact with HyperPod clusters.

## Configure your cluster context
<a name="configure-cluster"></a>

HyperPod operates on clusters optimized for machine learning. Start by listing available clusters to select one for your tasks.

1. List all available clusters:

   ```
   hyp list-cluster
   ```

1. Choose and set your active cluster:

   ```
   hyp set-cluster-context your-eks-cluster-name
   ```

1. Verify the configuration:

   ```
   hyp get-cluster-context
   ```

**Note**  
All subsequent commands target the cluster you've set as your context.

## Choose your scenario
<a name="choose-scenario"></a>

For detailed instructions on each scenario, click on the topics below:

**Topics**
+ [Prerequisites](#prerequisites)
+ [Install the HyperPod CLI and SDK](#install-cli-sdk)
+ [Configure your cluster context](#configure-cluster)
+ [Choose your scenario](#choose-scenario)
+ [Train a PyTorch model](train-models-with-hyperpod.md)
+ [Deploy a custom model](deploy-trained-model.md)
+ [Deploy a JumpStart model](deploy-jumpstart-model.md)

# Train a PyTorch model
<a name="train-models-with-hyperpod"></a>

This topic walks you through the process of training a PyTorch model using HyperPod.

In this scenario, let's train a PyTorch model using the `hyp-pytorch-job` template, which simplifies job creation by exposing commonly used parameters. The model artifacts will be stored in an S3 bucket for later use in inference. However, this is optional, and you can choose your preferred storage location.

## Create a training job
<a name="create-training-job"></a>

You can train the model using either the CLI or Python SDK.

### Using the CLI
<a name="using-cli"></a>

Create a training job with the following command:

```
hyp create hyp-pytorch-job \
    --version 1.0 \
    --job-name test-pytorch-job \
    --image pytorch/pytorch:latest \
    --command '["python", "train.py"]' \
    --args '["--epochs", "10", "--batch-size", "32"]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    --pull-policy "IfNotPresent" \
    --instance-type ml.p4d.24xlarge \
    --tasks-per-node 8 \
    --label-selector '{"accelerator": "nvidia", "network": "efa"}' \
    --deep-health-check-passed-nodes-only true \
    --scheduler-type "kueue" \
    --queue-name "training-queue" \
    --priority "high" \
    --max-retry 3 \
    --volumes '["data-vol", "model-vol", "checkpoint-vol"]' \
    --persistent-volume-claims '["shared-data-pvc", "model-registry-pvc"]' \
    --output-s3-uri s3://my-bucket/model-artifacts
```

**Key required parameters explained**:
+ `--job-name`: Unique identifier for your training job
+ `--image`: Docker image containing your training environment

This command starts a training job named `test-pytorch-job`. The `--output-s3-uri` specifies where the trained model artifacts will be stored, for example, `s3://my-bucket/model-artifacts`. Note this location, as you’ll need it for deploying the custom model.

### Using the Python SDK
<a name="using-python-sdk"></a>

For programmatic control, use the SDK. Create a Python script to launch the same training job.

```
from sagemaker.hyperpod import HyperPodPytorchJob
from sagemaker.hyperpod.job 
import ReplicaSpec, Template, Spec, Container, Resources, RunPolicy, Metadata

# Define job specifications
nproc_per_node = "1"  # Number of processes per node
replica_specs = 
[
    ReplicaSpec
    (
        name = "pod",  # Replica name
        template = Template
        (
            spec = Spec
            (
                containers =
                [
                    Container
                    (
                        # Container name
                        name="container-name",  
                        
                        # Training image
                        image="448049793756.dkr.ecr.us-west-2.amazonaws.com/ptjob:mnist",  
                        
                        # Always pull image
                        image_pull_policy="Always",  
                        resources=Resources\
                        (
                            # No GPUs requested
                            requests={"nvidia.com/gpu": "0"},  
                            # No GPU limit
                            limits={"nvidia.com/gpu": "0"},   
                        ),
                        # Command to run
                        command=["python", "train.py"],  
                        # Script arguments
                        args=["--epochs", "10", "--batch-size", "32"],  
                    )
                ]
            )
        ),
    )
]
# Keep pods after completion
run_policy = RunPolicy(clean_pod_policy="None")  

# Create and start the PyTorch job
pytorch_job = HyperPodPytorchJob
(
    # Job name
    metadata = Metadata(name="demo"),  
    # Processes per node
    nproc_per_node = nproc_per_node,   
    # Replica specifications
    replica_specs = replica_specs,     
    # Run policy
    run_policy = run_policy,           
    # S3 location for artifacts
    output_s3_uri="s3://my-bucket/model-artifacts"  
)
# Launch the job
pytorch_job.create()
```

## Monitor your training job
<a name="monitor-training-job"></a>

Monitor your job's progress with these commands:

### Using the CLI
<a name="monitor-cli"></a>

```
# Check job status
hyp list hyp-pytorch-job

# Get detailed information
hyp describe hyp-pytorch-job --job-name test-pytorch-job

# View logs
hyp get-logs hyp-pytorch-job \
    --pod-name test-pytorch-job-pod-0 \
    --job-name test-pytorch-job
```

**Note**: Training time varies based on model complexity and instance type. Monitor the logs to track progress.

These commands help you verify the job’s status and troubleshoot issues. Once the job completes successfully, the model artifacts are saved to `s3://my-bucket/model-artifacts`.

### Using the Python SDK
<a name="monitor-python-sdk"></a>

Add the following code to your Python script:

```
print("List all pods created for this job:")
print(pytorch_job.list_pods())

print("Check the logs from pod0:")
print(pytorch_job.get_logs_from_pod(pod_name="demo-pod-0"))

print("List all HyperPodPytorchJobs:")
print(HyperPodPytorchJob.list())

print("Describe job:")
print(HyperPodPytorchJob.get(name="demo").model_dump())

pytorch_job.refresh()
print(pytorch_job.status.model_dump())
```

## Next steps
<a name="next-steps"></a>

After training, the model artifacts are stored in the S3 bucket you specified (`s3://my-bucket/model-artifacts`). You can use these artifacts to deploy a model. Currently, you must manually manage the transition from training to inference. This involves:
+ **Locating artifacts**: Check the S3 bucket (`s3://my-bucket/model-artifacts`) to confirm the trained model files are present.
+ **Recording the path**: Note the exact S3 path (e.g., `s3://my-bucket/model-artifacts/test-pytorch-job/model.tar.gz`) for use in the inference setup.
+ **Referencing in deployment**: Provide this S3 path when configuring the custom endpoint to ensure the correct model is loaded.

# Deploy a custom model
<a name="deploy-trained-model"></a>

After training completes, deploy your model for inference. You can deploy a custom model using either the CLI or the SDK.

## Locate your model artifacts
<a name="locate-model-artifacts"></a>
+ **Check your S3 bucket**: Verify that model artifacts are saved at `s3://my-bucket/model-artifacts/`
+ **Note the exact path**: You'll need the full path (for example, `s3://my-bucket/model-artifacts/test-pytorch-job/model.tar.gz`)

## Deploy using the CLI
<a name="deploy-using-cli"></a>

Run the following command to deploy your custom model:

```
hyp create hyp-custom-endpoint \
    --version 1.0 \
    --env '{"HF_MODEL_ID":"/opt/ml/model", "SAGEMAKER_PROGRAM":"inference.py", }' \
    --model-source-type s3 \
    --model-location test-pytorch-job \
    --s3-bucket-name my-bucket \
    --s3-region us-east-2 \
    --prefetch-enabled true \ 
    --image-uri 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:latest \
    --model-volume-mount-name model-weights \
    --container-port 8080 \
    --resources-requests '{"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"}' \
    --resources-limits '{"nvidia.com/gpu": 1}' \
    --tls-output-s3-uri s3://<bucket_name> \
    --instance-type ml.g5.8xlarge \
    --endpoint-name endpoint-custom-pytorch \
    --model-name pytorch-custom-model
```

This command deploys the trained model as an endpoint named `endpoint-custom-pytorch`. The `--model-location` references the artifact path from the training job.

## Deploy using the Python SDK
<a name="deploy-using-sdk"></a>

Create a Python script with the following content:

```
from sagemaker.hyperpod.inference.config.hp_custom_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig, EnvironmentVariables
from sagemaker.hyperpod.inference.hp_custom_endpoint import HPCustomEndpoint

model = Model(
    model_source_type="s3",
    model_location="test-pytorch-job",
    s3_bucket_name="my-bucket",
    s3_region="us-east-2",
    prefetch_enabled=True
)

server = Server(
    instance_type="ml.g5.8xlarge",
    image_uri="763104351884.dkr.ecr.us-east-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.3.1-gpu-py311-cu124-ubuntu22.04-v2.0",
    container_port=8080,
    model_volume_mount_name="model-weights"
)

resources = {
    "requests": {"cpu": "30000m", "nvidia.com/gpu": 1, "memory": "100Gi"},
    "limits": {"nvidia.com/gpu": 1}
}

env = EnvironmentVariables(
    HF_MODEL_ID="/opt/ml/model",
    SAGEMAKER_PROGRAM="inference.py",
    SAGEMAKER_SUBMIT_DIRECTORY="/opt/ml/model/code",
    MODEL_CACHE_ROOT="/opt/ml/model",
    SAGEMAKER_ENV="1"
)

endpoint_name = SageMakerEndpoint(name="endpoint-custom-pytorch")

tls_config = TlsConfig(tls_certificate_output_s3_uri="s3://<bucket_name>")

custom_endpoint = HPCustomEndpoint(
    model=model,
    server=server,
    resources=resources,
    environment=env,
    sage_maker_endpoint=endpoint_name,
    tls_config=tls_config
)

custom_endpoint.create()
```

## Invoke the endpoint
<a name="invoke-endpoint"></a>

### Using the CLI
<a name="invoke-using-cli"></a>

Test the endpoint with a sample input:

```
hyp invoke hyp-custom-endpoint \
    --endpoint-name endpoint-custom-pytorch \
    --body '{"inputs":"What is the capital of USA?"}'
```

This returns the model’s response, such as “The capital of the USA is Washington, D.C.”

### Using the SDK
<a name="invoke-using-sdk"></a>

Add the following code to your Python script:

```
data = '{"inputs":"What is the capital of USA?"}'
response = custom_endpoint.invoke(body=data).body.read()
print(response)
```

## Manage the endpoint
<a name="manage-endpoint"></a>

### Using the CLI
<a name="manage-using-cli"></a>

List and inspect the endpoint:

```
hyp list hyp-custom-endpoint
hyp get hyp-custom-endpoint --name endpoint-custom-pytorch
```

### Using the SDK
<a name="manage-using-sdk"></a>

Add the following code to your Python script:

```
logs = custom_endpoint.get_logs()
print(logs)
```

## Clean up resources
<a name="cleanup-resources"></a>

When you're done, delete the endpoint to avoid unnecessary costs.

### Using the CLI
<a name="cleanup-using-cli"></a>

```
hyp delete hyp-custom-endpoint --name endpoint-custom-pytorch
```

### Using the SDK
<a name="cleanup-using-sdk"></a>

```
custom_endpoint.delete()
```

## Next steps
<a name="next-steps"></a>

You've successfully deployed and tested a custom model using SageMaker HyperPod. You can now use this endpoint for inference in your applications.

# Deploy a JumpStart model
<a name="deploy-jumpstart-model"></a>

You can deploy a pre-trained JumpStart model for inference using either the CLI or the SDK.

## Using the CLI
<a name="deploy-jumpstart-cli"></a>

Run the following command to deploy a JumpStart model:

```
hyp create hyp-jumpstart-endpoint \
  --version 1.0 \
  --model-id deepseek-llm-r1-distill-qwen-1-5b \
  --instance-type ml.g5.8xlarge \
  --endpoint-name endpoint-test-jscli
```

## Using the SDK
<a name="deploy-jumpstart-sdk"></a>

Create a Python script with the following content:

```
from sagemaker.hyperpod.inference.config.hp_jumpstart_endpoint_config import Model, Server, SageMakerEndpoint, TlsConfig
from sagemaker.hyperpod.inference.hp_jumpstart_endpoint import HPJumpStartEndpoint

model=Model(
    model_id='deepseek-llm-r1-distill-qwen-1-5b'
)

server=Server(
    instance_type='ml.g5.8xlarge',
)

endpoint_name=SageMakerEndpoint(name='<endpoint-name>')

# create spec
js_endpoint=HPJumpStartEndpoint(
    model=model,
    server=server,
    sage_maker_endpoint=endpoint_name
)
```

## Invoke the endpoint
<a name="invoke-jumpstart-endpoint"></a>

### Using the CLI
<a name="invoke-jumpstart-cli"></a>

Test the endpoint with a sample input:

```
hyp invoke hyp-jumpstart-endpoint \
    --endpoint-name endpoint-jumpstart \
    --body '{"inputs":"What is the capital of USA?"}'
```

### Using the SDK
<a name="invoke-jumpstart-sdk"></a>

Add the following code to your Python script:

```
data = '{"inputs":"What is the capital of USA?"}'
response = js_endpoint.invoke(body=data).body.read()
print(response)
```

## Manage the endpoint
<a name="manage-jumpstart-endpoint"></a>

### Using the CLI
<a name="manage-jumpstart-cli"></a>

List and inspect the endpoint:

```
hyp list hyp-jumpstart-endpoint
hyp get hyp-jumpstart-endpoint --name endpoint-jumpstart
```

### Using the SDK
<a name="manage-jumpstart-sdk"></a>

Add the following code to your Python script:

```
endpoint_iterator = HPJumpStartEndpoint.list()
for endpoint in endpoint_iterator:
    print(endpoint.name, endpoint.status)

logs = js_endpoint.get_logs()
print(logs)
```

## Clean up resources
<a name="cleanup-jumpstart-resources"></a>

When you're done, delete the endpoint to avoid unnecessary costs.

### Using the CLI
<a name="cleanup-jumpstart-cli"></a>

```
hyp delete hyp-jumpstart-endpoint --name endpoint-jumpstart
```

### Using the SDK
<a name="cleanup-jumpstart-sdk"></a>

```
js_endpoint.delete()
```

## Next steps
<a name="jumpstart-next-steps"></a>

Now that you've trained a PyTorch model, deployed it as a custom endpoint, and deployed a JumpStart model using HyperPod's CLI and SDK, explore advanced features:
+ **Multi-node training**: Scale training across multiple instances
+ **Custom containers**: Build specialized training environments
+ **Integration with SageMaker Pipelines**: Automate your ML workflows
+ **Advanced monitoring**: Set up custom metrics and alerts

For more examples and advanced configurations, visit the [SageMaker HyperPod GitHub repository](https://github.com/aws/amazon-sagemaker-examples).

# Running jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-run-jobs"></a>

The following topics provide procedures and examples of accessing compute nodes and running ML workloads on provisioned SageMaker HyperPod clusters orchestrated with Amazon EKS. Depending on how you have set up the environment on your HyperPod cluster, there are many ways to run ML workloads on HyperPod clusters.

**Note**  
When running jobs via the SageMaker HyperPod CLI or kubectl, HyperPod can track compute utilization (GPU/CPU hours) across namespaces (teams). These metrics power usage reports, which provide:  
Visibility into allocated vs. borrowed resource consumption
Teams resource utilization for auditing (up to 180 days)
Cost attribution aligned with Task Governance policies
To use usage reports, you must install the usage report infrastructure. We strongly recommend configuring [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) to enforce compute quotas and enable granular cost attribution.  
For more information about setting up and generating usage reports, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html).

**Tip**  
For a hands-on experience and guidance on how to set up and use a SageMaker HyperPod cluster orchestrated with Amazon EKS, we recommend taking this [Amazon EKS Support in SageMaker HyperPod](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e) workshop.

Data scientist users can train foundational models using the EKS cluster set as the orchestrator for the SageMaker HyperPod cluster. Scientists leverage the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) and the native `kubectl` commands to find available SageMaker HyperPod clusters, submit training jobs (Pods), and manage their workloads. The SageMaker HyperPod CLI enables job submission using a training job schema file, and provides capabilities for job listing, description, cancellation, and execution. Scientists can use [Kubeflow Training Operator](https://www.kubeflow.org/docs/components/training/overview/) according to compute quotas managed by HyperPod, and [SageMaker AI-managed MLflow](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html) to manage ML experiments and training runs. 

**Topics**
+ [Installing the SageMaker HyperPod CLI](sagemaker-hyperpod-eks-run-jobs-access-nodes.md)
+ [SageMaker HyperPod CLI commands](sagemaker-hyperpod-eks-hyperpod-cli-reference.md)
+ [Running jobs using the SageMaker HyperPod CLI](sagemaker-hyperpod-eks-run-jobs-hyperpod-cli.md)
+ [Running jobs using `kubectl`](sagemaker-hyperpod-eks-run-jobs-kubectl.md)

# Installing the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-eks-run-jobs-access-nodes"></a>

SageMaker HyperPod provides the [SageMaker HyperPod command line interface](https://github.com/aws/sagemaker-hyperpod-cli) (CLI) package. 

1. Check if the version of Python on your local machine is between 3.8 and 3.11.

1. Check the prerequisites in the `README` markdown file in the [SageMaker HyperPod CLI](https://github.com/aws/sagemaker-hyperpod-cli) package.

1. Clone the SageMaker HyperPod CLI package from GitHub.

   ```
   git clone https://github.com/aws/sagemaker-hyperpod-cli.git
   ```

1. Install the SageMaker HyperPod CLI.

   ```
   cd sagemaker-hyperpod-cli && pip install .
   ```

1. Test if the SageMaker HyperPod CLI is successfully installed by running the following command. 

   ```
   hyperpod --help
   ```

**Note**  
If you are a data scientist and want to use the SageMaker HyperPod CLI, make sure that your IAM role is set up properly by your cluster admins following the instructions at [IAM users for scientists](sagemaker-hyperpod-prerequisites-iam.md#sagemaker-hyperpod-prerequisites-iam-cluster-user) and [Setting up Kubernetes role-based access control](sagemaker-hyperpod-eks-setup-rbac.md).

# SageMaker HyperPod CLI commands
<a name="sagemaker-hyperpod-eks-hyperpod-cli-reference"></a>

The following table summarizes the SageMaker HyperPod CLI commands.

**Note**  
For a complete CLI reference, see [README](https://github.com/aws/sagemaker-hyperpod-cli?tab=readme-ov-file#sagemaker-hyperpod-command-line-interface) in the [SageMaker HyperPod CLI GitHub repository](https://github.com/aws/sagemaker-hyperpod-cli).


| SageMaker HyperPod CLI command | Entity  | Description | 
| --- | --- | --- | 
| hyperpod get-clusters | cluster/access | Lists all clusters to which the user has been enabled with IAM permissions to submit training workloadsGives the current snapshot of whole available instances which are not running any workloads or jobs along with maximum capacity, grouping by health check statuses (ex: BurnInPassed) | 
| hyperpod connect-cluster | cluster/access | Configures kubectl to operate on the specified HyperPod cluster and namespace | 
| hyperpod start-job  | job | Submits the job to targeted cluster-Job name will be unique at namespace level-Users will be able to override yaml spec by passing them as CLI arguments | 
| hyperpod get-job | job | Display metadata of the submitted job | 
| hyperpod list-jobs | job | Lists all jobs in the connected cluster/namespace to which the user has been added with IAM permissions to submit training workloads | 
| hyperpod cancel-job | job | Stops and deletes the job and gives up underlying compute resources. This job cannot be resumed again. A new job needs to be started, if needed. | 
| hyperpod list-pods | pod | Lists all the pods in the given job in a namespace | 
| hyperpod get-log | pod | Retrieves the logs of a particulat pod in a specified job | 
| hyperpod exec | pod | Run the bash command in the shell of the specified pod(s) and publishes the output | 
| hyperpod --help | utility | lists all supported commands | 

# Running jobs using the SageMaker HyperPod CLI
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli"></a>

To run jobs, make sure that you installed Kubeflow Training Operator in the EKS clusters. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md).

Run the `hyperpod get-cluster` command to get the list of available HyperPod clusters.

```
hyperpod get-clusters
```

Run the `hyperpod connect-cluster` to configure the SageMaker HyperPod CLI with the EKS cluster orchestrating the HyperPod cluster.

```
hyperpod connect-cluster --cluster-name <hyperpod-cluster-name>
```

Use the `hyperpod start-job` command to run a job. The following command shows the command with required options. 

```
hyperpod start-job \
    --job-name <job-name>
    --image <docker-image-uri>
    --entry-script <entrypoint-script>
    --instance-type <ml.instance.type>
    --node-count <integer>
```

The `hyperpod start-job` command also comes with various options such as job auto-resume and job scheduling.

## Enabling job auto-resume
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-enable-auto-resume"></a>

The `hyperpod start-job` command also has the following options to specify job auto-resume. For enabling job auto-resume to work with the SageMaker HyperPod node resiliency features, you must set the value for the `restart-policy` option to `OnFailure`. The job must be running under the `kubeflow` namespace or a namespace prefixed with `hyperpod`.
+ [--auto-resume <bool>] \$1Optional, enable job auto resume after fails, default is false
+ [--max-retry <int>] \$1Optional, if auto-resume is true, max-retry default value is 1 if not specified
+ [--restart-policy <enum>] \$1Optional, PyTorchJob restart policy. Available values are `Always`, `OnFailure`, `Never` or `ExitCode`. The default value is `OnFailure`. 

```
hyperpod start-job \
    ... // required options \
    --auto-resume true \
    --max-retry 3 \
    --restart-policy OnFailure
```

## Running jobs with scheduling options
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-scheduling"></a>

The `hyperpod start-job` command has the following options to set up the job with queuing mechanisms. 

**Note**  
You need [Kueue](https://kueue.sigs.k8s.io/docs/overview/) installed in the EKS cluster. If you haven't installed, follow the instructions in [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md).
+ [--scheduler-type <enum>] \$1Optional, Specify the scheduler type. The default is `Kueue`.
+ [--queue-name <string>] \$1Optional, Specify the name of the [Local Queue](https://kueue.sigs.k8s.io/docs/concepts/local_queue/) or [Cluster Queue](https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/) you want to submit with the job. The queue should be created by cluster admins using `CreateComputeQuota`.
+ [--priority <string>] \$1Optional, Specify the name of the [Workload Priority Class](https://kueue.sigs.k8s.io/docs/concepts/workload_priority_class/), which should be created by cluster admins.

```
hyperpod start-job \
    ... // required options
    --scheduler-type Kueue \
    --queue-name high-priority-queue \
    --priority high
```

## Running jobs from a configuration file
<a name="sagemaker-hyperpod-eks-run-jobs-hyperpod-cli-from-config"></a>

As an alternative, you can create a job configuration file containing all the parameters required by the job and then pass this config file to the `hyperpod start-job` command using the --config-file option. In this case:

1. Create your job configuration file with the required parameters. Refer to the job configuration file in the SageMaker HyperPod CLI GitHub repository for a [baseline configuration file](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-run-jobs-hyperpod-cli.html#sagemaker-hyperpod-eks-hyperpod-cli-from-config).

1. Start the job using the configuration file as follows.

   ```
   hyperpod start-job --config-file /path/to/test_job.yaml
   ```

**Tip**  
For a complete list of parameters of the `hyperpod start-job` command, see the [Submitting a Job](https://github.com/aws/sagemaker-hyperpod-cli?tab=readme-ov-file#submitting-a-job) section in the `README.md` of the SageMaker HyperPod CLI GitHub repository.

# Running jobs using `kubectl`
<a name="sagemaker-hyperpod-eks-run-jobs-kubectl"></a>

**Note**  
Training job auto resume requires Kubeflow Training Operator release version `1.7.0`, `1.8.0`, or `1.8.1`.

Note that you should install Kubeflow Training Operator in the clusters using a Helm chart. For more information, see [Installing packages on the Amazon EKS cluster using Helm](sagemaker-hyperpod-eks-install-packages-using-helm-chart.md). Verify if the Kubeflow Training Operator control plane is properly set up by running the following command.

```
kubectl get pods -n kubeflow
```

This should return an output similar to the following.

```
NAME                                             READY   STATUS    RESTARTS   AGE
training-operator-658c68d697-46zmn               1/1     Running   0          90s
```

**To submit a training job**

To run a training jobs, prepare the job configuration file and run the [https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#apply) command as follows.

```
kubectl apply -f /path/to/training_job.yaml
```

**To describe a training job**

To retrieve the details of the job submitted to the EKS cluster, use the following command. It returns job information such as the job submission time, completion time, job status, configuration details.

```
kubectl get -o yaml training-job -n kubeflow
```

**To stop a training job and delete EKS resources**

To stop a training job, use kubectl delete. The following is an example of stopping the training job created from the configuration file `pytorch_job_simple.yaml`.

```
kubectl delete -f /path/to/training_job.yaml 
```

This should return the following output.

```
pytorchjob.kubeflow.org "training-job" deleted
```

**To enable job auto-resume**

SageMaker HyperPod supports job auto-resume functionality for Kubernetes jobs, integrating with the Kubeflow Training Operator control plane.

Ensure that there are sufficient nodes in the cluster that have passed the SageMaker HyperPod health check. The nodes should have the taint `sagemaker.amazonaws.com/node-health-status` set to `Schedulable`. It is recommended to include a node selector in the job YAML file to select nodes with the appropriate configuration as follows.

```
sagemaker.amazonaws.com/node-health-status: Schedulable
```

The following code snippet is an example of how to modify a Kubeflow PyTorch job YAML configuration to enable the job auto-resume functionality. You need to add two annotations and set `restartPolicy` to `OnFailure` as follows.

```
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob 
metadata:
    name: pytorch-simple
    namespace: kubeflow
    annotations: { // config for job auto resume
      sagemaker.amazonaws.com/enable-job-auto-resume: "true"
      sagemaker.amazonaws.com/job-max-retry-count: "2"
    }
spec:
  pytorchReplicaSpecs:
  ......
  Worker:
      replicas: 10
      restartPolicy: OnFailure
      template:
          spec:
              nodeSelector:
                  sagemaker.amazonaws.com/node-health-status: Schedulable
```

**To check the job auto-resume status**

Run the following command to check the status of job auto-resume.

```
kubectl describe pytorchjob -n kubeflow <job-name>
```

Depending on the failure patterns, you might see two patterns of Kubeflow training job restart as follows.

**Pattern 1**:

```
Start Time:    2024-07-11T05:53:10Z
Events:
  Type     Reason                   Age                    From                   Message
  ----     ------                   ----                   ----                   -------
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-0
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-worker-1
  Normal   SuccessfulCreateService  9m45s                  pytorchjob-controller  Created service: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m59s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-0
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-worker-1
  Normal   SuccessfulCreatePod      7m58s (x2 over 9m45s)  pytorchjob-controller  Created pod: pt-job-1-master-0
  Warning  PyTorchJobRestarting     7m58s                  pytorchjob-controller  PyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed.
```

**Pattern 2**: 

```
Events:
  Type    Reason                   Age    From                   Message
  ----    ------                   ----   ----                   -------
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      19m    pytorchjob-controller  Created pod: pt-job-2-master-0
  Normal  SuccessfulCreateService  19m    pytorchjob-controller  Created service: pt-job-2-master-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-worker-0
  Normal  SuccessfulCreatePod      4m48s  pytorchjob-controller  Created pod: pt-job-2-master-0
```

# Using the HyperPod training operator
<a name="sagemaker-eks-operator"></a>

 The Amazon SageMaker HyperPod training operator helps you accelerate generative AI model development by efficiently managing distributed training across large GPU clusters. It introduces intelligent fault recovery, hang job detection, and process-level management capabilities that minimize training disruptions and reduce costs. Unlike traditional training infrastructure that requires complete job restarts when failures occur, this operator implements surgical process recovery to keep your training jobs running smoothly. 

 The operator also works with HyperPod's health monitoring and observability functions, providing real-time visibility into training execution and automatic monitoring of critical metrics like loss spikes and throughput degradation. You can define recovery policies through simple YAML configurations without code changes, allowing you to quickly respond to and recover from unrecoverable training states. These monitoring and recovery capabilities work together to maintain optimal training performance while minimizing operational overhead.

 While Kueue is not required for this training operator, your cluster administrator can install and configure it for enhanced job scheduling capabilities. For more information, see the [official documentation for Kueue](https://kueue.sigs.k8s.io/docs/overview/).

**Note**  
To use the training operator, you must use the latest [ HyperPod AMI release](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-release-ami-eks.html). To upgrade, use the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API operation. If you use [ HyperPod task governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html), it must also be the latest version.

## Supported versions
<a name="sagemaker-eks-operator-supported-versions"></a>

 The HyperPod training operator works only work with specific versions of Kubernetes, Kueue, and HyperPod. See the list below for the complete list of compatible versions. 
+ Supported Kubernetes versions – 1.28, 1.29, 1.30, 1.31, 1.32, and 1.33
+ Suggested Kueue versions – [ v.0.12.2](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.2) and [v.0.12.3](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.12.3)
+ The latest HyperPod AMI release. To upgrade to the latest AMI release, use the [ UpdateClusterSoftware](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html) API.
+ [PyTorch 2.4.0 – 2.7.1](https://github.com/pytorch/pytorch/releases)

**Note**  
We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to a job operations, resource management, and essential service functionality.

# Installing the training operator
<a name="sagemaker-eks-operator-install"></a>

See the following sections to learn about how to install the training operator.

## Prerequisites
<a name="sagemaker-eks-operator-prerequisites"></a>

 Before you use the HyperPod training operator, you must have completed the following prerequisites: 
+  [ Created a HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html). 
+ Installed the latest AMI on your HyperPod cluster. For more information, see [SageMaker HyperPod AMI releases for Amazon EKS](sagemaker-hyperpod-release-ami-eks.md).
+ [Installed cert-manager](https://cert-manager.io/docs/installation/).
+  [ Set up the EKS Pod Identity Agent using the console](https://docs.aws.amazon.com/eks/latest/userguide/pod-id-agent-setup.html). If you want to use the AWS CLI, use the following command: 

  ```
  aws eks create-addon \ 
   --cluster-name my-eks-cluster \
   --addon-name eks-pod-identity-agent \
   --region AWS Region
  ```
+ (Optional) If you run your HyperPod cluster nodes in a private VPC, you must set up PrivateLinks VPC endpoints for the Amazon SageMaker AI API (`com.amazonaws.aws-region.sagemaker.api`) and Amazon EKS Auth services (com.amazonaws.*aws-region*.eks-auth). You must also make sure that your cluster nodes are running with subnets that are in a security group that allows the traffic to route through the VPC endpoints to communicate with SageMaker AI and Amazon EKS. If these aren't properly set up, the add-on installation can fail. To learn more about setting up VPC endpoints, see [Create a VPC endpoint](https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html#create-interface-endpoint-aws).

## Installing the training operator
<a name="sagemaker-eks-operator-install-operator"></a>

 You can now install the HyperPod training operator through the SageMaker AI console, the Amazon EKS console, or with the AWS CLI The console methods offer simplified experiences that help you install the operator. The AWS CLI offers a programmatic approach that lets you customize more of your installation.

Between the two console experiences, SageMaker AI provides a one-click installation creates the IAM execution role, creates the pod identity association, and installs the operator. The Amazon EKS console installation is similar, but this method doesn't automatically create the IAM execution role. During this process, you can choose to create a new IAM execution role with information that the console pre-populates. By default, these created roles only have access to the current cluster that you're installing the operator in. Unless you edit the role's permissions to include other clusters, if you remove and reinstall the operator, you must create a new role. 

------
#### [ SageMaker AI console (recommended) ]

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **Amazon SageMaker HyperPod training operator**, and choose **install**. During the installation process, SageMaker AI creates an IAM execution role with permissions similar to the [ AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and creates a pod identity association between your Amazon EKS cluster and your new execution role.

------
#### [ Amazon EKS console ]

**Note**  
If you install the add-on through the Amazon EKS cluster, first make sure that you've tagged your HyperPod cluster with the key-value pair `SageMaker:true`. Otherwise, the installation will fail.

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Go to your EKS cluster, choose **Add-ons**, then choose ** Get more Add-ons**.

1. Choose Amazon SageMaker HyperPod training operator, then choose **Next**.

1. Under **Version**, the console defaults to the latest version, which we recommend that you use.

1. Under **Add-on access**, choose a pod identity IAM role to use with the training operator add-on. If you don't already have a role, choose **Create recommended role** to create one.

1. During this role creation process, the IAM console pre-populates all of the necessary information, such as the use case, the [ AmazonSageMakerHyperPodTrainingOperatorAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) managed policy and other required permissions, the role name, and the description. As you go through the steps, review the information, and choose **Create role**.

1. In the EKS console, review your add-on's settings, and then choose **Create**.

------
#### [ CLI ]

1. Make sure that the IAM execution role for your HyperPod cluster has a trust relationship that allows EKS Pod Identity to assume the role or or [create a new IAM role](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html) with the following trust policy. Alternatively, you could use the Amazon EKS console to install the add-on, which creates a recommended role.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
         "Effect": "Allow",
         "Principal": {
           "Service": "pods.eks.amazonaws.com"
         },
         "Action": [
           "sts:AssumeRole",
           "sts:TagSession",
           "eks-auth:AssumeRoleForPodIdentity"
         ]
       }
     ]
   }
   ```

------

1.  Attach the [ AmazonSageMakerHyperPodTrainingOperatorAccess managed policy](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerHyperPodTrainingOperatorAccess.html) to your created role. 

1.  [ Then create a pod identity association between your EKS cluster, your IAM role, and your new IAM role](https://docs.aws.amazon.com/eks/latest/userguide/pod-identities.html).

   ```
   aws eks create-pod-identity-association \
   --cluster-name my-eks-cluster \
   --role-arn ARN of your execution role \
   --namespace aws-hyperpod \
   --service-account hp-training-operator-controller-manager \
   --region AWS Region
   ```

1.  After you finish the process, you can use the ListPodIdentityAssociations operation to see the association you created. The following is a sample response of what it might look like. 

   ```
   aws eks list-pod-identity-associations --cluster-name my-eks-cluster
   {
       "associations": [{
           "clusterName": "my-eks-cluster",
           "namespace": "aws-hyperpod",
           "serviceAccount": "hp-training-operator-controller-manager",
           "associationArn": "arn:aws:eks:us-east-2:123456789012:podidentityassociation/my-hyperpod-cluster/a-1a2b3c4d5e6f7g8h9",
           "associationId": "a-1a2b3c4d5e6f7g8h9"
       }]
   }
   ```

1. To install the training operator, use the `create-addon` operation. The `--addon-version` parameter is optional. If you don’t provide one, the default is the latest version. To get the possible versions, use the [ DescribeAddonVersions](https://docs.aws.amazon.com/eks/latest/APIReference/API_DescribeAddonVersions.html) operation.

   ```
   aws eks create-addon \
     --cluster-name my-eks-cluster \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --resolve-conflicts OVERWRITE
   ```

------

If you already have the training operator installed on your HyperPod cluster, you can update the EKS add-on to the version that you want. If you want to use [ checkpointless training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless.html) or [ elastic training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-elastic-training.html), consider the following:
+ Both checkpointless training and elastic training require the EKS add-on to be on version 1.2.0 or above.
+ The Amazon SageMaker HyperPod training operator maintains backwards compatibility for any EKS add-on version, so you can upgrade from any add-on version to 1.2.0 or above.
+ If you downgrade from versions 1.2.0 or above to a lower version, you must first delete the existing jobs before the downgrade and resubmit the jobs after the downgrade is complete.

------
#### [ Amazon EKS Console ]

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Go to your EKS cluster, and choose **Add-ons**. Then, choose the Amazon SageMaker HyperPod training operator add-on and choose **Edit**.

1. In the **Version** menu, choose the version of the add-on that you want, then choose **Save changes**.

------
#### [ CLI ]

1. First get the list of the supported versions of the add-on for your cluster.

   ```
   aws eks describe-addon-versions \
     --kubernetes-version $(aws eks describe-cluster --name my-eks-cluster --query 'cluster.version' --output text) \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --query 'addons[0].addonVersions[].addonVersion' \
     --output table
   ```

1. Then update the add-on to the version that you want.

   ```
   aws eks update-addon \
     --cluster-name my-eks-cluster \
     --addon-name amazon-sagemaker-hyperpod-training-operator \
     --addon-version target-version
     --resolve-conflicts OVERWRITE
   ```

------

 The training operator comes with a number of options with default values that might fit your use case. We recommend that you try out the training operator with default values before changing them. The table below describes all parameters and examples of when you might want to configure each parameter.


| Parameter | Description | Default | 
| --- | --- | --- | 
| hpTrainingControllerManager.manager.resources.requests.cpu | How many processors to allocate for the controller | 1 | 
| hpTrainingControllerManager.manager.resources.requests.memory | How much memory to allocate to the controller | 2Gi | 
| hpTrainingControllerManager.manager.resources.limits.cpu | The CPU limit for the controller | 2 | 
| hpTrainingControllerManager.manager.resources.limits.memory | The memory limit for the controller | 4Gi | 
| hpTrainingControllerManager.nodeSelector | Node selector for the controller pods | Default behavior is to select nodes with the label sagemaker.amazonaws.com/compute-type: "HyperPod" | 

## HyperPod elastic agent
<a name="sagemaker-eks-operator-elastic-agent"></a>

The HyperPod elastic agent is an extension of [PyTorch’s ElasticAgent](https://docs.pytorch.org/docs/stable/elastic/agent.html). It orchestrates lifecycles of training workers on each container and communicates with the HyperPod training operator. To use the HyperPod training operator, you must first install the HyperPod elastic agent into your training image before you can submit and run jobs using the operator. The following is a docker file that installs elastic agent and uses `hyperpodrun` to create the job launcher.

**Note**  
Both [ checkpointless training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-checkpointless.html) and [ elastic training](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-elastic-training.html) require that you use HyperPod elastic agent version 1.1.0 or above.

```
RUN pip install hyperpod-elastic-agent

ENTRYPOINT ["entrypoint.sh"]
# entrypoint.sh
...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
            --rdzv-backend hyperpod \ # Optional
            --inprocess-restart \ # Optional (in-process fault recovery with checkpointless training)
            ... # Other torchrun args
            # pre-traing arg_group
            --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
            # post-train arg_group
            --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
            training.py --script-args
```

You can now submit jobs with `kubectl`.

### HyperPod elastic agent arguments
<a name="sagemaker-eks-operator-elastic-agent-args"></a>

 The HyperPod elastic agent supports all of the original arguments and adds some additional arguments. The following is all of the arguments available in the HyperPod elastic agent. For more information about PyTorch's Elastic Agent, see their [official documentation](https://docs.pytorch.org/docs/stable/elastic/agent.html). 


| Argument | Description | Default Value | 
| --- | --- | --- | 
| --shutdown-signal | Signal to send to workers for shutdown (SIGTERM or SIGKILL) | "SIGKILL" | 
| --shutdown-timeout | Timeout in seconds between shutdown-signal and SIGKILL signals | 15 | 
| --server-host | Agent server address | "0.0.0.0" | 
| --server-port | Agent server port | 8080 | 
| --server-log-level | Agent server log level | "info" | 
| --server-shutdown-timeout | Server shutdown timeout in seconds | 300 | 
| --pre-train-script | Path to pre-training script | None | 
| --pre-train-args | Arguments for pre-training script | None | 
| --post-train-script | Path to post-training script | None | 
| --post-train-args | Arguments for post-training script | None | 
| --inprocess-restart | Flag specifying whether to use the inprocess\$1restart feature | FALSE | 
| --inprocess-timeout | Time in seconds that the agent waits for workers to reach a synchronization barrier before triggering a process-level restart. | None | 

## Task governance (optional)
<a name="sagemaker-eks-operator-task-governance"></a>

The training operator is integrated with [ HyperPod task governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance), a robust management system designed to streamline resource allocation and ensure efficient utilization of compute resources across teams and projects for your Amazon EKS clusters. To set up HyperPod task governance, see [Setup for SageMaker HyperPod task governance](sagemaker-hyperpod-eks-operate-console-ui-governance-setup.md). 

**Note**  
When installing the HyperPod task governance add-on, you must use version v1.3.0-eksbuild.1 or higher.

When submitting a job, make sure you include your queue name and priority class labels of `hyperpod-ns-team-name-localqueue` and `priority-class-name-priority`. For example, if you're using Kueue, your labels become the following:
+ kueue.x-k8s.io/queue-name: hyperpod-ns-*team-name*-localqueue
+ kueue.x-k8s.io/priority-class: *priority-class*-name-priority

The following is an example of what your configuration file might look like:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPytorchJob
metadata:
  name: hp-task-governance-sample
  namespace: hyperpod-ns-team-name
  labels:
    kueue.x-k8s.io/queue-name: hyperpod-ns-team-name-localqueue
    kueue.x-k8s.io/priority-class: priority-class-priority
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 4
      spares: 2
      template:
        spec:
          containers:
            - name: ptjob
              image: XXXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  cpu: "2"
```

Then use the following kubectl command to apply the YAML file.

```
kubectl apply -f task-governance-job.yaml
```

## Kueue (optional)
<a name="sagemaker-eks-operator-kueue"></a>

While you can run jobs directly, your organization can also integrate the training operator with Kueue to allocate resources and schedule jobs. Follow the steps below to install Kueue into your HyperPod cluster.

1. Follow the installation guide in the [ official Kueue documentation](https://kueue.sigs.k8s.io/docs/installation/#install-a-custom-configured-released-version). When you reach the step of configuring `controller_manager_config.yaml`, add the following configuration:

   ```
   externalFrameworks:
   - "HyperPodPytorchJob.v1.sagemaker.amazonaws.com"
   ```

1. Follow the rest of the steps in the official installation guide. After you finish installing Kueue, you can create some sample queues with the `kubectl apply -f sample-queues.yaml` command. Use the following YAML file.

   ```
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: ClusterQueue
   metadata:
     name: cluster-queue
   spec:
     namespaceSelector: {}
     preemption:
       withinClusterQueue: LowerPriority
     resourceGroups:
     - coveredResources:
       - cpu
       - nvidia.com/gpu
       - pods
       flavors:
       - name: default-flavor
         resources:
         - name: cpu
           nominalQuota: 16
         - name: nvidia.com/gpu
           nominalQuota: 16
         - name: pods
           nominalQuota: 16
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: LocalQueue
   metadata:
     name: user-queue
     namespace: default
   spec:
     clusterQueue: cluster-queue
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   kind: ResourceFlavor
   metadata:
     name: default-flavor
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   description: High priority
   kind: WorkloadPriorityClass
   metadata:
     name: high-priority-class
   value: 1000
   ---
   apiVersion: kueue.x-k8s.io/v1beta1
   description: Low Priority
   kind: WorkloadPriorityClass
   metadata:
     name: low-priority-class
   value: 500
   ```

# Using the training operator to run jobs
<a name="sagemaker-eks-operator-usage"></a>

 To use kubectl to run the job, you must create a job.yaml to specify the job specifications and run `kubectl apply -f job.yaml` to submit the job. In this YAML file, you can specify custom configurations in the `logMonitoringConfiguration` argument to define automated monitoring rules that analyze log outputs from the distributed training job to detect problems and recover. 

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    app.kubernetes.io/name: HyperPod
    app.kubernetes.io/managed-by: kustomize
  name: &jobname xxx
  annotations:
    XXX: XXX
    ......
spec:
  nprocPerNode: "X"
  replicaSpecs:
    - name: 'XXX'
      replicas: 16
      template:
        spec:
          nodeSelector:
            beta.kubernetes.io/instance-type: ml.p5.48xlarge
          containers:
            - name: XXX
              image: XXX
              imagePullPolicy: Always
              ports:
                - containerPort: 8080 # This is the port that HyperPodElasticAgent listens to
              resources:
                limits:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                requests:
                  nvidia.com/gpu: 8
                  hugepages-2Mi: 5120Mi
                  memory: 32000Mi
          ......        
  runPolicy:
    jobMaxRetryCount: 50
    restartPolicy:
      numRestartBeforeFullJobRestart: 3 
      evalPeriodSeconds: 21600 
      maxFullJobRestarts: 1
    cleanPodPolicy: "All"
    logMonitoringConfiguration: 
      - name: "JobStart"
        logPattern: ".*Experiment configuration.*" # This is the start of the training script
        expectedStartCutOffInSeconds: 120 # Expected match in the first 2 minutes
      - name: "JobHangingDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'training_loss_step': (\\d+(\\.\\d+)?).*"
        expectedRecurringFrequencyInSeconds: 300 # If next batch is not printed within 5 minute, consider it hangs. Or if loss is not decimal (e.g. nan) for 2 minutes, mark it hang as well.
        expectedStartCutOffInSeconds: 600 # Allow 10 minutes of job startup time
      - name: "NoS3CheckpointingDetection"
        logPattern: ".*The checkpoint is finalized. All shards is written.*"
        expectedRecurringFrequencyInSeconds: 600 # If next checkpoint s3 upload doesn't happen within 10 mins, mark it hang.
        expectedStartCutOffInSeconds: 1800 # Allow 30 minutes for first checkpoint upload
      - name: "LowThroughputDetection"
        logPattern: ".*\\[Epoch 0 Batch \\d+.*'samples\\/sec': (\\d+(\\.\\d+)?).*"
        metricThreshold: 80 # 80 samples/sec
        operator: "lteq"
        metricEvaluationDataPoints: 25 # if throughput lower than threshold for 25 datapoints, kill the job
```

If you want to use the log monitoring options, make sure that you’re emitting the training log to `sys.stdout`. HyperPod elastic agent monitors training logs in sys.stdout, which is saved at `/tmp/hyperpod/`. You can use the following command to emit training logs.

```
logging.basicConfig(format="%(asctime)s [%(levelname)s] %(name)s: %(message)s", level=logging.INFO, stream=sys.stdout)
```

 The following table describes all of the possible log monitoring configurations: 


| Parameter | Usage | 
| --- | --- | 
| jobMaxRetryCount | Maximum number of restarts at the process level. | 
| restartPolicy: numRestartBeforeFullJobRestart | Maximum number of restarts at the process level before the operator restarts at the job level. | 
| restartPolicy: evalPeriodSeconds | The period of evaluating the restart limit in seconds | 
| restartPolicy: maxFullJobRestarts | Maximum number of full job restarts before the job fails. | 
| cleanPodPolicy | Specifies the pods that the operator should clean. Accepted values are All, OnlyComplete, and None. | 
| logMonitoringConfiguration | The log monitoring rules for slow and hanging job detection | 
| expectedRecurringFrequencyInSeconds | Time interval between two consecutive LogPattern matches after which the rule evaluates to HANGING. If not specified, no time constraint exists between consecutive LogPattern matches. | 
| expectedStartCutOffInSeconds | Time to first LogPattern match after which the rule evaluates to HANGING. If not specified, no time constraint exists for the first LogPattern match. | 
| logPattern | Regular expression that identifies log lines that the rule applies to when the rule is active | 
| metricEvaluationDataPoints | Number of consecutive times a rule must evaluate to SLOW before marking a job as SLOW. If not specified, the default is 1. | 
| metricThreshold | Threshold for value extracted by LogPattern with a capturing group. If not specified, metric evaluation is not performed. | 
| operator | The inequality to apply to the monitoring configuration. Accepted values are gt, gteq, lt, lteq, and eq. | 
| stopPattern | Regular expresion to identify the log line at which to deactivate the rule. If not specified, the rule will always be active. | 
| faultOnMatch | Indicates whether a match of LogPattern should immediately trigger a job fault. When true, the job will be marked as faulted as soon as the LogPattern is matched, regardless of other rule parameters. When false or not specified, the rule will evaluate to SLOW or HANGING based on other parameters. | 

 For more training resiliency, specify spare node configuration details. If your job fails, the operator works with Kueue to use nodes reserved in advance to continue running the job. Spare node configurations require Kueue, so if you try to submit a job with spare nodes but don’t have Kueue installed, the job will fail. The following example is a sample `job.yaml` file that contains spare node configurations.

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: user-queue # Specify the queue to run the job.
  name: hyperpodpytorchjob-sample
spec:
  nprocPerNode: "1"
  runPolicy:
    cleanPodPolicy: "None"
  replicaSpecs: 
    - name: pods
      replicas: 1
      spares: 1 # Specify how many spare nodes to reserve.
      template:
        spec:
          containers:
            - name: XXX
              image: XXX
              
              imagePullPolicy: Always
              ports:
                - containerPort: 8080
              resources:
                requests:
                  nvidia.com/gpu: "0"
                limits:
                  nvidia.com/gpu: "0"
```

## Monitoring
<a name="sagemaker-eks-operator-usage-monitoring"></a>

The Amazon SageMaker HyperPod is integrated with [ observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-observability-addon.html), so you can set up monitoring to collect and feed metrics into these observability tools.

Alternatively, you can scrape metrics through Amazon Managed Service for Prometheus without managed observability. To do so, include the metrics that you want to monitor into your `job.yaml` file when you run jobs with `kubectl`.

```
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hyperpod-training-operator
  namespace: aws-hyperpod
spec:
  ......
  endpoints:
    - port: 8081
      path: /metrics
      interval: 15s
```

The following are events that the training operator emits that you can feed into Amazon Managed Service for Prometheus to monitor your training jobs.


| Event | Description | 
| --- | --- | 
| hyperpod\$1training\$1operator\$1jobs\$1created\$1total | Total number of jobs that the training operator has run | 
| hyperpod\$1training\$1operator\$1jobs\$1restart\$1latency | Current job restart latency | 
| hyperpod\$1training\$1operator\$1jobs\$1fault\$1detection\$1latency | Fault detection latency | 
| hyperpod\$1training\$1operator\$1jobs\$1deleted\$1total | Total number of deleted jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1successful\$1total | Total number of completed jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1failed\$1total | Total number of failed jobs | 
| hyperpod\$1training\$1operator\$1jobs\$1restarted\$1total | Total number of auto-restarted jobs | 

## Sample docker configuration
<a name="sagemaker-eks-operator-usage-docker"></a>

The following is a sample docker file that you can run with the `hyperpod run` command.

```
export AGENT_CMD="--backend=nccl"
exec hyperpodrun --server-host=${AGENT_HOST} --server-port=${AGENT_PORT} \
    --tee=3 --log_dir=/tmp/hyperpod \
    --nnodes=${NNODES} --nproc-per-node=${NPROC_PER_NODE} \
    --pre-train-script=/workspace/echo.sh --pre-train-args='Pre-training script' \
    --post-train-script=/workspace/echo.sh --post-train-args='Post-training script' \
    /workspace/mnist.py --epochs=1000 ${AGENT_CMD}
```

## Sample log monitoring configurations
<a name="sagemaker-eks-operator-usage-log-monitoring"></a>

**Job hang detection**

To detect hang jobs, use the following configurations. It uses the following parameters:
+ expectedStartCutOffInSeconds – how long the monitor should wait before expecting the first logs
+ expectedRecurringFrequencyInSeconds – the time interval to wait for the next batch of logs

With these settings, the log monitor expects to see a log line matching the regex pattern `.*Train Epoch.*` within 60 seconds after the training job starts. After the first appearance, the monitor expects to see matching log lines every 10 seconds. If the first logs don't appear within 60 seconds or subsequent logs don't appear every 10 seconds, the HyperPod elastic agent treats the container as stuck and coordinates with the training operator to restart the job.

```
runPolicy:
    jobMaxRetryCount: 10
    cleanPodPolicy: "None"
    logMonitoringConfiguration:
      - name: "JobStartGracePeriod"
        # Sample log line: [default0]:2025-06-17 05:51:29,300 [INFO] __main__: Train Epoch: 5 [0/60000 (0%)]       loss=0.8470
        logPattern: ".*Train Epoch.*"  
        expectedStartCutOffInSeconds: 60 
      - name: "JobHangingDetection"
        logPattern: ".*Train Epoch.*"
        expectedRecurringFrequencyInSeconds: 10 # if the next batch is not printed within 10 seconds
```

**Training loss spike**

The following monitoring configuration emits training logs with the pattern `xxx training_loss_step xx`. It uses the parameter `metricEvaluationDataPoints`, which lets you specify a threshold of data points before the operator restarts the job. If the training loss value is more than 2.0, the operator restarts the job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "LossSpikeDetection"
      logPattern: ".*training_loss_step (\\d+(?:\\.\\d+)?).*"   # training_loss_step 5.0
      metricThreshold: 2.0
      operator: "gt"
      metricEvaluationDataPoints: 5 # if loss higher than threshold for 5 data points, restart the job
```

**Low TFLOPs detection**

The following monitoring configuration emits training logs with the pattern `xx TFLOPs xx` every five seconds. If TFLOPs is less than 100 for 5 data points, the operator restarts the training job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "TFLOPs"
      logPattern: ".* (.+)TFLOPs.*"    # Training model, speed: X TFLOPs...
      expectedRecurringFrequencyInSeconds: 5        
      metricThreshold: 100       # if Tflops is less than 100 for 5 data points, restart the job       
      operator: "lt"
      metricEvaluationDataPoints: 5
```

**Training script error log detection**

The following monitoring configuration detects if the pattern specified in `logPattern` is present in the training logs. As soon as the training operator encounters the error pattern, the training operator treats it as a fault and restarts the job.

```
runPolicy:
  jobMaxRetryCount: 10
  cleanPodPolicy: "None"
  logMonitoringConfiguration:
    - name: "GPU Error"
      logPattern: ".*RuntimeError.*out of memory.*"
      faultOnMatch: true
```

# Troubleshooting
<a name="sagemaker-eks-operator-troubleshooting"></a>

See the following sections to learn how to troubleshoot error when using the training operator.

## I can't install the training operator
<a name="sagemaker-eks-operator-troubleshooting-installation-error"></a>

If you can't install the training operator, make sure that you're using the [ supported versions of components](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html#sagemaker-eks-operator-supported-versions). For example, if you get an error that your HyperPod AMI release is incompatible with the training operator, [ update to the latest version](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateClusterSoftware.html).

## Incompatible HyperPod task governance version
<a name="sagemaker-eks-operator-troubleshooting-task-governance-version"></a>

During installation, you might get an error message that the version of HyperPod task governance is incompatible. The training operator works only with version v1.3.0-eksbuild.1 or higher. Update your HyperPod task governance add-on and try again. 

## Missing permissions
<a name="sagemaker-eks-operator-troubleshooting-task-missing-permissions"></a>

 While you're setting up the training operator or running jobs, you might receive errors that you're not authorized to run certain operations, such as `DescribeClusterNode`. To resolve these errors, make sure you correctly set up IAM permissions while you're [setting up the Amazon EKS Pod Identity Agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html#sagemaker-eks-operator-install-pod-identity).

# Using elastic training in Amazon SageMaker HyperPod
<a name="sagemaker-eks-elastic-training"></a>

 Elastic training is a new Amazon SageMaker HyperPod capability that automatically scales training jobs based on compute resource availability and workload priority. Elastic training jobs can start with minimum compute resources required for model training and dynamically scale up or down through automatic checkpointing and resumption across different node configurations (world size). Scaling is achieved by automatically adjusting the number of data-parallel replicas. During high cluster utilization periods, elastic training jobs can be configured to automatically scale down in response to resource requests from higher-priority jobs, freeing up compute for critical workloads. When resources free up during off-peak periods, elastic training jobs automatically scale back up to accelerate training, then scale back down when higher-priority workloads need resources again. 

Elastic training is built on top of the HyperPod training operator and integrates the following components:
+ [Amazon EKS for Kubernetes orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks.html)
+ [ Amazon SageMaker HyperPod Task Governance ](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html)for job queuing, prioritization, and scheduling
+ [PyTorch Distributed Checkpoint (DCP)](https://docs.pytorch.org/docs/stable/distributed.checkpoint.html) for scalable state and checkpoint management, such as DCP

**Supported frameworks**
+ PyTorch with Distributed Data Parallel(DDP) and Fully Sharded Data Parallel(FSDP)
+ PyTorch Distributed Checkpoint (DCP)

## Prerequisites
<a name="sagemaker-eks-elastic-prereqs"></a>

### SageMaker HyperPod EKS Cluster
<a name="sagemaker-eks-elastic-hyperpod-cluster"></a>

You must have a running SageMaker HyperPod cluster with Amazon EKS orchestration. For information on creating a HyperPod EKS cluster, see:
+ [Getting started with Amazon EKS in SageMaker HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-prerequisites.html)
+ [Creating a SageMaker HyperPod cluster with Amazon EKS orchestration](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-create-cluster.html)

### SageMaker HyperPod Training Operator
<a name="sagemaker-eks-elastic-training-operator"></a>

Elastic Training is supported in training operator v. 1.2 and above.

To install the training operator as EKS add-on, see: [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html)

### (Recommended) Install and configure Task Governance and Kueue
<a name="sagemaker-eks-elastic-task-governance"></a>

We recommend installing and configuring Kueue via [HyperPod Task Governance](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html) to specify workload priorities with elastic training. Kueue provides stronger workload management with queuing, prioritization, gang scheduling, resource tracking and graceful preemption which are essential for operating in multi-tenant training environments.
+ Gang scheduling ensures that all required pods of a training job start together. This prevents situations where some pods start while others remain pending, which could cause wasted resources.
+ Gentle preemption allows lower-priority elastic jobs to yield resources to higher-priority workloads. Elastic jobs can scale down gracefully without being forcibly evicted, improving overall cluster stability.

We recommend configuring the following Kueue components:
+ PriorityClasses to define relative job importance
+ ClusterQueues to manage global resource sharing and quotas across teams or workloads
+ LocalQueues to route jobs from individual namespaces into the appropriate ClusterQueue

For more advanced setups, you can also incorporate:
+ Fair-share policies to balance resource usage across multiple teams
+ Custom preemption rules to enforce organizational SLAs or cost controls

Please refer to:
+ [https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-eks-operate-console-ui-governance.html)
+ [Kueue Documentation](https://kueue.sigs.k8s.io/)

### (Recommended) Setup user namespaces and resource quotas
<a name="sagemaker-eks-elastic-namespaces-quotas"></a>

When deploying this feature on Amazon EKS, we recommend applying a set of foundational cluster-level configurations to ensure isolation, resource fairness, and operational consistency across teams.

#### Namespace and Access Configuration
<a name="sagemaker-eks-elastic-namespace-access"></a>

Organize your workloads using separate namespaces for each team or project. This allows you to apply fine-grained isolation and governance. We also recommend configuring AWS IAM to Kubernetes RBAC mapping to associate individual IAM users or roles with their corresponding namespaces.

Key practices include:
+ Map IAM roles to Kubernetes service accounts using IAM Roles for Service Accounts (IRSA) when workloads need AWS permissions. [https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html](https://docs.aws.amazon.com/eks/latest/userguide/access-entries.html)
+ Apply RBAC policies to restrict users to only their designated namespaces (e.g., `Role`/`RoleBinding` rather than cluster-wide permissions).

#### Resource and Compute Constraints
<a name="sagemaker-eks-elastic-resource-constraints"></a>

To prevent resource contention and ensure fair scheduling across teams, apply quotas and limits at the namespace level:
+ ResourceQuotas to cap aggregate CPU, memory, storage, and object counts (pods, PVCs, services, etc.).
+ LimitRanges to enforce default and maximum per-pod or per-container CPU and memory limits.
+ PodDisruptionBudgets (PDBs) as needed to define resiliency expectations.
+ Optional: Namespace-level queueing constraints (e.g., via Task Governance or Kueue) to prevent users from over-submitting jobs.

These constraints help maintain cluster stability and support predictable scheduling for distributed training workloads.

#### Auto-scaling
<a name="sagemaker-eks-elastic-autoscaling"></a>

SageMaker HyperPod on EKS supports cluster autoscaling through Karpenter. When Karpenter or similar resource provisioner is used together with elastic training, the cluster as well as the elastic training job may scale up automatically after an elastic training job is once submitted. This is because elastic training operator takes greedy approach, always asks more than the available compute resources until it reaches maximum limit set by the job. This occurs because the elastic training operator continuously requests additional resources as part of elastic job execution, which can trigger node provisioning. Continuous resource provisioners like Karpenter will serve the requests by scaling up the compute cluster.

To keep these scale-ups predictable and under control, we recommend configuring namespace-level ResourceQuotas in the namespaces where elastic training jobs are created. ResourceQuotas help limit the maximum resources that jobs can request, preventing unbounded cluster growth while still allowing elastic behavior within defined limits.

For example, a ResourceQuota for 8 ml.p5.48xlarge instances will have the following form:

```
apiVersion: v1
kind: ResourceQuota
metadata:
  name: <quota-name>
  namespace: <namespace-name>
spec:
  hard:
    nvidia.com/gpu: "64"
    vpc.amazonaws.com/efa: "256"
    requests.cpu: "1536"
    requests.memory: "5120Gi"
    limits.cpu: "1536"
    limits.memory: "5120Gi"
```

## Build Training Container
<a name="sagemaker-eks-elastic-build-container"></a>

HyperPod training operator works with a custom PyTorch launcher provided via HyperPod Elastic Agent python package ([https://www.piwheels.org/project/hyperpod-elastic-agent/](https://www.piwheels.org/project/hyperpod-elastic-agent/)). Customers must install the elastic agent and replace the `torchrun` command with `hyperpodrun` to launch training. For more details, please see:

[https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html\$1sagemaker-eks-operator-elastic-agent](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-install.html#sagemaker-eks-operator-elastic-agent)

An example training container:

```
FROM ...

...

RUN pip install hyperpod-elastic-agent
ENTRYPOINT ["entrypoint.sh"]

# entrypoint.sh ...
hyperpodrun --nnodes=node_count --nproc-per-node=proc_count \
  --rdzv-backend hyperpod \
 # Optional ...
 # Other torchrun args
 # pre-traing arg_group
 --pre-train-script pre.sh --pre-train-args "pre_1 pre_2 pre_3" \
 # post-train arg_group
 --post-train-script post.sh --post-train-args "post_1 post_2 post_3" \
 training.py --script-args
```

## Training code modification
<a name="sagemaker-eks-elastic-training-code"></a>

SageMaker HyperPod provides a set of recipes that already configured to run with Elastic Policy.

To enable elastic training for custom PyTorch training scripts, you will need to make minor modifications to your training loop. This guide walks you through the necessary modifications needed to ensure your training job responds to elastic scaling events that occur when compute resource availability changes. During all elastic events (e.g., nodes are available, or nodes get preempted), the training job receives an elastic event signal that is used to coordinate a graceful shutdown by saving a checkpoint, and resuming training by restarting from that saved checkpoint with a new world configuration. To enable elastic training with custom training scripts, you need to:

### Detect Elastic Scaling Events
<a name="sagemaker-eks-elastic-detect-events"></a>

In your training loop, check for elastic events during each iteration:

```
from hyperpod_elastic_agent.elastic_event_handler import elastic_event_detected

def train_epoch(model, dataloader, optimizer, args):
    for batch_idx, batch_data in enumerate(dataloader):
        # Forward and backward pass
        loss = model(batch_data).loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # Handle checkpointing and elastic scaling
        should_checkpoint = (batch_idx + 1) % args.checkpoint_freq == 0
        elastic_event = elastic_event_detected()
        
        # Save checkpoint if scaling-up or scaling down job
        if should_checkpoint or elastic_event:
            save_checkpoint(model, optimizer, scheduler, 
                            checkpoint_dir=args.checkpoint_dir, 
                            step=global_step)
              
            if elastic_event:
                print("Elastic scaling event detected. Checkpoint saved.")
                return
```

### Implement Checkpoint Saving and Checkpoint Loading
<a name="sagemaker-eks-elastic-checkpoint-implementation"></a>

Note: We recommend using PyTorch Distributed Checkpoint (DCP) for saving model and optimizer states, as DCP supports resuming from a checkpoint with different world sizes. Other checkpointing formats may not support checkpoint loading on different world sizes, in which case you'll need to implement custom logic to handle dynamic world size changes.

```
import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_state_dict, set_state_dict

def save_checkpoint(model, optimizer, lr_scheduler, user_content, checkpoint_path):
    """Save checkpoint using DCP for elastic training."""
    state_dict = {
        "model": model,
        "optimizer": optimizer,
        "lr_scheduler": lr_scheduler,
        **user_content
    }
      
    dcp.save(
        state_dict=state_dict,
        storage_writer=dcp.FileSystemWriter(checkpoint_path)
    )

def load_checkpoint(model, optimizer, lr_scheduler, checkpoint_path):
    """Load checkpoint using DCP with automatic resharding."""
    state_dict = {
        "model": model,
        "optimizer": optimizer,
        "lr_scheduler": lr_scheduler
    }
      
    dcp.load(
        state_dict=state_dict,
        storage_reader=dcp.FileSystemReader(checkpoint_path)
    )
      
    return model, optimizer, lr_scheduler
```

### (Optional) Use stateful dataloaders
<a name="sagemaker-eks-elastic-stateful-dataloaders"></a>

If you're only training for a single-epoch (i.e., one single pass through the entire dataset), the model must see each data sample exactly once. If the training job stops mid-epoch and resumes with a different world size, previously processed data samples will be repeated if the dataloader state is not persisted. A stateful dataloader prevents this by saving and restoring the dataloader's position, ensuring that resumed runs continue from the elastic scaling event without reprocessing any samples. We recommend using [StatefulDataLoader](https://meta-pytorch.org/data/main/torchdata.stateful_dataloader.html), which is a drop‑in replacement for `torch.utils.data.DataLoader` that adds `state_dict()` and `load_state_dict()` methods, enabling mid‑epoch checkpointing of the data loading process.

## Submitting elastic training jobs
<a name="sagemaker-eks-elastic-submit-job"></a>

[HyperPod training operator](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator-usage.html) defines a new resource type - `hyperpodpytorchjob`. Elastic training extends this resource type and add the highlighted fields below:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  name: elastic-training-job
spec:
  elasticPolicy:
    minReplicas: 1
    maxReplicas: 4
    # Increment amount of pods in fixed-size groups
    # Amount of pods will be equal to minReplicas + N * replicaIncrementStep
    replicaIncrementStep: 1           
    # ... or Provide an exact amount of pods that required for training
    replicaDiscreteValues: [2,4,8]     

    # How long traing operator wait job to save checkpoint and exit during
    # scaling events. Job will be force-stopped after this period of time
    gracefulShutdownTimeoutInSeconds: 600

    # When scaling event is detected:   
    # how long job controller waits before initiate scale-up.
    # Some delay can prevent from frequent scale-ups and scale-downs
    scalingTimeoutInSeconds: 60

    # In case of faults, specify how long elastic training should wait for
    # recovery, before triggering a scale-down
    faultyScaleDownTimeoutInSeconds: 30
  ...
  replicaSpecs:
    - name: pods
      replicas: 4           # Initial replica count
      maxReplicas: 8        # Max for this replica spec (should match elasticPolicy.maxReplicas)
      ...
```

### Using kubectl
<a name="sagemaker-eks-elastic-kubectl-apply"></a>

You can subsequently launch elastic training with the following command.

```
kubectl apply -f elastic-training-job.yaml
```

### Using SageMaker Recipes
<a name="sagemaker-eks-elastic-sagemaker-recipes"></a>

Elastic training jobs can be launched through [SageMaker HyperPod recipes](https://github.com/aws/sagemaker-hyperpod-recipes).

**Note**  
We have included **46** elastic recipes for **SFO** and **DPO** jobs on Hyperpod Recipe. Users can launch those jobs with one line change on top of existing static launcher script:  
`++recipes.elastic_policy.is_elastic=true`

In addition to static recipes, elastic recipes add the following fields to define the elastic behaviors:

#### Elastic Policy
<a name="sagemaker-eks-elastic-policy"></a>

The `elastic_policy` field defines the job level configuration for the elastic training job, it has the following configurations:
+ `is_elastic` : `bool` - if this job is an elastic job
+ `min_nodes` : `int` - the minimum number of nodes used for elastic training
+ `max_nodes`: `int` - the maximum number of nodes used for elastic training
+ `replica_increment_step` : `int` - increment amount of pods in fixed-size groups, this field is mutually exclusive to the `scale_config` we define later.
+ `use_graceful_shutdown` : `bool` - if use graceful shutdown during scaling events, default to `true`.
+ `scaling_timeout`: `int` - the waiting time in second during scaling event before timeout
+ `graceful_shutdown_timeout`: `int` - the waiting time for graceful shutdown

The following is a sample definition of this field, you can also find in on Hyperpod Recipe repo in recipe: `recipes_collection/recipes/fine-tuning/llama/llmft_llama3_1_8b_instruct_seq4k_gpu_sft_lora.yaml`

```
<static recipe>
...
elastic_policy:
  is_elastic: true
  min_nodes: 1
  max_nodes: 16
  use_graceful_shutdown: true
  scaling_timeout: 600
  graceful_shutdown_timeout: 600
```

#### Scale Config
<a name="sagemaker-eks-elastic-scale-config"></a>

The `scale_config` field defines overriding configurations at each specific scale. It's a key-value dictionary, where key is an integer representing the target scale and value is a subset of base recipe. At `<key>` scale, we use the `<value>` to update the specific configurations in the base/static recipe. The following show an example of this field:

```
scale_config:   
...
  2:
    trainer:
      num_nodes: 2
    training_config:
      training_args:
        train_batch_size: 128
        micro_train_batch_size: 8
        learning_rate: 0.0004
  3:
    trainer:
      num_nodes: 3
    training_config:
      training_args:
        train_batch_size: 128
        learning_rate: 0.0004
        uneven_batch:
          use_uneven_batch: true
          num_dp_groups_with_small_batch_size: 16
          small_local_batch_size: 5
          large_local_batch_size: 6
 ...
```

The above configuration defines the training configuration at scale 2 and 3. In both cases, we use learning rate `4e-4`, batch size of `128`. But at scale 2, we use a `micro_train_batch_size` of 8, while scale 3, we use an uneven batch size as the train batch size cannot be evenly divided across 3 nodes.

**Uneven Batch Size**

This is a field to define the batch distributing behavior when the global batch size cannot be evenly divided by number of ranks. It's not specific to elastic training, but it's an enabler for finer scaling granularity.
+ `use_uneven_batch` : `bool` - if use uneven batch distribution
+ `num_dp_groups_with_small_batch_size` : `int` - in uneven batch distribution, some ranks use smaller local batch size, where the others use larger batch size. The global batch size should equal to `small_local_batch_size * num_dp_groups_with_small_batch_size + (world_size-num_dp_groups_with_small_batch_size) * large_local_batch_size`
+ `small_local_batch_size` : `int` - this value is the smaller local batch size
+ `large_local_batch_size` : `int` - this value is the larger local batch size

**Monitor training on MLFlow**

Hyperpod recipe jobs support observability through MLFlow. Users can specify MLFlow configurations in recipe:

```
training_config:
  mlflow:
    tracking_uri: "<local_file_path or MLflow server URL>"
    run_id: "<MLflow run ID>"
    experiment_name: "<MLflow experiment name, e.g. llama_exps>"
    run_name: "<run name, e.g. llama3.1_8b>"
```

These configurations are mapped to corresponding [MLFlow setup](https://mlflow.org/docs/latest/ml/tracking/tracking-api/#setup--configuration). The following is a sample MLflow dashboard for an elastic training job.

![\[The following is a sample MLflow dashboard for an elastic training job.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-sample-dashboard.png)


After defining the elastic recipes, we can use the launcher scripts, such as `launcher_scripts/llama/run_llmft_llama3_1_8b_instruct_seq4k_gpu_sft_lora.sh` to launch an elastic training job. This is similar to launching a static job using Hyperpod recipe.

**Note**  
Elastic training job from recipe support automatically resume from latest checkpoints, however, by default, every restart creates a new training directory. To enable resuming from last checkpoint correctly, we need to make sure the same training directory is reused. This can be done by setting  
`recipes.training_config.training_args.override_training_dir=true`

## Use-case examples and limitations
<a name="sagemaker-eks-elastic-use-cases"></a>

### Scale-up when more resources are available
<a name="sagemaker-eks-elastic-scale-up"></a>

When more resources become available on the cluster (e.g., other workloads complete). During this event, the training controller will automatically scale up the training job. This behavior is explained below.

To simulate a situation when more resources become available we can submit a high-priority job, and then release resources back by deleting the high-priority job.

```
# Submit a high-priority job on your cluster. As a result of this command
# resources will not be available for elastic training
kubectl apply -f high_prioriy_job.yaml

# Submit an elastic job with normal priority
kubectl apply -f hyperpod_job_with_elasticity.yaml

# Wait for training to start....

# Delete high priority job. This command will make additional resources available for
# elastic training
kubectl delete -f high_prioriy_job.yaml

# Observe the scale-up of elastic job
```

Expected behavior:
+ The training operator creates a Kueue Workload When an elastic training job requests a change in world size, the training operator generates an additional Kueue Workload object representing the new resource requirements.
+ Kueue admits the Workload Kueue evaluates the request based on available resources, priorities, and queue policies. Once approved, the Workload is admitted.
+ The training operator creates the additional Pods Upon admission, the operator launches the additional pods required to reach the new world size.
+ When the new pods become ready, the training operator sends a special elastic event signal to training script.
+ The training job performs checkpointing, to prepare for a graceful shutdown The training process periodically checks for the elastic event signal by calling **elastic\$1event\$1detected()** function. Once detected, it initiates a checkpoint. After the checkpoint is successfully completed, the training process exits cleanly.
+ The training operator restarts the job with the new world size The operator waits for all processes to exit, then restarts the training job using the updated world size and the latest checkpoint.

**Note:** When Kueue is not used, the training operator skips the first two steps. It immediately attempts to create the additional pods required for the new world size. If sufficient resources are not available in the cluster, these pods will remain in a **Pending** state until capacity becomes available.

![\[The diagram illustrates the resizing and resource timeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-resize-timeline.png)


### Preemption by high priority job
<a name="sagemaker-eks-elastic-preemption"></a>

Elastic jobs can be scaled down automatically when a high-priority job needs resources. To simulate this behavior you can submit an elastic training job, which uses the maximum number of available resources from the start of training, than submit a high priority job, and observe preemption behavior.

```
# Submit an elastic job with normal priority
kubectl apply -f hyperpod_job_with_elasticity.yaml

# Submit a high-priority job on your cluster. As a result of this command
# some amount of resources will be   
kubectl apply -f high_prioriy_job.yaml

# Observe scale-down behaviour
```

When a high-priority job needs resources, Kueue can preempt lower-priority Elastic Training workloads (there could be more than 1 Workload object associated with Elastic Training job). The preemption process follows this sequence:

1. A high-priority job is submitted The job creates a new Kueue Workload, but the Workload cannot be admitted due to insufficient cluster resources.

1. Kueue preempts one of the Elastic Training job's Workloads Elastic jobs may have multiple active Workloads (one per world-size configuration). Kueue selects one to preempt based on priority and queue policies.

1. The training operator sends an elastic event signal. Once preemption is triggered, the training operator notifies the running training process to stop gracefully.

1. The training process performs checkpointing. The training job periodically checks for elastic event signals. When detected, it begins a coordinated checkpoint to preserve progress before shutting down.

1. training operator cleans up pods and workloads. The operator waits for checkpoint completion, then deletes the training pods that were part of the preempted Workload. It also removes the corresponding Workload object from Kueue.

1. The high-priority workload is admitted. With resources freed, Kueue admits the high-priority job, allowing it to start execution.  
![\[Preemption timeline for elastic training worklaods.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod/hyperpod-elastic-preemption-timeline.png)

Preemption can cause the entire training job to pause, which may not be desirable for all workflows. To avoid full-job suspension while still allowing elastic scaling, customers can configure two different priority levels within the same training job by defining two `replicaSpec` sections:
+ A primary (fixed) replicaSpec with normal or high priority
  + Contains the minimum required number of replicas needed to keep the training job running.
  + Uses a higher PriorityClass, ensuring these replicas are *never* preempted.
  + Maintains baseline progress even when the cluster is under resource pressure.
+ An elastic (scalable) replicaSpec with lower priority
  + Contains the additional optional replicas that provide extra compute during elastic scaling.
  + Uses a lower PriorityClass, allowing Kueue to preempt these replicas when higher-priority jobs need resources.
  + Ensures only the elastic portion is reclaimed, while the core training continues uninterrupted.

This configuration enables partial preemption, where only the elastic capacity is reclaimed—maintaining training continuity while still supporting fair resource sharing in multi-tenant environments. Example:

```
apiVersion: sagemaker.amazonaws.com/v1
kind: HyperPodPyTorchJob
metadata:
  name: elastic-training-job
spec:
  elasticPolicy:
    minReplicas: 2
    maxReplicas: 8
    replicaIncrementStep: 2
  ...
  replicaSpecs:
    - name: base
      replicas: 2
      template:
        spec:
          priorityClassName: high-priority # set high-priority to avoid evictions
           ...
    - name: elastic
      replicas: 0
      maxReplicas: 6
      template:
        spec:
          priorityClassName: low-priority. # Set low-priority for elastic part
           ...
```

### Handing pod eviction, pod crashes, and hardware degradation:
<a name="sagemaker-eks-elastic-pod-eviction"></a>

The HyperPod training operator includes built-in mechanisms to recover the training process when it is unexpectedly interrupted. Interruptions can occur for various reasons, such as training code failures, pod evictions, node failures, hardware degradation, and other runtime issues.

When this happens, the operator automatically attempts to recreate the affected pods and resume training from the latest checkpoint. If recovery is not immediately possible, for example, due to insufficient spare capacity, the operator can continue progress by temporarily reducing the world size and scale down the elastic training job.

When an elastic training job crashes or loses replicas, the system behaves as follows:
+ Recovery Phase (using spare nodes) The Training Controller waits up to `faultyScaleDownTimeoutInSeconds` for resources to become available and attempts to recover the failed replicas by redeploying pods on spare capacity.
+ Elastic scale-down If recovery is not possible within the timeout window, the training operator scales the job down to a smaller world size (if the job's elastic policy permits it). Training then resumes with fewer replicas.
+ Elastic scale-up When additional resources become available again, the operator automatically scales the training job back up to the preferred world size.

This mechanism ensures that training can continue with minimal downtime, even under resource pressure or partial infrastructure failures, while still taking advantage of elastic scaling.

### Use elastic training with other HyperPod features
<a name="sagemaker-eks-elastic-other-features"></a>

Elastic training does not currently support checkpointless training capabilities, HyperPod managed tiered checkpointing, or Spot instances.

**Note**  
We collect certain routine aggregated and anonymized operational metrics to provide essential service availability. The creation of these metrics is fully automated and does not involve human review of the underlying model training workload. These metrics relate to a job and scaling operations, resource management, and essential service functionality.

# Observability for Amazon SageMaker HyperPod cluster orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-cluster-observability"></a>

To achieve comprehensive observability into your Amazon SageMaker HyperPod (SageMaker HyperPod) cluster resources and software components, integrate the cluster with [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html), [Amazon Managed Service for Prometheus](https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html), and [Amazon Managed Grafana](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html). These tools provide visibility into cluster health, performance metrics, and resource utilization.

The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster's behavior. By leveraging these services, you gain a centralized and unified view of your HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads.

**Note**  
While CloudWatch, Amazon Managed Service for Prometheus, and Amazon Managed Grafana focus on operational metrics (for example, system health, training job performance), SageMaker HyperPod Usage Reports complement [Task Governance](sagemaker-hyperpod-eks-operate-console-ui-governance.md) to provide financial and resource accountability insights. These reports track:  
Compute utilization (GPU/CPU/Neuron Core hours) across namespaces/teams
Cost attribution for allocated vs. borrowed resources
Historical trends (up to 180 days) for auditing and optimization
For more information about setting up and generating usage reports, see [Reporting Compute Usage in HyperPod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod-usage-reporting.html). 

**Tip**  
To find practical examples and solutions, see also the [Observability](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e/en-US/06-observability) section in the [Amazon EKS Support in SageMaker HyperPod workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/2433d39e-ccfe-4c00-9d3d-9917b729258e).

Proceed to the following topics to set up for SageMaker HyperPod cluster observability.

**Topics**
+ [Model observability for training jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-cluster-observability-model.md)
+ [Cluster and task observability](sagemaker-hyperpod-eks-cluster-observability-cluster.md)

# Model observability for training jobs on SageMaker HyperPod clusters orchestrated by Amazon EKS
<a name="sagemaker-hyperpod-eks-cluster-observability-model"></a>

SageMaker HyperPod clusters orchestrated with Amazon EKS can integrate with the [MLflow application on Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow.html). Cluster admins set up the MLflow server and connect it with the SageMaker HyperPod clusters. Data scientists can gain insights into the model.

**To set up an MLflow server using AWS CLI**

A cluster admin must create an MLflow tracking server.

1. Create a SageMaker AI MLflow tracking server, following the instructions at [Create a tracking server using the AWS CLI](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-cli.html#mlflow-create-tracking-server-cli-infra-setup).

1. Make sure that the [https://docs.aws.amazon.com/eks/latest/APIReference/API_auth_AssumeRoleForPodIdentity.html](https://docs.aws.amazon.com/eks/latest/APIReference/API_auth_AssumeRoleForPodIdentity.html) permission exists in the IAM execution role for SageMaker HyperPod.

1. If the `eks-pod-identity-agent` add-on is not already installed on your EKS cluster, install the add-on on the EKS cluster.

   ```
   aws eks create-addon \
       --cluster-name <eks_cluster_name> \
       --addon-name eks-pod-identity-agent \
       --addon-version vx.y.z-eksbuild.1
   ```

1. Create a `trust-relationship.json` file for a new role for Pod to call MLflow APIs.

   ```
   cat >trust-relationship.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
               "Effect": "Allow",
               "Principal": {
                   "Service": "pods.eks.amazonaws.com"
   
               },
               "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
               ]
           }
       ]
   }
   EOF
   ```

   Run the following code to create the role and attach the trust relationship.

   ```
   aws iam create-role --role-name hyperpod-mlflow-role \
       --assume-role-policy-document file://trust-relationship.json \
       --description "allow pods to emit mlflow metrics and put data in s3"
   ```

1. Create the following policy that grants Pod access to call all `sagemaker-mlflow` operations and to put model artifacts in S3. S3 permission already exists within the tracking server but if the model artifacts is too big direct call to s3 is made from the MLflow code to upload the artifacts.

   ```
   cat >hyperpod-mlflow-policy.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker-mlflow:AccessUI",
                   "sagemaker-mlflow:CreateExperiment",
                   "sagemaker-mlflow:SearchExperiments",
                   "sagemaker-mlflow:GetExperiment",
                   "sagemaker-mlflow:GetExperimentByName",
                   "sagemaker-mlflow:DeleteExperiment",
                   "sagemaker-mlflow:RestoreExperiment",
                   "sagemaker-mlflow:UpdateExperiment",
                   "sagemaker-mlflow:CreateRun",
                   "sagemaker-mlflow:DeleteRun",
                   "sagemaker-mlflow:RestoreRun",
                   "sagemaker-mlflow:GetRun",
                   "sagemaker-mlflow:LogMetric",
                   "sagemaker-mlflow:LogBatch",
                   "sagemaker-mlflow:LogModel",
                   "sagemaker-mlflow:LogInputs",
                   "sagemaker-mlflow:SetExperimentTag",
                   "sagemaker-mlflow:SetTag",
                   "sagemaker-mlflow:DeleteTag",
                   "sagemaker-mlflow:LogParam",
                   "sagemaker-mlflow:GetMetricHistory",
                   "sagemaker-mlflow:SearchRuns",
                   "sagemaker-mlflow:ListArtifacts",
                   "sagemaker-mlflow:UpdateRun",
                   "sagemaker-mlflow:CreateRegisteredModel",
                   "sagemaker-mlflow:GetRegisteredModel",
                   "sagemaker-mlflow:RenameRegisteredModel",
                   "sagemaker-mlflow:UpdateRegisteredModel",
                   "sagemaker-mlflow:DeleteRegisteredModel",
                   "sagemaker-mlflow:GetLatestModelVersions",
                   "sagemaker-mlflow:CreateModelVersion",
                   "sagemaker-mlflow:GetModelVersion",
                   "sagemaker-mlflow:UpdateModelVersion",
                   "sagemaker-mlflow:DeleteModelVersion",
                   "sagemaker-mlflow:SearchModelVersions",
                   "sagemaker-mlflow:GetDownloadURIForModelVersionArtifacts",
                   "sagemaker-mlflow:TransitionModelVersionStage",
                   "sagemaker-mlflow:SearchRegisteredModels",
                   "sagemaker-mlflow:SetRegisteredModelTag",
                   "sagemaker-mlflow:DeleteRegisteredModelTag",
                   "sagemaker-mlflow:DeleteModelVersionTag",
                   "sagemaker-mlflow:DeleteRegisteredModelAlias",
                   "sagemaker-mlflow:SetRegisteredModelAlias",
                   "sagemaker-mlflow:GetModelVersionByAlias"
               ],
               "Resource": "arn:aws:sagemaker:us-west-2:111122223333:mlflow-tracking-server/<ml tracking server name>"
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject"
               ],
               "Resource": "arn:aws:s3:::<mlflow-s3-bucket_name>"
           }
       ]
   }
   EOF
   ```
**Note**  
The ARNs are the one from the MLflow server and the S3 bucket set up with the MLflow server during the server you created following the instructions [Set up MLflow infrastructure](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-create-tracking-server-cli.html#mlflow-create-tracking-server-cli-infra-setup).

1. Attach the `mlflow-metrics-emit-policy` policy to the `hyperpod-mlflow-role` using the policy document saved in the previous step.

   ```
   aws iam put-role-policy \
     --role-name hyperpod-mlflow-role \
     --policy-name mlflow-metrics-emit-policy \
     --policy-document file://hyperpod-mlflow-policy.json
   ```

1. Create a Kubernetes service account for Pod to access the MLflow server. 

   ```
   cat >mlflow-service-account.yaml <<EOF
   apiVersion: v1
   kind: ServiceAccount
   metadata:
     name: mlflow-service-account
     namespace: kubeflow
   EOF
   ```

   Run the following command to apply to the EKS cluster.

   ```
   kubectl apply -f mlflow-service-account.yaml
   ```

1. Create a Pod identity association.

   ```
   aws eks create-pod-identity-association \
       --cluster-name EKS_CLUSTER_NAME \
       --role-arn arn:aws:iam::111122223333:role/hyperpod-mlflow-role \
       --namespace kubeflow \
       --service-account mlflow-service-account
   ```

**To collect metrics from training jobs to the MLflow server**

Data scientists need to set up the training script and docker image to emit metrics to the MLflow server.

1. Add the following lines at the beginning of your training script.

   ```
   import mlflow
   
   # Set the Tracking Server URI using the ARN of the Tracking Server you created
   mlflow.set_tracking_uri(os.environ['MLFLOW_TRACKING_ARN'])
   # Enable autologging in MLflow
   mlflow.autolog()
   ```

1. Build a Docker image with the training script and push to Amazon ECR. Get the ARN of the ECR container. For more information about building and pushing a Docker image, see [Pushing a Docker image](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html) in the *ECR User Guide*.
**Tip**  
Make sure that you add installation of mlflow and sagemaker-mlflow packages in the Docker file. To learn more about the installation of the packages, requirements, and compatible versions of the packages, see [Install MLflow and the SageMaker AI MLflow plugin](https://docs.aws.amazon.com/sagemaker/latest/dg/mlflow-track-experiments.html#mlflow-track-experiments-install-plugin).

1. Add a service account in the training job Pods to give them access to `hyperpod-mlflow-role`. This allows Pods to call MLflow APIs. Run the following SageMaker HyperPod CLI job submission template. Create this with file name `mlflow-test.yaml`.

   ```
   defaults:
    - override hydra/job_logging: stdout
   
   hydra:
    run:
     dir: .
    output_subdir: null
   
   training_cfg:
    entry_script: ./train.py
    script_args: []
    run:
     name: test-job-with-mlflow # Current run name
     nodes: 2 # Number of nodes to use for current training
     # ntasks_per_node: 1 # Number of devices to use per node
   cluster:
    cluster_type: k8s # currently k8s only
    instance_type: ml.c5.2xlarge
    cluster_config:
     # name of service account associated with the namespace
     service_account_name: mlflow-service-account
     # persistent volume, usually used to mount FSx
     persistent_volume_claims: null
     namespace: kubeflow
     # required node affinity to select nodes with SageMaker HyperPod
     # labels and passed health check if burn-in enabled
     label_selector:
         required:
             sagemaker.amazonaws.com/node-health-status:
                 - Schedulable
         preferred:
             sagemaker.amazonaws.com/deep-health-check-status:
                 - Passed
         weights:
             - 100
     pullPolicy: IfNotPresent # policy to pull container, can be Always, IfNotPresent and Never
     restartPolicy: OnFailure # restart policy
   
   base_results_dir: ./result # Location to store the results, checkpoints and logs.
   container: 111122223333.dkr.ecr.us-west-2.amazonaws.com/tag # container to use
   
   env_vars:
    NCCL_DEBUG: INFO # Logging level for NCCL. Set to "INFO" for debug information
    MLFLOW_TRACKING_ARN: arn:aws:sagemaker:us-west-2:11112223333:mlflow-tracking-server/tracking-server-name
   ```

1. Start the job using the YAML file as follows.

   ```
   hyperpod start-job --config-file /path/to/mlflow-test.yaml
   ```

1. Generate a pre-signed URL for the MLflow tracking server. You can open the link on your browser and start tracking your training job.

   ```
   aws sagemaker create-presigned-mlflow-tracking-server-url \                          
       --tracking-server-name "tracking-server-name" \
       --session-expiration-duration-in-seconds 1800 \
       --expires-in-seconds 300 \
       --region region
   ```

# Cluster and task observability
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster"></a>

There are two options for monitoring SageMaker HyperPod clusters:

**The SageMaker HyperPod observability add-on**—SageMaker HyperPod provides a comprehensive, out-of-the-box dashboard that gives you insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards. The dashboards are optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance. With this add-on, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators.

**Amazon CloudWatch Insights**—Amazon CloudWatch Insights collects metrics for compute resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

**Topics**
+ [Amazon SageMaker HyperPod observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus](sagemaker-hyperpod-observability-addon.md)
+ [Observability with Amazon CloudWatch](sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci.md)

# Amazon SageMaker HyperPod observability with Amazon Managed Grafana and Amazon Managed Service for Prometheus
<a name="sagemaker-hyperpod-observability-addon"></a>

Amazon SageMaker HyperPod (SageMaker HyperPod) provides a comprehensive, out-of-the-box dashboard that gives you insights into foundation model (FM) development tasks and cluster resources. This unified observability solution automatically publishes key metrics to Amazon Managed Service for Prometheus and displays them in Amazon Managed Grafana dashboards. The dashboards are optimized specifically for FM development with deep coverage of hardware health, resource utilization, and task-level performance. With this add-on, you can consolidate health and performance data from NVIDIA DCGM, instance-level Kubernetes node exporters, Elastic Fabric Adapter, integrated file systems, Kubernetes APIs, Kueue, and SageMaker HyperPod task operators.

## Restricted Instance Group (RIG) support
<a name="hyperpod-observability-addon-rig-support"></a>

The observability add-on also supports clusters that contain Restricted Instance Groups. In RIG clusters, the add-on automatically adapts its deployment strategy to comply with the network isolation and security constraints of restricted nodes. DaemonSet components (node exporter, DCGM exporter, EFA exporter, Neuron monitor, and node collector) run on both standard and restricted nodes. Deployment components (central collector, Kube State Metrics, and Training Metrics Agent) are scheduled with boundary-aware logic to respect network isolation between instance groups. Container log collection with Fluent Bit is not available on restricted nodes.

For information about setting up the add-on on clusters with Restricted Instance Groups, see [Setting up the SageMaker HyperPod observability add-on](hyperpod-observability-addon-setup.md).

**Topics**
+ [Restricted Instance Group (RIG) support](#hyperpod-observability-addon-rig-support)
+ [Setting up the SageMaker HyperPod observability add-on](hyperpod-observability-addon-setup.md)
+ [Amazon SageMaker HyperPod observability dashboards](hyperpod-observability-addon-viewing-dashboards.md)
+ [Exploring SageMaker HyperPod cluster metrics in Amazon Managed Grafana](hyperpod-observability-addon-exploring-metrics.md)
+ [Customizing SageMaker HyperPod cluster metrics dashboards and alerts](hyperpod-observability-addon-customizing.md)
+ [Creating custom SageMaker HyperPod cluster metrics](hyperpod-observability-addon-custom-metrics.md)
+ [SageMaker HyperPod cluster metrics](hyperpod-observability-cluster-metrics.md)
+ [Preconfigured alerts](hyperpod-observability-addon-alerts.md)
+ [Troubleshooting the Amazon SageMaker HyperPod observability add-on](hyperpod-observability-addon-troubleshooting.md)

# Setting up the SageMaker HyperPod observability add-on
<a name="hyperpod-observability-addon-setup"></a>

The following list describes the prerequisites for setting up the observability add-on.

To have metrics for your Amazon SageMaker HyperPod (SageMaker HyperPod) cluster sent to a Amazon Managed Service for Prometheus workspace and to optionally view them in Amazon Managed Grafana, first attach the following managed policies and permissions to your console role.
+ To use Amazon Managed Grafana, enable AWS IAM Identity Center (IAM Identity Center) in an AWS Region where Amazon Managed Grafana is available. For instructions, see [Getting started with IAM Identity Center](https://docs.aws.amazon.com/singlesignon/latest/userguide/getting-started.html) in the *AWS IAM Identity Center User Guide*. For a list of AWS Regions where Amazon Managed Grafana is available, see [Supported Regions](https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Service-Grafana.html#AMG-supported-Regions) in the *Amazon Managed Grafana User Guide*.
+ Create at least one user in IAM Identity Center.
+ Ensure that the [Amazon EKS Pod Identity Agent](https://docs.aws.amazon.com/eks/latest/userguide/workloads-add-ons-available-eks.html#add-ons-pod-id) add-on is installed in your Amazon EKS cluster. The Amazon EKS Pod Identity Agent add-on makes it possible for the SageMaker HyperPod observability add-on to get the credentials to interact with Amazon Managed Service for Prometheus and CloudWatch Logs. To check whether your Amazon EKS cluster has the add-on, go to the Amazon EKS console, and check your cluster's **Add-ons** tab. For information about how to install the add-on if it's not installed, see [Create add-on (AWS Management Console)](https://docs.aws.amazon.com/eks/latest/userguide/creating-an-add-on.html#_create_add_on_console) in the *Amazon EKS User Guide*.
**Note**  
The Amazon EKS Pod Identity Agent is required for standard instance groups. For Restricted Instance Groups (RIG), the Pod Identity Agent is not available due to network isolation constraints. The cluster's instance group execution IAM role is used to interact with Amazon Managed Service for Prometheus. For information about how to configure that role, see [Additional prerequisites for Restricted Instance Groups](#hyperpod-observability-addon-rig-prerequisites).
+ Ensure that you have at least one node in your SageMaker HyperPod cluster before installing SageMaker HyperPod observability add-on. The smallest Amazon EC2 instance type that works in this case is `4xlarge`. This minimum node size requirement ensures that the node can accommodate all the pods that the SageMaker HyperPod observability add-on creates alongside any other already running pods on the cluster.
+ Add the following policies and permissions to your role.
  + [AWS managed policy: AmazonSageMakerHyperPodObservabilityAdminAccess](security-iam-awsmanpol-AmazonSageMakerHyperPodObservabilityAdminAccess.md)
  + [AWS managed policy: AWSGrafanaWorkspacePermissionManagementV2](https://docs.aws.amazon.com/grafana/latest/userguide/security-iam-awsmanpol.html#security-iam-awsmanpol-AWSGrafanaWorkspacePermissionManagementV2)
  + [AWS managed policy: AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html)
  + Additional permissions to set up required IAM roles for Amazon Managed Grafana and Amazon Elastic Kubernetes Service add-on access:

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Sid": "CreateRoleAccess",
                "Effect": "Allow",
                "Action": [
                    "iam:CreateRole",
                    "iam:CreatePolicy",
                    "iam:AttachRolePolicy",
                    "iam:ListRoles"
                ],
                "Resource": [
                    "arn:aws:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityGrafanaAccess*",
                    "arn:aws:iam::*:role/service-role/AmazonSageMakerHyperPodObservabilityAddonAccess*",
                    "arn:aws:iam::*:policy/service-role/HyperPodObservabilityAddonPolicy*",
                    "arn:aws:iam::*:policy/service-role/HyperPodObservabilityGrafanaPolicy*"
                ]
            }
        ]
    }
    ```

------
  + Additional permissions needed to manage IAM Identity Center users for Amazon Managed Grafana:

------
#### [ JSON ]

****  

    ```
    {
        "Version":"2012-10-17",		 	 	 
        "Statement": [
            {
                "Sid": "SSOAccess",
                "Effect": "Allow",
                "Action": [
                    "sso:ListProfileAssociations",
                    "sso-directory:SearchUsers",
                    "sso-directory:SearchGroups",
                    "sso:AssociateProfile",
                    "sso:DisassociateProfile"
                ],
                "Resource": [
                    "*"
                ]
            }
        ]
    }
    ```

------

## Additional prerequisites for Restricted Instance Groups
<a name="hyperpod-observability-addon-rig-prerequisites"></a>

If your cluster contains Restricted Instance Groups, the instance group execution role must have permissions to write metrics to Amazon Managed Service for Prometheus. When you use **Quick setup** to create your cluster with observability enabled, these permissions are added to the execution role automatically.

If you are using **Custom setup** or adding observability to an existing RIG cluster, ensure that the execution role for each Restricted Instance Group has the following permissions:

```
{
    "Version": "2012-10-17", 		 	 	 
    "Statement": [
        {
            "Sid": "PrometheusAccess",
            "Effect": "Allow",
            "Action": "aps:RemoteWrite",
            "Resource": "arn:aws:aps:us-east-1:account_id:workspace/workspace-ID"
        }
    ]
}
```

Replace *us-east-1*, *account\$1id*, and *workspace-ID* with your AWS Region, account ID, and Amazon Managed Service for Prometheus workspace ID.

After you ensure that you have met the above prerequisites, you can install the observability add-on.

**To quickly install the observability add-on**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **HyperPod Monitoring & Observability**, and choose **Quick install**.

**To do a custom-install of the observability add-on**

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the add-on named **HyperPod Monitoring & Observability**, and choose **Custom install**.

1. Specify the metrics categories that you want to see. For more information about these metrics categories, see [SageMaker HyperPod cluster metrics](hyperpod-observability-cluster-metrics.md).

1. Specify whether you want to enable Amazon CloudWatch Logs.

1. Specify whether you want the service to create a new Amazon Managed Service for Prometheus workspace.

1. To be able to view the metrics in Amazon Managed Grafana dashboards, check the box labeled **Use an Amazon Managed Grafana workspace**. You can specify your own workspace or let the service create a new one for you. 
**Note**  
Amazon Managed Grafana isn't available in all AWS Regions in which Amazon Managed Service for Prometheus is available. However, you can set up a Grafana workspace in any AWS Region and configure it to get metrics data from a Prometheus workspace that resides in a different AWS Region. For information, see [Use AWS data source configuration to add Amazon Managed Service for Prometheus as a data source](https://docs.aws.amazon.com/grafana/latest/userguide/AMP-adding-AWS-config.html) and [Connect to Amazon Managed Service for Prometheus and open-source Prometheus data sources](https://docs.aws.amazon.com/grafana/latest/userguide/prometheus-data-source.html). 

# Amazon SageMaker HyperPod observability dashboards
<a name="hyperpod-observability-addon-viewing-dashboards"></a>

This topic describes how to view metrics dashboards for your Amazon SageMaker HyperPod (SageMaker HyperPod) clusters and how to add new users to a dashboard. The topic also describes the different types of dashboards.

## Accessing dashboards
<a name="hyperpod-observability-addon-accessing-dashboards"></a>

To view your SageMaker HyperPod cluster's metrics in Amazon Managed Grafana, perform the following steps:

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Go to your cluster's details page.

1. On the **Dashboard** tab, locate the **HyperPod Observability** section, and choose **Open dashboard in Grafana**.

## Adding new users to a Amazon Managed Grafana workspace
<a name="hyperpod-observability-addon-adding-users"></a>

For information about how to add users to a Amazon Managed Grafana workspace, see [Use AWS IAM Identity Center with your Amazon Managed Grafana workspace](https://docs.aws.amazon.com/grafana/latest/userguide/authentication-in-AMG-SSO.html) in the *Amazon Managed Grafana User Guide*.

## Observability dashboards
<a name="hyperpod-observability-addon-dashboards.title"></a>

The SageMaker HyperPod observability add-on provides six interconnected dashboards in your default Amazon Managed Grafana workspace. Each dashboard provides in-depth insights about different resources and tasks in the clusters for various users such as data scientists, machine learning engineers, and administrators.

### Task dashboard
<a name="hyperpod-observability-addon-task-dashboard"></a>

The Task dashboard provides comprehensive monitoring and visualization of resource utilization metrics for SageMaker HyperPod tasks. The main panel displays a detailed table grouping resource usage by parent tasks, showing CPU, GPU, and memory utilization across pods. Interactive time-series graphs track CPU usage, system memory consumption, GPU utilization percentages, and GPU memory usage for selected pods, allowing you to monitor performance trends over time. The dashboard features powerful filtering capabilities through variables like cluster name, namespace, task type, and specific pods, making it easy to drill down into specific workloads. This monitoring solution is essential for optimizing resource allocation and maintaining performance of machine learning workloads on SageMaker HyperPod.

### Training dashboard
<a name="hyperpod-observability-addon-training-dashboard"></a>

The training dashboard provides comprehensive monitoring of training task health, reliability, and fault management metrics. The dashboard features key performance indicators including task creation counts, success rates, and uptime percentages, along with detailed tracking of both automatic and manual restart events. It offers detailed visualizations of fault patterns through pie charts and heatmaps that break down incidents by type and remediation latency, enabling you to identify recurring issues and optimize task reliability. The interface includes real-time monitoring of critical metrics like system recovery times and fault detection latencies, making it an essential tool for maintaining high availability of training workloads. Additionally, the dashboard's 24-hour trailing window provides historical context for analyzing trends and patterns in training task performance, helping teams proactively address potential issues before they impact production workloads.

### Inference dashboard
<a name="hyperpod-observability-addon-inference-dashboard"></a>

The inference dashboard provides comprehensive monitoring of model deployment performance and health metrics across multiple dimensions. It features a detailed overview of active deployments, real-time monitoring of request rates, success percentages, and latency metrics, enabling you to track model serving performance and identify potential bottlenecks. The dashboard includes specialized panels for both general inference metrics and token-specific metrics for language models, such as time to first token (TTFT) and token throughput, making it particularly valuable for monitoring large language model deployments. Additionally, it provides infrastructure insights through pod and node allocation tracking, while offering detailed error analysis capabilities to help maintain high availability and performance of inference workloads.

### Cluster dashboard
<a name="hyperpod-observability-addon-cluster-dashboard"></a>

The cluster dashboard provides a comprehensive view of cluster health and performance, offering real-time visibility into compute, memory, network, and storage resources across your Amazon SageMaker HyperPod (SageMaker HyperPod) environment. At a glance, you can view critical metrics including total instances, GPU utilization, memory usage, and network performance through an intuitive interface that automatically updates data every few seconds. The dashboard is organized into logical sections, starting with a high-level cluster overview that displays key metrics such as healthy instance percentage and total resource counts, followed by detailed sections for GPU performance, memory utilization, network statistics, and storage metrics. Each section features interactive graphs and panels that allow you to drill down into specific metrics, with customizable time ranges and filtering options by cluster name, instance, or GPU ID.

### File system dashboard
<a name="hyperpod-observability-addon-filesystem-dashboard"></a>

The file-system dashboard provides comprehensive visibility into file system (Amazon FSx for Lustre) performance and health metrics. The dashboard displays critical storage metrics including free capacity, deduplication savings, CPU/memory utilization, disk IOPS, throughput, and client connections across multiple visualizations. It makes it possible for you to monitor both system-level performance indicators like CPU and memory usage, as well as storage-specific metrics such as read/write operations and disk utilization patterns. The interface includes alert monitoring capabilities and detailed time-series graphs for tracking performance trends over time, making it valuable for proactive maintenance and capacity planning. Additionally, through its comprehensive metrics coverage, the dashboard helps identify potential bottlenecks, optimize storage performance, and ensure reliable file system operations for SageMaker HyperPod workloads.

### GPU partition dashboard
<a name="hyperpod-observability-addon-gpu-partition-dashboard"></a>

To monitor GPU partition-specific metrics when using Multi-Instance GPU (MIG) configurations, you need to install or upgrade to the latest version of the SageMaker HyperPod Observability addon. This addon provides comprehensive monitoring capabilities, including MIG-specific metrics such as partition count, memory usage, and compute utilization per GPU partition.

If you already have SageMaker HyperPod Observability installed but need MIG metrics support, simply update the addon to the latest version. This process is non-disruptive and maintains your existing monitoring configuration.

SageMaker HyperPod automatically exposes MIG-specific metrics, including:
+ `nvidia_mig_instance_count`: Number of MIG instances per profile
+ `nvidia_mig_memory_usage`: Memory utilization per MIG instance
+ `nvidia_mig_compute_utilization`: Compute utilization per MIG instance

### Cluster Logs dashboard
<a name="hyperpod-observability-addon-cluster-logs-dashboard"></a>

The Cluster Logs dashboard provides a centralized view of CloudWatch Logs for your SageMaker HyperPod cluster. The dashboard queries the `/aws/sagemaker/Clusters/{cluster-name}/{cluster-id}` log group and displays log events with filtering capabilities by instance ID, log stream name, log level (ERROR, WARN, INFO, DEBUG), and free-text search. The dashboard includes an events timeline showing log event distribution over time, a total events counter, a searched events timeline for filtered results, and a detailed logs panel with full log messages, timestamps, and log stream metadata. This dashboard uses CloudWatch as its data source and is useful for debugging cluster issues, monitoring instance health events, and investigating training job failures.

# Exploring SageMaker HyperPod cluster metrics in Amazon Managed Grafana
<a name="hyperpod-observability-addon-exploring-metrics"></a>

After you connect Amazon Managed Grafana to your Amazon Managed Service for Prometheus workspace, you can use Grafana's query editor and visualization tools to explore your metrics data. Amazon Managed Grafana provides multiple ways to interact with Prometheus data, including a comprehensive query editor for building PromQL expressions, a metrics browser for discovering available metrics and labels, and templating capabilities for creating dynamic dashboards. You can perform range queries to visualize time series data over periods and instant queries to retrieve the latest values, with options to format results as time series graphs, tables, or heatmaps. For detailed information about configuring query settings, using the metrics browser, and leveraging templating features, see [Using the Prometheus data source](https://docs.aws.amazon.com/grafana/latest/userguide/using-prometheus-datasource.html).

# Customizing SageMaker HyperPod cluster metrics dashboards and alerts
<a name="hyperpod-observability-addon-customizing"></a>

Amazon Managed Grafana makes it possible for you to create comprehensive dashboards that visualize your data through panels containing queries connected to your data sources. You can build dashboards from scratch, import existing ones, or export your creations for sharing and backup purposes. Grafana dashboards support dynamic functionality through variables that replace hard-coded values in queries, making your visualizations more flexible and interactive. You can also enhance your dashboards with features like annotations, library panels for reusability, version history management, and custom links to create a complete monitoring and observability solution. For step-by-step guidance on creating, importing, configuring, and managing dashboards, see [Building dashboards](https://docs.aws.amazon.com/grafana/latest/userguide/v10-dash-building-dashboards.html).

# Creating custom SageMaker HyperPod cluster metrics
<a name="hyperpod-observability-addon-custom-metrics"></a>

The Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on provides hundreds of health, performance, and efficiency metrics out-of-the-box. In addition to those metrics, you might need to monitor custom metrics specific to your applications or business needs that aren't captured by default metrics, such as model-specific performance indicators, data processing statistics, or application-specific measurements. To address this need, you can implement custom metrics collection using OpenTelemetry by integrating a Python code snippet into your application.

To create custom metrics, first run the following shell command to install the core OpenTelemetry components needed to instrument Python applications for observability. This installation makes it possible for Python applications that run on SageMaker HyperPod clusters to emit custom telemetry data. That data gets collected by the OpenTelemetry collector and forwarded to the observability infrastructure.

```
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
```

The following example script configures an OpenTelemetry metrics pipeline that automatically tags metrics with pod and node information, ensuring proper attribution within your cluster, and sends these metrics to the SageMaker HyperPod built-in observability stack every second. The script establishes a connection to the SageMaker HyperPod metrics collector, sets up appropriate resource attributes for identification, and provides a meter interface through which you can create various types of metrics (counters, gauges, or histograms) to track any aspect of your application's performance. Custom metrics integrate with the SageMaker HyperPod monitoring dashboards alongside system metrics. This integration allows for comprehensive observability through a single interface where you can create custom alerts, visualizations, and reports to monitor your workload's complete performance profile.

```
import os
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

# Get hostname/pod name
hostname = os.uname()[1]
node_name = os.getenv('NODE_NAME', 'unknown')

collector_endpoint = "hyperpod-otel-collector.hyperpod-observability:4317"

# Configure the OTLP exporter
exporter = OTLPMetricExporter(
    endpoint=collector_endpoint,
    insecure=True,
    timeout=5000  # 5 seconds timeout
)

reader = PeriodicExportingMetricReader(
    exporter,
    export_interval_millis=1000
)

resource = Resource.create({
    "service.name": "metric-test",
    "pod.name": hostname,
    "node.name": node_name
})

meter_provider = MeterProvider(
    metric_readers=[reader],
    resource=resource
)
metrics.set_meter_provider(meter_provider)

# Create a meter
meter = metrics.get_meter("test-meter")

# Create a counter
counter = meter.create_counter(
    name="test.counter",
    description="A test counter"
)

counter.add(1, {"pod": hostname, "node": node_name})
```

# SageMaker HyperPod cluster metrics
<a name="hyperpod-observability-cluster-metrics"></a>

Amazon SageMaker HyperPod (SageMaker HyperPod) publishes various metrics across 9 distinct categories to your Amazon Managed Service for Prometheus workspace. Not all metrics are enabled by default or displayed in your Amazon Managed Grafana workspace. The following table shows which metrics are enabled by default when you install the observability add-on, which categories have additional metrics that can be enabled for more granular cluster information, and where they appear in the Amazon Managed Grafana workspace.


| Metric category | Enabled by default? | Additional advanced metrics available? | Available under which Grafana dashboards? | 
| --- | --- | --- | --- | 
| Training metrics | Yes | Yes | Training | 
| Inference metrics | Yes | No | Inference | 
| Task governance metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. | 
| Scaling metrics | No | Yes | None. Query your Amazon Managed Service for Prometheus workspace to build your own dashboard. | 
| Cluster metrics | Yes | Yes | Cluster | 
| Instance metrics | Yes | Yes | Cluster | 
| Accelerated compute metrics | Yes | Yes | Task, Cluster | 
| Network metrics | No | Yes | Cluster | 
| File system | Yes | No | File system | 

The following tables describe the metrics available for monitoring your SageMaker HyperPod cluster, organized by category.

## Metrics availability on Restricted Instance Groups
<a name="hyperpod-observability-rig-metrics-availability"></a>

When your cluster contains Restricted Instance Groups, most metrics categories are available on restricted nodes with the following exceptions and considerations. You can also set up alerting on any metric of your choice.


| Metric category | Available on RIG nodes? | Notes | 
| --- | --- | --- | 
| Training metrics | Yes | Kubeflow and Kubernetes pod metrics are collected. Advanced training KPI metrics (from Training Metrics Agent) are not available from the RIG nodes. | 
| Inference metrics | No | Inference workloads are not supported on Restricted Instance Groups. | 
| Task governance metrics | No | Kueue metrics are collected from the standard nodes only, if any. | 
| Scaling metrics | No | KEDA metrics are collected from the standard nodes only, if any. | 
| Cluster metrics | Yes | Kube State Metrics and API server metrics are available. Kube State Metrics is preferentially scheduled on standard nodes but can run on restricted nodes in RIG-only clusters. | 
| Instance metrics | Yes | Node Exporter and cAdvisor metrics are collected on all nodes including restricted nodes. | 
| Accelerated compute metrics | Yes | DCGM Exporter runs on GPU-enabled restricted nodes. Neuron Monitor runs on Neuron-enabled restricted nodes when advanced mode is enabled. | 
| Network metrics | Yes | EFA Exporter runs on EFA-enabled restricted nodes when advanced mode is enabled. | 
| File system metrics | Yes | FSx for Lustre cluster utilization metrics are supported on Restricted Instance Groups. | 

**Note**  
Container log collection with Fluent Bit is not deployed on restricted nodes. Cluster logs from restricted nodes are available through the SageMaker HyperPod platform independently of the observability add-on. You can view these logs in the Cluster Logs dashboard.

## Training metrics
<a name="hyperpod-observability-training-metrics"></a>

Use these metrics to track the performance of training tasks executed on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Kubeflow metrics | [https://github.com/kubeflow/trainer](https://github.com/kubeflow/trainer) | Yes | Kubeflow | 
| Kubernetes pod metrics | [https://github.com/kubernetes/kube-state-metrics](https://github.com/kubernetes/kube-state-metrics) | Yes | Kubernetes | 
| training\$1uptime\$1percentage | Percentage of training time out of the total window size | No | SageMaker HyperPod training operator | 
| training\$1manual\$1recovery\$1count | Total number of manual restarts performed on the job | No | SageMaker HyperPod training operator | 
| training\$1manual\$1downtime\$1ms | Total time in milliseconds the job was down due to manual interventions | No | SageMaker HyperPod training operator | 
| training\$1auto\$1recovery\$1count | Total number of automatic recoveries | No | SageMaker HyperPod training operator | 
| training\$1auto\$1recovery\$1downtime | Total infrastructure overhead time in milliseconds during fault recovery | No | SageMaker HyperPod training operator | 
| training\$1fault\$1count | Total number of faults encountered during training | No | SageMaker HyperPod training operator | 
| training\$1fault\$1type\$1count | Distribution of faults by type | No | SageMaker HyperPod training operator | 
| training\$1fault\$1recovery\$1time\$1ms | Recovery time in milliseconds for each type of fault | No | SageMaker HyperPod training operator | 
| training\$1time\$1ms | Total time in milliseconds spent in actual training | No | SageMaker HyperPod training operator | 

## Inference metrics
<a name="hyperpod-observability-inference-metrics"></a>

Use these metrics to track the performance of inference tasks on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| model\$1invocations\$1total | Total number of invocation requests to the model | Yes | SageMaker HyperPod inference operator | 
| model\$1errors\$1total | Total number of errors during model invocation | Yes | SageMaker HyperPod inference operator | 
| model\$1concurrent\$1requests | Active concurrent model requests | Yes | SageMaker HyperPod inference operator | 
| model\$1latency\$1milliseconds | Model invocation latency in milliseconds | Yes | SageMaker HyperPod inference operator | 
| model\$1ttfb\$1milliseconds | Model time to first byte latency in milliseconds | Yes | SageMaker HyperPod inference operator | 
| TGI | These metrics can be used to monitor the performance of TGI, auto-scale deployment and to help identify bottlenecks. For a detailed list of metrics, see [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md). | Yes | Model container | 
| LMI | These metrics can be used to monitor the performance of LMI, and to help identify bottlenecks. For a detailed list of metrics, see [https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md](https://github.com/deepjavalibrary/djl-serving/blob/master/prometheus/README.md). | Yes | Model container | 

## Task governance metrics
<a name="hyperpod-observability-task-governance-metrics"></a>

Use these metrics to monitor task governance and resource allocation on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Kueue | See [https://kueue.sigs.k8s.io/docs/reference/metrics/](https://kueue.sigs.k8s.io/docs/reference/metrics/). | No | Kueue | 

## Scaling metrics
<a name="hyperpod-observability-scaling-metrics"></a>

Use these metrics to monitor auto-scaling behavior and performance on the SageMaker HyperPod cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| KEDA Operator Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1operator](https://keda.sh/docs/2.17/integrations/prometheus/#operator). | No | Kubernetes Event-driven Autoscaler (KEDA) | 
| KEDA Webhook Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1admission-webhooks](https://keda.sh/docs/2.17/integrations/prometheus/#admission-webhooks). | No | Kubernetes Event-driven Autoscaler (KEDA) | 
| KEDA Metrics server Metrics | See [https://keda.sh/docs/2.17/integrations/prometheus/\$1metrics-server](https://keda.sh/docs/2.17/integrations/prometheus/#metrics-server). | No | Kubernetes Event-driven Autoscaler (KEDA) | 

## Cluster metrics
<a name="hyperpod-observability-cluster-health-metrics"></a>

Use these metrics to monitor overall cluster health and resource allocation.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Cluster health | Kubernetes API server metrics. See [https://kubernetes.io/docs/reference/instrumentation/metrics/](https://kubernetes.io/docs/reference/instrumentation/metrics/). | Yes | Kubernetes | 
| Kubestate | See [https://github.com/kubernetes/kube-state-metrics/tree/main/docs\$1default-resources](https://github.com/kubernetes/kube-state-metrics/tree/main/docs#default-resources). | Limited | Kubernetes | 
| KubeState Advanced | See [https://github.com/kubernetes/kube-state-metrics/tree/main/docs\$1optional-resources](https://github.com/kubernetes/kube-state-metrics/tree/main/docs#optional-resources). | No | Kubernetes | 

## Instance metrics
<a name="hyperpod-observability-instance-metrics"></a>

Use these metrics to monitor individual instance performance and health.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| Node Metrics | See [https://github.com/prometheus/node\$1exporter?tab=readme-ov-file\$1enabled-by-default](https://github.com/prometheus/node_exporter?tab=readme-ov-file#enabled-by-default). | Yes | Kubernetes | 
| Container Metrics | Container metrics exposed by Cadvisor. See [https://github.com/google/cadvisor](https://github.com/google/cadvisor). | Yes | Kubernetes | 

## Accelerated compute metrics
<a name="hyperpod-observability-accelerated-compute-metrics"></a>

Use these metrics to monitor the performance, health, and utilization of individual accelerated compute devices in your cluster.

**Note**  
When GPU partitioning with MIG (Multi-Instance GPU) is enabled on your cluster, DCGM metrics automatically provide partition-level granularity for monitoring individual MIG instances. Each MIG partition is exposed as a separate GPU device with its own metrics for temperature, power, memory utilization, and compute activity. This allows you to track resource usage and health for each GPU partition independently, enabling precise monitoring of workloads running on fractional GPU resources. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| NVIDIA GPU | DCGM metrics. See [https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv](https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv). | Limited |  NVIDIA Data Center GPU Manager (DCGM)  | 
|  NVIDIA GPU (advanced)  | DCGM metrics that are commented out in the following CSV file:[https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv](https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv) | No |  NVIDIA Data Center GPU Manager (DCGM)  | 
| AWS Trainium | Neuron metrics. See [https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html\$1neuron-monitor-nc-counters](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#neuron-monitor-nc-counters). | No | AWS Neuron Monitor | 

## Network metrics
<a name="hyperpod-observability-network-metrics"></a>

Use these metrics to monitor the performance and health of the Elastic Fabric Adapters (EFA) in your cluster.


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| EFA | See [https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation\$1and\$1observability/3.efa-node-exporter/README.md](https://github.com/aws-samples/awsome-distributed-training/blob/main/4.validation_and_observability/3.efa-node-exporter/README.md). | No | Elastic Fabric Adapter | 

## File system metrics
<a name="hyperpod-observability-file-system-metrics"></a>


| Metric name or type | Description | Enabled by default? | Metric source | 
| --- | --- | --- | --- | 
| File system | Amazon FSx for Lustre metrics from Amazon CloudWatch:[Monitoring with Amazon CloudWatch](https://docs.aws.amazon.com/fsx/latest/LustreGuide/monitoring-cloudwatch.html). | Yes | Amazon FSx for Lustre | 

# Preconfigured alerts
<a name="hyperpod-observability-addon-alerts"></a>

The Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on enables default alerts for your cluster and workloads to notify you when the system detects common early indicators of cluster under-performance. These alerts are defined within the Amazon Managed Grafana built-in alerting system. For information about how to modify these pre-configured alerts or create new ones, see [Alerts in Grafana version 10](https://docs.aws.amazon.com/grafana/latest/userguide/v10-alerts.html) in the *Amazon Managed Grafana User Guide*. The following YAML shows the default alerts.

```
groups:
- name: sagemaker_hyperpod_alerts
  rules:
  # GPU_TEMP_ABOVE_80C
  - alert: GPUHighTemperature
    expr: DCGM_FI_DEV_GPU_TEMP > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "GPU Temperature Above 80C"
      description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C."

  # GPU_TEMP_ABOVE_85C  
  - alert: GPUCriticalTemperature  
    expr: DCGM_FI_DEV_GPU_TEMP > 85
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "GPU Temperature Above 85C"
      description: "GPU {{ $labels.gpu }} temperature is {{ $value }}°C."

  # GPU_MEMORY_ERROR
  # Any ECC double-bit errors indicate serious memory issues requiring immediate attention
  - alert: GPUMemoryErrorDetected
    expr: DCGM_FI_DEV_ECC_DBE_VOL_TOTAL > 0 or DCGM_FI_DEV_ECC_DBE_AGG_TOTAL > DCGM_FI_DEV_ECC_DBE_AGG_TOTAL offset 5m
    labels:
      severity: critical
    annotations:
      summary: "GPU ECC Double-Bit Error Detected"
      description: "GPU {{ $labels.gpu }} has detected ECC double-bit errors."

  # GPU_POWER_WARNING
  # Sustained power limit violations can impact performance and stability
  - alert: GPUPowerViolation
    expr: DCGM_FI_DEV_POWER_VIOLATION > 100
    for: 5m
    labels:
      severity: warning  
    annotations:
      summary: "GPU Power Violation"
      description: "GPU {{ $labels.gpu }} has been operating at power limit for extended period."

  # GPU_NVLINK_ERROR
  # NVLink errors above threshold indicate interconnect stability issues
  - alert: NVLinkErrorsDetected
    expr: DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL > 0 or DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL > 10
    labels:
      severity: warning
    annotations:
      summary: "NVLink Errors Detected" 
      description: "GPU {{ $labels.gpu }} has detected NVLink errors."

  # GPU_THERMAL_VIOLATION  
  # Immediate alert on thermal violations to prevent hardware damage
  - alert: GPUThermalViolation
    expr: increase(DCGM_FI_DEV_THERMAL_VIOLATION[5m]) > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "GPU Thermal Violation Detected"
      description: "GPU {{ $labels.gpu }} has thermal violations on node {{ $labels.Hostname }}"

  # GPU_XID_ERROR
  # XID errors indicate driver or hardware level GPU issues requiring investigation
  - alert: GPUXidError
    expr: DCGM_FI_DEV_XID_ERRORS > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: "GPU XID Error Detected"
      description: "GPU {{ $labels.gpu }} experienced XID error {{ $value }} on node {{ $labels.Hostname }}"

  # MIG_CONFIG_FAILURE
  # MIG configuration failures indicate issues with GPU partitioning setup
  - alert: MIGConfigFailure
    expr: kubelet_node_name{nvidia_com_mig_config_state="failed"} > 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "MIG Configuration Failed"
      description: "MIG configuration failed on node {{ $labels.instance }}"

  # DISK_SPACE_WARNING
  # 90% threshold ensures time to respond before complete disk exhaustion
  - alert: NodeDiskSpaceWarning
    expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High Disk Usage"
      description: "Node {{ $labels.instance }} disk usage is above 90%"

  # FSX_STORAGE_WARNING
  # 80% FSx utilization allows buffer for burst workloads
  - alert: FsxLustreStorageWarning
    expr: fsx_lustre_storage_used_bytes / fsx_lustre_storage_capacity_bytes * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High FSx Lustre Usage"
      description: "FSx Lustre storage usage is above 80% on file system {{ $labels.filesystem_id }}"
```

# Troubleshooting the Amazon SageMaker HyperPod observability add-on
<a name="hyperpod-observability-addon-troubleshooting"></a>

Use the following guidance to resolve common issues with the Amazon SageMaker HyperPod (SageMaker HyperPod) observability add-on.

## Troubleshooting missing metrics in Amazon Managed Grafana
<a name="troubleshooting-missing-metrics"></a>

If metrics don't appear in your Amazon Managed Grafana dashboards, perform the following steps to identify and resolve the issue.

### Verify the Amazon Managed Service for Prometheus-Amazon Managed Grafana connection
<a name="verify-amp-grafana-connection"></a>

1. Sign in to the Amazon Managed Grafana console.

1. In the left pane, choose **All workspaces**.

1. In the **Workspaces** table, choose your workspace.

1. In the details page of the workspace, choose the **Data sources** tab.

1. Verify that the Amazon Managed Service for Prometheus data source exists.

1. Check the connection settings:
   + Confirm that the endpoint URL is correct.
   + Verify that IAM authentication is properly configured.
   + Choose **Test connection**. Verify that the status is **Data source is working**.

### Verify the Amazon EKS add-on status
<a name="verify-eks-addon-status"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. Choose the **Add-ons** tab.

1. Verify that the SageMaker HyperPod observability add-on is listed and that its status is **ACTIVE**.

1. If the status isn't **ACTIVE**, see [Troubleshooting add-on installation failures](#troubleshooting-addon-installation-failures).

### Verify Pod Identity association
<a name="verify-pod-identity-association"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. On the cluster details page, choose the **Access** tab.

1. In the **Pod Identity associations** table, choose the association that has the following property values:
   + **Namespace**: `hyperpod-observability`
   + **Service account**: `hyperpod-observability-operator-otel-collector`
   + **Add-on**: `amazon-sagemaker-hyperpod-observability`

1. Ensure that the IAM role that is attached to this association has the following permissions.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "PrometheusAccess",
               "Effect": "Allow",
               "Action": "aps:RemoteWrite",
               "Resource": "arn:aws:aps:us-east-1:111122223333:workspace/workspace-ID"
           },
           {
               "Sid": "CloudwatchLogsAccess",
               "Effect": "Allow",
               "Action": [
                   "logs:CreateLogGroup",
                   "logs:CreateLogStream",
                   "logs:DescribeLogGroups",
                   "logs:DescribeLogStreams",
                   "logs:PutLogEvents",
                   "logs:GetLogEvents",
                   "logs:FilterLogEvents",
                   "logs:GetLogRecord",
                   "logs:StartQuery",
                   "logs:StopQuery",
                   "logs:GetQueryResults"
               ],
               "Resource": [
                   "arn:aws:logs:us-east-1:111122223333:log-group:/aws/sagemaker/Clusters/*",
                   "arn:aws:logs:us-east-1:111122223333:log-group:/aws/sagemaker/Clusters/*:log-stream:*"
               ]
           }
       ]
   }
   ```

------

1. Ensure that the IAM role that is attached to this association has the following trust policy. Verify that the source ARN and source account are correct.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "AllowEksAuthToAssumeRoleForPodIdentity",
               "Effect": "Allow",
               "Principal": {
                   "Service": "pods.eks.amazonaws.com"
               },
               "Action": [
                   "sts:AssumeRole",
                   "sts:TagSession"
               ],
               "Condition": {
                   "StringEquals": {
                       "aws:SourceArn": "arn:aws:eks:us-east-1:111122223333:cluster/cluster-name",
                       "aws:SourceAccount": "111122223333"
                   }
               }
           }
       ]
   }
   ```

------

### Check Amazon Managed Service for Prometheus throttling
<a name="check-amp-throttling"></a>

1. Sign in to the AWS Management Console and open the Service Quotas console at [https://console.aws.amazon.com/servicequotas/](https://console.aws.amazon.com/servicequotas/).

1. In the **Managed quotas** box, search for and select Amazon Managed Service for Prometheus.

1. Choose the **Active series per workspace** quota.

1. In the **Resource-level quotas** tab, select your Amazon Managed Service for Prometheus workspace.

1. Ensure that the utilization is less than your current quota.

1. If you've reached the quota limit, select your workspace by choosing the radio button to its left, and then choose **Request increase at resource level** .

### Verify KV caching and intelligent routing are enabled
<a name="verify-caching-routing"></a>

If the `KVCache Metrics` dashboard is missing, feature is either not enabled or the port isn't mentioned in the `modelMetrics`. For more information on how to enable this, see steps 1 and 3 in [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route). 

If the `Intelligent Router Metrics` dashboard is missing, enable the feature to have them show up. For more information on how to enable this, see [Configure KV caching and intelligent routing for improved performance](sagemaker-hyperpod-model-deployment-deploy-ftm.md#sagemaker-hyperpod-model-deployment-deploy-ftm-cache-route). 

## Troubleshooting add-on installation failures
<a name="troubleshooting-addon-installation-failures"></a>

If the observability add-on fails to install, use the following steps to diagnose and resolve the issue.

### Check health probe status
<a name="check-health-probe-status"></a>

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Select your cluster.

1. Choose the **Add-ons** tab.

1. Choose the failed add-on.

1. Review the **Health issues** section.

1. If the health issue is related to credentials or pod identity, see [Verify Pod Identity association](#verify-pod-identity-association). Also ensure that the pod identity agent add-on is running in the cluster.

1. Check for errors in the manager logs. For instructions, see [Review manager logs](#review-manager-logs).

1. Contact AWS Support with the issue details.

### Review manager logs
<a name="review-manager-logs"></a>

1. Get the add-on manager pod:

   ```
   kubectl logs -n hyperpod-observability -l control-plane=hyperpod-observability-controller-manager
   ```

1. For urgent issues, contact Support.

## Review all observability pods
<a name="review-all-observability-pods"></a>

All the pods that the SageMaker HyperPod observability add-on creates are in the `hyperpod-observability` namespace. To get the status of these pods, run the following command.

```
kubectl get pods -n hyperpod-observability
```

Look for the pods whose status is either `pending` or `crashloopbackoff`. Run the following command to get the logs of these pending or failing pods.

```
kubectl logs -n hyperpod-observability pod-name
```

If you don't find errors in the logs, run the following command to describe the pods and look for errors.

```
kubectl describe -n hyperpod-observability pod pod-name
```

To get more context, run the two following commands to describe the deployments and daemonsets for these pods.

```
kubectl describe -n hyperpod-observability deployment deployment-name
```

```
kubectl describe -n hyperpod-observability daemonset daemonset-name
```

## Troubleshooting pods that are stuck in the pending status
<a name="pods-stuck-in-pending"></a>

If you see that there are pods that are stuck in the `pending` status, make sure that the node is large enough to fit in all the pods. To verify that it is, perform the following steps.

1. Open the Amazon EKS console at [https://console.aws.amazon.com/eks/home\$1/clusters](https://console.aws.amazon.com/eks/home#/clusters).

1. Choose your cluster.

1. Choose the cluster's **Compute** tab.

1. Choose the node with the smallest instance type.

1. In the capacity allocation section, look for available pods.

1. If there are no available pods, then you need a larger instance type.

For urgent issues, contact AWS Support.

## Troubleshooting observability on Restricted Instance Groups
<a name="troubleshooting-rig-observability"></a>

Use the following guidance to resolve issues specific to clusters with Restricted Instance Groups.

### Observability pods not starting on restricted nodes
<a name="troubleshooting-rig-pods-not-starting"></a>

If observability pods are not starting on restricted nodes, check the pod status and events:

```
kubectl get pods -n hyperpod-observability -o wide
kubectl describe pod pod-name -n hyperpod-observability
```

Common causes include:
+ **Image pull failures:** The pod events may show image pull errors if the observability container images are not yet allowlisted on the restricted nodes. Ensure that you are running the latest version of the observability add-on. If the issue persists after upgrading, contact Support.
+ **Taint tolerations:** Verify that the pod spec includes the required toleration for restricted nodes. The add-on starting from version `v1.0.5-eksbuild.1` automatically adds this toleration when RIG support is enabled. If you are using older version, please upgrade to the latest version.

### Viewing logs for pods on restricted nodes
<a name="troubleshooting-rig-viewing-logs"></a>

The `kubectl logs` command does not work for pods running on restricted nodes. This is an expected limitation because the communication path required for log streaming is not available on restricted nodes.

To view logs from restricted nodes, use the **Cluster Logs** dashboard in Amazon Managed Grafana, which queries CloudWatch Logs directly. You can filter by instance ID, log stream, log level, and free-text search to find relevant log entries.

### DNS resolution failures in clusters with both standard and restricted nodes
<a name="troubleshooting-rig-dns-resolution"></a>

In hybrid clusters (clusters with both standard and restricted instance groups), pods on standard nodes may experience DNS resolution timeouts when trying to reach AWS service endpoints such as Amazon Managed Service for Prometheus or CloudWatch.

**Cause:** The `kube-dns` service has endpoints from both standard CoreDNS pods and RIG CoreDNS pods. Standard node pods cannot reach RIG CoreDNS endpoints due to network isolation. When `kube-proxy` load-balances a DNS request from a standard node pod to a RIG CoreDNS endpoint, the request times out.

**Resolution:** Set `internalTrafficPolicy: Local` on the `kube-dns` service so that pods only reach CoreDNS on their local node:

```
kubectl patch svc kube-dns -n kube-system -p '{"spec":{"internalTrafficPolicy":"Local"}}'
```

After applying this patch, restart the affected observability pods:

```
kubectl delete pods -n hyperpod-observability -l app.kubernetes.io/name=hyperpod-node-collector
```

### Metrics from restricted nodes not reaching Amazon Managed Service for Prometheus
<a name="troubleshooting-rig-metrics-not-reaching-amp"></a>

If metrics from restricted nodes are not appearing in your Amazon Managed Service for Prometheus workspace:

1. **Verify the execution role permissions.** Ensure that the execution role for the Restricted Instance Group has `aps:RemoteWrite` permission for your Prometheus workspace. For more information, see [Additional prerequisites for Restricted Instance Groups](hyperpod-observability-addon-setup.md#hyperpod-observability-addon-rig-prerequisites).

1. **Check the node collector pod status.** Run the following command and verify that node collector pods are running on restricted nodes:

   ```
   kubectl get pods -n hyperpod-observability | grep node-collector
   ```

1. **Check the central collector deployments.** In clusters with restricted nodes, the add-on deploys one central collector per network boundary. Verify that a central collector exists for each boundary:

   ```
   kubectl get deployments -n hyperpod-observability | grep central-collector
   ```

1. **Check pod events for errors.** Use `kubectl describe` on the collector pods to look for error events:

   ```
   kubectl describe pod collector-pod-name -n hyperpod-observability
   ```

If the issue persists after verifying the above, contact Support.

### Pod Identity verification does not apply to restricted instance group nodes
<a name="troubleshooting-rig-pod-identity"></a>

The [Verify Pod Identity association](#verify-pod-identity-association) troubleshooting steps apply only to standard nodes. On restricted nodes, the add-on uses the cluster instance group execution role for AWS authentication instead of Amazon EKS Pod Identity. If metrics are missing from restricted nodes, verify the execution role permissions instead of the Pod Identity association.

### Fluent Bit not running on restricted nodes
<a name="troubleshooting-rig-fluent-bit"></a>

This is expected behavior. Fluent Bit is intentionally not deployed on restricted nodes. Logs from restricted nodes are published to CloudWatch through the SageMaker HyperPod platform independently of the observability add-on. Use the **Cluster Logs** dashboard in Amazon Managed Grafana to view these logs.

# Observability with Amazon CloudWatch
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci"></a>

Use [Amazon CloudWatch Container Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights.html) to collect, aggregate, and summarize metrics and logs from the containerized applications and micro-services on the EKS cluster associated with a HyperPod cluster.

Amazon CloudWatch Insights collects metrics for compute resources, such as CPU, memory, disk, and network. Container Insights also provides diagnostic information, such as container restart failures, to help you isolate issues and resolve them quickly. You can also set CloudWatch alarms on metrics that Container Insights collects.

To find a complete list of metrics, see [Amazon EKS and Kubernetes Container Insights metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics-EKS.html) in the *Amazon EKS User Guide*.

## Install CloudWatch Container Insights
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-setup"></a>

Cluster admin users must set up CloudWatch Container Insights following the instructions at [Install the CloudWatch agent by using the Amazon CloudWatch Observability EKS add-on or the Helm chart](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/install-CloudWatch-Observability-EKS-addon.html) in the *CloudWatch User Guide*. For more information about Amazon EKS add-on, see also [Install the Amazon CloudWatch Observability EKS add-on](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-setup-EKS-addon.html) in the *Amazon EKS User Guide*.

After the installation has completed, verify that the CloudWatch Observability add-on is visible in the EKS cluster add-on tab. It might take about a couple of minutes until the dashboard loads.

**Note**  
SageMaker HyperPod requires the CloudWatch Insight v2.0.1-eksbuild.1 or later.

![\[CloudWatch Observability service card showing status, version, and IAM role information.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-CIaddon.png)


# Access CloudWatch container insights dashboard
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-access-dashboard"></a>

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Insights**, and then choose **Container Insights**.

1. Select the EKS cluster set up with the HyperPod cluster you're using.

1. View the Pod/Cluster level metrics.

![\[Performance monitoring dashboard for EKS cluster showing node status, resource utilization, and pod metrics.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hyperpod-eks-CIdashboard.png)


## Access CloudWatch container insights logs
<a name="sagemaker-hyperpod-eks-cluster-observability-cluster-cloudwatch-ci-access-log"></a>

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. Choose **Logs**, and then choose **Log groups**.

When you have the HyperPod clusters integrated with Amazon CloudWatch Container Insights, you can access the relevant log groups in the following format: `/aws/containerinsights /<eks-cluster-name>/*`. Within this log group, you can find and explore various types of logs such as Performance logs, Host logs, Application logs, and Data plane logs.

# Continuous provisioning for enhanced cluster operations on Amazon EKS
<a name="sagemaker-hyperpod-scaling-eks"></a>

Amazon SageMaker HyperPod clusters created with Amazon EKS orchestration now supports continuous provisioning, a new capability that enables greater flexibility and efficiency running large-scale AI/ML workloads. Continuous provisioning lets you start training quickly, scale seamlessly, perform maintenance without disrupting operations, and have granular visibility into cluster operations. 

**Note**  
Continuous provisioning is available as an optional configuration for HyperPod clusters created with EKS orchestration. Clusters created with Slurm orchestration use a different scaling model.

## How it works
<a name="sagemaker-hyperpod-scaling-eks-how"></a>

The continuous provisioning system introduces a desired-state architecture that replaces the traditional request-based model. This new architecture enables parallel, non-blocking operations across different resource levels while maintaining system stability and performance. The continuous provisioning system:
+ **Accepts the request**: Records the target instance count for each instance group
+ **Initiates provisioning**: Begins launching instances to meet the target count

  **Tracks progress**: Monitors each instance launch attempt and records the status
+ **Handles failures**: Automatically retries failed launches

Continuous provisioning is disabled by default. To use this feature, set `--node-provisioning-mode` to `Continuous`.

With continuous provisioning enabled, you can initiate multiple scaling operations simultaneously without waiting for previous operations to complete. This lets you scale different instance groups in the same cluster concurrently and submit multiple scaling requests to the same instance group. 

Continuous provisioning also gives you access to [DescribeClusterEvent](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterEvent.html) and [ListClusterEvent](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterEvents.html) for detailed event monitoring and operational visibility. 

## Usage metering
<a name="sagemaker-hyperpod-scaling-eks-metering"></a>

HyperPod clusters with continuous provisioning use instance-level metering to provide accurate billing that reflects actual resource usage. This metering approach differs from traditional cluster-level billing by tracking each instance independently.

**Instance-level billing**

With continuous provisioning, billing starts and stops at the individual instance level rather than waiting for cluster-level state changes. This approach provides the following benefits:
+ **Precise billing accuracy**: Billing starts when the lifecycle script execution begins. If the lifecycle script fails, the instance provision will be retried and you will be charged for the duration of the lifecycle script runtime.
+ **Independent metering**: Each instance's billing lifecycle is managed separately, preventing cascading billing errors
+ **Real-time billing updates**: Billing starts when an instance begins executing its lifecycle script and stops when the instance enters a terminating state

**Billing lifecycle**

Each instance in your HyperPod cluster follows this billing lifecycle:
+ **Billing starts**: When the instance successfully launches and begins executing its lifecycle configuration script
+ **Billing continues**: Throughout the instance's operational lifetime
+ **Billing stops**: When the instance enters a terminating state, regardless of the reason for termination

**Note**  
Billing does not start for instances that fail to launch. If an instance launch fails due to insufficient capacity or other issues, you are not charged for that failed attempt. Billing is calculated at the instance level and costs are aggregated and reported under your cluster's Amazon Resource Name (ARN). 

## Create a cluster with continuous provisioning enabled
<a name="sagemaker-hyperpod-scaling-eks-create"></a>

**Note**  
You must have an existing Amazon EKS cluster configured with VPC networking and the required Helm chart installed. Additionally, prepare a lifecycle configuration script and upload it to an Amazon S3 bucket that your execution role can access. For more information, see [Managing SageMaker HyperPod clusters orchestrated by Amazon EKS](sagemaker-hyperpod-eks-operate.md).

The following AWS CLI operation creates a HyperPod cluster with one instance group and continuous provisioning enabled.

```
aws sagemaker create-cluster \ 
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET'"]
}' \
--instance-groups '{
   "InstanceGroupName": "ig-1",
   "InstanceType": "ml.c5.2xlarge",
   "InstanceCount": 2,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create_noop.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'",
   "ThreadsPerCore": 1,
   "TrainingPlanArn": ""
}' \
--node-provisioning-mode Continuous


// Expected Output:
{
    "ClusterArn": "arn:aws:sagemaker:us-west-2:<account-id>:cluster/<cluster-id>"
}
```

After you’ve created your cluster, you can use [ListClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterNodes.html) or [DescribeClusterNode](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeClusterNode.html) to find out more information about the nodes in the cluster. 

Calling these operations will return a [ClusterInstanceStatusDetails](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ClusterInstanceStatusDetails.html) object with one of the following values: 
+  **Running**: The node is healthy and registered with the cluster orchestrator (EKS). 
+  **Failure**: The node provisioning failed but the system will automatically retry provisioning with a new EC2 instance. 
+  **Pending**: The node is being provisioned or rebooted. 
+  **ShuttingDown**: The node termination is in progress. The node will either transition to Failure status if termination encounters issues, or will be successfully removed from the cluster. 
+  **SystemUpdating**: The node is undergoing AMI patching, either triggered manually or as part of patching cronjobs. 
+  **DeepHealthCheckInProgress**: [Deep health checks (DHCs)](sagemaker-hyperpod-eks-resiliency-deep-health-checks.md) are being conducted. This could take anywhere between a few mins to several hours depending on the nature of tests. Bad nodes are replaced and healthy nodes switch to Running. 
+  **NotFound** : Used in [BatchAddClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchAddClusterNodes.html) response to indicate a node has been deleted during idempotent replay. 

## Minimum capacity requirements (MinCount)
<a name="sagemaker-hyperpod-scaling-eks-mincount"></a>

The MinCount feature allows you to specify the minimum number of instances that must be successfully provisioned before an instance group transitions to the `InService` status. This feature provides better control over scaling operations and helps prevent scenarios where partially provisioned instance groups cannot be used effectively for training workloads.

**Important**  
MinCount is not a permanent guarantee of minimum capacity. It only ensures that the specified minimum number of instances are available when the instance group first becomes `InService`. Brief dips below MinCount may occur during normal operations such as unhealthy instance replacements or maintenance activities.

### How MinCount works
<a name="sagemaker-hyperpod-scaling-eks-mincount-how"></a>

When you create or update an instance group with MinCount enabled, the following behavior occurs:
+ **New instance groups**: The instance group remains in `Creating` status until at least MinCount instances are successfully provisioned and ready. Once this threshold is met, the instance group transitions to `InService`.
+ **Existing instance groups**: When updating MinCount on an existing instance group, the status changes to `Updating` until the new MinCount requirement is satisfied.
+ **Continuous scaling**: If TargetCount is greater than MinCount, the continuous scaling system continues attempting to launch additional instances until TargetCount is reached.
+ **Timeout and rollback**: If MinCount cannot be satisfied within 3 hours, the system automatically rolls back the instance group to its last known good state. For more information about rollback behavior, see [Automatic rollback behavior](#sagemaker-hyperpod-scaling-eks-mincount-rollback).

### Instance group status during MinCount operations
<a name="sagemaker-hyperpod-scaling-eks-mincount-status"></a>

Instance groups with MinCount configured exhibit the following status behavior:

Creating  
For new instance groups when CurrentCount < MinCount. The instance group remains in this status until the minimum capacity requirement is met.

Updating  
For existing instance groups when MinCount is modified and CurrentCount < MinCount. The instance group remains in this status until the new minimum capacity requirement is satisfied.

InService  
When MinCount ≤ CurrentCount ≤ TargetCount. The instance group is ready for use and all mutating operations are unblocked.

During `Creating` or `Updating` status, the following restrictions apply:
+ Mutating operations such as `BatchAddClusterNodes`, `BatchDeleteClusterNodes`, or `UpdateClusterSoftware` are blocked
+ You can still modify MinCount and TargetCount values to correct configuration errors
+ Cluster and Instance group deletion is always permitted

### Automatic rollback behavior
<a name="sagemaker-hyperpod-scaling-eks-mincount-rollback"></a>

If an instance group cannot reach its MinCount within 3 hours, the system automatically initiates a rollback to prevent indefinite waiting:
+ **New instance groups**: MinCount and TargetCount are reset to (0, 0)
+ **Existing instance groups**: MinCount and TargetCount are restored to their values from the last `InService` state
+ **Instance selection for termination**: If instances need to be terminated during rollback, the system selects the unhealthy instances first, then those that were most recently provisioned.
+ **Status transition**: The instance group immediately transitions to `InService` status after rollback initiation, allowing the continuous scaling system to manage capacity according to the rollback settings

The 3-hour timeout resets each time MinCount is updated. For example, if you update MinCount multiple times, the timeout period starts fresh from the most recent update.

### MinCount events
<a name="sagemaker-hyperpod-scaling-eks-mincount-events"></a>

The system emits specific events to help you track MinCount operations:
+ **Minimum capacity reached**: Emitted when an instance group successfully reaches its MinCount and transitions to `InService`
+ **Rollback initiated**: Emitted when the 3-hour timeout expires and automatic rollback begins

You can monitor these events using [ListClusterEvents](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ListClusterEvents.html) to track the progress of your MinCount operations.

### API usage
<a name="sagemaker-hyperpod-scaling-eks-mincount-api"></a>

MinCount is specified using the `MinInstanceCount` parameter in instance group configurations:

```
aws sagemaker create-cluster \
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET'"]
}' \
--instance-groups '{
   "InstanceGroupName": "worker-group",
   "InstanceType": "ml.p4d.24xlarge",
   "InstanceCount": 64,
   "MinInstanceCount": 50,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'"
}' \
--node-provisioning-mode Continuous
```

Key considerations for MinCount usage:
+ `MinInstanceCount` must be between 0 and `InstanceCount` (inclusive) value of the instance group specified in [CreateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateCluster.html) or [UpdateCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateCluster.html) request
+ Setting `MinInstanceCount` to 0 (default) preserves standard continuous scaling behavior
+ Setting `MinInstanceCount` equal to `InstanceCount` provides all-or-nothing scaling behavior
+ MinCount is only available for clusters with `NodeProvisioningMode` set to `Continuous`

## Flexible instance groups
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig"></a>

Flexible instance groups allow you to specify multiple instance types within a single instance group. This simplifies cluster management by reducing the number of instance groups you need to create and manage, especially for inference workloads that use autoscaling.

With flexible instance groups, HyperPod:
+ Attempts to provision instances using the first instance type in your list
+ Falls back to subsequent instance types if capacity is unavailable
+ Terminates instances of the lowest-priority instance type first during scale-down

**Note**  
Flexible instance groups are only available for clusters with `NodeProvisioningMode` set to `Continuous`. The `InstanceType` and `InstanceRequirements` properties are mutually exclusive—you can specify one or the other, but not both.

### Create a cluster with a flexible instance group
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-create"></a>

Use `InstanceRequirements` instead of `InstanceType` to create a flexible instance group. The order of instance types in the list determines the priority for provisioning.

```
aws sagemaker create-cluster \
--cluster-name $HP_CLUSTER_NAME \
--orchestrator 'Eks={ClusterArn='$EKS_CLUSTER_ARN'}' \
--vpc-config '{
   "SecurityGroupIds": ["'$SECURITY_GROUP'"],
   "Subnets": ["'$SUBNET_AZ1'", "'$SUBNET_AZ2'"]
}' \
--instance-groups '[{
   "InstanceGroupName": "flexible-ig",
   "InstanceRequirements": {
      "InstanceTypes": ["ml.p5.48xlarge", "ml.p4d.24xlarge", "ml.g6.48xlarge"]
   },
   "InstanceCount": 10,
   "LifeCycleConfig": {
      "SourceS3Uri": "s3://'$BUCKET_NAME'",
      "OnCreate": "on_create.sh"
   },
   "ExecutionRole": "'$EXECUTION_ROLE'"
}]' \
--node-provisioning-mode Continuous
```

### Targeted scaling with BatchAddClusterNodes
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-targeted"></a>

When using flexible instance groups, you can use [BatchAddClusterNodes](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_BatchAddClusterNodes.html) to add nodes with specific instance types and availability zones. This is particularly useful when Karpenter autoscaling determines the optimal instance type and availability zone for your workload.

```
aws sagemaker batch-add-cluster-nodes \
--cluster-name $HP_CLUSTER_NAME \
--nodes-to-add '[
   {
      "InstanceGroupName": "flexible-ig",
      "IncrementTargetCountBy": 1,
      "InstanceTypes": ["ml.p5.48xlarge"],
      "AvailabilityZones": ["us-west-2a"]
   }
]'
```

### View flexible instance group details
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-describe"></a>

Use [DescribeCluster](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeCluster.html) to view the instance types and per-type breakdown of your flexible instance group. The response includes:
+ `InstanceRequirements` — The current and desired instance types for the instance group
+ `InstanceTypeDetails` — A per-instance-type breakdown showing the count and configuration of each instance type in the group

### Using flexible instance groups with Karpenter autoscaling
<a name="sagemaker-hyperpod-scaling-eks-flexible-ig-autoscaling"></a>

Flexible instance groups integrate with HyperPod's managed Karpenter autoscaling. For more information about setting up Karpenter, see [Autoscaling on SageMaker HyperPod EKS](sagemaker-hyperpod-eks-autoscaling.md). When you reference a flexible instance group in a `HyperPodNodeClass` configuration, Karpenter automatically:
+ Detects the supported instance types from the flexible instance group
+ Selects the optimal instance type and availability zone based on pod requirements and pricing
+ Scales the flexible instance group using targeted `BatchAddClusterNodes` calls with the selected instance type and availability zone

**Note**  
When Karpenter manages scaling, it uses its own selection logic based on pod requirements and pricing to determine which instance type to provision. This differs from the list-order priority used by HyperPod's native provisioning (such as `CreateCluster` and `UpdateCluster`), where the first instance type in the list is always attempted first.

This eliminates the need to create separate instance groups for each instance type and manually configure Karpenter to reference multiple groups.

# Autoscaling on SageMaker HyperPod EKS
<a name="sagemaker-hyperpod-eks-autoscaling"></a>

Amazon SageMaker HyperPod provides a managed Karpenter based node autoscaling solution for clusters created with EKS orchestration. [Karpenter](https://karpenter.sh/) is an open-source, Kubernetes node lifecycle manager built by AWS that optimizes cluster scaling and cost efficiency. Unlike self-managed Karpenter deployments, SageMaker HyperPod's managed implementation eliminates the operational overhead of installing, configuring, and maintaining Karpenter controllers while providing integrated resilience and fault tolerance. This managed autoscaling solution is built on HyperPod's [continuous provisioning](sagemaker-hyperpod-scaling-eks.md) capabilities and enables you to efficiently scale compute resources for training and inference workloads with automatic failure handling and recovery. 

You pay only for what you use. You're responsible for paying for all compute instances that are automatically provisioned through autoscaling according to standard SageMaker HyperPod pricing. For detailed pricing information, see [Amazon SageMaker AI](https://aws.amazon.com/sagemaker/ai/pricing/).

By enabling Karpenter-based autoscaling with HyperPod, you have access to:
+ **Service managed lifecycle** - HyperPod handles Karpenter installation, updates, and maintenance, eliminating operational overhead.
+ **Just in time provisioning** - Karpenter will observe your pending pods and provision the required compute for your workloads from on-demand pool.
+ **Scale to zero** - Scale down to zero nodes without maintaining dedicated controller infrastructure.
+ **Workload aware node selection** - Karpenter chooses optimal instance types based on pod requirements, availability zones, and pricing to minimize costs.
+ **Automatic node consolidation** - Karpenter regularly evaluates cluster for optimization opportunities, shifting workloads to eliminate underutilized nodes.
+ **Integrated resilience** - Leverages HyperPod's built-in fault tolerance and node recovery mechanisms.

The following topics explain how to enable HyperPod autoscaling with Karpenter.

**Topics**
+ [Prerequisites](#sagemaker-hyperpod-eks-autoscaling-prereqs)
+ [Create an IAM role for HyperPod autoscaling with Karpenter](sagemaker-hyperpod-eks-autoscaling-iam.md)
+ [Create and configure a HyperPod cluster with Karpenter autoscaling](sagemaker-hyperpod-eks-autoscaling-cluster.md)
+ [Create a NodeClass](sagemaker-hyperpod-eks-autoscaling-nodeclass.md)
+ [Create a NodePool](sagemaker-hyperpod-eks-autoscaling-nodepool.md)
+ [Deploy a workload](sagemaker-hyperpod-eks-autoscaling-workload.md)

## Prerequisites
<a name="sagemaker-hyperpod-eks-autoscaling-prereqs"></a>
+ Continuous provisioning enabled on your HyperPod cluster. Enable continuous provisioning by setting `--node-provisioning-mode` to `Continuous` when creating your SageMaker HyperPod cluster. For more information, see [Continuous provisioning for enhanced cluster operations on Amazon EKS](sagemaker-hyperpod-scaling-eks.md).
+ Health Monitoring Agent version 1.0.742.0\$11.0.241.0 or above installed. Required for HyperPod cluster operations and monitoring. The agent must be configured before enabling Karpenter autoscaling to ensure proper cluster health reporting and node lifecycle management. For more information, see [Health Monitoring System](sagemaker-hyperpod-eks-resiliency-health-monitoring-agent.md).
+ Only if your Amazon EKS cluster has Karpenter running on it, the Karpenter `NodePool` and `NodeClaim` versions need to be v1.
+ `NodeRecovery` set to automatic. For more information, see [Automatic node recovery](sagemaker-hyperpod-eks-resiliency-node-recovery.md).

# Create an IAM role for HyperPod autoscaling with Karpenter
<a name="sagemaker-hyperpod-eks-autoscaling-iam"></a>

In the following steps, you'll create an IAM role that allows SageMaker HyperPod to manage Kubernetes nodes in your cluster through Karpenter-based autoscaling. This role provides the necessary permissions for HyperPod to add and remove cluster nodes automatically based on workload demand.

**Open the IAM console**

1. Sign in to the AWS Management Console and open the IAM console at console.aws.amazon.com.

1. In the navigation pane, choose **Roles**.

1. Choose **Create role**.

**Configure the trust policy**

1. For **Trusted entity type**, choose **Custom trust policy**.

1. In the **Custom trust policy** editor, replace the default policy with the following:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
                   "Service": [
                       "hyperpod.sagemaker.amazonaws.com"
                   ]
               },
               "Action": "sts:AssumeRole"
           }
       ]
   }
   ```

------

1. Choose **Next**.

**Create and attach the permissions policy**

Because SageMaker HyperPod requires specific permissions that aren't available in AWS managed policies, you must create a custom policy.

1. Choose **Create policy**. This opens a new browser tab.

1. Choose the **JSON** tab.

1. Replace the default policy with the following:

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "sagemaker:BatchAddClusterNodes",
                   "sagemaker:BatchDeleteClusterNodes"
               ],
               "Resource": "arn:aws:sagemaker:*:*:cluster/*",
               "Condition": {
                   "StringEquals": {
                       "aws:ResourceAccount": "${aws:PrincipalAccount}"
                   }
               }
           },
           {
               "Effect": "Allow",
               "Action": [
                   "kms:CreateGrant",
                   "kms:DescribeKey"
               ],
               "Resource": "arn:aws:kms:*:*:key/*",
               "Condition": {
                   "StringLike": {
                       "kms:ViaService": "sagemaker.*.amazonaws.com"
                   },
                   "Bool": {
                       "kms:GrantIsForAWSResource": "true"
                   },
                   "ForAllValues:StringEquals": {
                       "kms:GrantOperations": [
                           "CreateGrant",
                           "Decrypt",
                           "DescribeKey",
                           "GenerateDataKeyWithoutPlaintext",
                           "ReEncryptTo",
                           "ReEncryptFrom",
                           "RetireGrant"
                       ]
                   }
               }
           }
       ]
   }
   ```

------

1. Choose **Next**.

1. For **Policy name**, enter **SageMakerHyperPodKarpenterPolicy**.

1. (Optional) For **Description**, enter a description for the policy.

1. Choose **Create policy**.

1. Return to the role creation tab and refresh the policy list.

1. Search for and select the **SageMakerHyperPodKarpenterPolicy** that you just created.

1. Choose **Next**.

**Name and create the role**

1. For **Role name**, enter `SageMakerHyperPodKarpenterRole`.

1. (Optional) For **Description**, enter a description for the role.

1. In the **Step 1: Select trusted entities** section, verify that the trust policy shows the correct service principals.

1. In the **Step 2: Add permissions** section, verify that `SageMakerHyperPodKarpenterPolicy` is attached.

1. Choose **Create role**.

**Record the role ARN**

After the role is created successfully:

1. In the **Roles** list, choose the role name `SageMakerHyperPodKarpenterRole`.

1. Copy the **Role ARN** from the **Summary** section. You'll need this ARN when creating your HyperPod cluster.

The role ARN follows this format: `arn:aws:iam::ACCOUNT-ID:role/SageMakerHyperPodKarpenterRole`.

# Create and configure a HyperPod cluster with Karpenter autoscaling
<a name="sagemaker-hyperpod-eks-autoscaling-cluster"></a>

In the following steps, you'll create a SageMaker HyperPod cluster with continuous provisioning enabled and configure it to use Karpenter-based autoscaling.

**Create a HyperPod cluster**

1. Load your environment configuration and extract values from CloudFormation stacks.

   ```
   source .env
   SUBNET1=$(cfn-output $VPC_STACK_NAME PrivateSubnet1)
   SUBNET2=$(cfn-output $VPC_STACK_NAME PrivateSubnet2)
   SUBNET3=$(cfn-output $VPC_STACK_NAME PrivateSubnet3)
   SECURITY_GROUP=$(cfn-output $VPC_STACK_NAME NoIngressSecurityGroup)
   EKS_CLUSTER_ARN=$(cfn-output $EKS_STACK_NAME ClusterArn)
   EXECUTION_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ExecutionRole)
   SERVICE_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ServiceRole)
   BUCKET_NAME=$(cfn-output $SAGEMAKER_STACK_NAME Bucket)
   HP_CLUSTER_NAME="hyperpod-eks-test-$(date +%s)"
   EKS_CLUSTER_NAME=$(cfn-output $EKS_STACK_NAME ClusterName)
   HP_CLUSTER_ROLE=$(cfn-output $SAGEMAKER_STACK_NAME ClusterRole)
   ```

1. Upload the node initialization script to your Amazon S3 bucket.

   ```
   aws s3 cp lifecyclescripts/on_create_noop.sh s3://$BUCKET_NAME
   ```

1. Create a cluster configuration file with your environment variables.

   ```
   cat > cluster_config.json << EOF
   {
       "ClusterName": "$HP_CLUSTER_NAME",
       "InstanceGroups": [
           {
               "InstanceCount": 1,
               "InstanceGroupName": "system",
               "InstanceType": "ml.c5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE"
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-c5-az1",
               "InstanceType": "ml.c5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE"
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-c5-4xaz2",
               "InstanceType": "ml.c5.4xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE",
               "OverrideVpcConfig": {
                   "SecurityGroupIds": [
                       "$SECURITY_GROUP"
                   ],
                   "Subnets": [
                       "$SUBNET2"
                   ]
               }
           },
           {
               "InstanceCount": 0,
               "InstanceGroupName": "auto-g5-az3",
               "InstanceType": "ml.g5.xlarge",
               "LifeCycleConfig": {
                   "SourceS3Uri": "s3://$BUCKET_NAME",
                   "OnCreate": "on_create_noop.sh"
               },
               "ExecutionRole": "$EXECUTION_ROLE",
               "OverrideVpcConfig": {
                   "SecurityGroupIds": [
                       "$SECURITY_GROUP"
                   ],
                   "Subnets": [
                       "$SUBNET3"
                   ]
               }
           }
       ],
       "VpcConfig": {
           "SecurityGroupIds": [
               "$SECURITY_GROUP"
           ],
           "Subnets": [
               "$SUBNET1"
           ]
       },
       "Orchestrator": {
           "Eks": {
               "ClusterArn": "$EKS_CLUSTER_ARN"
           }
       },
       "ClusterRole": "$HP_CLUSTER_ROLE",
       "AutoScaling": {
           "Mode": "Enable",
           "AutoScalerType": "Karpenter"
       },
       "NodeProvisioningMode": "Continuous"
   }
   EOF
   ```

1. Run the following command to create your HyperPod cluster.

   ```
   aws sagemaker create-cluster --cli-input-json file://./cluster_config.json
   ```

1. The cluster creation process takes approximately 20 minutes. Monitor the cluster status until both ClusterStatus and AutoScaling.Status show InService.

1. Save the cluster ARN for subsequent operations.

   ```
   HP_CLUSTER_ARN=$(aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME \
      --output text --query ClusterArn)
   ```

**Enable Karpenter autoscaling**

1. Run the following command to enable Karpenter-based autoscaling on any pre-existing cluster that has continuous node provisioning mode.

   ```
   aws sagemaker update-cluster \
       --cluster-name $HP_CLUSTER_NAME \
       --auto-scaling Mode=Enable,AutoScalerType=Karpenter \
       --cluster-role $HP_CLUSTER_ROLE
   ```

1. Verify that Karpenter has been successfully enabled:

   ```
   aws sagemaker describe-cluster --cluster-name $HP_CLUSTER_NAME --query 'AutoScaling'
   ```

1. Expected output:

   ```
   {
       "Mode": "Enable",
       "AutoScalerType": "Karpenter",
       "Status": "InService"
   }
   ```

Wait for the `Status` to show `InService` before proceeding to configure NodeClass and NodePool.

# Create a NodeClass
<a name="sagemaker-hyperpod-eks-autoscaling-nodeclass"></a>

**Important**  
You must start with 0 nodes in your instance group and let Karpenter handle the autoscaling. If you start with more than 0 nodes, Karpenter will scale them down to 0.

A node class (`NodeClass`) defines infrastructure-level settings that apply to groups of nodes in your Amazon EKS cluster, including network configuration, storage settings, and resource tagging. A `HyperPodNodeClass` is a custom `NodeClass` that maps to pre-created instance groups in SageMaker HyperPod, defining constraints around which instance types and Availability Zones are supported for Karpenter's autoscaling decisions.

**Considerations for creating a node class**
+ You can specify up to 10 instance groups in a `NodeClass`.
+ Instance groups that use `InstanceRequirements` (flexible instance groups) can contain multiple instance types within a single instance group. This simplifies your `NodeClass` configuration because you can reference fewer instance groups to cover the same set of instance types and Availability Zones. For example, instead of creating 6 instance groups (3 instance types × 2 AZs), you can create a single flexible instance group that covers all combinations. Note that `InstanceType` and `InstanceRequirements` are mutually exclusive—you must specify one or the other for each instance group.
+ When using GPU partitioning with MIG (Multi-Instance GPU), Karpenter can automatically provision nodes with MIG-enabled instance groups. Ensure your instance groups include MIG-supported instance types (ml.p4d.24xlarge, ml.p5.48xlarge, or ml.p5e/p5en.48xlarge) and configure the appropriate MIG labels during cluster creation. For more information about configuring GPU partitioning, see [Using GPU partitions in Amazon SageMaker HyperPod](sagemaker-hyperpod-eks-gpu-partitioning.md).
+ If custom labels are applied to instance groups, you can view them in the `desiredLabels` field when querying the `HyperpodNodeClass` status. This includes MIG configuration labels such as `nvidia.com/mig.config`. When incoming jobs request MIG resources, Karpenter will automatically scale instances with the appropriate MIG labels applied.
+ If you choose to delete an instance group, we recommend removing it from your `NodeClass` before deleting it from your HyperPod cluster. If an instance group is deleted while it is used in a `NodeClass`, the `NodeClass` will be marked as not `Ready` for provisioning and won't be used for subsequent scaling operations until the instance group is removed from `NodeClass`.
+ When you remove instance groups from a `NodeClass`, Karpenter will detect a drift on the nodes that were managed by Karpenter in the instance group(s) and disrupt the nodes based on your disruption budget controls.
+ Subnets used by the instance group should belong to the same AZ. Subnets are specified either using `OverrideVpcConfig` at the instance group level or the cluster level. `VpcConfig` is used by default.
+ Only on-demand capacity is supported at this time. Instance groups with Training plan or reserved capacity are not supported.
+ Instance groups with `DeepHealthChecks (DHC)` are not supported. This is because a DHC takes around 60-90 minutes to complete and pods will remain in pending state during that time which can cause over-provisioning.

The following steps cover how to create a `NodeClass`.

1. Create a YAML file (for example, nodeclass.yaml) with your `NodeClass` configuration.

1. Apply the configuration to your cluster using kubectl.

1. Reference the `NodeClass` in your `NodePool` configuration.

1. Here's a sample `NodeClass` that uses a ml.c5.xlarge and ml.c5.4xlarge instance types:

   ```
   apiVersion: karpenter.sagemaker.amazonaws.com/v1
   kind: HyperpodNodeClass
   metadata:
     name: sample-nc
   spec:
     instanceGroups:
       # name of InstanceGroup in HyperPod cluster. InstanceGroup needs to pre-created
       # MaxItems: 10
       - auto-c5-xaz1
       - auto-c5-4xaz2
   ```

1. Apply the configuration:

   ```
   kubectl apply -f nodeclass.yaml
   ```

1. Monitor the NodeClass status to ensure the Ready condition in status is set to True:

   ```
   kubectl get hyperpodnodeclass sample-nc -o yaml
   ```

   ```
   apiVersion: karpenter.sagemaker.amazonaws.com/v1
   kind: HyperpodNodeClass
   metadata:
     creationTimestamp: "<timestamp>"
     name: sample-nc
     uid: <resource-uid>
   spec:
     instanceGroups:
     - auto-c5-az1
     - auto-c5-4xaz2
   status:
     conditions:
     // true when all IGs in the spec are present in SageMaker cluster, false otherwise
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: InstanceGroupReady
       status: "True"
       type: InstanceGroupReady
     // true if subnets of IGs are discoverable, false otherwise
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: SubnetsReady
       status: "True"
       type: SubnetsReady
     // true when all dependent resources are Ready [InstanceGroup, Subnets]
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 3
       reason: Ready
       status: "True"
       type: Ready
     instanceGroups:
     - desiredLabels:
       - key: <custom_label_key>
         value: <custom_label_value>
       - key: nvidia.com/mig.config
         value: all-1g.5gb
       instanceTypes:
       - ml.c5.xlarge
       name: auto-c5-az1
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-a>
         zoneId: <zone-id-a>
     - instanceTypes:
       - ml.c5.4xlarge
       name: auto-c5-4xaz2
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-b>
         zoneId: <zone-id-b>
     # Flexible instance group with multiple instance types
     - instanceTypes:
       - ml.p5.48xlarge
       - ml.p4d.24xlarge
       - ml.g6.48xlarge
       name: inference-workers
       subnets:
       - id: <subnet-id>
         zone: <availability-zone-a>
         zoneId: <zone-id-a>
       - id: <subnet-id>
         zone: <availability-zone-b>
         zoneId: <zone-id-b>
   ```

# Create a NodePool
<a name="sagemaker-hyperpod-eks-autoscaling-nodepool"></a>

The `NodePool` sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The `NodePool` can be configured to do things like:
+ Limit node creation to certain zones, instance types, and computer architectures.
+ Define labels or taints to limit the pods that can run on nodes Karpenter creates.

**Note**  
HyperPod provider supports a limited set of well-known Kubernetes and Karpenter requirements explained below. 

The following steps cover how to create a `NodePool`.

1. Create a YAML file named nodepool.yaml with your desired `NodePool` configuration.

1. You can use the sample configuration below.

   Look for `Ready` under `Conditions` to indicate all dependent resources are functioning properly.

   ```
   apiVersion: karpenter.sh/v1
   kind: NodePool
   metadata:
    name: sample-np
   spec:
    template:
      spec:
        nodeClassRef:
         group: karpenter.sagemaker.amazonaws.com
         kind: HyperpodNodeClass
         name: multiazc5
        expireAfter: Never
        requirements:
           - key: node.kubernetes.io/instance-type
             operator: Exists
   ```

1. Apply the `NodePool` to your cluster:

   ```
   kubectl apply -f nodepool.yaml
   ```

1. Monitor the `NodePool` status to ensure the `Ready` condition in status is set to `True`:

   ```
   kubectl get nodepool sample-np -oyaml
   ```

   ```
   apiVersion: karpenter.sh/v1
   kind: NodePool
   metadata:
     name: <nodepool-name>
     uid: <resource-uid>
     ...
   spec:
     disruption:
       budgets:
       - nodes: 90%
       consolidateAfter: 0s
       consolidationPolicy: WhenEmptyOrUnderutilized
     template:
       spec:
         expireAfter: 720h
         nodeClassRef:
           group: karpenter.sagemaker.amazonaws.com
           kind: HyperpodNodeClass
           name: <nodeclass-name>
         requirements:
         - key: node.kubernetes.io/instance-type
           operator: Exists
   status:
     conditions:
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: ValidationSucceeded
       status: "True"
       type: ValidationSucceeded
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: NodeClassReady
       status: "True"
       type: NodeClassReady
     - lastTransitionTime: "<timestamp>"
       message: ""
       observedGeneration: 2
       reason: Ready
       status: "True"
       type: Ready
   ```

**Supported Labels for Karpenter HyperPod Provider**

These are the optional constraints and requirements you can specify in your `NodePool` configuration.


|  Requirement Type  |  Purpose  |  Use Case/Supported Values  |  Recommendation  | 
| --- | --- | --- | --- | 
|  Instance Types (`node.kubernetes.io/instance-type`)  |  Controls which SageMaker instance types Karpenter can choose from  |  Instead of restricting to only ml.c5.xlarge, let Karpenter pick from all available types in your instance groups  |  Leave this undefined or use Exists operator to give Karpenter maximum flexibility in choosing cost-effective instance types  | 
|  Availability Zones (`topology.kubernetes.io/zone`)  |  Controls which AWS availability zones nodes can be created in  |  Specific zone names like us-east-1c. Use when you need pods to run in specific zones for latency or compliance reasons  | n/a | 
|  Architecture (`kubernetes.io/arch`)  |  Specifies CPU architecture  |  Only amd64 (no ARM support currently)  |  n/a  | 

# Deploy a workload
<a name="sagemaker-hyperpod-eks-autoscaling-workload"></a>

The following examples demonstrate how HyperPod autoscaling with Karpenter automatically provisions nodes in response to workload demands. These examples show basic scaling behavior and multi-availability zone distribution patterns.

**Deploy a simple workload**

1. The following Kubernetes deployment includes pods that request for 1 CPU and 256M memory per replica or pod. In this scenario, the pods aren’t spun up yet.

   ```
   kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/inflate.yaml
   ```

1. To test the scale up process, run the following command. Karpenter will add new nodes to the cluster.

   ```
   kubectl scale deployment inflate --replicas 10
   ```

1. To test the scale down process, run the following command. Karpenter will remove nodes from the cluster.

   ```
   kubectl scale deployment inflate --replicas 0
   ```

**Deploy a workload across multiple AZs**

1. Run the following command to deploy a workload that runs a Kubernetes deployment where the pods in deployment need to spread evenly across different availability zones with a max Skew of 1.

   ```
   kubectl apply -f https://raw.githubusercontent.com/aws/karpenter-provider-aws/refs/heads/main/examples/workloads/spread-zone.yaml
   ```

1. Run the following command to adjust number of pods:

   ```
   kubectl scale deployment zone-spread --replicas 15
   ```

   Karpenter will add new nodes to the cluster with at least one node it a different availability zone.

For more examples, see [Karpenter example workloads](https://github.com/aws/karpenter-provider-aws/tree/main/examples/workloads) on GitHub.