Dashboard setup - Amazon SageMaker AI

Dashboard setup

Use the following information to get set up with Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on. This sets you up with a detailed visual dashboard that provides a view into metrics for your EKS cluster hardware, team allocation, and tasks.

If you are having issues setting up, please see Troubleshoot for known troubleshooting solutions.

HyperPod Amazon CloudWatch Observability EKS add-on prerequisites

The following section includes the prerequisites needed before installing the Amazon EKS Observability add-on.

  • If you have not already done so, follow the instructions in IAM users for cluster admin to ensure that you have the minimum permission for HyperPod cluster administrative tasks.

  • Attach the CloudWatchAgentServerPolicy IAM policy to your worker nodes. To do so, enter the following command. Replace my-worker-node-role with the IAM role used by your Kubernetes worker nodes.

    aws iam attach-role-policy \ --role-name my-worker-node-role \ --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy

HyperPod Amazon CloudWatch Observability EKS add-on setup

Use the following options to set up the Amazon SageMaker HyperPod Amazon CloudWatch Observability EKS add-on.

Setup using the SageMaker AI console

The following permissions are required for setup and visualizing the HyperPod task governance dashboard. This section expands upon the permissions listed in IAM users for cluster admin.

To manage task governance, use the sample policy:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "sagemaker:ListClusters", "sagemaker:DescribeCluster", "sagemaker:ListComputeQuotas", "sagemaker:CreateComputeQuota", "sagemaker:UpdateComputeQuota", "sagemaker:DescribeComputeQuota", "sagemaker:DeleteComputeQuota", "sagemaker:ListClusterSchedulerConfigs", "sagemaker:DescribeClusterSchedulerConfig", "sagemaker:CreateClusterSchedulerConfig", "sagemaker:UpdateClusterSchedulerConfig", "sagemaker:DeleteClusterSchedulerConfig", "eks:ListAddons", "eks:CreateAddon", "eks:DescribeAddon", "eks:DescribeCluster", "eks:DescribeAccessEntry", "eks:ListAssociatedAccessPolicies", "eks:AssociateAccessPolicy", "eks:DisassociateAccessPolicy" ], "Resource": "*" } ] }

To grant permissions to manage Amazon CloudWatch Observability Amazon EKS and view the HyperPod cluster dashboard through the SageMaker AI console, use the sample policy below:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "eks:ListAddons", "eks:CreateAddon", "eks:UpdateAddon", "eks:DescribeAddon", "eks:DescribeAddonVersions", "sagemaker:DescribeCluster", "sagemaker:DescribeClusterNode", "sagemaker:ListClusterNodes", "sagemaker:ListClusters", "sagemaker:ListComputeQuotas", "sagemaker:DescribeComputeQuota", "sagemaker:ListClusterSchedulerConfigs", "sagemaker:DescribeClusterSchedulerConfig", "eks:DescribeCluster", "cloudwatch:GetMetricData", "eks:AccessKubernetesApi" ], "Resource": "*" } ] }

Navigate to the Dashboard tab in the SageMaker HyperPod console to install the Amazon CloudWatch Observability EKS. To ensure task governance related metrics are included in the Dashboard, enable the Kueue metrics checkbox. Enabling the Kueue metrics enables CloudWatch Metrics costs, after free-tier limit is reached. For more information, see Metrics in Amazon CloudWatch Pricing.

Setup using the EKS AWS CLI

Use the following EKS AWS CLI command to install the add-on:

aws eks create-addon --cluster-name cluster-name --addon-name amazon-cloudwatch-observability --configuration-values "configuration json"

Below is an example of the JSON of the configuration values:

{ "agent": { "config": { "logs": { "metrics_collected": { "kubernetes": { "kueue_container_insights": true, "enhanced_container_insights": true }, "application_signals": { } } }, "traces": { "traces_collected": { "application_signals": { } } } }, }, }
Setup using the EKS Console UI
  1. Navigate to the EKS console.

  2. Choose your cluster.

  3. Choose Add-ons.

  4. Find the Amazon CloudWatch Observability add-on and install. Install version >= 2.4.0 for the add-on.

  5. Include the following JSON, Configuration values:

    { "agent": { "config": { "logs": { "metrics_collected": { "kubernetes": { "kueue_container_insights": true, "enhanced_container_insights": true }, "application_signals": { } }, }, "traces": { "traces_collected": { "application_signals": { } } } }, }, }

Once the EKS Observability add-on has been successfully installed, you can view your EKS cluster metrics under the HyperPod console Dashboard tab.