Troubleshoot - Amazon SageMaker AI

Troubleshoot

The following page contains known solutions for troubleshooting your HyperPod EKS clusters.

Dashboard tab

The EKS add-on fails to install

For the EKS add-on installation to succeed, you will need to have a Kubernets version >= 1.30. To update, see Update Kubernetes version.

For the EKS add-on installation to succeed, all of the nodes need to be in Ready status and all of the pods need to be in Running status.

To check the status of your nodes, use the list-cluster-nodes AWS CLI command or navigate to your EKS cluster in the EKS console and view the status of your nodes. Resolve the issue for each node or reach out to your administrator. If the node status is Unknown, delete the node. Once all nodes statuses are Ready, retry installing the EKS add-on in HyperPod from the Amazon SageMaker AI console.

To check the status of your pods, use the Kubernetes CLI command kubectl get pods -n cloudwatch-agent or navigate to your EKS cluster in the EKS console and view the status of your pods with the namespace cloudwatch-agent. Resolve the issue for the pods or reach out to your administrator to resolve the issues. Once all pod statuses are Running, retry installing the EKS add-on in HyperPod from the Amazon SageMaker AI console.

For more troubleshooting, see Troubleshooting the Amazon CloudWatch Observability EKS add-on.

Tasks tab

If you see the error message about how the Custom Resource Definition (CRD) is not configured on the cluster, grant EKSAdminViewPolicy and ClusterAccessRole policies to your domain execution role.

Policies

The following lists solutions to errors relating to policies using the HyperPod APIs or console.

  • If the policy is in CreateFailed or CreateRollbackFailed status, you need to delete the failed policy and create a new one.

  • If the policy is in UpdateFailed status, retry the update with the same policy ARN.

  • If the policy is in UpdateRollbackFailed status, you need to delete the failed policy and then create a new one.

  • If the policy is in DeleteFailed or DeleteRollbackFailed status, retry the delete with the same policy ARN.

    • If you ran into an error while trying to delete the Compute prioritization, or cluster policy, using the HyperPod console, try to delete the cluster-scheduler-config using the API. To check the status of the resource, go to the details page of a compute allocation.

To see more details into the failure, use the describe API.