Troubleshoot
The following page contains known solutions for troubleshooting your HyperPod EKS clusters.
Dashboard tab
The EKS add-on fails to install
For the EKS add-on installation to succeed, you will need to have a Kubernets version >= 1.30. To update, see Update Kubernetes version.
For the EKS add-on installation to succeed, all of the nodes need to be in Ready status and all of the pods need to be in Running status.
To check the status of your nodes, use the list-cluster-nodes
AWS CLI command or navigate
to your EKS cluster in the EKS console
To check the status of your pods, use the Kubernetes
CLIkubectl get pods -n cloudwatch-agent
or
navigate to your EKS cluster in the EKS consolecloudwatch-agent
. Resolve the issue for
the pods or reach out to your administrator to resolve the issues. Once all
pod statuses are Running, retry installing the EKS
add-on in HyperPod from the Amazon SageMaker AI console
For more troubleshooting, see Troubleshooting the Amazon CloudWatch Observability EKS add-on.
Tasks tab
If you see the error message about how the Custom Resource
Definition (CRD) is not configured on the cluster, grant
EKSAdminViewPolicy
and ClusterAccessRole
policies to your domain execution role.
-
For information on how to get your execution role, see Get your execution role.
-
To learn how to attach policies to an IAM user or group, see Adding and removing IAM identity permissions.
Policies
The following lists solutions to errors relating to policies using the HyperPod APIs or console.
-
If the policy is in
CreateFailed
orCreateRollbackFailed
status, you need to delete the failed policy and create a new one. -
If the policy is in
UpdateFailed
status, retry the update with the same policy ARN. -
If the policy is in
UpdateRollbackFailed
status, you need to delete the failed policy and then create a new one. -
If the policy is in
DeleteFailed
orDeleteRollbackFailed
status, retry the delete with the same policy ARN.-
If you ran into an error while trying to delete the Compute prioritization, or cluster policy, using the HyperPod console, try to delete the
cluster-scheduler-config
using the API. To check the status of the resource, go to the details page of a compute allocation.
-
To see more details into the failure, use the describe API.