Troubleshoot - Amazon SageMaker AI

Troubleshoot

The following section lists troubleshooting solutions for HyperPod in Studio.

Tasks tab

If you get Custom Resource Definition (CRD) is not configured on the cluster while in the Tasks tab.

If the tasks grid for Slurm metrics doesn’t stop loading in the Tasks tab.

For restricted task view in Studio for EKS clusters:

  • If your execution role doesn’t have permissions to list namespaces for EKS clusters.

  • If users are experiencing issues with access for EKS clusters.

    1. Verify RBAC is enabled by running the following AWS CLI command.

      kubectl api-versions | grep rbac

      This should return rbac.authorization.k8s.io/v1.

    2. Check if the ClusterRole and ClusterRoleBinding exist by running the following commands.

      kubectl get clusterrole pods-events-crd-cluster-role kubectl get clusterrolebinding pods-events-crd-cluster-role-binding
    3. Verify user group membership. Ensure the user is correctly assigned to the pods-events-crd-cluster-level group in your identity provider or IAM.

  • If user can't see any resources.

    • Verify group membership and ensure the ClusterRoleBinding is correctly applied.

  • If users can see resources in all namespaces.

    • If namespace restriction is required, consider using Role and RoleBinding instead of ClusterRole and ClusterRoleBinding.

  • If configuration appears correct, but permissions aren't applied.

    • Check if there are any NetworkPolicies or PodSecurityPolicies interfering with access.

Metrics tab

If there are no Amazon CloudWatch metrics are displayed in the Metrics tab.

  • The Metrics section of HyperPod cluster details uses CloudWatch to fetch the data. In order to see the metrics in this section, you need to have enabled Cluster observability. Contact your administrator to configure metrics.