Manually quarantine, replace, or reboot a node - Amazon SageMaker AI

Manually quarantine, replace, or reboot a node

Learn how to manually quarantine, replace, and reboot a faulty node in SageMaker HyperPod clusters orchestrated with Amazon EKS.

To quarantine a node and force delete a training pod

kubectl cordon <node-name>

After quarantine, force ejecting the Pod. This is useful when you see a pod is stuck in termination for more than 30min or kubectl describe pod shows ‘Node is not ready’ in Events

kubectl delete pods <pod-name> --grace-period=0 --force

To replace a node

Label the node to replace with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement, which triggers the SageMaker HyperPod Automatic node recovery. Note that you also need to activate automatic node recovery during cluster creation or update.

kubectl label nodes <node-name> \ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement

To reboot a node

Label the node to reboot with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot, which triggers the SageMaker HyperPod Automatic node recovery. Note that you also need to activate automatic node recovery during cluster creation or update.

kubectl label nodes <node-name> \ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot

After the labels UnschedulablePendingReplacement or UnschedulablePendingReboot are applied, you should be able to see the node is terminated or reboot in few minutes.