Manually quarantine, replace, or reboot a node
Learn how to manually quarantine, replace, and reboot a faulty node in SageMaker HyperPod clusters orchestrated with Amazon EKS.
To quarantine a node and force delete a training pod
kubectl cordon
<node-name>
After quarantine, force ejecting the Pod. This is useful when you see a pod is stuck
in termination for more than 30min or kubectl describe pod
shows ‘Node is
not ready’ in Events
kubectl delete pods
<pod-name>
--grace-period=0 --force
To replace a node
Label the node to replace with
sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement
,
which triggers the SageMaker HyperPod Automatic node
recovery. Note that you also need
to activate automatic node recovery during cluster creation or update.
kubectl label nodes
<node-name>
\ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplacement
To reboot a node
Label the node to reboot with
sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot
,
which triggers the SageMaker HyperPod Automatic node
recovery. Note that you also need
to activate automatic node recovery during cluster creation or update.
kubectl label nodes <node-name> \ sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReboot
After the labels UnschedulablePendingReplacement
or
UnschedulablePendingReboot
are applied, you should be able to see the
node is terminated or reboot in few minutes.