SageMaker HyperPod provides the following cluster resiliency features.
SageMaker HyperPod health-monitoring agent
Basic health checks
Deep health checks
Automatic node recovery
Resilience-related Kubernetes labels by SageMaker HyperPod
Manually quarantine, replace, or reboot a node
Suggested resilience configurations
Javascript is disabled or is unavailable in your browser.
To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.
Thanks for letting us know we're doing a good job!
If you've got a moment, please tell us what we did right so we can do more of it.
Thanks for letting us know this page needs work. We're sorry we let you down.
If you've got a moment, please tell us how we can make the documentation better.