StartClusterHealthCheck
Start deep health checks for a SageMaker HyperPod cluster. You can use DescribeClusterNode API to track progress of the deep health checks. The unhealthy nodes will be automatically rebooted or replaced. Please see Resilience-related Kubernetes labels by SageMaker HyperPod for details.
Request Syntax
{
"ClusterName": "string",
"DeepHealthCheckConfigurations": [
{
"DeepHealthChecks": [ "string" ],
"InstanceGroupName": "string",
"InstanceIds": [ "string" ]
}
]
}
Request Parameters
For information about the parameters that are common to all actions, see Common Parameters.
The request accepts the following data in JSON format.
- ClusterName
-
The string name or the Amazon Resource Name (ARN) of the SageMaker HyperPod cluster.
Type: String
Length Constraints: Minimum length of 0. Maximum length of 256.
Pattern:
(arn:aws[a-z\-]*:sagemaker:[a-z0-9\-]*:[0-9]{12}:cluster/[a-z0-9]{12})|([a-zA-Z0-9](-*[a-zA-Z0-9]){0,62})Required: Yes
- DeepHealthCheckConfigurations
-
A list of configurations containing instance group names, EC2 instance IDs, and deep health checks to perform.
Type: Array of InstanceGroupHealthCheckConfiguration objects
Array Members: Minimum number of 1 item. Maximum number of 99 items.
Required: Yes
Response Syntax
{
"ClusterArn": "string"
}
Response Elements
If the action is successful, the service sends back an HTTP 200 response.
The following data is returned in JSON format by the service.
- ClusterArn
-
The Amazon Resource Name (ARN) of the SageMaker HyperPod cluster on which the deep health checks were initiated.
Type: String
Length Constraints: Minimum length of 0. Maximum length of 256.
Pattern:
arn:aws[a-z\-]*:sagemaker:[a-z0-9\-]*:[0-9]{12}:cluster/[a-z0-9]{12}
Errors
For information about the errors that are common to all actions, see Common Error Types.
- ResourceNotFound
-
Resource being access is not found.
HTTP Status Code: 400
See Also
For more information about using this API in one of the language-specific AWS SDKs, see the following: