SageMaker HyperPod health-monitoring agent - Amazon SageMaker

SageMaker HyperPod health-monitoring agent

SageMaker HyperPod health-monitoring agent continuously monitors the health status of each GPU-based or Tranium-based instances. When it detects any instance or GPU failures, the agent marks the instance as unhealthy.

Check the results collected by the SageMaker HyperPod health-monitoring agent

The SageMaker HyperPod health-monitoring agent is an out-of-the-box health check feature and continously runs on all HyperPod clusters. The health monitoring agent publishes detected health events on GPU or Trn instances to CloudWatch under the Cluster log group /aws/sagemaker/Clusters/.

The detection logs from the HyperPod health monitoring agent are created as separate log streams named SagemakerHealthMonitoringAgent for each node. You can query the detection logs using CloudWatch log insights as follows.

fields @timestamp, @message | filter @message like /HealthMonitoringAgentDetectionEvent/

This should return an output similar to the following.

2024-08-21T11:35:35.532-07:00 {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"} 2024-08-21T11:35:35.532-07:00 {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}