SageMaker HyperPod健康监测代理 - Amazon SageMaker

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

SageMaker HyperPod健康监测代理

SageMaker HyperPod 运行状况监控代理持续监控每个GPU基于或基于 Tranium 的实例的运行状况。当它检测到任何实例或GPU故障时,代理会将该实例标记为运行状况不佳。

检查运行 SageMaker HyperPod 状况监控代理收集的结果

运行 SageMaker HyperPod 状况监控代理是一项运行 out-of-the-box 状况检查功能,可在所有 HyperPod 集群上持续运行。运行状况监控代理将检测到的运行状况事件发布到集群日志组/aws/sagemaker/Clusters/下GPU或 Trn 实例 CloudWatch 下。

来自 HyperPod 运行状况监控代理的检测日志创建为SagemakerHealthMonitoringAgent为每个节点命名的单独日志流。您可以使用日志见解查询检测日志,如下所示。 CloudWatch

fields @timestamp, @message | filter @message like /HealthMonitoringAgentDetectionEvent/

这将返回类似于以下内容的输出。

2024-08-21T11:35:35.532-07:00 {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"} 2024-08-21T11:35:35.532-07:00 {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}