SageMaker HyperPod健康監控代理程式 - Amazon SageMaker

本文為英文版的機器翻譯版本,如內容有任何歧義或不一致之處,概以英文版為準。

SageMaker HyperPod健康監控代理程式

SageMaker HyperPod 健全狀況監控代理程式會持續監控每個執行個體或GPU以 Tranium 為基礎的執行個體的健全狀態。當它偵測到任何執行個體或GPU故障時,代理程式會將執行個體標示為狀態不良。

檢查 SageMaker HyperPod 健全狀況監控代理程式收集的結果

SageMaker HyperPod 健全狀況監視代理程式是 out-of-the-box 健全狀況檢查功能,可在所有叢集上持續執行。 HyperPod 健全狀況監視代理程式會將偵測到的健全狀況事件發佈在GPU或 Trn 執行個體的叢集記錄群組 CloudWatch /aws/sagemaker/Clusters/

HyperPod 健全狀況監視代理程式的偵測記錄會建立為每個節點命名SagemakerHealthMonitoringAgent的個別記錄串流。您可以使用日誌深 CloudWatch 入解析來查詢偵測記錄,如下所示。

fields @timestamp, @message | filter @message like /HealthMonitoringAgentDetectionEvent/

這應該會傳回類似下列的輸出。

2024-08-21T11:35:35.532-07:00 {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"} 2024-08-21T11:35:35.532-07:00 {"level":"info","ts":"2024-08-21T18:35:35Z","msg":"NPD caught event: %v","details: ":{"severity":"warn","timestamp":"2024-08-22T20:59:29Z","reason":"XidHardwareFailure","message":"Node condition NvidiaErrorReboot is now: True, reason: XidHardwareFailure, message: \"NVRM: Xid (PCI:0000:b9:00): 71, pid=<unknown>, name=<unknown>, NVLink: fatal error detected on link 6(0x10000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)\""},"HealthMonitoringAgentDetectionEvent":"HealthEvent"}