Observability for SageMaker HyperPod cluster orchestrated by Amazon EKS
To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, integrate the cluster with Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana.
The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster's behavior. By leveraging these services, you gain a centralized and unified view of your HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads.
Tip
To find practical examples and solutions, see also the Observability
Proceed to the following topics to set up for SageMaker HyperPod cluster observability.