Cluster level metrics Instance level metrics

Amazon SageMaker HyperPod Slurm metrics

Amazon SageMaker HyperPod provides a set of Amazon CloudWatch metrics that you can use to monitor the health and performance of your HyperPod clusters. These metrics are collected from the Slurm workload manager running on your HyperPod clusters and are available in the /aws/sagemaker/Clusters CloudWatch namespace.

Cluster level metrics

The following cluster-level metrics are available for HyperPod. These metrics use the ClusterId dimension to identify the specific HyperPod cluster.

CloudWatch metric name	Notes	Amazon EKS Container Insights metric name
cluster_node_count	Total number of nodes in the cluster	cluster_node_count
cluster_idle_node_count	Number of idle nodes in the cluster	N/A
cluster_failed_node_count	Number of failed nodes in the cluster	cluster_failed_node_count
cluster_cpu_count	Total CPU cores in the cluster	node_cpu_limit
cluster_idle_cpu_count	Number of idle CPU cores in the cluster	N/A
cluster_gpu_count	Total GPUs in the cluster	node_gpu_limit
cluster_idle_gpu_count	Number of idle GPUs in the cluster	N/A
cluster_running_task_count	Number of running Slurm jobs in the cluster	N/A
cluster_pending_task_count	Number of pending Slurm jobs in the cluster	N/A
cluster_preempted_task_count	Number of preempted Slurm jobs in the cluster	N/A
cluster_avg_task_wait_time	Average wait time for Slurm jobs in the cluster	N/A
cluster_max_task_wait_time	Maximum wait time for Slurm jobs in the cluster	N/A

Instance level metrics

The following instance-level metrics are available for HyperPod. These metrics also use the ClusterId dimension to identify the specific HyperPod cluster.

CloudWatch metric name	Notes	Amazon EKS Container Insights metric name
node_gpu_utilization	Average GPU utilization across all instances	node_gpu_utilization
node_gpu_memory_utilization	Average GPU memory utilization across all instances	node_gpu_memory_utilization
node_cpu_utilization	Average CPU utilization across all instances	node_cpu_utilization
node_memory_utilization	Average memory utilization across all instances	node_memory_utilization

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Exported metrics reference

Cluster resiliency