Amazon SageMaker HyperPod Slurm metrics - Amazon SageMaker AI

Amazon SageMaker HyperPod Slurm metrics

Amazon SageMaker HyperPod provides a set of Amazon CloudWatch metrics that you can use to monitor the health and performance of your HyperPod clusters. These metrics are collected from the Slurm workload manager running on your HyperPod clusters and are available in the /aws/sagemaker/Clusters CloudWatch namespace.

Cluster level metrics

The following cluster-level metrics are available for HyperPod. These metrics use the ClusterId dimension to identify the specific HyperPod cluster.

CloudWatch metric name Notes Amazon EKS Container Insights metric name
cluster_node_count Total number of nodes in the cluster cluster_node_count
cluster_idle_node_count Number of idle nodes in the cluster N/A
cluster_failed_node_count Number of failed nodes in the cluster cluster_failed_node_count
cluster_cpu_count Total CPU cores in the cluster node_cpu_limit
cluster_idle_cpu_count Number of idle CPU cores in the cluster N/A
cluster_gpu_count Total GPUs in the cluster node_gpu_limit
cluster_idle_gpu_count Number of idle GPUs in the cluster N/A
cluster_running_task_count Number of running Slurm jobs in the cluster N/A
cluster_pending_task_count Number of pending Slurm jobs in the cluster N/A
cluster_preempted_task_count Number of preempted Slurm jobs in the cluster N/A
cluster_avg_task_wait_time Average wait time for Slurm jobs in the cluster N/A
cluster_max_task_wait_time Maximum wait time for Slurm jobs in the cluster N/A

Instance level metrics

The following instance-level metrics are available for HyperPod. These metrics also use the ClusterId dimension to identify the specific HyperPod cluster.

CloudWatch metric name Notes Amazon EKS Container Insights metric name
node_gpu_utilization Average GPU utilization across all instances node_gpu_utilization
node_gpu_memory_utilization Average GPU memory utilization across all instances node_gpu_memory_utilization
node_cpu_utilization Average CPU utilization across all instances node_cpu_utilization
node_memory_utilization Average memory utilization across all instances node_memory_utilization