Amazon EKS and Kubernetes Container Insights with enhanced observability metrics - Amazon CloudWatch

Amazon EKS and Kubernetes Container Insights with enhanced observability metrics

The following tables list the metrics and dimensions that Container Insights with enhanced observability collects for Amazon EKS and Kubernetes. These metrics are in the ContainerInsights namespace. For more information, see Metrics.

If you do not see any Container Insights with enhanced observability metrics in your console, be sure that you have completed the setup of Container Insights with enhanced observability. Metrics do not appear before Container Insights with enhanced observability has been set up completely. For more information, see Setting up Container Insights.

If you are using version 1.5.0 or later of the Amazon EKS add-on or version 1.300035.0 of the CloudWatch agent, most metrics listed in the following table are collected for both Linux and Windows nodes. See the Metric Name column of the table to see which metrics are not collected for Windows.

With the earlier version of Container Insights which delivers aggregated metrics at Cluster and Service level, the metrics are charged as custom metrics. With Container Insights with enhanced observability for Amazon EKS, Container Insights metrics are charged per observation instead of being charged per metric stored or log ingested. For more information about CloudWatch pricing, see Amazon CloudWatch Pricing.

Note

On Windows, network metrics such as pod_network_rx_bytes and pod_network_tx_bytes are not collected for host process containers.

Metric name Dimensions Description

cluster_failed_node_count

ClusterName

The number of failed worker nodes in the cluster. A node is considered failed if it is suffering from any node conditions. For more information, see Conditions in the Kubernetes documentation.

cluster_node_count

ClusterName

The total number of worker nodes in the cluster.

namespace_number_of_running_pods

Namespace ClusterName

ClusterName

The number of pods running per namespace in the resource that is specified by the dimensions that you're using.

node_cpu_limit

ClusterName

ClusterName, InstanceId, NodeName

The maximum number of CPU units that can be assigned to a single node in this cluster.

node_cpu_reserved_capacity

NodeName, ClusterName, InstanceId

ClusterName

The percentage of CPU units that are reserved for node components, such as kubelet, kube-proxy, and Docker.

Formula: node_cpu_request / node_cpu_limit

Note

node_cpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_cpu_usage_total

ClusterName

ClusterName, InstanceId, NodeName

The number of CPU units being used on the nodes in the cluster.

node_cpu_utilization

NodeName, ClusterName, InstanceId

ClusterName

The total percentage of CPU units being used on the nodes in the cluster.

Formula: node_cpu_usage_total / node_cpu_limit

node_filesystem_utilization

NodeName, ClusterName, InstanceId

ClusterName

The total percentage of file system capacity being used on nodes in the cluster.

Formula: node_filesystem_usage / node_filesystem_capacity

Note

node_filesystem_usage and node_filesystem_capacity are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_memory_limit

ClusterName

ClusterName, InstanceId, NodeName

The maximum amount of memory, in bytes, that can be assigned to a single node in this cluster.

node_filesystem_inodes

It is not available on Windows.

ClusterName

ClusterName, InstanceId, NodeName

The total number of inodes (used and unused) on a node.

node_filesystem_inodes_free

It is not available on Windows.

ClusterName

ClusterName, InstanceId, NodeName

The number of unused inodes on a node.

node_memory_reserved_capacity

NodeName, ClusterName, InstanceId

ClusterName

The percentage of memory currently being used on the nodes in the cluster.

Formula: node_memory_request / node_memory_limit

Note

node_memory_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_memory_utilization

NodeName, ClusterName, InstanceId

ClusterName

The percentage of memory currently being used by the node or nodes. It is the percentage of node memory usage divided by the node memory limitation.

Formula: node_memory_working_set / node_memory_limit.

node_memory_working_set

ClusterName

ClusterName, InstanceId, NodeName

The amount of memory, in bytes, being used in the working set of the nodes in the cluster.

node_network_total_bytes

NodeName, ClusterName, InstanceId

ClusterName

The total number of bytes per second transmitted and received over the network per node in a cluster.

Formula: node_network_rx_bytes + node_network_tx_bytes

Note

node_network_rx_bytes and node_network_tx_bytes are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

node_number_of_running_containers

NodeName, ClusterName, InstanceId

ClusterName

The number of running containers per node in a cluster.

node_number_of_running_pods

NodeName, ClusterName, InstanceId

ClusterName

The number of running pods per node in a cluster.

node_status_allocatable_pods

ClusterName

ClusterName, InstanceId, NodeName

The number of pods that can be assigned to a node based on its allocatable resources, which is defined as the remainder of a node's capacity after accounting for system daemons reservations and hard eviction thresholds.

node_status_capacity_pods

ClusterName

ClusterName, InstanceId, NodeName

The number of pods that can be assigned to a node based on its capacity.

node_status_condition_ready

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether the node status condition Ready is true for Amazon EC2 nodes.

node_status_condition_memory_pressure

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether the node status condition MemoryPressure is true.

node_status_condition_pid_pressure

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether the node status condition PIDPressure is true.

node_status_condition_disk_pressure

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether the node status condition OutOfDisk is true.

node_status_condition_unknown

ClusterName

ClusterName, InstanceId, NodeName

Indicates whether any of the node status conditions are Unknown.

node_interface_network_rx_dropped

ClusterName

ClusterName, InstanceId, NodeName

The number of packets which were received and subsequently dropped by a network interface on the node.

node_interface_network_tx_dropped

ClusterName

ClusterName, InstanceId, NodeName

The number of packets which were due to be transmitted but were dropped by a network interface on the node.

node_diskio_io_service_bytes_total

It is not available on Windows.

ClusterName

ClusterName, InstanceId, NodeName

The total number of bytes transferred by all I/O operations on the node.

node_diskio_io_serviced_total

It is not available on Windows.

ClusterName

ClusterName, InstanceId, NodeName

The total number of I/O operations on the node.

pod_cpu_reserved_capacity

PodName, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, Service

The CPU capacity that is reserved per pod in a cluster.

Formula: pod_cpu_request / node_cpu_limit

Note

pod_cpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_cpu_utilization

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The percentage of CPU units being used by pods.

Formula: pod_cpu_usage_total / node_cpu_limit

pod_cpu_utilization_over_pod_limit

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The percentage of CPU units being used by pods relative to the pod limit.

Formula: pod_cpu_usage_total / pod_cpu_limit

pod_memory_reserved_capacity

PodName, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, Service

The percentage of memory that is reserved for pods.

Formula: pod_memory_request / node_memory_limit

Note

pod_memory_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_utilization

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The percentage of memory currently being used by the pod or pods.

Formula: pod_memory_working_set / node_memory_limit

pod_memory_utilization_over_pod_limit

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The percentage of memory that is being used by pods relative to the pod limit. If any containers in the pod don't have a memory limit defined, this metric doesn't appear.

Formula: pod_memory_working_set / pod_memory_limit

pod_network_rx_bytes

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The number of bytes per second being received over the network by the pod.

Formula: sum(pod_interface_network_rx_bytes)

Note

pod_interface_network_rx_bytes is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_network_tx_bytes

PodName, Namespace, ClusterName

Namespace, ClusterName

Service, Namespace, ClusterName

ClusterName

ClusterName, Namespace, PodName, FullPodName

The number of bytes per second being transmitted over the network by the pod.

Formula: sum(pod_interface_network_tx_bytes)

Note

pod_interface_network_tx_bytes is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_cpu_request

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The CPU requests for the pod.

Formula: sum(container_cpu_request)

Note

pod_cpu_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_request

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The memory requests for the pod.

Formula: sum(container_memory_request)

Note

pod_memory_request is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_cpu_limit

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The CPU limit defined for the containers in the pod. If any containers in the pod don't have a CPU limit defined, this metric doesn't appear.

Formula: sum(container_cpu_limit)

Note

pod_cpu_limit is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_memory_limit

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The memory limit defined for the containers in the pod. If any containers in the pod don't have a memory limit defined, this metric doesn't appear.

Formula: sum(container_memory_limit)

Note

pod_cpu_limit is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

pod_status_failed

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that all containers in the pod have terminated, and at least one container has terminated with a non-zero status or was terminated by the system.

pod_status_ready

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that all containers in the pod are ready, having reached the condition of ContainerReady.

pod_status_running

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that all containers in the pod are running.

pod_status_scheduled

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that the pod has been scheduled to a node.

pod_status_unknown

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that status of the pod can't be obtained.

pod_status_pending

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that the pod has been accepted by the cluster but one or more of the containers has not become ready yet.

pod_status_succeeded

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Indicates that all containers in the pod have successfully terminated and will not be restarted.

pod_number_of_containers

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers defined in the pod specification.

pod_number_of_running_containers

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are currently in the Running state.

pod_container_status_terminated

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are in the Terminated state.

pod_container_status_running

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are in the Running state.

pod_container_status_waiting

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are in the Waiting state.

pod_container_status_waiting_reason_crash_loop_back_off

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are pending because of a CrashLoopBackOff error, where a container repeatedly fails to start.

pod_container_status_waiting_reason_create_container_config_error

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are pending with the reason CreateContainerConfigError. This is because of an error while creating the container configuration.

pod_container_status_waiting_reason_create_container_error

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are pending with the reason CreateContainerError because of an error while creating the container.

pod_container_status_waiting_reason_image_pull_error

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are pending because of ErrImagePull, ImagePullBackOff, or InvalidImageName. These situations are because of an error while pulling the container image.

pod_container_status_waiting_reason_oom_killed

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are in the Terminated state

because of running out of memory (OOM killed).

pod_container_status_waiting_reason_start_error

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

Reports the number of containers in the pod which are pending with the reason being StartError because of an error while starting the container.

pod_container_status_terminated_reason_oom_killed

ContainerName, FullPodName, PodName, Namespace, ClusterName

ContainerName, PodName, Namespace, ClusterName

ClusterName

Indicates a pod was terminated for exceeding the memory limit. This metric is only displayed when this issue occurs.

pod_interface_network_rx_dropped

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The number of packets which were received and subsequently dropped a network interface for the pod.

pod_interface_network_tx_dropped

ClusterName

PodName, Namespace, ClusterName

Namespace, ClusterName, Service

ClusterName, Namespace, PodName, FullPodName

The number of packets which were due to be transmitted but were dropped for the pod.

pod_memory_working_set

ClusterName

ClusterName, Namespace, PodName

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

The memory in bytes that is currently being used by a pod.

pod_cpu_usage_total

ClusterName

ClusterName, Namespace, PodName

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

The number of CPU units used by a pod.

container_cpu_utilization

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The percentage of CPU units being used by the container.

Formula: container_cpu_usage_total / node_cpu_limit

Note

container_cpu_utilization is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

container_cpu_utilization_over_container_limit

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The percentage of CPU units being used by the container relative to the container limit. If the container doesn't have a CPU limit defined, this metric doesn't appear.

Formula: container_cpu_usage_total / container_cpu_limit

Note

container_cpu_utilization_over_container_limit is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

container_memory_utilization

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The percentage of memory units being used by the container.

Formula: container_memory_working_set / node_memory_limit

Note

container_memory_utilization is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

container_memory_utilization_over_container_limit

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The percentage of memory units being used by the container relative to the container limit. If the container doesn't have a memory limit defined, this metric doesn't appear.

Formula: container_memory_working_set / container_memory_limit

Note

container_memory_utilization_over_container_limit is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.

container_memory_failures_total

It is not available on Windows.

ClusterName

PodName, Namespace, ClusterName, ContainerName

PodName, Namespace, ClusterName, ContainerName, FullPodName

The number of memory allocation failures experienced by the container.

pod_number_of_container_restarts

PodName, Namespace, ClusterName

The total number of container restarts in a pod.

service_number_of_running_pods

Service, Namespace, ClusterName

ClusterName

The number of pods running the service or services in the cluster.

replicas_desired

ClusterName

PodName, Namespace, ClusterName

The number of pods desired for a workload as defined in the workload specification.

replicas_ready

ClusterName

PodName, Namespace, ClusterName

The number of pods for a workload that have reached the ready status.

status_replicas_available

ClusterName

PodName, Namespace, ClusterName

The number of pods for a workload which are available. A pod is available when it has been ready for the minReadySeconds defined in the workload specification.

status_replicas_unavailable

ClusterName

PodName, Namespace, ClusterName

The number of pods for a workload which are unavailable. A pod is available when it has been ready for the minReadySeconds defined in the workload specification. Pods are unavailable if they have not met this criterion.

apiserver_storage_objects

ClusterName

ClusterName, resource

The number of objects stored in etcd at the time of the last check.

apiserver_request_total

ClusterName

ClusterName, code, verb

The total number of API requests to the Kubernetes API server.

apiserver_request_duration_seconds

ClusterName

ClusterName, verb

Responce latency for API requests to the Kubernetes API server.

apiserver_admission_controller_admission_duration_seconds

ClusterName

ClusterName, operation

Admission controller latency in seconds. An admission controller is code which intercepts requests to the Kubernetes API server.

rest_client_request_duration_seconds

ClusterName

ClusterName, operation

Reponse latency experienced by clients calling the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes.

rest_client_requests_total

ClusterName

ClusterName, code, method

The total number of API requests to the Kubernetes API server made by clients. This metric is experimental and may change in future releases of Kubernetes.

etcd_request_duration_seconds

ClusterName

ClusterName, operation

Response latency of API calls to Etcd. This metric is experimental and may change in future releases of Kubernetes.

apiserver_storage_size_bytes

ClusterName

ClusterName, endpoint

Size of the storage database file physically allocated in bytes. This metric is experimental and may change in future releases of Kubernetes.

apiserver_longrunning_requests

ClusterName

ClusterName, resource

The number of active long-running requests to the Kubernetes API server.

apiserver_current_inflight_requests

ClusterName

ClusterName, request_kind

The number of requests that are being processed by Kubernetes API server.

apiserver_admission_webhook_admission_duration_seconds

ClusterName

ClusterName, name

Admission webhook latency in seconds. Admission webhooks are HTTP callbacks that receive admission requests and do something with them.

apiserver_admission_step_admission_duration_seconds

ClusterName

ClusterName, operation

Admission sub-step latency in seconds.

apiserver_requested_deprecated_apis

ClusterName

ClusterName, group

Number of requests to deprecated APIs on the Kubernetes API server.

apiserver_request_total_5xx

ClusterName

ClusterName, code, verb

Number of requests to the Kubernetes API server which were responded to with a 5XX HTTP response code.

apiserver_storage_list_duration_seconds

ClusterName

ClusterName, resource

Response latency of listing objects from Etc. This metric is experimental and may change in future releases of Kubernetes.

apiserver_flowcontrol_request_concurrency_limit

ClusterName

ClusterName, priority_level

The number of threads used by the currently executing requests in the API Priority and Fairness subsystem.

apiserver_flowcontrol_rejected_requests_total

ClusterName

ClusterName, reason

Number of requests rejected by API Priority and Fairness subsystem. This metric is experimental and may change in future releases of Kubernetes.

apiserver_current_inqueue_requests

ClusterName

ClusterName, request_kind

The number queued requests queued by the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes.

NVIDIA GPU metrics

Beginning with version 1.300034.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects NVIDIA GPU metrics from EKS workloads by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.3.0-eksbuild.1 or later. For more information, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. These NVIDIA GPU metrics that are collected are listed in the table in this section.

For Container Insights to collect NVIDIA GPU metrics, you must meet the following prerequisites:

  • You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.3.0-eksbuild.1 or later.

  • The NVIDIA device plugin for Kubernetes must be installed in the cluster.

  • The NVIDIA container toolkit must be installed on the nodes of the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.

You can opt out of collecting NVIDIA GPU metrics by setting the accelerated_compute_metrics option in the beginn CloudWatch agent configuration file to false. For more information and an example opt-out configuration, see (Optional) Additional configuration.

Metric name Dimensions Description

container_gpu_memory_total

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The total frame buffer size, in bytes, on the GPU(s) allocated to the container.

container_gpu_memory_used

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The bytes of frame buffer used on the GPU(s) allocated to the container.

container_gpu_memory_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The percentage of frame buffer used of the GPU(s) allocated to the container.

container_gpu_power_draw

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The power usage in watts of the GPU(s) allocated to the container.

container_gpu_temperature

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The temperature in degrees celsius of the GPU(s) allocated to the container.

container_gpu_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The percentage utilization of the GPU(s) allocated to the container.

node_gpu_memory_total

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The total frame buffer size, in bytes, on the GPU(s) allocated to the node.

node_gpu_memory_used

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The bytes of frame buffer used on the GPU(s) allocated to the node.

node_gpu_memory_utilization

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The percentage of frame buffer used on the GPU(s) allocated to the node.

node_gpu_power_draw

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The power usage in watts of the GPU(s) allocated to the node.

node_gpu_temperature

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The temperature in degrees celsius of the GPU(s) allocated to the node.

node_gpu_utilization

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, InstanceType, NodeName, GpuDevice

The percentage utilization of the GPU(s) allocated to the node.

pod_gpu_memory_total

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The total frame buffer size, in bytes, on the GPU(s) allocated to the pod.

pod_gpu_memory_used

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The bytes of frame buffer used on the GPU(s) allocated to the pod.

pod_gpu_memory_utilization

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The percentage of frame buffer used of the GPU(s) allocated to the pod.

pod_gpu_power_draw

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The power usage in watts of the GPU(s) allocated to the pod.

pod_gpu_temperature

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName. GpuDevice

The temperature in degrees Celsius of the GPU(s) allocated to the pod.

pod_gpu_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, GpuDevice

The percentage utilization of the GPU(s) allocated to the pod.

AWS Neuron metrics for AWS Trainium and AWS Inferentia

Beginning with version 1.300036.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects accelerated computing metrics from AWS Trainium and AWS Inferentia accelerators by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.5.0-eksbuild.1 or later. For more information about the add-on, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about AWS Trainium, see AWS Trainium. For more information about AWS Inferentia, see AWS Inferentia.

For Container Insights to collect AWS Neuron metrics, you must meet the following prerequisites:

  • You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.5.0-eksbuild.1 or later.

  • The Neuron driver must be installed on the nodes of the cluster.

  • The Neuron device plugin must be installed on the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.

The metrics that are collected are listed in the table in this section. The metrics are collected for AWS Trainium, AWS Inferentia, and AWS Inferentia2.

The CloudWatch agent collects these metrics from the Neuron monitor and does the necessary Kubernetes resource correlation to deliver metrics at the pod and container levels

Metric name Dimensions Description

container_neuroncore_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

NeuronCore utilization, during the captured period of the NeuronCore allocated to the container.

Unit: Percent

container_neuroncore_memory_usage_constants

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the container (or weights during inference).

Unit: Bytes

container_neuroncore_memory_usage_model_code

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the container. This memory region is reserved for the models.

Unit: Bytes

container_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_tensors

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_total

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore allocated to the container.

Unit: Bytes

container_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node.

Unit: Count

pod_neuroncore_utilization

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The NeuronCore utilization during the captured period of the NeuronCore allocated to the pod.

Unit: Percent

pod_neuroncore_memory_usage_constants

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the pod (or weights during inference).

Unit: Bytes

pod_neuroncore_memory_usage_model_code

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the pod. This memory region is reserved for the models.

Unit: Bytes

pod_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_tensors

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_total

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device allocated to a pod.

Unit: Bytes

node_neuroncore_utilization

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The NeuronCore utilization during the captured period of the NeuronCore allocated to the node.

Unit: Percent

node_neuroncore_memory_usage_constants

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the node (or weights during inference).

Unit: Bytes

node_neuroncore_memory_usage_model_code

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for models' executable code by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the node. This is a memory region reserved for the models.

Unit: Bytes

node_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_tensors

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_total

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuron_execution_errors_total

ClusterName

ClusterName, InstanceId, NodeName

The total number of execution errors on the node. This is calculated by the CloudWatch agent by aggregating the errors of the following types: generic, numerical, transient, model, runtime, and hardware

Unit: Count

node_neurondevice_runtime_memory_used_bytes

ClusterName

ClusterName, InstanceId, NodeName

The total Neuron device memory usage in bytes on the node.

Unit: Bytes

node_neuron_execution_latency

ClusterName

ClusterName, InstanceId, NodeName

In seconds, the latency for an execution on the node as measured by the Neuron runtime.

Unit: Seconds

node_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, NodeName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node.

Unit: Count

AWS Elastic Fabric Adapter (EFA) metrics

Beginning with version 1.300037.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects AWS Elastic Fabric Adapter (EFA) metrics from Amazon EKS clusters on Linux instances. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.5.2-eksbuild.1 or later. For more information about the add-on, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about AWS Elastic Fabric Adapter, see Elastic Fabric Adapter.

For Container Insights to collect AWS Elastic Fabric adapter metrics, you must meet the following prerequisites:

  • You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.5.2-eksbuild.1 or later.

  • The EFA device plugin must be installed on the cluster. For more information, see aws-efa-k8s-device-plugin on GitHub.

The metrics that are collected are listed in the following table.

Metric name Dimensions Description

container_efa_rx_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_tx_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second transmitted by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rx_dropped

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of packets that were received and then dropped by the EFA device(s) allocated to the container.

Unit: Count/Second

container_efa_rdma_read_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rdma_write_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rdma_write_recv_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

pod_efa_rx_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_tx_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second transmitted by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rx_dropped

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of packets that were received and then dropped by the EFA device(s) allocated to the pod.

Unit: Count/Second

pod_efa_rdma_read_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rdma_write_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rdma_write_recv_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

node_efa_rx_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second received by the EFA device(s) allocated to the node.

Unit: Bytes/Second

node_efa_tx_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second transmitted by the EFA device(s) allocated to the node.

Unit: Bytes/Second

node_efa_rx_dropped

ClusterName

ClusterName, InstanceId, NodeName

The number of packets that were received and then dropped by the EFA device(s) allocated to the node.

Unit: Count/Second

node_efa_rdma_read_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the node.

Unit: Bytes/Second

node_efa_rdma_write_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

node_efa_rdma_write_recv_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the node.

Unit: Bytes/Second

Amazon SageMaker AI HyperPod metrics

Beginning with version v2.0.1-eksbuild.1 of the the CloudWatch Observability EKS add-on, Container Insights with enhanced observability for Amazon EKS automatically collects Amazon SageMaker AI HyperPod metrics from Amazon EKS clusters. For more information about the add-on, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about Amazon SageMaker AI HyperPod, see Amazon SageMaker AI HyperPod.

The metrics that are collected are listed in the following table.

Metric name Dimensions Description

hyperpod_node_health_status_unschedulable

ClusterName

ClusterName, InstanceId, NodeName

Indicates if a node is labeled as Unschedulable by Amazon SageMaker AI HyperPod. This means that the node is running deep health checks and is not available for running workloads.

Unit: Count

hyperpod_node_health_status_schedulable

ClusterName

ClusterName, InstanceId, NodeName

Indicates if a node is labeled as Schedulable by Amazon SageMaker AI HyperPod. This means that the node has passed basic health checks or deep health checks and is available for running workloads.

Unit: Count

hyperpod_node_health_status_unschedulable_pending_replacement

ClusterName

ClusterName, InstanceId, NodeName

Indicates if a node is labeled as UnschedulablePendingReplacement by HyperPod. This means that the node has failed deep health checks or health monitoring agent checks and requires a replacement.

If automatic node recovery is enabled, the node will be automatically replaced by Amazon SageMaker AI HyperPod.

Unit: Count

hyperpod_node_health_status_unschedulable_pending_reboot

ClusterName

ClusterName, InstanceId, NodeName

Indicates if a node is labeled as UnschedulablePendingReboot by Amazon SageMaker AI HyperPod. This means that the node is running deep health checks and requires a reboot.

If automatic node recovery is enabled, the node will be automatically rebooted by Amazon SageMaker AI HyperPod.

Unit: Count