NVIDIA GPU metrics AWS Neuron metrics for AWS Trainium and AWS Inferentia AWS Elastic Fabric Adapter (EFA) metrics Amazon SageMaker AI HyperPod metrics

Amazon EKS and Kubernetes Container Insights with enhanced observability metrics

The following tables list the metrics and dimensions that Container Insights with enhanced observability collects for Amazon EKS and Kubernetes. These metrics are in the ContainerInsights namespace. For more information, see Metrics.

If you do not see any Container Insights with enhanced observability metrics in your console, be sure that you have completed the setup of Container Insights with enhanced observability. Metrics do not appear before Container Insights with enhanced observability has been set up completely. For more information, see Setting up Container Insights.

If you are using version 1.5.0 or later of the Amazon EKS add-on or version 1.300035.0 of the CloudWatch agent, most metrics listed in the following table are collected for both Linux and Windows nodes. See the Metric Name column of the table to see which metrics are not collected for Windows.

With the earlier version of Container Insights which delivers aggregated metrics at Cluster and Service level, the metrics are charged as custom metrics. With Container Insights with enhanced observability for Amazon EKS, Container Insights metrics are charged per observation instead of being charged per metric stored or log ingested. For more information about CloudWatch pricing, see Amazon CloudWatch Pricing.

Note

On Windows, network metrics such as pod_network_rx_bytes and pod_network_tx_bytes are not collected for host process containers.

Metric name	Dimensions	Description
`cluster_failed_node_count`	`ClusterName`	The number of failed worker nodes in the cluster. A node is considered failed if it is suffering from any node conditions. For more information, see Conditions in the Kubernetes documentation.
`cluster_node_count`	`ClusterName`	The total number of worker nodes in the cluster.
`namespace_number_of_running_pods`	`Namespace` `ClusterName` `ClusterName`	The number of pods running per namespace in the resource that is specified by the dimensions that you're using.
`node_cpu_limit`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The maximum number of CPU units that can be assigned to a single node in this cluster.
`node_cpu_reserved_capacity`	`NodeName`, `ClusterName`, `InstanceId` `ClusterName`	The percentage of CPU units that are reserved for node components, such as kubelet, kube-proxy, and Docker. Formula: `node_cpu_request / node_cpu_limit` Note `node_cpu_request` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`node_cpu_usage_total`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of CPU units being used on the nodes in the cluster.
`node_cpu_utilization`	`NodeName`, `ClusterName`, `InstanceId` `ClusterName`	The total percentage of CPU units being used on the nodes in the cluster. Formula: `node_cpu_usage_total / node_cpu_limit`
`node_filesystem_utilization`	`NodeName`, `ClusterName`, `InstanceId` `ClusterName`	The total percentage of file system capacity being used on nodes in the cluster. Formula: `node_filesystem_usage / node_filesystem_capacity` Note `node_filesystem_usage` and `node_filesystem_capacity` are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`node_memory_limit`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The maximum amount of memory, in bytes, that can be assigned to a single node in this cluster.
`node_filesystem_inodes` It is not available on Windows.	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The total number of inodes (used and unused) on a node.
`node_filesystem_inodes_free` It is not available on Windows.	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of unused inodes on a node.
`node_memory_reserved_capacity`	`NodeName`, `ClusterName`, `InstanceId` `ClusterName`	The percentage of memory currently being used on the nodes in the cluster. Formula: `node_memory_request / node_memory_limit` Note `node_memory_request` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`node_memory_utilization`	`NodeName`, `ClusterName`, `InstanceId` `ClusterName`	The percentage of memory currently being used by the node or nodes. It is the percentage of node memory usage divided by the node memory limitation. Formula: `node_memory_working_set / node_memory_limit`.
`node_memory_working_set`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The amount of memory, in bytes, being used in the working set of the nodes in the cluster.
`node_network_total_bytes`	`NodeName`, `ClusterName`, `InstanceId` `ClusterName`	The total number of bytes per second transmitted and received over the network per node in a cluster. Formula: `node_network_rx_bytes + node_network_tx_bytes` Note `node_network_rx_bytes` and `node_network_tx_bytes` are not reported directly as metrics, but are fields in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`node_number_of_running_containers`	`NodeName`, `ClusterName`, `InstanceId` `ClusterName`	The number of running containers per node in a cluster.
`node_number_of_running_pods`	`NodeName`, `ClusterName`, `InstanceId` `ClusterName`	The number of running pods per node in a cluster.
`node_status_allocatable_pods`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of pods that can be assigned to a node based on its allocatable resources, which is defined as the remainder of a node's capacity after accounting for system daemons reservations and hard eviction thresholds.
`node_status_capacity_pods`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of pods that can be assigned to a node based on its capacity.
`node_status_condition_ready`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates whether the node status condition `Ready` is true for Amazon EC2 nodes.
`node_status_condition_memory_pressure`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates whether the node status condition `MemoryPressure` is true.
`node_status_condition_pid_pressure`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates whether the node status condition `PIDPressure` is true.
`node_status_condition_disk_pressure`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates whether the node status condition `OutOfDisk` is true.
`node_status_condition_unknown`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates whether any of the node status conditions are Unknown.
`node_interface_network_rx_dropped`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of packets which were received and subsequently dropped by a network interface on the node.
`node_interface_network_tx_dropped`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of packets which were due to be transmitted but were dropped by a network interface on the node.
`node_diskio_io_service_bytes_total` It is not available on Windows.	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The total number of bytes transferred by all I/O operations on the node.
`node_diskio_io_serviced_total` It is not available on Windows.	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The total number of I/O operations on the node.
`pod_cpu_reserved_capacity`	`PodName`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `Service`	The CPU capacity that is reserved per pod in a cluster. Formula: `pod_cpu_request / node_cpu_limit` Note `pod_cpu_request` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`pod_cpu_utilization`	`PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The percentage of CPU units being used by pods. Formula: `pod_cpu_usage_total / node_cpu_limit`
`pod_cpu_utilization_over_pod_limit`	`PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The percentage of CPU units being used by pods relative to the pod limit. Formula: `pod_cpu_usage_total / pod_cpu_limit`
`pod_memory_reserved_capacity`	`PodName`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `Service`	The percentage of memory that is reserved for pods. Formula: `pod_memory_request / node_memory_limit` Note `pod_memory_request` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`pod_memory_utilization`	`PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The percentage of memory currently being used by the pod or pods. Formula: `pod_memory_working_set / node_memory_limit`
`pod_memory_utilization_over_pod_limit`	`PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The percentage of memory that is being used by pods relative to the pod limit. If any containers in the pod don't have a memory limit defined, this metric doesn't appear. Formula: `pod_memory_working_set / pod_memory_limit`
`pod_network_rx_bytes`	`PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The number of bytes per second being received over the network by the pod. Formula: `sum(pod_interface_network_rx_bytes)` Note `pod_interface_network_rx_bytes` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`pod_network_tx_bytes`	`PodName`, `Namespace`, `ClusterName` `Namespace,` `ClusterName` `Service`, `Namespace`, `ClusterName` `ClusterName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The number of bytes per second being transmitted over the network by the pod. Formula: `sum(pod_interface_network_tx_bytes)` Note `pod_interface_network_tx_bytes` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`pod_cpu_request`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The CPU requests for the pod. Formula: `sum(container_cpu_request)` Note `pod_cpu_request` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`pod_memory_request`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The memory requests for the pod. Formula: `sum(container_memory_request)` Note `pod_memory_request` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`pod_cpu_limit`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The CPU limit defined for the containers in the pod. If any containers in the pod don't have a CPU limit defined, this metric doesn't appear. Formula: `sum(container_cpu_limit)` Note `pod_cpu_limit` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`pod_memory_limit`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The memory limit defined for the containers in the pod. If any containers in the pod don't have a memory limit defined, this metric doesn't appear. Formula: `sum(container_memory_limit)` Note `pod_cpu_limit` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`pod_status_failed`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Indicates that all containers in the pod have terminated, and at least one container has terminated with a non-zero status or was terminated by the system.
`pod_status_ready`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Indicates that all containers in the pod are ready, having reached the condition of `ContainerReady`.
`pod_status_running`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Indicates that all containers in the pod are running.
`pod_status_scheduled`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Indicates that the pod has been scheduled to a node.
`pod_status_unknown`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Indicates that status of the pod can't be obtained.
`pod_status_pending`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Indicates that the pod has been accepted by the cluster but one or more of the containers has not become ready yet.
`pod_status_succeeded`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Indicates that all containers in the pod have successfully terminated and will not be restarted.
`pod_number_of_containers`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers defined in the pod specification.
`pod_number_of_running_containers`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are currently in the `Running` state.
`pod_container_status_terminated`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are in the `Terminated` state.
`pod_container_status_running`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are in the `Running` state.
`pod_container_status_waiting`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are in the `Waiting` state.
`pod_container_status_waiting_reason_crash_loop_back_off`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are pending because of a `CrashLoopBackOff` error, where a container repeatedly fails to start.
`pod_container_status_waiting_reason_create_container_config_error`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are pending with the reason `CreateContainerConfigError`. This is because of an error while creating the container configuration.
`pod_container_status_waiting_reason_create_container_error`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are pending with the reason `CreateContainerError` because of an error while creating the container.
`pod_container_status_waiting_reason_image_pull_error`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are pending because of `ErrImagePull`, `ImagePullBackOff`, or `InvalidImageName`. These situations are because of an error while pulling the container image.
`pod_container_status_waiting_reason_oom_killed`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are in the `Terminated` state because of running out of memory (OOM killed).
`pod_container_status_waiting_reason_start_error`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	Reports the number of containers in the pod which are pending with the reason being `StartError` because of an error while starting the container.
`pod_container_status_terminated_reason_oom_killed`	`ContainerName`, `FullPodName`, `PodName`, `Namespace`, `ClusterName` `ContainerName`, `PodName`, `Namespace`, `ClusterName` `ClusterName`	Indicates a pod was terminated for exceeding the memory limit. This metric is only displayed when this issue occurs.
`pod_interface_network_rx_dropped`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The number of packets which were received and subsequently dropped a network interface for the pod.
`pod_interface_network_tx_dropped`	`ClusterName` `PodName`, `Namespace`, `ClusterName` `Namespace`, `ClusterName`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The number of packets which were due to be transmitted but were dropped for the pod.
`pod_memory_working_set`	`ClusterName` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The memory in bytes that is currently being used by a pod.
`pod_cpu_usage_total`	`ClusterName` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName`	The number of CPU units used by a pod.
`container_cpu_utilization`	`ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`	The percentage of CPU units being used by the container. Formula: `container_cpu_usage_total / node_cpu_limit` Note `container_cpu_utilization` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`container_cpu_utilization_over_container_limit`	`ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`	The percentage of CPU units being used by the container relative to the container limit. If the container doesn't have a CPU limit defined, this metric doesn't appear. Formula: `container_cpu_usage_total / container_cpu_limit` Note `container_cpu_utilization_over_container_limit` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`container_memory_utilization`	`ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`	The percentage of memory units being used by the container. Formula: `container_memory_working_set / node_memory_limit` Note `container_memory_utilization` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`container_memory_utilization_over_container_limit`	`ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`	The percentage of memory units being used by the container relative to the container limit. If the container doesn't have a memory limit defined, this metric doesn't appear. Formula: `container_memory_working_set / container_memory_limit` Note `container_memory_utilization_over_container_limit` is not reported directly as a metric, but is a field in performance log events. For more information, see Relevant fields in performance log events for Amazon EKS and Kubernetes.
`container_memory_failures_total` It is not available on Windows.	`ClusterName` `PodName`, `Namespace`, `ClusterName`, `ContainerName` `PodName`, `Namespace`, `ClusterName`, `ContainerName`, `FullPodName`	The number of memory allocation failures experienced by the container.
`pod_number_of_container_restarts`	PodName, `Namespace`, `ClusterName`	The total number of container restarts in a pod.
`service_number_of_running_pods`	Service, `Namespace`, `ClusterName` `ClusterName`	The number of pods running the service or services in the cluster.
`replicas_desired`	`ClusterName` `PodName`, `Namespace`, `ClusterName`	The number of pods desired for a workload as defined in the workload specification.
`replicas_ready`	`ClusterName` `PodName`, `Namespace`, `ClusterName`	The number of pods for a workload that have reached the ready status.
`status_replicas_available`	`ClusterName` `PodName`, `Namespace`, `ClusterName`	The number of pods for a workload which are available. A pod is available when it has been ready for the `minReadySeconds` defined in the workload specification.
`status_replicas_unavailable`	`ClusterName` `PodName`, `Namespace`, `ClusterName`	The number of pods for a workload which are unavailable. A pod is available when it has been ready for the `minReadySeconds` defined in the workload specification. Pods are unavailable if they have not met this criterion.
`apiserver_storage_objects`	`ClusterName` `ClusterName`, `resource`	The number of objects stored in etcd at the time of the last check.
`apiserver_request_total`	`ClusterName` `ClusterName`, `code`, `verb`	The total number of API requests to the Kubernetes API server.
`apiserver_request_duration_seconds`	`ClusterName` `ClusterName`, `verb`	Responce latency for API requests to the Kubernetes API server.
`apiserver_admission_controller_admission_duration_seconds`	`ClusterName` `ClusterName`, `operation`	Admission controller latency in seconds. An admission controller is code which intercepts requests to the Kubernetes API server.
`rest_client_request_duration_seconds`	`ClusterName` `ClusterName`, `operation`	Reponse latency experienced by clients calling the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes.
`rest_client_requests_total`	`ClusterName` `ClusterName`, `code`, `method`	The total number of API requests to the Kubernetes API server made by clients. This metric is experimental and may change in future releases of Kubernetes.
`etcd_request_duration_seconds`	`ClusterName` `ClusterName`, `operation`	Response latency of API calls to Etcd. This metric is experimental and may change in future releases of Kubernetes.
`apiserver_storage_size_bytes`	`ClusterName` `ClusterName`, `endpoint`	Size of the storage database file physically allocated in bytes. This metric is experimental and may change in future releases of Kubernetes.
`apiserver_longrunning_requests`	`ClusterName` `ClusterName`, `resource`	The number of active long-running requests to the Kubernetes API server.
`apiserver_current_inflight_requests`	`ClusterName` `ClusterName`, `request_kind`	The number of requests that are being processed by Kubernetes API server.
`apiserver_admission_webhook_admission_duration_seconds`	`ClusterName` `ClusterName`, `name`	Admission webhook latency in seconds. Admission webhooks are HTTP callbacks that receive admission requests and do something with them.
`apiserver_admission_step_admission_duration_seconds`	`ClusterName` `ClusterName`, `operation`	Admission sub-step latency in seconds.
`apiserver_requested_deprecated_apis`	`ClusterName` `ClusterName`, `group`	Number of requests to deprecated APIs on the Kubernetes API server.
`apiserver_request_total_5xx`	`ClusterName` `ClusterName`, `code`, `verb`	Number of requests to the Kubernetes API server which were responded to with a 5XX HTTP response code.
`apiserver_storage_list_duration_seconds`	`ClusterName` `ClusterName`, `resource`	Response latency of listing objects from Etc. This metric is experimental and may change in future releases of Kubernetes.
`apiserver_flowcontrol_request_concurrency_limit`	`ClusterName` `ClusterName`, `priority_level`	The number of threads used by the currently executing requests in the API Priority and Fairness subsystem.
`apiserver_flowcontrol_rejected_requests_total`	`ClusterName` `ClusterName`, `reason`	Number of requests rejected by API Priority and Fairness subsystem. This metric is experimental and may change in future releases of Kubernetes.
`apiserver_current_inqueue_requests`	`ClusterName` `ClusterName`, `request_kind`	The number queued requests queued by the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes.

NVIDIA GPU metrics

Beginning with version 1.300034.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects NVIDIA GPU metrics from EKS workloads by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.3.0-eksbuild.1 or later. For more information, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. These NVIDIA GPU metrics that are collected are listed in the table in this section.

For Container Insights to collect NVIDIA GPU metrics, you must meet the following prerequisites:

You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.3.0-eksbuild.1 or later.
The NVIDIA device plugin for Kubernetes must be installed in the cluster.
The NVIDIA container toolkit must be installed on the nodes of the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.

You can opt out of collecting NVIDIA GPU metrics by setting the accelerated_compute_metrics option in the beginn CloudWatch agent configuration file to false. For more information and an example opt-out configuration, see (Optional) Additional configuration.

Metric name	Dimensions	Description
`container_gpu_memory_total`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`	The total frame buffer size, in bytes, on the GPU(s) allocated to the container.
`container_gpu_memory_used`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`	The bytes of frame buffer used on the GPU(s) allocated to the container.
`container_gpu_memory_utilization`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`	The percentage of frame buffer used of the GPU(s) allocated to the container.
`container_gpu_power_draw`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`	The power usage in watts of the GPU(s) allocated to the container.
`container_gpu_temperature`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`	The temperature in degrees celsius of the GPU(s) allocated to the container.
`container_gpu_utilization`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`	The percentage utilization of the GPU(s) allocated to the container.
`node_gpu_memory_total`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`	The total frame buffer size, in bytes, on the GPU(s) allocated to the node.
`node_gpu_memory_used`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`	The bytes of frame buffer used on the GPU(s) allocated to the node.
`node_gpu_memory_utilization`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`	The percentage of frame buffer used on the GPU(s) allocated to the node.
`node_gpu_power_draw`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`	The power usage in watts of the GPU(s) allocated to the node.
`node_gpu_temperature`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`	The temperature in degrees celsius of the GPU(s) allocated to the node.
`node_gpu_utilization`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `InstanceType`, `NodeName`, `GpuDevice`	The percentage utilization of the GPU(s) allocated to the node.
`pod_gpu_memory_total`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`	The total frame buffer size, in bytes, on the GPU(s) allocated to the pod.
`pod_gpu_memory_used`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`	The bytes of frame buffer used on the GPU(s) allocated to the pod.
`pod_gpu_memory_utilization`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`	The percentage of frame buffer used of the GPU(s) allocated to the pod.
`pod_gpu_power_draw`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`	The power usage in watts of the GPU(s) allocated to the pod.
`pod_gpu_temperature`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`. `GpuDevice`	The temperature in degrees Celsius of the GPU(s) allocated to the pod.
`pod_gpu_utilization`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `GpuDevice`	The percentage utilization of the GPU(s) allocated to the pod.

AWS Neuron metrics for AWS Trainium and AWS Inferentia

Beginning with version 1.300036.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects accelerated computing metrics from AWS Trainium and AWS Inferentia accelerators by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.5.0-eksbuild.1 or later. For more information about the add-on, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about AWS Trainium, see AWS Trainium. For more information about AWS Inferentia, see AWS Inferentia.

For Container Insights to collect AWS Neuron metrics, you must meet the following prerequisites:

You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.5.0-eksbuild.1 or later.
The Neuron driver must be installed on the nodes of the cluster.
The Neuron device plugin must be installed on the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.

The metrics that are collected are listed in the table in this section. The metrics are collected for AWS Trainium, AWS Inferentia, and AWS Inferentia2.

The CloudWatch agent collects these metrics from the Neuron monitor and does the necessary Kubernetes resource correlation to deliver metrics at the pod and container levels

Metric name Dimensions Description

Metric name	Dimensions	Description
`container_neuroncore_utilization`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`	NeuronCore utilization, during the captured period of the NeuronCore allocated to the container. Unit: Percent
`container_neuroncore_memory_usage_constants`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for constants during training by the NeuronCore that is allocated to the container (or weights during inference). Unit: Bytes
`container_neuroncore_memory_usage_model_code`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the container. Unit: Bytes
`container_neuroncore_memory_usage_model_shared_scratchpad`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the container. This memory region is reserved for the models. Unit: Bytes
`container_neuroncore_memory_usage_runtime_memory`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the container. Unit: Bytes
`container_neuroncore_memory_usage_tensors`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for tensors by the NeuronCore allocated to the container. Unit: Bytes
`container_neuroncore_memory_usage_total`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`, `NeuronCore`	The total amount of memory used by the NeuronCore allocated to the container. Unit: Bytes
`container_neurondevice_hw_ecc_events_total`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `NeuronDevice`	The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node. Unit: Count
`pod_neuroncore_utilization`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`	The NeuronCore utilization during the captured period of the NeuronCore allocated to the pod. Unit: Percent
`pod_neuroncore_memory_usage_constants`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for constants during training by the NeuronCore that is allocated to the pod (or weights during inference). Unit: Bytes
`pod_neuroncore_memory_usage_model_code`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the pod. Unit: Bytes
`pod_neuroncore_memory_usage_model_shared_scratchpad`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the pod. This memory region is reserved for the models. Unit: Bytes
`pod_neuroncore_memory_usage_runtime_memory`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the pod. Unit: Bytes
`pod_neuroncore_memory_usage_tensors`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for tensors by the NeuronCore allocated to the pod. Unit: Bytes
`pod_neuroncore_memory_usage_total`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`, `NeuronCore`	The total amount of memory used by the NeuronCore allocated to the pod. Unit: Bytes
`pod_neurondevice_hw_ecc_events_total`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `NeuronDevice`	The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device allocated to a pod. Unit: Bytes
`node_neuroncore_utilization`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`	The NeuronCore utilization during the captured period of the NeuronCore allocated to the node. Unit: Percent
`node_neuroncore_memory_usage_constants`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for constants during training by the NeuronCore that is allocated to the node (or weights during inference). Unit: Bytes
`node_neuroncore_memory_usage_model_code`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for models' executable code by the NeuronCore that is allocated to the node. Unit: Bytes
`node_neuroncore_memory_usage_model_shared_scratchpad`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the node. This is a memory region reserved for the models. Unit: Bytes
`node_neuroncore_memory_usage_runtime_memory`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for the Neuron runtime by the NeuronCore that is allocated to the node. Unit: Bytes
`node_neuroncore_memory_usage_tensors`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`	The amount of device memory used for tensors by the NeuronCore that is allocated to the node. Unit: Bytes
`node_neuroncore_memory_usage_total`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceType`, `InstanceId`, `NodeName`, `NeuronDevice`, `NeuronCore`	The total amount of memory used by the NeuronCore that is allocated to the node. Unit: Bytes
`node_neuron_execution_errors_total`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The total number of execution errors on the node. This is calculated by the CloudWatch agent by aggregating the errors of the following types: `generic`, `numerical`, `transient`, `model`, `runtime`, and `hardware` Unit: Count
`node_neurondevice_runtime_memory_used_bytes`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The total Neuron device memory usage in bytes on the node. Unit: Bytes
`node_neuron_execution_latency`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	In seconds, the latency for an execution on the node as measured by the Neuron runtime. Unit: Seconds
`node_neurondevice_hw_ecc_events_total`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName` `ClusterName`, `InstanceId`, `NodeName`, `NeuronDevice`	The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node. Unit: Count

container_neuroncore_utilization

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

NeuronCore utilization, during the captured period of the NeuronCore allocated to the container.

Unit: Percent

container_neuroncore_memory_usage_constants

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the container (or weights during inference).

Unit: Bytes

container_neuroncore_memory_usage_model_code

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the container. This memory region is reserved for the models.

Unit: Bytes

container_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_tensors

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore allocated to the container.

Unit: Bytes

container_neuroncore_memory_usage_total

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore allocated to the container.

Unit: Bytes

container_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node.

Unit: Count

pod_neuroncore_utilization

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The NeuronCore utilization during the captured period of the NeuronCore allocated to the pod.

Unit: Percent

pod_neuroncore_memory_usage_constants

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the pod (or weights during inference).

Unit: Bytes

pod_neuroncore_memory_usage_model_code

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the pod. This memory region is reserved for the models.

Unit: Bytes

pod_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_tensors

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neuroncore_memory_usage_total

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore allocated to the pod.

Unit: Bytes

pod_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device allocated to a pod.

Unit: Bytes

node_neuroncore_utilization

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The NeuronCore utilization during the captured period of the NeuronCore allocated to the node.

Unit: Percent

node_neuroncore_memory_usage_constants

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for constants during training by the NeuronCore that is allocated to the node (or weights during inference).

Unit: Bytes

node_neuroncore_memory_usage_model_code

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for models' executable code by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_model_shared_scratchpad

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the node. This is a memory region reserved for the models.

Unit: Bytes

node_neuroncore_memory_usage_runtime_memory

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for the Neuron runtime by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_tensors

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The amount of device memory used for tensors by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuroncore_memory_usage_total

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceType, InstanceId, NodeName, NeuronDevice, NeuronCore

The total amount of memory used by the NeuronCore that is allocated to the node.

Unit: Bytes

node_neuron_execution_errors_total

ClusterName

ClusterName, InstanceId, NodeName

The total number of execution errors on the node. This is calculated by the CloudWatch agent by aggregating the errors of the following types: generic, numerical, transient, model, runtime, and hardware

Unit: Count

node_neurondevice_runtime_memory_used_bytes

ClusterName

ClusterName, InstanceId, NodeName

The total Neuron device memory usage in bytes on the node.

Unit: Bytes

node_neuron_execution_latency

ClusterName

ClusterName, InstanceId, NodeName

In seconds, the latency for an execution on the node as measured by the Neuron runtime.

Unit: Seconds

node_neurondevice_hw_ecc_events_total

ClusterName

ClusterName, InstanceId, NodeName

ClusterName, InstanceId, NodeName, NeuronDevice

The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node.

Unit: Count

AWS Elastic Fabric Adapter (EFA) metrics

Beginning with version 1.300037.0 of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects AWS Elastic Fabric Adapter (EFA) metrics from Amazon EKS clusters on Linux instances. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on version v1.5.2-eksbuild.1 or later. For more information about the add-on, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about AWS Elastic Fabric Adapter, see Elastic Fabric Adapter.

For Container Insights to collect AWS Elastic Fabric adapter metrics, you must meet the following prerequisites:

You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version v1.5.2-eksbuild.1 or later.
The EFA device plugin must be installed on the cluster. For more information, see aws-efa-k8s-device-plugin on GitHub.

The metrics that are collected are listed in the following table.

Metric name Dimensions Description

Metric name	Dimensions	Description
`container_efa_rx_bytes`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `EfaDevice`	The number of bytes per second received by the EFA device(s) allocated to the container. Unit: Bytes/Second
`container_efa_tx_bytes`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `EfaDevice`	The number of bytes per second transmitted by the EFA device(s) allocated to the container. Unit: Bytes/Second
`container_efa_rx_dropped`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `EfaDevice`	The number of packets that were received and then dropped by the EFA device(s) allocated to the container. Unit: Count/Second
`container_efa_rdma_read_bytes`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `EfaDevice`	The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the container. Unit: Bytes/Second
`container_efa_rdma_write_bytes`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `EfaDevice`	The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the container. Unit: Bytes/Second
`container_efa_rdma_write_recv_bytes`	`ClusterName` `ClusterName`, `Namespace`, `PodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `ContainerName`, `EfaDevice`	The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the container. Unit: Bytes/Second
`pod_efa_rx_bytes`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `EfaDevice`	The number of bytes per second received by the EFA device(s) allocated to the pod. Unit: Bytes/Second
`pod_efa_tx_bytes`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `EfaDevice`	The number of bytes per second transmitted by the EFA device(s) allocated to the pod. Unit: Bytes/Second
`pod_efa_rx_dropped`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `EfaDevice`	The number of packets that were received and then dropped by the EFA device(s) allocated to the pod. Unit: Count/Second
`pod_efa_rdma_read_bytes`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `EfaDevice`	The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second
`pod_efa_rdma_write_bytes`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `EfaDevice`	The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second
`pod_efa_rdma_write_recv_bytes`	`ClusterName` `ClusterName`, `Namespace` `ClusterName`, `Namespace`, `Service` `ClusterName`, `Namespace`, `PodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName` `ClusterName`, `Namespace`, `PodName`, `FullPodName`, `EfaDevice`	The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second
`node_efa_rx_bytes`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of bytes per second received by the EFA device(s) allocated to the node. Unit: Bytes/Second
`node_efa_tx_bytes`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of bytes per second transmitted by the EFA device(s) allocated to the node. Unit: Bytes/Second
`node_efa_rx_dropped`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of packets that were received and then dropped by the EFA device(s) allocated to the node. Unit: Count/Second
`node_efa_rdma_read_bytes`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the node. Unit: Bytes/Second
`node_efa_rdma_write_bytes`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second
`node_efa_rdma_write_recv_bytes`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the node. Unit: Bytes/Second

container_efa_rx_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_tx_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second transmitted by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rx_dropped

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of packets that were received and then dropped by the EFA device(s) allocated to the container.

Unit: Count/Second

container_efa_rdma_read_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rdma_write_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

container_efa_rdma_write_recv_bytes

ClusterName

ClusterName, Namespace, PodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName

ClusterName, Namespace, PodName, FullPodName, ContainerName, EfaDevice

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the container.

Unit: Bytes/Second

pod_efa_rx_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_tx_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second transmitted by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rx_dropped

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of packets that were received and then dropped by the EFA device(s) allocated to the pod.

Unit: Count/Second

pod_efa_rdma_read_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rdma_write_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

pod_efa_rdma_write_recv_bytes

ClusterName

ClusterName, Namespace

ClusterName, Namespace, Service

ClusterName, Namespace, PodName

ClusterName, Namespace, PodName, FullPodName

ClusterName, Namespace, PodName, FullPodName, EfaDevice

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

node_efa_rx_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second received by the EFA device(s) allocated to the node.

Unit: Bytes/Second

node_efa_tx_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second transmitted by the EFA device(s) allocated to the node.

Unit: Bytes/Second

node_efa_rx_dropped

ClusterName

ClusterName, InstanceId, NodeName

The number of packets that were received and then dropped by the EFA device(s) allocated to the node.

Unit: Count/Second

node_efa_rdma_read_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the node.

Unit: Bytes/Second

node_efa_rdma_write_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod.

Unit: Bytes/Second

node_efa_rdma_write_recv_bytes

ClusterName

ClusterName, InstanceId, NodeName

The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the node.

Unit: Bytes/Second

Amazon SageMaker AI HyperPod metrics

Beginning with version v2.0.1-eksbuild.1 of the the CloudWatch Observability EKS add-on, Container Insights with enhanced observability for Amazon EKS automatically collects Amazon SageMaker AI HyperPod metrics from Amazon EKS clusters. For more information about the add-on, see Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about Amazon SageMaker AI HyperPod, see Amazon SageMaker AI HyperPod.

The metrics that are collected are listed in the following table.

Metric name Dimensions Description

Metric name	Dimensions	Description
`hyperpod_node_health_status_unschedulable`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates if a node is labeled as `Unschedulable` by Amazon SageMaker AI HyperPod. This means that the node is running deep health checks and is not available for running workloads. Unit: Count
`hyperpod_node_health_status_schedulable`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates if a node is labeled as `Schedulable` by Amazon SageMaker AI HyperPod. This means that the node has passed basic health checks or deep health checks and is available for running workloads. Unit: Count
`hyperpod_node_health_status_unschedulable_pending_replacement`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates if a node is labeled as `UnschedulablePendingReplacement` by HyperPod. This means that the node has failed deep health checks or health monitoring agent checks and requires a replacement. If automatic node recovery is enabled, the node will be automatically replaced by Amazon SageMaker AI HyperPod. Unit: Count
`hyperpod_node_health_status_unschedulable_pending_reboot`	`ClusterName` `ClusterName`, `InstanceId`, `NodeName`	Indicates if a node is labeled as `UnschedulablePendingReboot` by Amazon SageMaker AI HyperPod. This means that the node is running deep health checks and requires a reboot. If automatic node recovery is enabled, the node will be automatically rebooted by Amazon SageMaker AI HyperPod. Unit: Count

hyperpod_node_health_status_unschedulable

ClusterName

ClusterName, InstanceId, NodeName

Indicates if a node is labeled as Unschedulable by Amazon SageMaker AI HyperPod. This means that the node is running deep health checks and is not available for running workloads.

Unit: Count

hyperpod_node_health_status_schedulable

ClusterName

ClusterName, InstanceId, NodeName

Indicates if a node is labeled as Schedulable by Amazon SageMaker AI HyperPod. This means that the node has passed basic health checks or deep health checks and is available for running workloads.

Unit: Count

hyperpod_node_health_status_unschedulable_pending_replacement

ClusterName

ClusterName, InstanceId, NodeName

Indicates if a node is labeled as UnschedulablePendingReplacement by HyperPod. This means that the node has failed deep health checks or health monitoring agent checks and requires a replacement.

If automatic node recovery is enabled, the node will be automatically replaced by Amazon SageMaker AI HyperPod.

Unit: Count

hyperpod_node_health_status_unschedulable_pending_reboot

ClusterName

ClusterName, InstanceId, NodeName

Indicates if a node is labeled as UnschedulablePendingReboot by Amazon SageMaker AI HyperPod. This means that the node is running deep health checks and requires a reboot.

If automatic node recovery is enabled, the node will be automatically rebooted by Amazon SageMaker AI HyperPod.

Unit: Count

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Amazon ECS Container Insights metrics

Amazon EKS and Kubernetes Container Insights metrics