Amazon EKS and Kubernetes Container Insights metrics
The following tables list the metrics and dimensions that Container Insights collects
for Amazon EKS and Kubernetes. These metrics are in the ContainerInsights
namespace.
For more information, see Metrics.
If you do not see any Container Insights metrics in your console, be sure that you have completed the setup of Container Insights. Metrics do not appear before Container Insights has been set up completely. For more information, see Setting up Container Insights.
If you are using version 1.5.0 or later of the Amazon EKS add-on or version 1.300035.0 of the CloudWatch agent, most metrics listed in the following table are collected for both Linux and Windows nodes. See the Metric Name column of the table to see which metrics are not collected for Windows.
With the original version of Container Insights, the metrics are charged as custom
metrics. With Container Insights with enhanced observability for Amazon EKS, Container Insights
metrics are charged per observation instead of being charged per metric stored or log
ingested. For more information about CloudWatch pricing, see Amazon CloudWatch Pricing
Note
On Windows, network metrics such as pod_network_rx_bytes
and pod_network_tx_bytes
are not collected for host process containers.
Metric name | Dimensions with any version of Container Insights | Additional dimensions with Container Insights with enhanced observability for Amazon EKS | Description |
---|---|---|---|
|
|
The number of failed worker nodes in the cluster. A node is considered failed
if it is suffering from any node conditions. For more
information, see Conditions |
|
|
|
The total number of worker nodes in the cluster. |
|
|
|
The number of pods running per namespace in the resource that is specified by the dimensions that you're using. |
|
|
|
|
The maximum number of CPU units that can be assigned to a single node in this cluster. |
|
|
The percentage of CPU units that are reserved for node components, such as kubelet, kube-proxy, and Docker. Formula: Note
|
|
|
|
|
The number of CPU units being used on the nodes in the cluster. |
|
|
The total percentage of CPU units being used on the nodes in the cluster. Formula: |
|
|
|
The total percentage of file system capacity being used on nodes in the cluster. Formula: Note
|
|
|
|
|
The maximum amount of memory, in bytes, that can be assigned to a single node in this cluster. |
This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows. |
|
The total number of inodes (used and unused) on a node. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows. |
|
The number of unused inodes on a node. |
|
|
|
The percentage of memory currently being used on the nodes in the cluster. Formula: Note
|
|
|
|
The percentage of memory currently being used by the node or nodes. It is the percentage of node memory usage divided by the node memory limitation. Formula: |
|
|
|
|
The amount of memory, in bytes, being used in the working set of the nodes in the cluster. |
|
|
The total number of bytes per second transmitted and received over the network per node in a cluster. Formula: Note
|
|
|
|
The number of running containers per node in a cluster. |
|
|
|
The number of running pods per node in a cluster. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of pods that can be assigned to a node based on its allocatable resources, which is defined as the remainder of a node's capacity after accounting for system daemons reservations and hard eviction thresholds. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of pods that can be assigned to a node based on its capacity. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates whether the node status condition |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates whether the node status condition |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates whether the node status condition |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates whether the node status condition |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates whether any of the node status conditions are Unknown. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of packets which were received and subsequently dropped by a network interface on the node. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of packets which were due to be transmitted but were dropped by a network interface on the node. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows. |
|
The total number of bytes transferred by all I/O operations on the node. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows. |
|
The total number of I/O operations on the node. |
|
|
|
|
The CPU capacity that is reserved per pod in a cluster. Formula: Note
|
|
Namespace, Service, Namespace,
|
|
The percentage of CPU units being used by pods. Formula: Note
|
|
Namespace, Service, Namespace,
|
|
The percentage of CPU units being used by pods relative to the pod limit. Formula: Note
|
|
|
|
The percentage of memory that is reserved for pods. Formula: Note
|
|
Namespace, Service, Namespace,
|
|
The percentage of memory currently being used by the pod or pods. Formula: Note
|
|
Namespace, Service, Namespace,
|
|
The percentage of memory that is being used by pods relative to the pod limit. If any containers in the pod don't have a memory limit defined, this metric doesn't appear. Formula: Note
|
|
Namespace, Service, Namespace,
|
|
The number of bytes per second being received over the network by the pod. Formula: Note
|
|
Namespace, Service, Namespace,
|
|
The number of bytes per second being transmitted over the network by the pod. Formula: Note
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The CPU requests for the pod. Formula: Note
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The memory requests for the pod. Formula: Note
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The CPU limit defined for the containers in the pod. If any containers in the pod don't have a CPU limit defined, this metric doesn't appear. Formula: Note
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The memory limit defined for the containers in the pod. If any containers in the pod don't have a memory limit defined, this metric doesn't appear. Formula: Note
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates that all containers in the pod have terminated, and at least one container has terminated with a non-zero status or was terminated by the system. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates that all containers in the pod are ready, having reached the
condition of |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates that all containers in the pod are running. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates that the pod has been scheduled to a node. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates that status of the pod can't be obtained. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates that the pod has been accepted by the cluster but one or more of the containers has not become ready yet. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Indicates that all containers in the pod have successfully terminated and will not be restarted. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers defined in the pod specification. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are currently in the
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are in the
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are in the
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are in the
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are pending because of a |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are pending with the reason |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are pending with the reason |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are pending
because of
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are in the
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reports the number of containers in the pod which are pending with the reason being
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of packets which were received and subsequently dropped a network interface for the pod. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of packets which were due to be transmitted but were dropped for the pod. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The percentage of CPU units being used by the container. Formula: Note
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The percentage of CPU units being used by the container relative to the container limit. If the container doesn't have a CPU limit defined, this metric doesn't appear. Formula: Note
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The percentage of memory units being used by the container. Formula: Note
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The percentage of memory units being used by the container relative to the container limit. If the container doesn't have a memory limit defined, this metric doesn't appear. Formula: Note
|
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS. It is not available on Windows. |
|
The number of memory allocation failures experienced by the container. |
|
|
PodName, |
The total number of container restarts in a pod. |
|
|
Service,
|
The number of pods running the service or services in the cluster. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of pods desired for a workload as defined in the workload specification. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of pods for a workload that have reached the ready status. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of pods for a workload which are available. A pod is available when
it has been ready for the |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of pods for a workload which are unavailable. A pod is available
when it has been ready for the |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of objects stored in etcd at the time of the last check. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The total number of API requests to the Kubernetes API server. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Responce latency for API requests to the Kubernetes API server. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Admission controller latency in seconds. An admission controller is code which intercepts requests to the Kubernetes API server. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Reponse latency experienced by clients calling the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The total number of API requests to the Kubernetes API server made by clients. This metric is experimental and may change in future releases of Kubernetes. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Response latency of API calls to Etcd. This metric is experimental and may change in future releases of Kubernetes. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Size of the storage database file physically allocated in bytes. This metric is experimental and may change in future releases of Kubernetes. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of active long-running requests to the Kubernetes API server. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number of requests that are being processed by Kubernetes API server. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Admission webhook latency in seconds. Admission webhooks are HTTP callbacks that receive admission requests and do something with them. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Admission sub-step latency in seconds. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Number of requests to deprecated APIs on the Kubernetes API server. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Number of requests to the Kubernetes API server which were responded to with a 5XX HTTP response code. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Response latency of listing objects from Etcd. This metric is experimental and may change in future releases of Kubernetes. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
The number queued requests queued by the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes. |
|
This metric is available only with Container Insights with enhanced observability for Amazon EKS |
|
Number of requests rejected by API Priority and Fairness subsystem. This metric is experimental and may change in future releases of Kubernetes. |
NVIDIA GPU metrics
Beginning with version 1.300034.0
of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects NVIDIA GPU metrics
from EKS workloads by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on
version v1.3.0-eksbuild.1
or later. For more information, see
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. These NVIDIA GPU metrics that are
collected are listed in the table in this section.
For Container Insights to collect NVIDIA GPU metrics, you must meet the following prerequisites:
You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version
v1.3.0-eksbuild.1
or later.The NVIDIA device plugin for Kubernetes
must be installed in the cluster. The NVIDIA container toolkit
must be installed on the nodes of the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.
You can opt out of collecting NVIDIA GPU metrics by setting the accelerated_compute_metrics
option in the beginn
CloudWatch agent configuration file to false
. For more information and an example opt-out configuration, see
(Optional)
Additional configuration.
Metric name | Dimensions | Description |
---|---|---|
|
|
The total frame buffer size, in bytes, on the GPU(s) allocated to the container. |
|
|
The bytes of frame buffer used on the GPU(s) allocated to the container. |
|
|
The percentage of frame buffer used of the GPU(s) allocated to the container. |
|
|
The power usage in watts of the GPU(s) allocated to the container. |
|
|
The temperature in degrees celsius of the GPU(s) allocated to the container. |
|
|
The percentage utilization of the GPU(s) allocated to the container. |
|
|
The total frame buffer size, in bytes, on the GPU(s) allocated to the node. |
|
|
The bytes of frame buffer used on the GPU(s) allocated to the node. |
|
|
The percentage of frame buffer used on the GPU(s) allocated to the node. |
|
|
The power usage in watts of the GPU(s) allocated to the node. |
|
|
The temperature in degrees celsius of the GPU(s) allocated to the node. |
|
|
The percentage utilization of the GPU(s) allocated to the node. |
|
|
The total frame buffer size, in bytes, on the GPU(s) allocated to the pod. |
|
|
The bytes of frame buffer used on the GPU(s) allocated to the pod. |
|
|
The percentage of frame buffer used of the GPU(s) allocated to the pod. |
|
|
The power usage in watts of the GPU(s) allocated to the pod. |
|
|
The temperature in degrees celsius of the GPU(s) allocated to the pod. |
|
|
The percentage utilization of the GPU(s) allocated to the pod. |
AWS Neuron metrics for AWS Trainium and AWS Inferentia
Beginning with version 1.300036.0
of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects
accelerated computing metrics from AWS Trainium and AWS Inferentia accelerators by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on
version v1.5.0-eksbuild.1
or later. For more information about the add-on, see
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about AWS Trainium, see
AWS Trainium
For Container Insights to collect AWS Neuron metrics, you must meet the following prerequisites:
You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version
v1.5.0-eksbuild.1
or later.The Neuron driver
must be installed on the nodes of the cluster. The Neuron device plugin
must be installed on the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.
The metrics that are collected are listed in the table in this section. The metrics are collected for AWS Trainium, AWS Inferentia, and AWS Inferentia2.
The CloudWatch agent collects these metrics from the Neuron monitor
Metric name | Dimensions | Description |
---|---|---|
|
|
NeuronCore utilization, during the captured period of the NeuronCore allocated to the container. Unit: Percent |
|
|
The amount of device memory used for constants during training by the NeuronCore that is allocated to the container (or weights during inference). Unit: Bytes |
|
|
The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the container. Unit: Bytes |
|
|
The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the container. This memory region is reserved for the models. Unit: Bytes |
|
|
The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the container. Unit: Bytes |
|
|
The amount of device memory used for tensors by the NeuronCore allocated to the container. Unit: Bytes |
|
|
The total amount of memory used by the NeuronCore allocated to the container. Unit: Bytes |
|
|
The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node. Unit: Count |
|
|
The NeuronCore utilization during the captured period of the NeuronCore allocated to the pod. Unit: Percent |
|
|
The amount of device memory used for constants during training by the NeuronCore that is allocated to the pod (or weights during inference). Unit: Bytes |
|
|
The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the pod. Unit: Bytes |
|
|
The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the pod. This memory region is reserved for the models. Unit: Bytes |
|
|
The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the pod. Unit: Bytes |
|
|
The amount of device memory used for tensors by the NeuronCore allocated to the pod. Unit: Bytes |
|
|
The total amount of memory used by the NeuronCore allocated to the pod. Unit: Bytes |
|
|
The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device allocated to a pod. Unit: Bytes |
|
|
The NeuronCore utilization during the captured period of the NeuronCore allocated to the node. Unit: Percent |
|
|
The amount of device memory used for constants during training by the NeuronCore that is allocated to the node (or weights during inference). Unit: Bytes |
|
|
The amount of device memory used for models' executable code by the NeuronCore that is allocated to the node. Unit: Bytes |
|
|
The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the node. This is a memory region reserved for the models. Unit: Bytes |
|
|
The amount of device memory used for the Neuron runtime by the NeuronCore that is allocated to the node. Unit: Bytes |
|
|
The amount of device memory used for tensors by the NeuronCore that is allocated to the node. Unit: Bytes |
|
|
The total amount of memory used by the NeuronCore that is allocated to the node. Unit: Bytes |
|
|
The total number of execution errors on the node. This is calculated by the CloudWatch agent by aggregating
the errors of the following types: Unit: Count |
|
|
The total Neuron device memory usage in bytes on the node. Unit: Bytes |
|
|
In seconds, the latency for an execution on the node as measured by the Neuron runtime. Unit: Seconds |
|
|
The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node. Unit: Count |
AWS Elastic Fabric Adapter (EFA) metrics
Beginning with version 1.300037.0
of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects
AWS Elastic Fabric Adapter (EFA) metrics from Amazon EKS clusters on Linux instances. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on
version v1.5.2-eksbuild.1
or later. For more information about the add-on, see
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about AWS Elastic Fabric Adapter,
see Elastic Fabric Adapter
For Container Insights to collect AWS Elastic Fabric adapter metrics, you must meet the following prerequisites:
You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version
v1.5.2-eksbuild.1
or later.The EFA device plugin must be installed on the cluster. For more information, see aws-efa-k8s-device-plugin
on GitHub.
The metrics that are collected are listed in the following table.
Metric name | Dimensions | Description |
---|---|---|
|
|
The number of bytes per second received by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of packets that were received and then dropped by the EFA device(s) allocated to the container. Unit: Count/Second |
|
|
The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of bytes per second received by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of packets that were received and then dropped by the EFA device(s) allocated to the pod. Unit: Count/Second |
|
|
The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second received by the EFA device(s) allocated to the node. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted by the EFA device(s) allocated to the node. Unit: Bytes/Second |
|
|
The number of packets that were received and then dropped by the EFA device(s) allocated to the node. Unit: Count/Second |
|
|
The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the node. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the node. Unit: Bytes/Second |
Amazon SageMaker HyperPod metrics
Beginning with version v2.0.1-eksbuild.1
of the the CloudWatch Observability EKS add-on, Container Insights with enhanced observability for Amazon EKS automatically collects
Amazon SageMaker HyperPod metrics from Amazon EKS clusters. For more information about the add-on, see
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about Amazon SageMaker HyperPod,
see Amazon SageMaker HyperPod.
The metrics that are collected are listed in the following table.
Metric name | Dimensions | Description |
---|---|---|
|
|
Indicates if a node is labeled as Unit: Count |
|
|
Indicates if a node is labeled as Unit: Count |
|
|
Indicates if a node is labeled as If automatic node recovery is enabled, the node will be automatically replaced by Amazon SageMaker HyperPod. Unit: Count |
|
|
Indicates if a node is labeled as If automatic node recovery is enabled, the node will be automatically rebooted by Amazon SageMaker HyperPod. Unit: Count |