Amazon EKS and Kubernetes Container Insights with enhanced observability metrics
The following tables list the metrics and dimensions that Container Insights with
enhanced observability collects for Amazon EKS and Kubernetes. These metrics are in the
ContainerInsights
namespace. For more information, see Metrics.
If you do not see any Container Insights with enhanced observability metrics in your console, be sure that you have completed the setup of Container Insights with enhanced observability. Metrics do not appear before Container Insights with enhanced observability has been set up completely. For more information, see Setting up Container Insights.
If you are using version 1.5.0 or later of the Amazon EKS add-on or version 1.300035.0 of the CloudWatch agent, most metrics listed in the following table are collected for both Linux and Windows nodes. See the Metric Name column of the table to see which metrics are not collected for Windows.
With the earlier version of Container Insights which delivers aggregated metrics at
Cluster and Service level, the metrics are charged as custom metrics. With Container
Insights with enhanced observability for Amazon EKS, Container Insights metrics are charged per
observation instead of being charged per metric stored or log ingested. For more information
about CloudWatch pricing, see Amazon CloudWatch
Pricing
Note
On Windows, network metrics such as pod_network_rx_bytes
and pod_network_tx_bytes
are not collected for host process containers.
Metric name | Dimensions | Description |
---|---|---|
|
|
The number of failed worker nodes in the cluster. A node is considered failed
if it is suffering from any node conditions. For more
information, see Conditions |
|
|
The total number of worker nodes in the cluster. |
|
|
The number of pods running per namespace in the resource that is specified by the dimensions that you're using. |
|
|
The maximum number of CPU units that can be assigned to a single node in this cluster. |
|
|
The percentage of CPU units that are reserved for node components, such as kubelet, kube-proxy, and Docker. Formula: Note
|
|
|
The number of CPU units being used on the nodes in the cluster. |
|
|
The total percentage of CPU units being used on the nodes in the cluster. Formula: |
|
|
The total percentage of file system capacity being used on nodes in the cluster. Formula: Note
|
|
|
The maximum amount of memory, in bytes, that can be assigned to a single node in this cluster. |
It is not available on Windows. |
|
The total number of inodes (used and unused) on a node. |
It is not available on Windows. |
|
The number of unused inodes on a node. |
|
|
The percentage of memory currently being used on the nodes in the cluster. Formula: Note
|
|
|
The percentage of memory currently being used by the node or nodes. It is the percentage of node memory usage divided by the node memory limitation. Formula: |
|
|
The amount of memory, in bytes, being used in the working set of the nodes in the cluster. |
|
|
The total number of bytes per second transmitted and received over the network per node in a cluster. Formula: Note
|
|
|
The number of running containers per node in a cluster. |
|
|
The number of running pods per node in a cluster. |
|
|
The number of pods that can be assigned to a node based on its allocatable resources, which is defined as the remainder of a node's capacity after accounting for system daemons reservations and hard eviction thresholds. |
|
|
The number of pods that can be assigned to a node based on its capacity. |
|
|
Indicates whether the node status condition |
|
|
Indicates whether the node status condition |
|
|
Indicates whether the node status condition |
|
|
Indicates whether the node status condition |
|
|
Indicates whether any of the node status conditions are Unknown. |
|
|
The number of packets which were received and subsequently dropped by a network interface on the node. |
|
|
The number of packets which were due to be transmitted but were dropped by a network interface on the node. |
It is not available on Windows. |
|
The total number of bytes transferred by all I/O operations on the node. |
It is not available on Windows. |
|
The total number of I/O operations on the node. |
|
|
The CPU capacity that is reserved per pod in a cluster. Formula: Note
|
|
|
The percentage of CPU units being used by pods. Formula: |
|
|
The percentage of CPU units being used by pods relative to the pod limit. Formula: |
|
|
The percentage of memory that is reserved for pods. Formula: Note
|
|
|
The percentage of memory currently being used by the pod or pods. Formula: |
|
|
The percentage of memory that is being used by pods relative to the pod limit. If any containers in the pod don't have a memory limit defined, this metric doesn't appear. Formula: |
|
|
The number of bytes per second being received over the network by the pod. Formula: Note
|
|
|
The number of bytes per second being transmitted over the network by the pod. Formula: Note
|
|
|
The CPU requests for the pod. Formula: Note
|
|
|
The memory requests for the pod. Formula: Note
|
|
|
The CPU limit defined for the containers in the pod. If any containers in the pod don't have a CPU limit defined, this metric doesn't appear. Formula: Note
|
|
|
The memory limit defined for the containers in the pod. If any containers in the pod don't have a memory limit defined, this metric doesn't appear. Formula: Note
|
|
|
Indicates that all containers in the pod have terminated, and at least one container has terminated with a non-zero status or was terminated by the system. |
|
|
Indicates that all containers in the pod are ready, having reached the
condition of |
|
|
Indicates that all containers in the pod are running. |
|
|
Indicates that the pod has been scheduled to a node. |
|
|
Indicates that status of the pod can't be obtained. |
|
|
Indicates that the pod has been accepted by the cluster but one or more of the containers has not become ready yet. |
|
|
Indicates that all containers in the pod have successfully terminated and will not be restarted. |
|
|
Reports the number of containers defined in the pod specification. |
|
|
Reports the number of containers in the pod which are currently in the
|
|
|
Reports the number of containers in the pod which are in the
|
|
|
Reports the number of containers in the pod which are in the
|
|
|
Reports the number of containers in the pod which are in the
|
|
|
Reports the number of containers in the pod which are pending because of a
|
|
|
Reports the number of containers in the pod which are pending with the reason
|
|
|
Reports the number of containers in the pod which are pending with the reason
|
|
|
Reports the number of containers in the pod which are pending because of
|
|
|
Reports the number of containers in the pod which are in the
|
|
|
Reports the number of containers in the pod which are pending with the reason
being |
|
|
Indicates a pod was terminated for exceeding the memory limit. This metric is only displayed when this issue occurs. |
|
|
The number of packets which were received and subsequently dropped a network interface for the pod. |
|
|
The number of packets which were due to be transmitted but were dropped for the pod. |
|
|
The memory in bytes that is currently being used by a pod. |
|
|
The number of CPU units used by a pod. |
|
|
The percentage of CPU units being used by the container. Formula: Note
|
|
|
The percentage of CPU units being used by the container relative to the container limit. If the container doesn't have a CPU limit defined, this metric doesn't appear. Formula: Note
|
|
|
The percentage of memory units being used by the container. Formula: Note
|
|
|
The percentage of memory units being used by the container relative to the container limit. If the container doesn't have a memory limit defined, this metric doesn't appear. Formula: Note
|
It is not available on Windows. |
|
The number of memory allocation failures experienced by the container. |
|
PodName, |
The total number of container restarts in a pod. |
|
Service,
|
The number of pods running the service or services in the cluster. |
|
|
The number of pods desired for a workload as defined in the workload specification. |
|
|
The number of pods for a workload that have reached the ready status. |
|
|
The number of pods for a workload which are available. A pod is available when
it has been ready for the |
|
|
The number of pods for a workload which are unavailable. A pod is available
when it has been ready for the |
|
|
The number of objects stored in etcd at the time of the last check. |
|
|
The total number of API requests to the Kubernetes API server. |
|
|
Responce latency for API requests to the Kubernetes API server. |
|
|
Admission controller latency in seconds. An admission controller is code which intercepts requests to the Kubernetes API server. |
|
|
Reponse latency experienced by clients calling the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes. |
|
|
The total number of API requests to the Kubernetes API server made by clients. This metric is experimental and may change in future releases of Kubernetes. |
|
|
Response latency of API calls to Etcd. This metric is experimental and may change in future releases of Kubernetes. |
|
|
Size of the storage database file physically allocated in bytes. This metric is experimental and may change in future releases of Kubernetes. |
|
|
The number of active long-running requests to the Kubernetes API server. |
|
|
The number of requests that are being processed by Kubernetes API server. |
|
|
Admission webhook latency in seconds. Admission webhooks are HTTP callbacks that receive admission requests and do something with them. |
|
|
Admission sub-step latency in seconds. |
|
|
Number of requests to deprecated APIs on the Kubernetes API server. |
|
|
Number of requests to the Kubernetes API server which were responded to with a 5XX HTTP response code. |
|
|
Response latency of listing objects from Etc. This metric is experimental and may change in future releases of Kubernetes. |
|
|
The number of threads used by the currently executing requests in the API Priority and Fairness subsystem. |
|
|
Number of requests rejected by API Priority and Fairness subsystem. This metric is experimental and may change in future releases of Kubernetes. |
|
|
The number queued requests queued by the Kubernetes API server. This metric is experimental and may change in future releases of Kubernetes. |
NVIDIA GPU metrics
Beginning with version 1.300034.0
of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects NVIDIA GPU metrics
from EKS workloads by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on
version v1.3.0-eksbuild.1
or later. For more information, see
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. These NVIDIA GPU metrics that are
collected are listed in the table in this section.
For Container Insights to collect NVIDIA GPU metrics, you must meet the following prerequisites:
You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version
v1.3.0-eksbuild.1
or later.The NVIDIA device plugin for Kubernetes
must be installed in the cluster. The NVIDIA container toolkit
must be installed on the nodes of the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.
You can opt out of collecting NVIDIA GPU metrics by setting the accelerated_compute_metrics
option in the beginn
CloudWatch agent configuration file to false
. For more information and an example opt-out configuration, see
(Optional)
Additional configuration.
Metric name | Dimensions | Description |
---|---|---|
|
|
The total frame buffer size, in bytes, on the GPU(s) allocated to the container. |
|
|
The bytes of frame buffer used on the GPU(s) allocated to the container. |
|
|
The percentage of frame buffer used of the GPU(s) allocated to the container. |
|
|
The power usage in watts of the GPU(s) allocated to the container. |
|
|
The temperature in degrees celsius of the GPU(s) allocated to the container. |
|
|
The percentage utilization of the GPU(s) allocated to the container. |
|
|
The total frame buffer size, in bytes, on the GPU(s) allocated to the node. |
|
|
The bytes of frame buffer used on the GPU(s) allocated to the node. |
|
|
The percentage of frame buffer used on the GPU(s) allocated to the node. |
|
|
The power usage in watts of the GPU(s) allocated to the node. |
|
|
The temperature in degrees celsius of the GPU(s) allocated to the node. |
|
|
The percentage utilization of the GPU(s) allocated to the node. |
|
|
The total frame buffer size, in bytes, on the GPU(s) allocated to the pod. |
|
|
The bytes of frame buffer used on the GPU(s) allocated to the pod. |
|
|
The percentage of frame buffer used of the GPU(s) allocated to the pod. |
|
|
The power usage in watts of the GPU(s) allocated to the pod. |
|
|
The temperature in degrees Celsius of the GPU(s) allocated to the pod. |
|
|
The percentage utilization of the GPU(s) allocated to the pod. |
AWS Neuron metrics for AWS Trainium and AWS Inferentia
Beginning with version 1.300036.0
of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects
accelerated computing metrics from AWS Trainium and AWS Inferentia accelerators by default. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on
version v1.5.0-eksbuild.1
or later. For more information about the add-on, see
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about AWS Trainium, see
AWS Trainium
For Container Insights to collect AWS Neuron metrics, you must meet the following prerequisites:
You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version
v1.5.0-eksbuild.1
or later.The Neuron driver
must be installed on the nodes of the cluster. The Neuron device plugin
must be installed on the cluster. For example, the Amazon EKS optimized accelerated AMIs are built with the necessary components.
The metrics that are collected are listed in the table in this section. The metrics are collected for AWS Trainium, AWS Inferentia, and AWS Inferentia2.
The CloudWatch agent collects these metrics from the Neuron monitor
Metric name | Dimensions | Description |
---|---|---|
|
|
NeuronCore utilization, during the captured period of the NeuronCore allocated to the container. Unit: Percent |
|
|
The amount of device memory used for constants during training by the NeuronCore that is allocated to the container (or weights during inference). Unit: Bytes |
|
|
The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the container. Unit: Bytes |
|
|
The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the container. This memory region is reserved for the models. Unit: Bytes |
|
|
The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the container. Unit: Bytes |
|
|
The amount of device memory used for tensors by the NeuronCore allocated to the container. Unit: Bytes |
|
|
The total amount of memory used by the NeuronCore allocated to the container. Unit: Bytes |
|
|
The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node. Unit: Count |
|
|
The NeuronCore utilization during the captured period of the NeuronCore allocated to the pod. Unit: Percent |
|
|
The amount of device memory used for constants during training by the NeuronCore that is allocated to the pod (or weights during inference). Unit: Bytes |
|
|
The amount of device memory used for the models' executable code by the NeuronCore that is allocated to the pod. Unit: Bytes |
|
|
The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the pod. This memory region is reserved for the models. Unit: Bytes |
|
|
The amount of device memory used for the Neuron runtime by the NeuronCore allocated to the pod. Unit: Bytes |
|
|
The amount of device memory used for tensors by the NeuronCore allocated to the pod. Unit: Bytes |
|
|
The total amount of memory used by the NeuronCore allocated to the pod. Unit: Bytes |
|
|
The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device allocated to a pod. Unit: Bytes |
|
|
The NeuronCore utilization during the captured period of the NeuronCore allocated to the node. Unit: Percent |
|
|
The amount of device memory used for constants during training by the NeuronCore that is allocated to the node (or weights during inference). Unit: Bytes |
|
|
The amount of device memory used for models' executable code by the NeuronCore that is allocated to the node. Unit: Bytes |
|
|
The amount of device memory used for the scratchpad shared of the models by the NeuronCore that is allocated to the node. This is a memory region reserved for the models. Unit: Bytes |
|
|
The amount of device memory used for the Neuron runtime by the NeuronCore that is allocated to the node. Unit: Bytes |
|
|
The amount of device memory used for tensors by the NeuronCore that is allocated to the node. Unit: Bytes |
|
|
The total amount of memory used by the NeuronCore that is allocated to the node. Unit: Bytes |
|
|
The total number of execution errors on the node. This is calculated by the CloudWatch agent by aggregating
the errors of the following types: Unit: Count |
|
|
The total Neuron device memory usage in bytes on the node. Unit: Bytes |
|
|
In seconds, the latency for an execution on the node as measured by the Neuron runtime. Unit: Seconds |
|
|
The number of corrected and uncorrected ECC events for the on-chip SRAM and device memory of the Neuron device on the node. Unit: Count |
AWS Elastic Fabric Adapter (EFA) metrics
Beginning with version 1.300037.0
of the CloudWatch agent, Container Insights with enhanced observability for Amazon EKS collects
AWS Elastic Fabric Adapter (EFA) metrics from Amazon EKS clusters on Linux instances. The CloudWatch agent must be installed using the CloudWatch Observability EKS add-on
version v1.5.2-eksbuild.1
or later. For more information about the add-on, see
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about AWS Elastic Fabric Adapter,
see Elastic Fabric Adapter
For Container Insights to collect AWS Elastic Fabric adapter metrics, you must meet the following prerequisites:
You must be using Container Insights with enhanced observability for Amazon EKS, with the Amazon CloudWatch Observability EKS add-on version
v1.5.2-eksbuild.1
or later.The EFA device plugin must be installed on the cluster. For more information, see aws-efa-k8s-device-plugin
on GitHub.
The metrics that are collected are listed in the following table.
Metric name | Dimensions | Description |
---|---|---|
|
|
The number of bytes per second received by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of packets that were received and then dropped by the EFA device(s) allocated to the container. Unit: Count/Second |
|
|
The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the container. Unit: Bytes/Second |
|
|
The number of bytes per second received by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of packets that were received and then dropped by the EFA device(s) allocated to the pod. Unit: Count/Second |
|
|
The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second received by the EFA device(s) allocated to the node. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted by the EFA device(s) allocated to the node. Unit: Bytes/Second |
|
|
The number of packets that were received and then dropped by the EFA device(s) allocated to the node. Unit: Count/Second |
|
|
The number of bytes per second received using remote direct memory access read operations by the EFA device(s) allocated to the node. Unit: Bytes/Second |
|
|
The number of bytes per second transmitted using remote direct memory access read operations by the EFA device(s) allocated to the pod. Unit: Bytes/Second |
|
|
The number of bytes per second received during remote direct memory access write operations by the EFA device(s) allocated to the node. Unit: Bytes/Second |
Amazon SageMaker AI HyperPod metrics
Beginning with version v2.0.1-eksbuild.1
of the the CloudWatch Observability EKS add-on, Container Insights with enhanced observability for Amazon EKS automatically collects
Amazon SageMaker AI HyperPod metrics from Amazon EKS clusters. For more information about the add-on, see
Install the CloudWatch agent with the Amazon CloudWatch Observability EKS add-on or the Helm chart. For more information about Amazon SageMaker AI HyperPod,
see Amazon SageMaker AI HyperPod.
The metrics that are collected are listed in the following table.
Metric name | Dimensions | Description |
---|---|---|
|
|
Indicates if a node is labeled as Unit: Count |
|
|
Indicates if a node is labeled as Unit: Count |
|
|
Indicates if a node is labeled as If automatic node recovery is enabled, the node will be automatically replaced by Amazon SageMaker AI HyperPod. Unit: Count |
|
|
Indicates if a node is labeled as If automatic node recovery is enabled, the node will be automatically rebooted by Amazon SageMaker AI HyperPod. Unit: Count |