

 **Help improve this page** 

To contribute to this user guide, choose the **Edit this page on GitHub** link that is located in the right pane of every page.

# Detect node health issues and enable automatic node repair
<a name="node-health"></a>

Node health refers to the operational status and capability of a Kubernetes node to effectively run workloads. A healthy node maintains expected network connectivity, has sufficient compute and storage resources, and can successfully run workloads without disruption.

To help with maintaining healthy nodes in EKS clusters, EKS offers the *node monitoring agent* and *automatic node repair*. These features are automatically enabled with EKS Auto Mode compute. You can also use automatic node repair with EKS managed node groups and Karpenter, and can use the EKS node monitoring agent with any EKS compute types except for AWS Fargate. The EKS node monitoring agent and automatic node repair are most effective when used together, but they can also be used individually in EKS clusters.

**Important**  
The *node monitoring agent* and *node auto repair* are only available on Linux. These features aren’t available on Windows.

## Node monitoring agent
<a name="node-monitoring-agent"></a>

The EKS node monitoring agent reads node logs to detect health issues. It parses logs to detect failures and surfaces status information about the health status of the nodes. For each category of issues detected, the agent applies a dedicated `NodeCondition` to the worker nodes. For detailed information on the node health issues detected by the EKS node monitoring agent, see [Detect node health issues with the EKS node monitoring agent](node-health-nma.md).

EKS Auto Mode compute includes the node monitoring agent. For other EKS compute types, you can add the node monitoring agent as an EKS add-on or you can manage it with Kubernetes tooling such as Helm. For more information, see [Configure the node monitoring agent](node-health-nma.md#node-monitoring-agent-configure).

With the EKS node monitoring agent, the following categories of node health issues are surfaced as node conditions. Note, `Ready`, `DiskPressure`, and `MemoryPressure` are standard Kubernetes node conditions that are surfaced even without the EKS node monitoring agent.


| Node Condition | Description | 
| --- | --- | 
|  AcceleratedHardwareReady  |  AcceleratedHardwareReady indicates whether accelerated hardware (GPU, Neuron) on the node is functioning correctly.  | 
|  ContainerRuntimeReady  |  ContainerRuntimeReady indicates whether the container runtime (containerd, etc.) is functioning correctly and able to run containers.  | 
|  DiskPressure  |  DiskPressure is a standard Kubernetes condition indicating the node is experiencing disk pressure (low disk space or high I/O).  | 
|  KernelReady  |  KernelReady indicates whether the kernel is functioning correctly without critical errors, panics, or resource exhaustion.  | 
|  MemoryPressure  |  MemoryPressure is a standard Kubernetes condition indicating the node is experiencing memory pressure (low available memory).  | 
|  NetworkingReady  |  NetworkingReady indicates whether the node’s networking stack is functioning correctly (interfaces, routing, connectivity).  | 
|  StorageReady  |  StorageReady indicates whether the node’s storage subsystem is functioning correctly (disks, filesystems, I/O).  | 
|  Ready  |  Ready is the standard Kubernetes condition indicating the node is healthy and ready to accept pods.  | 

## Automatic node repair
<a name="node-auto-repair"></a>

EKS automatic node repair continuously monitors node health, reacts to detected problems, and replaces or reboots nodes when possible. This improves cluster reliability with minimal manual intervention and helps reduce application downtime.

By itself, EKS automatic node repair reacts to the `Ready` conditions of the kubelet, any manually deleted node objects, and EKS managed node group instances that fail to join the cluster. When EKS automatic node repair is enabled with the node monitoring agent installed, EKS automatic node repair reacts to additional node conditions: `AcceleratedHardwareReady`, `ContainerRuntimeReady`, `KernelReady`, `NetworkingReady`, and `StorageReady`.

EKS automatic node repair does not react to standard Kubernetes `DiskPressure`, `MemoryPressure`, or `PIDPressure` node conditions. These conditions often indicate issues with application behavior, workload configuration, or resource limits rather than node-level failures, making it difficult to determine an appropriate default repair action. In these scenarios, workloads are subject to the Kubernetes [node pressure eviction behavior](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction).

For more information on EKS automatic node repair, see [Automatically repair nodes in EKS clusters](node-repair.md).

**Topics**

# Detect node health issues with the EKS node monitoring agent
<a name="node-health-nma"></a>

This topic details the node health issues detected by the EKS node monitoring agent, how those issues are surfaced as node conditions or events, and how to configure the node monitoring agent.

The EKS node monitoring agent can be used with or without EKS automatic node repair. For more information on EKS automatic node repair, see [Automatically repair nodes in EKS clusters](node-repair.md).

The source code for the EKS node monitoring agent is published on GitHub in the [aws/eks-node-monitoring-agent](https://github.com/aws/eks-node-monitoring-agent) repository.

## Node health issues
<a name="node-health-issues"></a>

The following tables describe node health issues that can be detected by the node monitoring agent. There are two types of issues:
+ Condition – A terminal issue that warrants a remediation action like an instance replacement or reboot. When auto repair is enabled, Amazon EKS will do a repair action, either as a node replacement or reboot. For more information, see [Node conditions](learn-status-conditions.md#status-node-conditions).
+ Event – A temporary issue or sub-optimal node configuration. No auto repair action will take place. For more information, see [Node events](learn-status-conditions.md#status-node-events).

## AcceleratedHardware node health issues
<a name="node-health-AcceleratedHardware"></a>

The monitoring condition is `AcceleratedHardwareReady` for issues in the following table that have a severity of “Condition”. The events and conditions in the table below are for NVIDIA and Neuron related node health issues.


| Name | Severity | Description | Repair Action | 
| --- | --- | --- | --- | 
|  DCGMDiagnosticFailure  |  Condition  |  A test case from the DCGM active diagnostics test suite failed.  |  None  | 
|  DCGMError  |  Condition  |  Connection to the DCGM host process was lost or could not be established.  |  None  | 
|  DCGMFieldError[Code]  |  Event  |  DCGM detected GPU degradation through a field identifier.  |  None  | 
|  DCGMHealthCode[Code]  |  Event  |  A DCGM health check failed in a non-fatal manner.  |  None  | 
|  DCGMHealthCode[Code]  |  Condition  |  A DCGM health check failed in a fatal manner.  |  None  | 
|  NeuronDMAError  |  Condition  |  A DMA engine encountered an unrecoverable error.  |  Replace  | 
|  NeuronHBMUncorrectableError  |  Condition  |  An HBM encountered an uncorrectable error and produced incorrect results.  |  Replace  | 
|  NeuronNCUncorrectableError  |  Condition  |  A Neuron Core uncorrectable memory error was detected.  |  Replace  | 
|  NeuronSRAMUncorrectableError  |  Condition  |  An on-chip SRAM encountered a parity error and produced incorrect results.  |  Replace  | 
|  NvidiaDeviceCountMismatch  |  Event  |  The number of GPUs visible through NVML is inconsistent with the NVIDIA device count on the filesystem.  |  None  | 
|  NvidiaDoubleBitError  |  Condition  |  A double bit error was produced by the GPU driver.  |  Replace  | 
|  NvidiaNCCLError  |  Event  |  A segfault occurred in the NVIDIA Collective Communications library (`libnccl`).  |  None  | 
|  NvidiaNVLinkError  |  Condition  |  NVLink errors were reported by the GPU driver.  |  Replace  | 
|  NvidiaPCIeError  |  Event  |  PCIe replays were triggered to recover from transmission errors.  |  None  | 
|  NvidiaPageRetirement  |  Event  |  The GPU driver has marked a memory page for retirement. This may occur if there is a single double bit error or two single bit errors are encountered at the same address.  |  None  | 
|  NvidiaPowerError  |  Event  |  Power utilization of GPUs breached the allowed thresholds.  |  None  | 
|  NvidiaThermalError  |  Event  |  Thermal status of GPUs breached the allowed thresholds.  |  None  | 
|  NvidiaXID[Code]Error  |  Condition  |  A critical GPU error occurred.  |  Replace or Reboot  | 
|  NvidiaXID[Code]Warning  |  Event  |  A non-critical GPU error occurred.  |  None  | 

## NVIDIA XID error codes
<a name="nvidia-xid-codes"></a>

The node monitoring agent detects NVIDIA XID errors from GPU kernel logs. XID errors fall into two categories:
+  **Well-known XID codes** – Critical errors that set a node condition (`AcceleratedHardwareReady=False`) and trigger auto repair when enabled. The reason code format is `NvidiaXID[Code]Error`. The well-known XID codes that the EKS node monitoring agent detects may not represent the full list of NVIDIA XID codes that require repair actions.
+  **Unknown XID codes** – Logged as Kubernetes events only. These don’t trigger auto repair. The reason code format is `NvidiaXID[Code]Warning`. To investigate unknown XID errors, review your kernel logs with `dmesg | grep -i nvrm`.

For more information on XID errors, see [Xid Errors](https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_1) in the *NVIDIA GPU Deployment and Management Documentation*. For more information on the individual XID messages, see [Understanding Xid Messages](https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html#understanding-xid-messages) in the *NVIDIA GPU Deployment and Management Documentation*.

The following table lists the well-known XID codes, their meanings, and the default node repair action if enabled.


| XID Code | Description | Repair Action | 
| --- | --- | --- | 
|  13  |  Graphics Engine Exception – A GPU graphics engine error occurred, typically caused by software issues or driver bugs.  |  Reboot  | 
|  31  |  GPU memory page fault – An application attempted to access GPU memory that is not mapped or accessible.  |  Reboot  | 
|  48  |  Double Bit ECC Error – An uncorrectable double-bit error occurred in GPU memory, indicating potential hardware degradation.  |  Reboot  | 
|  63  |  GPU memory remapping event – The GPU driver remapped a portion of GPU memory due to detected errors. This is often recoverable.  |  Reboot  | 
|  64  |  GPU memory remapping failure – The GPU was unable to remap defective memory, indicating hardware issues.  |  Reboot  | 
|  74  |  NVLink Error – An error occurred on the high-speed NVLink interconnect between GPUs.  |  Replace  | 
|  79  |  GPU has fallen off the bus – The GPU is no longer accessible via PCIe, typically indicating a hardware failure or power issue.  |  Replace  | 
|  94  |  Contained memory error – A memory error occurred but was contained and did not affect other applications.  |  Reboot  | 
|  95  |  Uncontained memory error – A memory error occurred that may have affected other applications or system memory.  |  Reboot  | 
|  119  |  GSP RPC Timeout – Communication with the GPU System Processor timed out, possibly due to firmware issues.  |  Replace  | 
|  120  |  GSP Error – An error occurred in the GPU System Processor.  |  Replace  | 
|  121  |  C2C Error – An error occurred on the chip-to-chip interconnect (used in multi-die GPUs).  |  Replace  | 
|  140  |  ECC Unrecovered Error – An ECC error escaped containment and may have corrupted data.  |  Replace  | 

To view the current node conditions related to GPU health, run the following command.

```
kubectl get nodes -o custom-columns='NAME:.metadata.name,ACCELERATOR_READY:.status.conditions[?(@.type=="AcceleratedHardwareReady")].status,REASON:.status.conditions[?(@.type=="AcceleratedHardwareReady")].reason'
```

To view XID-related events on your cluster, run one of the following commands.

```
kubectl get events | grep -i "NvidiaXID"
```

## ContainerRuntime node health issues
<a name="node-health-ContainerRuntime"></a>

The monitoring condition is `ContainerRuntimeReady` for issues in the following table that have a severity of “Condition”.


| Name | Severity | Description | Repair Action | 
| --- | --- | --- | --- | 
|  ContainerRuntimeFailed  |  Event  |  The container runtime has failed to create a container, likely related to any reported issues if occurring repeatedly.  |  None  | 
|  DeprecatedContainerdConfiguration  |  Event  |  A container image using deprecated image manifest version 2, schema 1 was recently pulled onto the node through `containerd`.  |  None  | 
|  KubeletFailed  |  Event  |  The kubelet entered a failed state.  |  None  | 
|  LivenessProbeFailures  |  Event  |  A liveness probe failure was detected, potentially indicating application code issues or insufficient timeout values if occurring repeatedly.  |  None  | 
|  PodStuckTerminating  |  Condition  |  A Pod is or was stuck terminating for an excessive amount of time, which can be caused by CRI errors preventing pod state progression.  |  Replace  | 
|  ReadinessProbeFailures  |  Event  |  A readiness probe failure was detected, potentially indicating application code issues or insufficient timeout values if occurring repeatedly.  |  None  | 
|  [Name]RepeatedRestart  |  Event  |  A systemd unit is restarting frequently.  |  None  | 
|  ServiceFailedToStart  |  Event  |  A systemd unit failed to start.  |  None  | 

## Kernel node health issues
<a name="node-health-Kernel"></a>

The monitoring condition is `KernelReady` for issues in the following table that have a severity of “Condition”.


| Name | Severity | Description | Repair Action | 
| --- | --- | --- | --- | 
|  AppBlocked  |  Event  |  The task has been blocked for a long period of time from scheduling, usually caused by being blocked on input or output.  |  None  | 
|  AppCrash  |  Event  |  An application on the node has crashed.  |  None  | 
|  ApproachingKernelPidMax  |  Event  |  The number of processes is approaching the maximum number of PIDs that are available per the current `kernel.pid_max` setting, after which no more processes can be launched.  |  None  | 
|  ApproachingMaxOpenFiles  |  Event  |  The number of open files is approaching the maximum number of possible open files given the current kernel settings, after which opening new files will fail.  |  None  | 
|  ConntrackExceededKernel  |  Event  |  Connection tracking exceeded the maximum for the kernel and new connections could not be established, which can result in packet loss.  |  None  | 
|  ExcessiveZombieProcesses  |  Event  |  Processes which can’t be fully reclaimed are accumulating in large numbers, which indicates application issues and may lead to reaching system process limits.  |  None  | 
|  ForkFailedOutOfPIDs  |  Condition  |  A fork or exec call has failed due to the system being out of process IDs or memory, which may be caused by zombie processes or physical memory exhaustion.  |  Replace  | 
|  KernelBug  |  Event  |  A kernel bug was detected and reported by the Linux kernel itself, though this may sometimes be caused by nodes with high CPU or memory usage leading to delayed event processing.  |  None  | 
|  LargeEnvironment  |  Event  |  The number of environment variables for this process is larger than expected, potentially caused by many services with `enableServiceLinks` set to true, which may cause performance issues.  |  None  | 
|  RapidCron  |  Event  |  A cron job is running faster than every five minutes on this node, which may impact performance if the job consumes significant resources.  |  None  | 
|  SoftLockup  |  Event  |  The CPU stalled for a given amount of time.  |  None  | 

## Networking node health issues
<a name="node-health-Networking"></a>

The monitoring condition is `NetworkingReady` for issues in the following table that have a severity of “Condition”.


| Name | Severity | Description | Repair Action | 
| --- | --- | --- | --- | 
|  BandwidthInExceeded  |  Event  |  Packets have been queued or dropped because the inbound aggregate bandwidth exceeded the maximum for the instance.  |  None  | 
|  BandwidthOutExceeded  |  Event  |  Packets have been queued or dropped because the outbound aggregate bandwidth exceeded the maximum for the instance.  |  None  | 
|  ConntrackExceeded  |  Event  |  Connection tracking exceeded the maximum for the instance and new connections could not be established, which can result in packet loss.  |  None  | 
|  EFAErrorMetric  |  Event  |  EFA driver metrics shows there is an interface with performance degredation.  |  None  | 
|  IPAMDInconsistentState  |  Event  |  The state of the IPAMD checkpoint on disk does not reflect the IPs in the container runtime.  |  None  | 
|  IPAMDNoIPs  |  Event  |  IPAMD is out of IP addresses.  |  None  | 
|  IPAMDNotReady  |  Condition  |  IPAMD fails to connect to the API server.  |  Replace  | 
|  IPAMDNotRunning  |  Condition  |  The Amazon VPC CNI process was not found to be running.  |  Replace  | 
|  IPAMDRepeatedlyRestart  |  Event  |  Multiple restarts in the IPAMD service have occurred.  |  None  | 
|  InterfaceNotRunning  |  Condition  |  This interface appears to not be running or there are network issues.  |  Replace  | 
|  InterfaceNotUp  |  Condition  |  This interface appears to not be up or there are network issues.  |  Replace  | 
|  KubeProxyNotReady  |  Event  |  Kube-proxy failed to watch or list resources.  |  None  | 
|  LinkLocalExceeded  |  Event  |  Packets were dropped because the PPS of traffic to local proxy services exceeded the network interface maximum.  |  None  | 
|  MACAddressPolicyMisconfigured  |  Event  |  The systemd-networkd link configuration has the incorrect `MACAddressPolicy` value.  |  None  | 
|  MissingDefaultRoutes  |  Event  |  There are missing default route rules.  |  None  | 
|  MissingIPRoutes  |  Event  |  There are missing routes for Pod IPs.  |  None  | 
|  MissingIPRules  |  Event  |  There are missing rules for Pod IPs.  |  None  | 
|  MissingLoopbackInterface  |  Condition  |  The loopback interface is missing from this instance, causing failure of services depending on local connectivity.  |  Replace  | 
|  NetworkSysctl  |  Event  |  This node’s network `sysctl` settings are potentially incorrect.  |  None  | 
|  PPSExceeded  |  Event  |  Packets have been queued or dropped because the bidirectional PPS exceeded the maximum for the instance.  |  None  | 
|  PortConflict  |  Event  |  If a Pod uses hostPort, it can write `iptables` rules that override the host’s already bound ports, potentially preventing API server access to `kubelet`.  |  None  | 
|  UnexpectedRejectRule  |  Event  |  An unexpected `REJECT` or `DROP` rule was found in the `iptables`, potentially blocking expected traffic.  |  None  | 

## Storage node health issues
<a name="node-health-Storage"></a>

The monitoring condition is `StorageReady` for issues in the following table that have a severity of “Condition”.


| Name | Severity | Description | Repair Action | 
| --- | --- | --- | --- | 
|  EBSInstanceIOPSExceeded  |  Event  |  Maximum IOPS for the instance was exceeded.  |  None  | 
|  EBSInstanceThroughputExceeded  |  Event  |  Maximum Throughput for the instance was exceeded.  |  None  | 
|  EBSVolumeIOPSExceeded  |  Event  |  Maximum IOPS to a particular EBS Volume was exceeded.  |  None  | 
|  EBSVolumeThroughputExceeded  |  Event  |  Maximum Throughput to a particular Amazon EBS volume was exceeded.  |  None  | 
|  EtcHostsMountFailed  |  Event  |  Mounting of the kubelet generated `/etc/hosts` failed due to userdata remounting `/var/lib/kubelet/pods` during `kubelet-container` operation.  |  None  | 
|  IODelays  |  Event  |  Input or output delay detected in a process, potentially indicating insufficient input-output provisioning if excessive.  |  None  | 
|  KubeletDiskUsageSlow  |  Event  |  The `kubelet` is reporting slow disk usage while trying to access the filesystem. This potentially indicates insufficient disk input-output or filesystem issues.  |  None  | 
|  XFSSmallAverageClusterSize  |  Event  |  The XFS Average Cluster size is small, indicating excessive free space fragmentation. This can prevent file creation despite available inodes or free space.  |  None  | 

## Configure the node monitoring agent
<a name="node-monitoring-agent-configure"></a>

The EKS node monitoring agent is deployed as a DaemonSet. When you deploy it as an EKS add-on, you can customize the installation with following configuration values. For default configurations, reference the EKS node monitoring agent [Helm chart](https://github.com/aws/eks-node-monitoring-agent/blob/main/charts/eks-node-monitoring-agent/values.yaml).


| Configuration Option | Description | 
| --- | --- | 
|   `monitoringAgent.resources.requests.cpu`   |  CPU resource request for the monitoring agent.  | 
|   `monitoringAgent.resources.requests.memory`   |  Memory resource request for the monitoring agent.  | 
|   `monitoringAgent.resources.limits.cpu`   |  CPU resource limit for the monitoring agent.  | 
|   `monitoringAgent.resources.limits.memory`   |  Memory resource limit for the monitoring agent.  | 
|   `monitoringAgent.tolerations`   |  Tolerations for scheduling the monitoring agent on tainted nodes.  | 
|   `monitoringAgent.additionalArgs`   |  Additional command-line arguments to pass to the monitoring agent.  | 

**Note**  
You can configure `hostname-override` and `verbosity` as `monitoringAgent.additionalArgs` with EKS add-ons or Helm installation. You currently cannot customize the node monitoring agent’s `probe-address` (`8002`) or `metrics-address` (`8003`) via additional args with EKS add-ons or Helm installation.

The node monitoring agent includes a NVIDIA DCGM (Data Center GPU Manager) server component (`nv-hostengine`) for monitoring NVIDIA GPUs. This component runs only on nodes that are NVIDIA GPU instance types as shown by the `nodeAffinity` in the agent’s [Helm chart](https://github.com/aws/eks-node-monitoring-agent/blob/main/charts/eks-node-monitoring-agent/values.yaml). You cannot use an existing NVIDIA DCGM installation with the EKS node monitoring agent, please provide feedback on the EKS roadmap [GitHub issue \$12763](https://github.com/aws/containers-roadmap/issues/2763) if you require this functionality.

When you deploy the EKS node monitoring agent as an EKS add-on, you can customize the NVIDIA DCGM installation with following configuration values.


| Configuration Option | Description | 
| --- | --- | 
|   `dcgmAgent.resources.requests.cpu`   |  CPU resource request for the DCGM agent.  | 
|   `dcgmAgent.resources.requests.memory`   |  Memory resource request for the DCGM agent.  | 
|   `dcgmAgent.resources.limits.cpu`   |  CPU resource limit for the DCGM agent.  | 
|   `dcgmAgent.resources.limits.memory`   |  Memory resource limit for the DCGM agent.  | 
|   `dcgmAgent.tolerations`   |  Tolerations for scheduling the DCGM agent on tainted nodes.  | 

You can use the following AWS CLI commands to get useful information about the versions and schema for the EKS node monitoring agent EKS add-on.

Get the latest agent add-on version for your Kubernetes version. Replace `1.35` with your Kubernetes version.

```
aws eks describe-addon-versions \
  --addon-name eks-node-monitoring-agent \
  --kubernetes-version 1.35 \
  --query='addons[].addonVersions[].addonVersion'
```

Get the agent add-on schema supported in EKS add-ons. Replace `v1.5.1-eksbuild.1` with your agent version.

```
aws eks describe-addon-configuration \
  --addon-name eks-node-monitoring-agent \
  --addon-version v1.5.1-eksbuild.1
```

# Automatically repair nodes in EKS clusters
<a name="node-repair"></a>

This topic details the EKS automatic node repair behavior and how to configure it to meet your requirements. EKS automatic node repair is enabled by default in EKS Auto Mode, and can be used with EKS managed node groups and Karpenter.

The default EKS automatic node repair actions are summarized in the table below and they apply to the behavior for EKS Auto Mode, EKS managed node groups, and Karpenter. When using EKS Auto Mode or Karpenter all `AcceleratedHardwareReady` repair actions are `Replace`, and only EKS managed node groups support `Reboot` as a repair action.

For a detailed list of node health issues detected by the EKS node monitoring agent and their corresponding node repair actions, see [Detect node health issues with the EKS node monitoring agent](node-health-nma.md).


| Node Condition | Description | Repair after | Repair action(s) | 
| --- | --- | --- | --- | 
|  AcceleratedHardwareReady  |  AcceleratedHardwareReady indicates whether accelerated hardware (GPU, Neuron) on the node is functioning correctly.  |  10m  |  Replace or Reboot  | 
|  ContainerRuntimeReady  |  ContainerRuntimeReady indicates whether the container runtime (containerd, etc.) is functioning correctly and able to run containers.  |  30m  |  Replace  | 
|  DiskPressure  |  DiskPressure is a standard Kubernetes condition indicating the node is experiencing disk pressure (low disk space or high I/O).  |  N/A  |  None  | 
|  KernelReady  |  KernelReady indicates whether the kernel is functioning correctly without critical errors, panics, or resource exhaustion.  |  30m  |  Replace  | 
|  MemoryPressure  |  MemoryPressure is a standard Kubernetes condition indicating the node is experiencing memory pressure (low available memory).  |  N/A  |  None  | 
|  NetworkingReady  |  NetworkingReady indicates whether the node’s networking stack is functioning correctly (interfaces, routing, connectivity).  |  30m  |  Replace  | 
|  StorageReady  |  StorageReady indicates whether the node’s storage subsystem is functioning correctly (disks, filesystems, I/O).  |  30m  |  Replace  | 
|  Ready  |  Ready is the standard Kubernetes condition indicating the node is healthy and ready to accept pods.  |  30m  |  Replace  | 

EKS automatic node repair actions are disabled in the following scenarios by default. In-progress node repair actions continue in each scenario. See [Configure automatic node repair](#configure-node-repair) for how to override these default settings.

 **EKS managed node groups** 
+ The node group has more than five nodes and more than 20% of the nodes in the node group are unhealthy.
+ A zonal shift for your cluster triggers through the Application Recovery Controller (ARC).

 **EKS Auto Mode and Karpenter** 
+ More than 20% of the nodes in the NodePool are unhealthy.
+ For standalone NodeClaims, 20% of nodes in the cluster are unhealthy.

## Configure automatic node repair
<a name="configure-node-repair"></a>

Automatic node repair cannot be configured when using EKS Auto Mode and it is always enabled with the same default settings as Karpenter.

### Karpenter
<a name="configure-node-repair-karpenter"></a>

To use automatic node repair with Karpenter, enable the feature gate `NodeRepair=true`. You can enable the feature gates through the `--feature-gates` CLI option or the `FEATURE_GATES` environment variable in the Karpenter deployment. For more information, see the [Karpenter documentation](https://karpenter.sh/docs/concepts/disruption/#node-auto-repair).

### Managed node groups
<a name="configure-node-repair-mng"></a>

You can enable automatic node repair when creating new EKS managed node groups or by updating existing EKS managed node groups.
+  **Amazon EKS console** – Select the **Enable node auto repair** checkbox for the managed node group. For more information, see [Create a managed node group for your cluster](create-managed-node-group.md).
+  ** AWS CLI** – Add `--node-repair-config enabled=true` to the [https://docs.aws.amazon.com/cli/latest/reference/eks/create-nodegroup.html](https://docs.aws.amazon.com/cli/latest/reference/eks/create-nodegroup.html) or [https://docs.aws.amazon.com/cli/latest/reference/eks/update-nodegroup-config.html](https://docs.aws.amazon.com/cli/latest/reference/eks/update-nodegroup-config.html) command.
+  **eksctl** – Configure `managedNodeGroups.nodeRepairConfig.enabled: true`, see the example in the [eksctl GitHub](https://github.com/eksctl-io/eksctl/blob/main/examples/44-node-repair.yaml).

When using EKS managed node groups, you can control node auto repair behavior with the following settings.

To control when node auto repair stops taking action, set a threshold based on the number of unhealthy nodes in the node group. Set either the absolute count or percentage, but not both.


| Setting | Description | 
| --- | --- | 
|   `maxUnhealthyNodeThresholdCount`   |  The absolute number of unhealthy nodes above which node auto repair stops. Use this to limit the scope of repairs.  | 
|   `maxUnhealthyNodeThresholdPercentage`   |  The percentage of unhealthy nodes above which node auto repair stops (0-100).  | 

To control how many nodes repair at the same time, you can configure repair parallelism. As with the unhealthy node threshold, set either the absolute count or percentage, but not both.


| Setting | Description | 
| --- | --- | 
|   `maxParallelNodesRepairedCount`   |  The maximum number of nodes to repair concurrently.  | 
|   `maxParallelNodesRepairedPercentage`   |  The maximum percentage of unhealthy nodes to repair concurrently (0-100).  | 

With `nodeRepairConfigOverrides`, you can customize repair behavior for specific conditions. Use this when you need different repair actions or wait times for different issue types.

Each override requires all of the following fields:


| Field | Description | 
| --- | --- | 
|   `nodeMonitoringCondition`   |  The node condition type reported by the node monitoring agent. For example: `AcceleratedHardwareReady`, `NetworkingReady`, `StorageReady`, `KernelReady`.  | 
|   `nodeUnhealthyReason`   |  The specific reason code for the unhealthy condition. For example: `NvidiaXID31Error`, `IPAMDNotRunning`.  | 
|   `minRepairWaitTimeMins`   |  The minimum time in minutes that the condition must persist before the node becomes eligible for repair. Use this to avoid repairing nodes for temporary issues.  | 
|   `repairAction`   |  The action to take when conditions are met. Valid values: `Replace` (terminate and replace the node), `Reboot` (reboot the node), or `NoAction` (no repair actions).  | 

The following AWS CLI example creates a node group with custom repair settings.

```
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodegroup \
  --node-role arn:aws:iam::111122223333:role/NodeRole \
  --subnets subnet-0123456789abcdef0 \
  --node-repair-config '{
    "enabled": true,
    "maxUnhealthyNodeThresholdPercentage": 10,
    "maxParallelNodesRepairedCount": 3,
    "nodeRepairConfigOverrides": [
      {
        "nodeMonitoringCondition": "AcceleratedHardwareReady",
        "nodeUnhealthyReason": "NvidiaXID64Error",
        "minRepairWaitTimeMins": 5,
        "repairAction": "Replace"
      },
      {
        "nodeMonitoringCondition": "AcceleratedHardwareReady",
        "nodeUnhealthyReason": "NvidiaXID31Error",
        "minRepairWaitTimeMins": 15,
        "repairAction": "NoAction"
      }
    ]
  }'
```

This configuration does the following:
+ Enables node auto repair
+ Stops repair actions when more than 10% of nodes are unhealthy
+ Repairs up to 3 nodes at a time
+ Overrides XID 64 errors (GPU memory remapping failure) to replace the node after 5 minutes. The default is reboot after 10 minutes.
+ Overrides XID 31 errors (GPU memory page fault) to take no action. The default is reboot after 10 minutes.

# View the health status of your nodes
<a name="learn-status-conditions"></a>

This topic explains the tools and methods available for monitoring node health status in Amazon EKS clusters. The information covers node conditions, events, and detection cases that help you identify and diagnose node-level issues. Use the commands and patterns described here to inspect node health resources, interpret status conditions, and analyze node events for operational troubleshooting.

You can get some node health information with Kubernetes commands for all nodes. And if you use the node monitoring agent through Amazon EKS Auto Mode or the Amazon EKS managed add-on, you will get a wider variety of node signals to help troubleshoot. Descriptions of detected health issues by the node monitoring agent are also made available in the observability dashboard. For more information, see [Detect node health issues with the EKS node monitoring agent](node-health-nma.md).

## Node conditions
<a name="status-node-conditions"></a>

Node conditions represent terminal issues requiring remediation actions like instance replacement or reboot.

 **To get conditions for all nodes:** 

```
kubectl get nodes -o 'custom-columns=NAME:.metadata.name,CONDITIONS:.status.conditions[*].type,STATUS:.status.conditions[*].status'
```

 **To get detailed conditions for a specific node** 

```
kubectl describe node node-name
```

 **Example condition output of a healthy node:** 

```
  - lastHeartbeatTime: "2024-11-21T19:07:40Z"
    lastTransitionTime: "2024-11-08T03:57:40Z"
    message: Monitoring for the Networking system is active
    reason: NetworkingIsReady
    status: "True"
    type: NetworkingReady
```

 **Example condition of a unhealthy node with a networking problem:** 

```
  - lastHeartbeatTime: "2024-11-21T19:12:29Z"
    lastTransitionTime: "2024-11-08T17:04:17Z"
    message: IPAM-D has failed to connect to API Server which could be an issue with
      IPTable rules or any other network configuration.
    reason: IPAMDNotReady
    status: "False"
    type: NetworkingReady
```

## Node events
<a name="status-node-events"></a>

Node events indicate temporary issues or sub-optimal configurations.

 **To get all events reported by the node monitoring agent** 

When the node monitoring agent is available, you can run the following command.

```
kubectl get events --field-selector=reportingComponent=eks-node-monitoring-agent
```

Sample output:

```
LAST SEEN   TYPE      REASON       OBJECT                                              MESSAGE
4s          Warning   SoftLockup   node/ip-192-168-71-251.us-west-2.compute.internal   CPU stuck for 23s
```

 **To get events for all nodes** 

```
kubectl get events --field-selector involvedObject.kind=Node
```

 **To get events for a specific node** 

```
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=node-name
```

 **To watch events in real-time** 

```
kubectl get events -w --field-selector involvedObject.kind=Node
```

 **Example event output:** 

```
LAST SEEN   TYPE     REASON           OBJECT         MESSAGE
2m          Warning  MemoryPressure   Node/node-1    Node experiencing memory pressure
5m          Normal   NodeReady        Node/node-1    Node became ready
```

## Common troubleshooting commands
<a name="status-node-troubleshooting"></a>

```
# Get comprehensive node status
kubectl get node node-name -o yaml

# Watch node status changes
kubectl get nodes -w

# Get node metrics
kubectl top node
```

# Retrieve node logs for a managed node using kubectl and S3
<a name="auto-get-logs"></a>

Learn how to retrieve node logs for an Amazon EKS managed node that has the node monitoring agent.

## Prerequisites
<a name="_prerequisites"></a>

Make sure you have the following:
+ An existing Amazon EKS cluster with the node monitoring agent. For more information, see [Detect node health issues and enable automatic node repair](node-health.md).
+ The `kubectl` command-line tool installed and configured to communicate with your cluster.
+ The AWS CLI installed and logged in with sufficent permissions to create S3 buckets and objects.
+ A recent version of Python 3 installed
+ The AWS SDK for Python 3, Boto 3, installed.

## Step 1: Create S3 bucket destination (optional)
<a name="_step_1_create_s3_bucket_destination_optional"></a>

If you don’t already have an S3 bucket to store the logs, create one. Use the following AWS CLI command. The bucket defaults to the `private` access control list. Replace *bucket-name* with your chosen unique bucket name.

```
aws s3api create-bucket --bucket <bucket-name>
```

## Step 2: Create pre-signed S3 URL for HTTP Put
<a name="_step_2_create_pre_signed_s3_url_for_http_put"></a>

Amazon EKS returns the node logs by doing a HTTP PUT operation to a URL you specify. In this tutorial, we will generate a pre-signed S3 HTTP PUT URL.

The logs will be returned as a gzip tarball, with the `.tar.gz` extension.

**Note**  
You must use the AWS API or a SDK to create the pre-signed S3 upload URL for EKS to upload the log file. You cannot create a pre-signed S3 upload URL using the AWS CLI.

1. Determine where in the bucket you want to store the logs. For example, you might use *2024-11-12/logs1.tar.gz* as the key.

1. Save the following Python code to the file *presign-upload.py*. Replace *<bucket-name>* and *<key>*. The key should end with `.tar.gz`.

   ```
   import boto3; print(boto3.client('s3').generate_presigned_url(
      ClientMethod='put_object',
      Params={'Bucket': '<bucket-name>', 'Key': '<key>'},
      ExpiresIn=1000
   ))
   ```

1. Run the script with

   ```
   python presign-upload.py
   ```

1. Note the URL output. Use this value in the next step as the *http-put-destination*.

For more information, see [Generate a presigned URL to upload a file](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html#generating-a-presigned-url-to-upload-a-file) in the AWS Boto3 SDK for Python Documentation.

## Step 3: Create NodeDiagnostic resource
<a name="_step_3_create_nodediagnostic_resource"></a>

Identify the name of the node you want to collect logs from.

Create a `NodeDiagnostic` manifest that uses the name of the node as the resource’s name, and providing a HTTP PUT URL destination.

```
apiVersion: eks.amazonaws.com/v1alpha1
kind: NodeDiagnostic
metadata:
    name: <node-name>
spec:
    logCapture:
        destination: http-put-destination
```

Apply the manifest to the cluster.

```
kubectl apply -f nodediagnostic.yaml
```

You can check on the Status of the collection by describing the `NodeDiagnostic` resource:
+ A status of `Success` or `SuccessWithErrors` indicates that the task completed and the logs uploaded to the provided destination (`SuccessWithErrors` indicates that some logs might be missing)
+ If the status is Failure, confirm the upload URL is well-formed and not expired.

```
kubectl describe nodediagnostics.eks.amazonaws.com/<node-name>
```

## Step 4: Download logs from S3
<a name="_step_4_download_logs_from_s3"></a>

Wait approximately one minute before attempting to download the logs. Then, use the S3 CLI to download the logs.

```
# Once NodeDiagnostic shows Success status, download the logs
aws s3 cp s3://<bucket-name>/key ./<path-to-node-logs>.tar.gz
```

## Step 5: Clean up NodeDiagnostic resource
<a name="_step_5_clean_up_nodediagnostic_resource"></a>
+  `NodeDiagnostic` resources do not get automatically deleted. You should clean these up on your own after you have obtained your log artifacts

```
# Delete the NodeDiagnostic resource
kubectl delete nodediagnostics.eks.amazonaws.com/<node-name>
```

## NodeDiagnostic `node` Destination
<a name="_nodediagnostic_node_destination"></a>

Starting with version `v1.6.1-eksbuild.1` of the Node Monitoring Agent, there is an option to set the log collection destination to `node`. Using this destination will lead to the collection and temporary persistence of logs on the node for later collection. In addition to this functionality, within the Node Monitoring Agent’s GitHub repository is a `kubectl` plugin you can install for easy interaction and log collection. For more information, see the [documentation for the `kubectl ekslogs` plugin](https://github.com/aws/eks-node-monitoring-agent/blob/main/tools/kubectl-ekslogs/README.md).

## Example Usage
<a name="_example_usage"></a>

```
# Collect NodeDiagnostic logs from a single node
kubectl ekslogs <node-name>

# Collect NodeDiagnostic logs from multiple nodes
kubectl ekslogs <node-name-1> <node-name-2> <node-name-3>

# Collect NodeDiagnostic logs from all nodes with a specific label
kubectl ekslogs -l <key>=<value>
```

# Capture network traffic on a managed node using kubectl and S3
<a name="auto-get-tcpdump"></a>

Learn how to capture network traffic on an Amazon EKS managed node that has the node monitoring agent. The agent runs tcpdump on the node, compresses capture files, and uploads them to your S3 bucket.

## Prerequisites
<a name="_prerequisites"></a>

Make sure you have the following:
+ An existing Amazon EKS Auto Mode cluster with the node monitoring agent. For more information, see [Detect node health issues and enable automatic node repair](node-health.md).
+ The `kubectl` command-line tool installed and configured to communicate with your cluster.
+ The AWS CLI installed and logged in with sufficient permissions to create S3 buckets and objects.
+ A recent version of Python 3 installed.
+ The AWS SDK for Python 3, Boto 3, installed.
+ The PyYAML library installed (`pip install pyyaml`).

## Step 1: Create S3 bucket destination (optional)
<a name="_step_1_create_s3_bucket_destination_optional"></a>

If you don’t already have an S3 bucket to store the capture files, create one. Replace *bucket-name* and *region* with your values.

```
aws s3api create-bucket --bucket <bucket-name> \
    --region <region> \
    --create-bucket-configuration LocationConstraint=<region>
```

**Note**  
The `--create-bucket-configuration` parameter is required for all regions except `us-east-1`.

## Step 2: Start packet capture
<a name="_step_2_start_packet_capture"></a>

Use the `start-capture.py` script from the [node monitoring agent repository](https://github.com/aws/eks-node-monitoring-agent) (`tools/start-capture.py`) to generate pre-signed S3 credentials, create the `NodeDiagnostic` resource, and apply it to your cluster.

1. Identify the node you want to capture traffic from.

   ```
   kubectl get nodes
   ```

1. Save the [start-capture.py](https://github.com/aws/eks-node-monitoring-agent/blob/main/tools/start-capture.py) script from the node monitoring agent repository to your local machine, then run it. Replace *<bucket-name>* and *<node-name>* with your values.

   ```
   python3 start-capture.py --bucket <bucket-name> --node <node-name>
   ```

   Common options:

   ```
   # Capture for 5 minutes on eth0 with a filter
   python3 start-capture.py --bucket <bucket-name> --node <node-name> \
       --duration 5m --interface eth0 --filter "tcp port 443"
   
   # Preview the YAML without applying
   python3 start-capture.py --bucket <bucket-name> --node <node-name> --dry-run
   ```

   The script requires Python 3 with `boto3` and `pyyaml` installed, and `kubectl` configured for your cluster.

   The script generates a `NodeDiagnostic` resource like the following. This example is provided for reference; note that the `upload` fields require pre-signed S3 POST credentials that are generated programmatically by the script.

   ```
   apiVersion: eks.amazonaws.com/v1alpha1
   kind: NodeDiagnostic
   metadata:
     name: <node-name>                    # Required: node instance ID
   spec:
     packetCapture:
       duration: "30s"                       # Required: capture duration (max 1h)
       # interface: "eth0"                   # Optional: default is primary ENI. Use "any" for all interfaces
       # filter: "tcp port 443"             # Optional: tcpdump filter expression
       # chunkSizeMB: 10                    # Optional: file rotation size in MB (1-100, default: 10)
       upload:                               # Required: pre-signed S3 POST credentials
         url: "https://<bucket>.s3.amazonaws.com/"
         fields:
           key: "captures/<node-name>/${filename}"
           # ... other pre-signed POST fields (generated by the script)
   ```

## Step 3: Monitor capture progress
<a name="_step_3_monitor_capture_progress"></a>

Check the status of the capture.

```
kubectl describe nodediagnostic <node-name>
```

The status will show:
+  `Running` while the capture is in progress.
+  `Completed` with reason `Success` when the capture finishes and all files are uploaded.
+  `Completed` with reason `Failure` if the capture encountered errors.

To see the full status including `captureID` (used for S3 path identification):

```
kubectl get nodediagnostic <node-name> -o jsonpath='{.status.captureStatuses}'
```

## Step 4: Download capture files from S3
<a name="_step_4_download_capture_files_from_s3"></a>

Once the status shows `Success`, download the capture files from S3.

```
aws s3 cp s3://<bucket-name>/captures/ ./captures/ --recursive
```

The files are gzip-compressed pcap format. Decompress and analyze with tcpdump or Wireshark:

```
gunzip captures/*.gz
tcpdump -r captures/capture.pcap0000 -n
```

## Step 5: Clean up
<a name="_step_5_clean_up"></a>

 `NodeDiagnostic` resources are not automatically deleted. Clean up after you have obtained your capture files. Deleting the resource while a capture is running will stop the capture immediately.

```
kubectl delete nodediagnostic <node-name>
```

## Configuration options and behavior
<a name="_configuration_options_and_behavior"></a>

For the full `packetCapture` spec reference, configuration options, and behavior details, see the [packet capture documentation](https://github.com/aws/eks-node-monitoring-agent/blob/main/docs/packet-capture.adoc) in the node monitoring agent repository.