Help improve this page
To contribute to this user guide, choose the Edit this page on GitHub link that is located in the right pane of every page.
This topic covers some common errors that you might see while using local clusters and how to troubleshoot them. Local clusters are similar to Amazon EKS clusters in the cloud, but there are some differences in how they’re managed by Amazon EKS.
Important
Never terminate any managed EKS local cluster Kubernetes
control-plane instance running on Outpost unless explicitly instructed by AWS Support. Terminating these instances impose a risk to local cluster service availability, including loss of the local cluster in case multiple instances are simultaneously terminated. EKS local cluster Kubernetes
control-plane instances are identified by the tag eks-local:controlplane-name
on the EC2 instance console.
Local clusters are created through the Amazon EKS API, but are run in an asynchronous manner. This means that requests to the Amazon EKS API return immediately for local clusters. However, these requests might succeed, fail fast because of input validation errors, or fail and have descriptive validation errors. This behavior is similar to the Kubernetes API.
Local clusters don’t transition to a FAILED
status. Amazon EKS attempts to reconcile the cluster state with the user-requested desired state in a continuous manner. As a result, a local cluster might remain in the CREATING
state for an extended period of time until the underlying issue is resolved.
Local cluster issues can be discovered using the describe-cluster Amazon EKS AWS CLI command. Local cluster issues are surfaced by the cluster.health
field of the describe-cluster
command’s response. The message contained in this field includes an error code, descriptive message, and related resource IDs. This information is available through the Amazon EKS API and AWS CLI only. In the following example, replace my-cluster
with the name of your local cluster.
aws eks describe-cluster --name my-cluster --query 'cluster.health'
An example output is as follows.
{
"issues": [
{
"code": "ConfigurationConflict",
"message": "The instance type 'm5.large' is not supported in Outpost 'my-outpost-arn'.",
"resourceIds": [
"my-cluster-arn"
]
}
]
}
If the problem is beyond repair, you might need to delete the local cluster and create a new one. For example, trying to provision a cluster with an instance type that’s not available on your Outpost. The following table includes common health related errors.
Error scenario | Code | Message | ResourceIds |
---|---|---|---|
Provided subnets couldn’t be found. |
|
|
All provided subnet IDs |
Provided subnets don’t belong to the same VPC. |
|
|
All provided subnet IDs |
Some provided subnets don’t belong to the specified Outpost. |
|
|
Problematic subnet ID |
Some provided subnets don’t belong to any Outpost. |
|
|
Problematic subnet ID |
Some provided subnets don’t have enough free addresses to create elastic network interfaces for control plane instances. |
|
|
Problematic subnet ID |
The specified control plane instance type isn’t supported on your Outpost. |
|
|
Cluster ARN |
You terminated a control plane Amazon EC2 instance or |
|
|
Cluster ARN |
You have insufficient capacity on your Outpost. This can also happen when a cluster is being created if an Outpost is disconnected from the AWS Region. |
|
|
Cluster ARN |
Your account exceeded your security group quota. |
|
Error message returned by Amazon EC2 API |
Target VPC ID |
Your account exceeded your elastic network interface quota. |
|
Error message returned by Amazon EC2 API |
Target subnet ID |
Control plane instances weren’t reachable through AWS Systems Manager. For resolution, see Control plane instances aren’t reachable through AWS Systems Manager. |
|
Amazon EKS control plane instances are not reachable through SSM. Please verify your SSM and network configuration, and reference the EKS on Outposts troubleshooting documentation. |
Amazon EC2 instance IDs |
An error occurred while getting details for a managed security group or elastic network interface. |
Based on Amazon EC2 client error code. |
Error message returned by Amazon EC2 API |
All managed security group IDs |
An error occurred while authorizing or revoking security group ingress rules. This applies to both the cluster and control plane security groups. |
Based on Amazon EC2 client error code. |
Error message returned by Amazon EC2 API |
Problematic security group ID |
An error occurred while deleting an elastic network interface for a control plane instance. |
Based on Amazon EC2 client error code. |
Error message returned by Amazon EC2 API |
Problematic elastic network interface ID |
The following table lists errors from other AWS services that are presented in the health field of the describe-cluster
response.
Amazon EC2 error code | Cluster health issue code | Description |
---|---|---|
|
|
This error can occur for a variety of reasons. The most common reason is that you accidentally removed a tag that the service uses to scope down the service linked role policy from the control plane. If this occurs, Amazon EKS can no longer manage and monitor these AWS resources. |
|
|
This error can occur for a variety of reasons. The most common reason is that you accidentally removed a tag that the service uses to scope down the service linked role policy from the control plane. If this occurs, Amazon EKS can no longer manage and monitor these AWS resources. |
|
|
This error occurs when subnet ID for the ingress rules of a security group can’t be found. |
|
|
This error occurs when the permissions for the ingress rules of a security group aren’t correct. |
|
|
This error occurs when the group of the ingress rules of a security group can’t be found. |
|
|
This error occurs when the network interface ID for the ingress rules of a security group can’t be found. |
|
|
This error occurs when the subnet resource quota is exceeded. |
|
|
This error occurs when the outpost capacity quota is exceeded. |
|
|
This error occurs when the elastic network interface quota is exceeded. |
|
|
This error occurs when the security group quota is exceeded. |
|
|
This is observed when creating an Amazon EC2 instance in a new account. The error might be similar to the following: " |
|
|
Amazon EC2 returns this error code if the specified instance type isn’t supported on the Outpost. |
All other failures |
|
None |
Local clusters require different permissions and policies than Amazon EKS clusters that are hosted in the cloud. When a cluster fails to create and produces an InvalidPermissions
error, double check that the cluster role that you’re using has the AmazonEKSLocalOutpostClusterPolicy managed policy attached to it. All other API calls require the same set of permissions as Amazon EKS clusters in the cloud.
The amount of time it takes to create a local cluster varies depending on several factors. These factors include your network configuration, Outpost configuration, and the cluster’s configuration. In general, a local cluster is created and changes to the ACTIVE
status within 15–20 minutes. If a local cluster remains in the CREATING
state, you can call describe-cluster
for information about the cause in the cluster.health
output field.
The most common issues are the following:
-
Your cluster can’t connect to the control plane instance from the AWS Region that Systems Manager is in. You can verify this by calling
aws ssm start-session --target
from an in-Region bastion host. If that command doesn’t work, check if Systems Manager is running on the control plane instance. Or, another work around is to delete the cluster and then recreate it.instance-id
-
The control plane instances fail to create due to KMS key permissions for EBS volumes. With user managed KMS keys for encrypted EBS volumes, the control plane instances will terminate if the key is not accessible. If the instances are terminated, either switch to an AWS managed KMS key or ensure that your user managed key policy grants the necessary permissions to the cluster role.
-
Systems Manager control plane instances might not have internet access. Check if the subnet that you provided when you created the cluster has a NAT gateway and a VPC with an internet gateway. Use VPC reachability analyzer to verify that the control plane instance can reach the internet gateway. For more information, see Getting started with VPC Reachability Analyzer.
-
The role ARN that you provided is missing policies. Check if the AWS managed policy: AmazonEKSLocalOutpostClusterPolicy was removed from the role. This can also occur if an AWS CloudFormation stack is misconfigured.
-
All the provided subnets must be associated with the same Outpost and must reach each other. When multiple subnets are specified when a cluster is created, Amazon EKS attempts to spread the control plane instances across multiple subnets.
-
The Amazon EKS managed security groups are applied at the elastic network interface. However, other configuration elements such as NACL firewall rules might conflict with the rules for the elastic network interface.
VPC and subnet DNS configuration is misconfigured or missing
Review Create a VPC and subnets for Amazon EKS clusters on AWS Outposts.
Amazon EKS automatically updates all existing local clusters to the latest platform versions for their corresponding Kubernetes minor version. For more information about platform versions, please refer to Learn Kubernetes and Amazon EKS platform versions for AWS Outposts.
During an automatic platform-version rollout a cluster status changes to UPDATING
. The update process consists of the replacement of all Kubernetes control-plane instances with new ones containing the latest security pathces and bugfixes released for the respective Kubernetes minor version. In general, a local cluster platform update process completes within less than 30 minutes and the cluster changes back to ACTIVE
status. If a local cluster remains in the UPDATING
state for an extended period of time, you may call describe-cluster
to check for information about the cause in the cluster.health
output field.
Amazon EKS ensures at least 2 out of 3 Kubernetes control-plane instances are healthy and operational cluster nodes in order to maintain the local cluster availability and prevent service interruption. If a local cluster is stalled in UPDATING
state it is usually because there is some infrastructure or configuration issue preventing the two-instances minimum availability to be guaranteed in case the process continues. So the update process stops progressing to protect the local cluster service interruption.
It is important to troubleshoot a local cluster stuck in UPDATING
status and address the root-cause so that the update process can complete and restore the local cluster back to ACTIVE
with the high-availability of 3 Kubernetes control-plane instances.
Do not terminate any managed EKS local cluster Kubernetes
instances on Outposts unless explicitly instructed by AWS Support. This is specially important for local clusters stuck in UPDATING
state because there’s a high probability that another control-plane nodes is not completely healthy and terminating the wrong instance could cause service interruption and risk local-cluster data loss.
The most common issues are the following:
-
One or more control-plane instances are unable to connect to System Manager because of a networking configuration change since the local cluster was first created. You can verify this by calling
aws ssm start-session --target
from an in-Region bastion host. If that command doesn’t work, check if Systems Manager is running on the control plane instance.instance-id
-
New control plane instances fail to be created due to KMS key permissions for EBS volumes. With user managed KMS keys for encrypted EBS volumes, the control plane instances will terminate if the key is not accessible. If the instances are terminated, either switch to an AWS managed KMS key or ensure that your user managed key policy grants the necessary permissions to the cluster role.
-
Systems Manager control plane instances might have lost internet access. Check if the subnet that was provided when you created the cluster has a NAT gateway and a VPC with an internet gateway. Use VPC reachability analyzer to verify that the control plane instance can reach the internet gateway. For more information, see Getting started with VPC Reachability Analyzer. If your private networks don’t have outbound internet connection, ensure that all the required VPC endpoints and gateway endpoint are still present in the Regional subnet from your cluster (see Subnet access to AWS services).
-
The role ARN that you provided is missing policies. Check if the AWS managed policy: AmazonEKSLocalOutpostClusterPolicy was not removed from the role.
-
One of the new Kubernetes control-plane instances may have experienced an unexpected bootstrapping failure. Please file a ticket with AWS Support Center
for further guidance on troubleshooting and log-collection in this exceptional case.
-
AMI issues:
-
You’re using an unsupported AMI. You must use v20220620
or later for the Create nodes with optimized Amazon Linux AMIs Amazon EKS optimized Amazon Linux. -
If you used an AWS CloudFormation template to create your nodes, make sure it wasn’t using an unsupported AMI.
-
-
Missing the AWS IAM Authenticator
ConfigMap
– If it’s missing, you must create it. For more information, see Apply the aws-auth ConfigMap to your cluster . -
The wrong security group is used – Make sure to use
eks-cluster-sg-
for your worker nodes' security group. The selected security group is changed by AWS CloudFormation to allow a new security group each time the stack is used.cluster-name
-uniqueid
-
Following unexpected private link VPC steps – Wrong CA data (
--b64-cluster-ca
) or API Endpoint (--apiserver-endpoint
) are passed. -
Misconfigured Pod security policy:
-
The CoreDNS and Amazon VPC CNI plugin for Kubernetes Daemonsets must run on nodes for nodes to join and communicate with the cluster.
-
The Amazon VPC CNI plugin for Kubernetes requires some privileged networking features to work properly. You can view the privileged networking features with the following command:
kubectl describe psp eks.privileged
.
We don’t recommend modifying the default pod security policy. For more information, see Understand Amazon EKS created Pod security policies (PSP).
-
When an Outpost gets disconnected from the AWS Region that it’s associated with, the Kubernetes cluster likely will continue working normally. However, if the cluster doesn’t work properly, follow the troubleshooting steps in Prepare local Amazon EKS clusters on AWS Outposts for network disconnects. If you encounter other issues, contact AWS Support. AWS Support can guide you on downloading and running a log collection tool. That way, you can collect logs from your Kubernetes cluster control plane instances and send them to AWS Support support for further investigation.
When the Amazon EKS control plane instances aren’t reachable through AWS Systems Manager (Systems Manager), Amazon EKS displays the following error for your cluster.
Amazon EKS control plane instances are not reachable through SSM. Please verify your SSM and network configuration, and reference the EKS on Outposts troubleshooting documentation.
To resolve this issue, make sure that your VPC and subnets meet the requirements in Create a VPC and subnets for Amazon EKS clusters on AWS Outposts and that you completed the steps in Setting up Session Manager in the AWS Systems Manager User Guide.