Troubleshoot local Amazon EKS clusters on AWS Outposts
This topic covers some common errors that you might see while using local clusters and how to troubleshoot them. Local clusters are similar to Amazon EKS clusters in the cloud, but there are some differences in how they’re managed by Amazon EKS.
Local clusters are created through the Amazon EKS API, but are run in an asynchronous manner. This means that requests to the Amazon EKS API return immediately for local clusters. However, these requests might succeed, fail fast because of input validation errors, or fail and have descriptive validation errors. This behavior is similar to the Kubernetes API.
Local clusters don’t transition to a FAILED
status. Amazon EKS attempts to reconcile the cluster state with the user-requested desired state in a continuous manner. As a result, a local cluster might remain in the CREATING
state for an extended period of time until the underlying issue is resolved.
Local cluster issues can be discovered using the describe-cluster Amazon EKS AWS CLI command. Local cluster issues are surfaced by the cluster.health
field of the describe-cluster
command’s response. The message contained in this field includes an error code, descriptive message, and related resource IDs. This information is available through the Amazon EKS API and AWS CLI only. In the following example, replace my-cluster
with the name of your local cluster.
aws eks describe-cluster --name my-cluster --query 'cluster.health'
An example output is as follows.
{ "issues": [ { "code": "ConfigurationConflict", "message": "The instance type 'm5.large' is not supported in Outpost 'my-outpost-arn'.", "resourceIds": [ "my-cluster-arn" ] } ] }
If the problem is beyond repair, you might need to delete the local cluster and create a new one. For example, trying to provision a cluster with an instance type that’s not available on your Outpost. The following table includes common health related errors.
Error scenario | Code | Message | ResourceIds |
---|---|---|---|
Provided subnets couldn’t be found. |
|
|
All provided subnet IDs |
Provided subnets don’t belong to the same VPC. |
|
|
All provided subnet IDs |
Some provided subnets don’t belong to the specified Outpost. |
|
|
Problematic subnet ID |
Some provided subnets don’t belong to any Outpost. |
|
|
Problematic subnet ID |
Some provided subnets don’t have enough free addresses to create elastic network interfaces for control plane instances. |
|
|
Problematic subnet ID |
The specified control plane instance type isn’t supported on your Outpost. |
|
|
Cluster ARN |
You terminated a control plane Amazon EC2 instance or |
|
|
Cluster ARN |
You have insufficient capacity on your Outpost. This can also happen when a cluster is being created if an Outpost is disconnected from the AWS Region. |
|
|
Cluster ARN |
Your account exceeded your security group quota. |
|
Error message returned by Amazon EC2 API |
Target VPC ID |
Your account exceeded your elastic network interface quota. |
|
Error message returned by Amazon EC2 API |
Target subnet ID |
Control plane instances weren’t reachable through AWS Systems Manager. For resolution, see Control plane instances aren’t reachable through AWS Systems Manager. |
|
Amazon EKS control plane instances are not reachable through SSM. Please verify your SSM and network configuration, and reference the EKS on Outposts troubleshooting documentation. |
Amazon EC2 instance IDs |
An error occurred while getting details for a managed security group or elastic network interface. |
Based on Amazon EC2 client error code. |
Error message returned by Amazon EC2 API |
All managed security group IDs |
An error occurred while authorizing or revoking security group ingress rules. This applies to both the cluster and control plane security groups. |
Based on Amazon EC2 client error code. |
Error message returned by Amazon EC2 API |
Problematic security group ID |
An error occurred while deleting an elastic network interface for a control plane instance. |
Based on Amazon EC2 client error code. |
Error message returned by Amazon EC2 API |
Problematic elastic network interface ID |
The following table lists errors from other AWS services that are presented in the health field of the describe-cluster
response.
Amazon EC2 error code | Cluster health issue code | Description |
---|---|---|
|
|
This error can occur for a variety of reasons. The most common reason is that you accidentally removed a tag that the service uses to scope down the service linked role policy from the control plane. If this occurs, Amazon EKS can no longer manage and monitor these AWS resources. |
|
|
This error can occur for a variety of reasons. The most common reason is that you accidentally removed a tag that the service uses to scope down the service linked role policy from the control plane. If this occurs, Amazon EKS can no longer manage and monitor these AWS resources. |
|
|
This error occurs when subnet ID for the ingress rules of a security group can’t be found. |
|
|
This error occurs when the permissions for the ingress rules of a security group aren’t correct. |
|
|
This error occurs when the group of the ingress rules of a security group can’t be found. |
|
|
This error occurs when the network interface ID for the ingress rules of a security group can’t be found. |
|
|
This error occurs when the subnet resource quota is exceeded. |
|
|
This error occurs when the outpost capacity quota is exceeded. |
|
|
This error occurs when the elastic network interface quota is exceeded. |
|
|
This error occurs when the security group quota is exceeded. |
|
|
This is observed when creating an Amazon EC2 instance in a new account. The error might be similar to the following: " |
|
|
Amazon EC2 returns this error code if the specified instance type isn’t supported on the Outpost. |
All other failures |
|
None |
Local clusters require different permissions and policies than Amazon EKS clusters that are hosted in the cloud. When a cluster fails to create and produces an InvalidPermissions
error, double check that the cluster role that you’re using has the AmazonEKSLocalOutpostClusterPolicy managed policy attached to it. All other API calls require the same set of permissions as Amazon EKS clusters in the cloud.
The amount of time it takes to create a local cluster varies depending on several factors. These factors include your network configuration, Outpost configuration, and the cluster’s configuration. In general, a local cluster is created and changes to the ACTIVE
status within 15–20 minutes. If a local cluster remains in the CREATING
state, you can call describe-cluster
for information about the cause in the cluster.health
output field.
The most common issues are the following:
-
Your cluster can’t connect to the control plane instance from the AWS Region that Systems Manager is in. You can verify this by calling
aws ssm start-session --target
from an in-Region bastion host. If that command doesn’t work, check if Systems Manager is running on the control plane instance. Or, another work around is to delete the cluster and then recreate it.instance-id
-
Systems Manager control plane instances might not have internet access. Check if the subnet that you provided when you created the cluster has a NAT gateway and a VPC with an internet gateway. Use VPC reachability analyzer to verify that the control plane instance can reach the internet gateway. For more information, see Getting started with VPC Reachability Analyzer.
-
The role ARN that you provided is missing policies. Check if the AWS managed policy: AmazonEKSLocalOutpostClusterPolicy was removed from the role. This can also occur if an AWS CloudFormation stack is misconfigured.
-
All the provided subnets must be associated with the same Outpost and must reach each other. When multiple subnets are specified when a cluster is created, Amazon EKS attempts to spread the control plane instances across multiple subnets.
-
The Amazon EKS managed security groups are applied at the elastic network interface. However, other configuration elements such as NACL firewall rules might conflict with the rules for the elastic network interface.
VPC and subnet DNS configuration is misconfigured or missing
Review Create a VPC and subnets for Amazon EKS clusters on AWS Outposts.
-
AMI issues:
-
You’re using an unsupported AMI. You must use v20220620
or later for the Create nodes with optimized Amazon Linux AMIs Amazon EKS optimized Amazon Linux. -
If you used an AWS CloudFormation template to create your nodes, make sure it wasn’t using an unsupported AMI.
-
-
Missing the AWS IAM Authenticator
ConfigMap
– If it’s missing, you must create it. For more information, see Apply the aws-auth ConfigMap to your cluster . -
The wrong security group is used – Make sure to use
eks-cluster-sg-
for your worker nodes' security group. The selected security group is changed by AWS CloudFormation to allow a new security group each time the stack is used.cluster-name
-uniqueid
-
Following unexpected private link VPC steps – Wrong CA data (
--b64-cluster-ca
) or API Endpoint (--apiserver-endpoint
) are passed. -
Misconfigured Pod security policy:
-
The CoreDNS and Amazon VPC CNI plugin for Kubernetes Daemonsets must run on nodes for nodes to join and communicate with the cluster.
-
The Amazon VPC CNI plugin for Kubernetes requires some privileged networking features to work properly. You can view the privileged networking features with the following command:
kubectl describe psp eks.privileged
.
We don’t recommend modifying the default pod security policy. For more information, see Understand Amazon EKS created pod security policies (PSP).
-
When an Outpost gets disconnected from the AWS Region that it’s associated with, the Kubernetes cluster likely will continue working normally. However, if the cluster doesn’t work properly, follow the troubleshooting steps in Prepare local Amazon EKS clusters on AWS Outposts for network disconnects. If you encounter other issues, contact AWS Support. AWS Support can guide you on downloading and running a log collection tool. That way, you can collect logs from your Kubernetes cluster control plane instances and send them to AWS Support support for further investigation.
When the Amazon EKS control plane instances aren’t reachable through AWS Systems Manager (Systems Manager), Amazon EKS displays the following error for your cluster.
Amazon EKS control plane instances are not reachable through SSM. Please verify your SSM and network configuration, and reference the EKS on Outposts troubleshooting documentation.
To resolve this issue, make sure that your VPC and subnets meet the requirements in Create a VPC and subnets for Amazon EKS clusters on AWS Outposts and that you completed the steps in Setting up Session Manager in the AWS Systems Manager User Guide.