Amazon Elastic Kubernetes Service
Amazon EKS provides features that enable you to make your applications more resilient to events such as the degraded health or impairment of an Availability Zone (AZ). When running your workloads in an Amazon EKS cluster, you can further improve your application environment’s fault tolerance and application recovery using zonal shift or zonal autoshift.
Using zonal shift for Amazon Elastic Kubernetes Service
To enable zonal shift, use one of the following methods For more information, refer to Enable Amazon EKS Zonal Shift to avoid impaired Availability Zones.
You can trigger a zonal shift for an Amazon EKS cluster, or you can allow AWS to do it for you by enabling zonal autoshift. Once your Amazon EKS cluster zonal shift enabled with ARC, you can trigger a zonal shift or enable zonal autoshift using the ARC Console, the AWS CLI, or the zonal shift and zonal autoshift APIs.
For more information on triggering a zonal shift, see Starting, updating, or canceling a zonal shift.
For more information on enabling Amazon EKS with zonal shift, refer to the Learn about ARC Zonal Shift in Amazon EKS topic in the Amazon Elastic Kubernetes Service User Guide.
How zonal shift works for Amazon Elastic Kubernetes Service
During an Amazon EKS zonal shift, the following will automatically take place:
All the nodes in the impacted AZ will be cordoned. This will prevent the Kubernetes Scheduler from scheduling new Pods onto the nodes in the unhealthy AZ.
If you’re using Managed Node Groups, Availability Zone rebalancing will be suspended, and your Auto Scaling Group (ASG) will be updated to ensure that new Amazon EKS Data Plane nodes are only launched in the healthy AZs.
The nodes in the unhealthy AZ will not be terminated and the Pods will not be evicted from these nodes. This is to ensure that when a zonal shift expires or gets cancelled, your traffic can be safely returned to the AZ which still has full capacity.
The EndpointSlice controller will find all the Pod endpoints in the impaired AZ and remove them from the relevant EndpointSlices. This will ensure that only Pod endpoints in healthy AZs are targeted to receive network traffic. When a zonal shift is cancelled or expires, the EndpointSlice controller will update the EndpointSlices to include the endpoints in the restored AZ.
For more information, refer to the AWS Containers blog