Advanced Multi-AZ Resilience Patterns
Publication date: July 11, 2023 (Document revisions)
Many customers run their workloads in highly available, multi-Availability Zone (AZ) configurations. These architectures perform well during binary failure events, but often encounter problems with gray failures. The manifestations of this type of failure can be subtle, and defy quick and definitive detection. This paper provides guidance on how to instrument workloads to detect impact from gray failures that are isolated to a single Availability Zone, and then take action to mitigate that impact in the Availability Zone.
Introduction
The purpose of this document is to help you more effectively
implement resilient multi-AZ architectures. One of the best
practices for building resilient systems in
Amazon Virtual Private Cloud
An
Availability
Zone
Many AWS services, such as
Amazon Elastic Compute Cloud (EC2) Auto Scaling
But there is another category of failures termed gray failures, whose manifestations are subtle and defy quick and definitive detection. This in turn results in longer times to mitigate the impact caused by the failure. This paper focuses on the impacts gray failures can have on multi-AZ architectures, how to detect them, and, finally, how to mitigate them.
The guidance provided in this whitepaper is mostly applicable to specific classes of workloads that:
-
Primarily use zonal AWS services
-
Need to improve single Region resilience
-
Are willing to make a significant investment to build the required observability and resilience patterns
In these workloads, you might not be willing to make some, or all, of the tradeoffs presented in Responding to gray failures, or not have the option to use multiple Regions. These types of workloads are likely to represent a small subset of your overall portfolio and hence this guidance should be considered at the workload level versus at the platform level.