Mapping critical applications Mapping user stories Defining measurements Creating additional measurements

Stage 1: Set objectives

Understanding what level of resilience is needed and how you will measure it is the basis for the set objectives stage. It's difficult to improve something if you don't have an objective and you can't measure it.

Not all applications need the same level of resilience. When you set objectives, consider the required level in order to make the correct investments and trade-offs. A good analogy for this is a car: It has four tires but carries only one spare tire. The chance of getting multiple flat tires during a ride is low, and having extra spares could take away from other features, such as cargo space or fuel efficiency, so this is a reasonable trade-off.

After you define objectives, you implement observability controls in later stages (Stage 2: design and implement and Stage 4: Operate) to understand if the objectives are being met.

Mapping critical applications

Defining resilience objectives shouldn't exclusively be a technical conversation. Instead, start with a business-oriented focus to understand what the application should deliver and the consequences of impairment. This understanding of business objectives then cascades to areas such as architecture, engineering, and operations. Any resilience objectives you define might be applied to all your applications, but the way the objectives are measured often vary depending on the function of the application. You might be running an application that's critical to the business, and if this application is impaired, your organization could lose significant revenue or suffer reputational harm. Alternately, you might have another application that isn't as critical and can tolerate some downtime without negatively impacting your organization's ability to do business.

As an example, think of an order management application for a retail company. If the components of the order management application are impaired and don't run properly, new sales won't go through. This retail company also has a coffee shop for its employees that's located in one of its buildings. The coffee shop has an online menu that employees can access on a static webpage. If this webpage becomes unavailable, some employees might complain, but it won't necessarily cause financial harm to the company. Based on this example, the business would likely choose to have more aggressive resilience goals for the order management application but won't make a significant investment to ensure the resilience of the web application.

Identifying the most critical applications, where to apply the most effort, and where to make trade-offs is as important as being able to measure an application's resilience in production. To better understand the impact of impairment, you can perform a business impact analysis (BIA). A BIA provides a structured and systematic approach to identify and prioritize critical business applications, assess potential risks and impacts, and identify supporting dependencies. The BIA helps quantify the cost of downtime for your organization's most important applications. This metric helps outline how much it will cost if a specific application is impaired and unable to complete its function. In the previous example, if the order management application is impaired, the retail business could lose significant revenue.

Mapping user stories

During the BIA process, you might discover that an application is responsible for more than one business function, or that a business function requires multiple applications. Using the previous retail company example, the order management function might require separate applications for checkout, promotion, and pricing. If one application fails, the impact could be felt by the business and by users who interact with the company. For example, the company might not be able to add new orders, provide access to promotions and discounts, or update the price of their products. These different functions required by the order management function might rely on multiple applications. These functions might also have multiple external dependencies, which makes the process of achieving purely component-focused resilience too complex. A better way to handle this scenario is to focus on user stories, which outline the experience that users expect when interacting with one application or a set of applications.

Focusing on user stories helps you understand which pieces of the customer experience are most important, so you can build mechanisms to protect against specific threats. In the previous example, one user story could be checkout, which involves the checkout application and has a dependency on the pricing application. Another user story could be viewing promotions, which involves the promotion application. After you map the most critical applications and their user stories, you can begin to define the metrics you will use to measure resilience for these user stories. These metrics can be applied across an entire portfolio or to individual user stories.

Defining measurements

Recovery point objectives (RPOs), recovery time objectives (RTOs), and service-level objectives (SLOs) are standard industry measurements that are used to assess the resilience of a given system. RPO refers to how much data loss the business can tolerate in case of a failure, whereas RTO is a measure of how quickly an application must be available again after an outage. These two metrics are measured in time units: seconds, minutes, and hours. You can also measure the amount of time during which the application is working properly; that is, it performs its functions as designed and is accessible to its users. These SLOs detail the expected level of service customers will receive and are measured by metrics such as the percentage (%) of requests that are serviced without error within a response time that's less than one second (for example, 99.99% of requests will receive a response each month). RPO and RTO are related to disaster recovery strategies, assuming that there will be interruptions in application operation and recovery processes that range from restoring backups to redirecting user traffic. SLOs are addressed by implementing high availability controls, which tend to reduce the downtime for an application.

SLO metrics are commonly used in the definition of service-level agreements (SLAs), which are contracts between service providers and end users. SLAs usually come with financial commitments and outline penalties that need to be paid by the provider if these agreements aren't met. However, an SLA isn't a measurement of your resilience posture, and increasing an SLA doesn't make your application more resilient.

You can start to set your objectives based on SLOs, RPOs, and RTOs. After you define your resilience objectives and gain a clear understanding of your RPO and RTO targets, you can use AWS Resilience Hub to run an assessment of your architecture to uncover potential resilience-related weaknesses. AWS Resilience Hub assesses an application architecture against AWS Well-Architected Framework best practices and shares remediation guidance in the context of what specifically needs to be improved to meet your defined RTO and RPO targets.

Creating additional measurements

RPO, RTO and SLOs are good indicators of resilience, but you can also think about goals from a business perspective and define objectives around your application's functions. For example, your objective could be: Successful orders per minute will remain above 98% if latency between my frontend and backend increases by 40%. Or: Streams started per second will remain within a standard deviation from average even if a specific component is lost. You can also create objectives to achieve a reduction on the mean time to recover (MTTR) across known failure types; for example: Recovery times will be reduced by x% if any of these known issues happen. Creating objectives that align with a business need helps you anticipate the types of failures that your application should tolerate. It also helps you identify approaches to reduce the likelihood of impairment to your application.

If you think about the objective to continue operating if you lose 5% of the instances that power your application, you might determine that your application should be prescaled or have the ability to scale fast enough to support the additional traffic caused during that event. Or, you might determine that you should leverage different architectural patterns, as described in the Stage 2: Design and implement section.

You also should implement observability measures for your specific business objectives. For example, you can track average order rate, average order price, average number of subscriptions, or other metrics that can provide insights into the health of the business based on your application's behavior. By implementing observability capabilities for your application, you can create alarms and take action if these metrics exceed your defined boundaries. Observability is covered in more detail in the Stage 4: Operate section.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Introduction

Stage 2: Design and implement