Stage 2: Design and implement
In the previous stage, you set your resilience objectives. Now at the design and implement stage, you try to anticipate failure modes and identify design choices, as guided by the objectives you set in the previous stage. You also define strategies for change management and develop software code and infrastructure configuration. The following sections highlight AWS best practices that you should consider while taking trade-offs such as cost, complexity, and operational overhead into account.
AWS Well-Architected Framework
When you architect your application based on your desired resilience objectives, you need
to evaluate multiple factors and make trade-offs on the most optimal architecture. To build a
highly resilient application you must consider aspects of design, building and deployment,
security, and operations. The AWS Well-Architected Framework
The following are examples of how the AWS Well-Architected Framework can help you design and implement applications that meet your resilience objectives:
-
The reliability pillar: The reliability pillar emphasizes the importance of building applications that can operate correctly and consistently, even during failures or disruptions. For example, the AWS Well-Architected Framework recommends that you use a microservices architecture to make your applications smaller and simpler, so you can differentiate between the availability needs of different components within your application. You can also find detailed descriptions of best practices for building applications by using throttling, retry with exponential back off, fail fast (load shedding), idempotency, constant work, circuit breakers, and static stability.
-
Comprehensive review: The AWS Well-Architected Framework encourages a comprehensive review of your architecture against best practices and design principles. It provides a way to consistently measure your architectures and identify areas for improvement.
-
Risk management: The AWS Well-Architected Framework helps you identify and manage risks that might impact the reliability of your application. By addressing potential failure scenarios proactively, you can reduce their likelihood or the resulting impairment.
-
Continuous improvement: Resilience is an ongoing process, and the AWS Well-Architected Framework emphasizes continuous improvement. By regularly reviewing and refining your architecture and processes based on the AWS Well-Architected Framework's guidance, you can ensure that your systems stay resilient in the face of evolving challenges and requirements.
Understanding dependencies
Understanding a system's dependencies is key for resilience. Dependencies include the
connections between components within an application, and connections to components outside
the application, such as third-party APIs and business-owned shared services. Understanding
these connections helps you isolate and manage disruptions, because an impairment in one
component can affect other components. This knowledge helps engineers assess the impact of
impairments and plan accordingly, and ensure that resources are used effectively.
Understanding dependencies helps you create alternate strategies and coordinating recovery
processes. It also helps you determine cases where you can replace a hard dependency with a
soft dependency, so your application can continue to serve its business function when there's
a dependency impairment. Dependencies also influence decisions on load balancing and
application scaling. Understanding dependencies is vital when you make changes to your
application, because it can help you determine potential risks and impacts. This knowledge
helps you create stable, resilient applications, assisting in fault management, impact
assessment, impairment recovery, load balancing, scaling, and change management. You can track
dependencies manually or use tools and services such as AWS X-Ray
Disaster recovery strategies
A disaster recovery (DR) strategy plays a pivotal role in building and operating resilient applications, primarily by ensuring business continuity. It guarantees that crucial business operations can persist with the least possible impairment, even during catastrophic events, therefore minimizing downtime and potential loss of revenue. DR strategies are essential for data protection because they often incorporate regular data backups and data replication across multiple locations, which helps safeguard valuable business information and helps prevent total loss during a disaster. Furthermore, many industries are regulated by policies that require businesses to have a DR strategy in place to protect sensitive data and to ensure that services remain available during a disaster. By assuring minimal service impairment, a DR strategy also bolsters customer trust and satisfaction. A well-implemented and frequently practiced DR strategy reduces the recovery time after a disaster, and helps ensure that applications are quickly brought back online. Moreover, disasters can incur substantial costs, not just from lost revenue due to downtime, but also from the expense of restoring applications and data. A well-designed DR strategy helps shield against these financial losses.
The strategy you choose depends on the specific needs of your application, your RTO and
RPO, and your budget. AWS Elastic Disaster Recovery
For more information, see Disaster Recovery of Workloads on AWS and AWS Multi-Region Fundamentals on the AWS website.
Defining CI/CD strategies
One of the common causes of application impairments is code or other changes that alter the application from a previously known working state. If you don't address change management carefully, it can cause frequent impairments. The frequency of changes increases the opportunity for impact. However, making changes less frequently results in larger change sets, which are much more likely to result in impairment due to their high complexity. Continuous integration and continuous delivery (CI/CD) practices are designed to keep changes small and frequent (resulting in increased productivity) while subjecting each change to a high level of inspection through automation. Some of the foundational strategies are:
-
Full automation: The fundamental concept of CI/CD is to automate the build and deployment processes as much as possible. This includes building, testing, deployment, and even monitoring. Automated pipelines help reduce the possibility of human error, ensure consistency, and make the process more reliable and efficient.
-
Test-driven development (TDD): Write tests before writing the application code. This practice ensures that all code has associated tests, which improves the reliability of the code and the quality of the automated inspection. These tests are run in the CI pipeline to validate changes.
-
Frequent commits and integrations: Encourage developers to commit code frequently and perform integrations often. Small, frequent changes are easier to test and debug, which reduces the risk of significant problems. Automation reduces the cost of each commit and deployment, making frequent integrations possible.
-
Immutable infrastructure: Treat your servers and other infrastructure components like static, immutable entities. Replace infrastructure instead of modifying it as much as possible, and build new infrastructure through code that is tested, and deployed through your pipeline.
-
Rollback mechanism: Always have an easy, reliable, and frequently tested way to roll back changes if something goes wrong. Being able to quickly return to the previous known good state quickly is essential to deployment safety. This can be a simple button to revert to the previous state, or it can be fully automated and initiated by alarms.
-
Version control: Maintain all application code, configuration, and even infrastructure as code in a version-controlled repository. This practice helps ensure that you can easily track changes and revert them if needed.
-
Canary deployments and blue/green deployments: Deploying new versions of your application to a subset of your infrastructure first, or maintaining two environments (blue/green), allows you to verify a change's behavior in production and quickly roll back if necessary.
CI/CD is not just about the tools but also about the culture. Creating a culture that values automation, testing, and learning from failures is just as important as implementing the right tools and processes. Rollbacks, if done very quickly with minimal impact, should not be considered a failure but a learning experience.
Conducting ORRs
An operational readiness review (ORR) helps identify operational and procedural gaps. At Amazon, we created ORRs to distill the learnings from decades of operating high-scale services into curated questions with best practice guidance. An ORR captures previous lessons learned and requires new teams to ensure that they have accounted for these lessons in their applications. ORRs can provide a list of failure modes or causes of failure that can be carried into the resilience modeling activity described in the resilience modeling section below. For more information, see Operational Readiness Reviews (ORRs) on the AWS Well-Architected Framework website.
Understanding AWS fault isolation boundaries
AWS provides multiple fault isolation boundaries to help you achieve your resilience objectives. You can use these boundaries to take advantage of the predictable scope of impact containment they provide. You should be familiar with how AWS services are designed by using these boundaries so that you can make intentional choices about the dependencies you select for your application. To understand how to use boundaries in your application, see AWS Fault Isolation Boundaries on the AWS website.
Selecting responses
A system can respond in a wide range of ways to an alarm. Some alarms might require a response from the operations team whereas others might trigger self-healing mechanisms within the application. You might decide to keep responses that could be automated as manual operations to control the costs of automation or to manage engineering constraints. The type of response to an alarm is likely to be selected as a function of the cost of implementing the response, the anticipated frequency of the alarm, the accuracy of the alarm, and the potential business loss of not responding to the alarm at all.
For example, when a server process crashes, the process might be restarted by the operating system, or a new server might be provisioned and the old one terminated, or an operator might be instructed to remotely connect to the server and restart it. These responses have the same result—restarting the application server process—but have varying levels of implementation and maintenance costs.
Note
You might select multiple responses in order to take an in-depth resilience approach. For example, in the previous scenario the application team might choose to implement all three responses with a time delay between each. If the Failed server process indicator is still in an alarmed state after 30 seconds, the team can assume that the operating system has failed to restart the application server. Therefore, they might create an auto scaling group to create a new virtual server and restore the application server process. If the indicator is still in an alarm state after 300 seconds, an alert might be sent to the operational staff to connect to the original server and attempt to restore the process.
The response that the application team and business select should reflect the appetite of the business to offset operational overhead with upfront investment in engineering time. You should choose a response―an architecture pattern such as static stability, a software pattern such as a circuit breaker, or an operational procedure―by carefully considering the constraints and the anticipated maintenance of each response option. Some standard responses might exist to guide application teams, so you can use the libraries and patterns that are managed by your centralized architecture function as an input to this consideration.
Resilience modeling
Resilience modeling documents how an application will respond to different anticipated disruptions. By anticipating disruptions, your team can implement observability, automated controls, and recovery processes to mitigate or prevent impairment despite disruptions. AWS has created guidance for developing a resilience model by using the resilience analysis framework. This framework can help you anticipate disruptions and their impact to your application. By anticipating disruptions, you can identify the mitigations needed to build a resilient, reliable application. We recommend that you use the resilience analysis framework to update your resilience model with every iteration of your application's lifecycle. Using this framework with each iteration helps reduce incidents by anticipating disruptions during the design phase and testing the application before and after production deployment. Developing a resilience model by using this framework helps you ensure that you meet your resilience objectives.
Failing safely
If you're unable to avoid disruptions, fail safely. Consider creating your application with a default fail-safe mode of operation, where no significant business loss can be incurred. An example of a fail-safe state for a database would be to default to read-only operations, where users aren't allowed to create or mutate any data. Depending on the sensitivity of the data, you might even want the application to default to a shutdown state and not even perform read-only queries. Consider what the fail-safe state for your application should be, and default to this mode of operation under extreme conditions.