Reducing MTTD

Reducing the MTTD of a failure means discovering the failure as quickly as possible. Shortening the MTTD is based on observability, or how you've instrumented your workload to understand its state. Customers should monitor their Customer Experience metrics in their workload's critical subsystems as a way to proactively identify when a problem occurs (refer to Appendix 1 – MTTD and MTTR critical metrics for more information about these metrics. ). Customers can use Amazon CloudWatch Synthetics to create canaries that monitor your APIs and consoles to proactively measure the user experience. There are a number of other health check mechanisms that can be used to minimize the MTTD, such as Elastic Load Balancing (ELB) health checks, Amazon Route 53 health checks, and more. (See Amazon Builders' Library – Implementing health checks.)

Your monitoring also needs to be able to detect partial failures of both the system as a whole and in your individual subsystems. Your availability, failure, and latency metrics should use the dimensionality of your fault isolation boundaries as CloudWatch metric dimensions. For example, consider a single EC2 instance that is part of a cell-based architecture, in the use1-az1 AZ, in the us-east-1 Region, that is part of the workload’s update API that is part of its control plane subsystem. When the server pushes its metrics, it can use its instance id, AZ, Region, API name, and subsystem name as dimensions. This allows you to have observability and set alarms across each of these dimensions to detect failure.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Designing highly available distributed systems on AWS

Reducing MTTR