Preparing for incidents in Incident Manager

Planning for an incident begins long before the incident lifecycle. As the following illustration shows, before starting to respond to incidents, you get prepared by setting up chat channels, creating escalation plans, specifying contacts, and determining the Automation runbooks to use in incident response. Then, use a response plan that specifies how monitoring occurs and whether responses are automated. After remediation is complete, you can analyze the incident and incident response to further refine your response plan for future incidents.

An Incident Manager workflow for preparing for, responding to, and learning from incidents.

Topics

Monitoring

Monitoring the health of your AWS hosted applications is key to ensuring application up time and performance. When determining monitoring solutions, consider the following:

Criticality of feature – If the system were to fail, how critical would the impact to downstream users be.
Commonality of failure – How commonly does a system fail; systems that require frequent intervention should be closely monitored.
Increased latency – How much the time to complete a task has increased or decreased.
Client-side versus server-side metrics – If there is a discrepancy between related metrics on the client and server.
Dependency failures – Failures that your team can and should prepare for.

After creating response plans, you can use your monitoring solutions to automatically track incidents the moment they happen in your environment. For more information about incident tracking and creation, see Viewing incident details in the Incident Manager console.

For more information about architecting secure, high-performing, resilient, and efficient infrastructure applications and workloads, see the AWS Well-Architected.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Managing incidents across AWS accounts and Regions

Configuring replication sets and Findings