Performing a post-incident analysis in Incident Manager
Post-incident analysis guides you through identifying improvements to your incident response, including time to detection and mitigation. An analysis can also help you understand the root cause of the incidents. Incident Manager creates recommended action items to improve your incident response.
Benefits of a post-incident analysis
-
Improve incident response
-
Understand the root cause of the problem
-
Address root causes with deliverable action items
-
Analyze the impact of incidents
-
Capture and share learnings within an organization
What not to use an analysis for
An analysis is blameless and doesn't call out people by name.
"Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." - Norm Kerth, Project Retrospectives: A Handbook for Team Review
Analysis details
The analysis details page guides you through gathering information, assessing improvements, and creating action items. The analysis details page is similar to the incident details with some key differences such as historical metrics, editable timeline, and questions to improve future incidents.
Overview
The overview is a summary of the incident. This summary includes background, what occurred, why it happened, how it was mitigated, duration, and key action items to prevent the incident from happening again. The overview is high level. You'll explore more details in the Questions tab of the analysis.
Metrics
Use the metrics tab to visualize key metrics in your application over the duration of the incident. You can add metric graphs here that have one or more metrics depicted in the same graph. Metrics used during an incident are automatically populated on this tab. We recommend you adding a description, title, and annotations of key timepoints during the incident.
Some key time points you can consider when analyzing a metric graph:
-
Deployment change
-
Configuration change
-
Incident start time
-
Alarm time
-
Time of engagement
-
Mitigation start time
-
Incident resolved time
Limitations
-
CloudWatch alarms and metric expressions aren't imported from an incident.
-
Metrics that are in a Region that Incident Manager doesn't support aren't imported from the incident.
-
Metrics in application accounts require configuration of the
CloudWatch-CrossAccountSharingRole
prior to creating the analysis. For more information about the role, see Cross-Account Cross-Region CloudWatch console in the CloudWatch user guide.
Timeline
Describe key time points on the timeline as you dive deeper into understanding the incident. The incidents timeline is automatically populated on this tab. You can delete timepoints that aren't relevant to the analysis. You can also add and edit time points to more accurately describe the incident and its impact.
Use the timeline tab to answer questions you find on the Questions tab about the incident response.
Questions
Use Incident Manager questions to improve the time to resolution of incidents in your application and reduce the occurrence of incidents. As you answer questions, update the Metrics and Timeline tabs for accuracy. The questions focus on these key aspects of incident response:
-
Detection – Could you improve time to detection? Are there updates to metrics and alarms that would detect the incident sooner?
-
Diagnosis – Can you improve the time to diagnosis? Are there updates to your response plans or escalation plans that would engage the correct responders sooner?
-
Mitigation – Can you improve the time to mitigation? Are there runbook steps that you could add or improve?
-
Prevention – Can you prevent future incidents from occurring? To discover the root causes of an incident, Amazon uses the 5-Whys approach in problem investigation.
Actions
Incident Manager creates recommended action items for you to review as you complete the questions. You can choose to accept and complete these actions from this tab or you can dismiss these actions. You can review dismissed action items by choosing Dismissed action items. Action items are a type of OpsItem that are linked to the analysis and incident in OpsCenter.
Checklist
Before closing an analysis, use the checklist to review actions that a responder should take. As responders complete actions in the checklist, the icon next to the action changes from an ellipse to a check-mark, indicating that the action is complete. If you haven't completed checklist items, Incident Manager displays a message to confirm the responder wants to close the analysis without completing it.
Analysis templates
An analysis template provides a set of questions that dive deep into the root cause of incidents. You can use your answers to these questions to improve application performance and incident response.
AWS standard template
Incident Manager provides a standard template of questions based on AWS incident response and
problem analysis best practices, titled AWSIncidents-PostIncidentAnalysisTemplate
.
Create an analysis template
We encourage you to use the default AWSIncidents-PostIncidentAnalysisTemplate
template and add additional questions or sections that are appropriate for your use cases.
Create
analysis templates based on the default template
Use this template as a starting point to create analysis templates in your management account.
You can then duplicate your analysis templates to each Region where you enabled
Incident Manager.
Create an analysis template
-
Call the
GetDocument
action and use itsName
parameter to downloadAWSIncidents-PostIncidentAnalysisTemplate
. For more information about theGetDocument
syntax, see Systems Manager API Reference. -
The content in the response contains the JSON building blocks for the analysis. Use the question building blocks to insert additional questions in the analysis. We recommend that you add questions or sections in the
Incident questions
section. -
To create the new template, use the
CreateDocument
operation with the updated JSON from the previous step. You must include the following, where
is the name of your template,Analysis_Template_Name
-
DocumentFormat: "JSON"
-
DocumentType: "ProblemAnalysisTemplate"
-
Name: "
Analysis_Template_Name
"
-
Create an analysis
-
To create an analysis, choose Create analysis from the incident details page of a closed incident.
-
Choose the analysis template to create this analysis from, and enter a descriptive name of the analysis.
-
Choose Create.
Print a formatted incident analysis
You can generate a copy of a complete or incomplete analysis that is formatted for printing. You can also save this copy as a PDF. You can print one analysis at a time. Batch printing of multiple analyses isn't currently supported.
To print a formatted analysis
-
Open the Incident Manager console
. -
Choose the Analysis tab.
-
Choose the title of the analysis that you want to print.
-
In the upper right corner of the analysis detail page, choose Print.
-
In the Print incident analysis dialog box, clear the sections of the analysis you don't want included in the printed version. By default, all sections are selected.
-
Choose Print to open the local print controls for your device.
-
Choose your printing destination or format. You can choose a local or network printer, or you can save the analysis to a PDF. Make any changes, if wanted, to the remaining printing options, and then choose Print.
Note
Local print controls refers to the user interface provided by your web browser and device.
Printing destinations are those configured for, and accessible from, your device.