

# Managing AWS Fault Injection Service experiments
<a name="testing"></a>

This section describes how to manage AWS Fault Injection Service (AWS FIS) experiments in AWS Resilience Hub. You run AWS FIS experiments to measure the resiliency of your AWS resources and the amount of time it takes to recover from application, infrastructure, availability zone, and AWS Region incidents.

To measure resiliency, these AWS FIS experiments simulate disruptions to your AWS resources. Examples of disruptions include network unavailable errors, failovers, stopped processes on Amazon EC2 or AWS ASG, boot recovery in Amazon RDS, and problems with your Availability Zone. When the AWS FIS experiment concludes, you can estimate whether an application can recover from the outage types defined in the RTO target of the resiliency policy.

All the experiments in AWS Resilience Hub are built using AWS FIS and they execute AWS FIS actions. AWS FIS experiments use only AWS FIS automation actions that are customized to specific AWS services (such as Amazon EKS action). For more information about AWS FIS actions, see [AWS FIS actions reference](https://docs.aws.amazon.com/fis/latest/userguide/fis-actions-reference.html).

You can use the AWS FIS experiments in their default state or customize them based on your requirements. For more information about managing AWS FIS experiments from AWS Resilience Hub console and AWS FIS console, see the following topics:
+ AWS Resilience Hub console
  + [Viewing AWS FIS experiments](view-fis-experiment.md)
    + [To view the list of implemented AWS FIS experiments from applications](view-fis-experiment.md#view-active-fis-experiments)
    + [To view the recommended AWS FIS experiments from assessments](view-fis-experiment.md#view-recommended-fis-experiments)
  + [Running AWS FIS experiments](test-assessment-report.md#arh-running-aws-fis-experiments)
  + [AWS Fault Injection Service experiment failures/status check](test-failures.md)
+ AWS FIS console
  + [Managing your AWS FIS experiments](https://docs.aws.amazon.com//fis/latest/userguide/experiments.html)
  + [Working with the AWS FIS scenario library](https://docs.aws.amazon.com//fis/latest/userguide/scenario-library.html)
  + [Managing AWS FIS experiment templates](https://docs.aws.amazon.com//fis/latest/userguide/manage-experiment-template.html)

# Initiating, creating, and running AWS FIS experiments
<a name="test-assessment-report"></a>

AWS Resilience Hub simplifies AWS FIS experiments by integrating with AWS FIS experiments. It provides tailored recommendations and allows initiating AWS FIS experiments with pre-populated templates mapped to your Application Components (AppComponents), enabling efficient resilience testing.<a name="arh-initiate-fis-experiment"></a>

**To initiate an AWS FIS experiment from Operational recommendations**

1. Open the AWS Resilience Hub console.

1. In the navigation pane, choose **Applications**.

1. From the list of applications, choose the application you want to create a test for.

1. Choose **Assessments** tab.

1. Select an assessment from the **Resiliency assessments** table. If you don't have an assessment, complete the procedure in [Running resiliency assessments in AWS Resilience Hub](run-assessment.md) and then return to this step.

1. Choose **Operational recommendations** tab.

1. Choose the right arrow before **Fault injection experiments**.

   This section lists all the AWS FIS experiments recommended by AWS Resilience Hub for your application to stress-test and improve its resilience. Based on your implementation, the AWS FIS experiments are categorized into the following states:
   + **Implemented** – Indicates that the experiments recommended by AWS Resilience Hub are implemented in your application. Choose the number below to view all the implemented experiments in the **Experiments** table.
   + **Partially implemented** – Indicates that the experiments recommended by AWS Resilience Hub are partially implemented in your application. Choose the number below to view all the partially implemented experiments in the **Experiments** table.
   + **Not implemented** – Indicates that the experiments recommended by AWS Resilience Hub are unimplemented in your application. Choose the number below to view all the unimplemented experiments in the **Experiments** table.
   + **Excluded** – Indicates that the experiments recommended by AWS Resilience Hub are excluded from your application. Choose the number below to view all the excluded experiments in the **Experiments** table. For more information about including and excluding recommended experiments, see [Including or excluding operational recommendations](https://docs.aws.amazon.com/resilience-hub/latest/userguide/exclude-recommend.html?icmpid=docs_resiliencehub_help_panel_operational_recommendations_alarms).

   **Experiments** table lists all the implemented AWS FIS experiments that impact the resiliency score of your application. You can identify the AWS FIS experiments using the following information:
   + **Action name** – Indicates the AWS FIS action recommended for your application. Choose the action name to view all the recommended AppComponents on the **AWS FIS experiment details** page. When the **State** is set to **Not trackable**, it indicates that the AWS FIS experiment is a scenario. Choose the scenario name to view its details on the **Scenario library** page in the AWS FIS console.
   + **State** – Indicates the current implementation state of the AWS FIS experiment. That is, **Implemented**, **Partially implemented**, **Not implemented**, and **Excluded**.
**Note**  
AWS FIS scenario is a console-only feature with multiple predefined actions. Hence, AWS Resilience Hub cannot track it and it will set the **State** to **Not trackable**.
   + **Description** – Describes the objective of the AWS FIS action.

1. Select an AWS FIS action for which you want to initiate an experiment.

   In the AWS FIS experiment recommendation section, you can understand more about the experiments you need implement on the AppComponents using the following information:
   + **Name** – Name of the AppComponent in which the resources are grouped into.
   + **State** – Indicates the current implementation state of the AWS FIS action. That is, **Implemented**, **Partially implemented**, **Not implemented**, and **Excluded**.
**Note**  
AWS FIS scenario is a console-only feature with multiple predefined actions. Hence, AWS Resilience Hub cannot track it and it will set the **State** to **Not trackable**.
   + **Target selection** – Indicates how the resources will be included in the experiment when you choose **Initiate experiment**. If AWS Resilience Hub doesn't automatically determine target resources, hover over the respective **Target selection** field for guidance on adding them.
   + **Resources** – Indicates the number of resources grouped under the AppComponent. Choose the number to view these resources in the **Resources** dialog box. You can identify the resources using the following:
     + **Logical ID** – Indicates the logical ID of the resource. A logical ID is a name used to identify resources in your AWS CloudFormation, Terraform state file, myApplications application, AWS Resource Groups resource, or Amazon Elastic Kubernetes Service cluster. 
     + **Physical ID** – Indicates the actual assigned identifier for the resource, such as an Amazon EC2 instance ID or an Amazon S3 bucket name.
     + **Type** – Indicates the type of resource.
     + **Region** – Indicates the AWS Region in which the resource is located. 

1. Select an AppComponent and choose **Include** or **Exclude** to include or exclude the AppComponent in the AWS FIS experiment, respectively.

1. Choose **Initiate experiment**.

   AWS Resilience Hub will redirect you to **Specify template details** page in the AWS FIS console, opening it in a new tab. 

1. To create an experiment template, complete the steps in [ To create an experiment template using the console](https://docs.aws.amazon.com/fis/latest/userguide/create-template.html). 

   Additionally, after you enter the template details and choose **Next** in the **Specify template details** page of the AWS FIS console by following the steps in [ To create an experiment template using the console](https://docs.aws.amazon.com/fis/latest/userguide/create-template.html), AWS Resilience Hub automatically tries to map **Actions** and **Targets** for your resource types in the **Actions and targets** page. However, to improve the coverage, you can manually add actions and targets by choosing **Add action** and **Add target**, respectively, and complete the rest of the procedure to create your experiment.

## Running AWS FIS experiments
<a name="arh-running-aws-fis-experiments"></a>

After creating an experiment in AWS FIS console, follow the steps in [Start an experiment from a template](https://docs.aws.amazon.com/fis/latest/userguide/start-experiment-from-template.html) to run an experiment in AWS FIS console. If you want AWS Resilience Hub to detect the latest experiments you have run in AWS FIS, you must run a new assessment. For more information about running assessments, see [Running resiliency assessments in AWS Resilience Hub](run-assessment.md).

# Viewing AWS FIS experiments
<a name="view-fis-experiment"></a>

In AWS Resilience Hub, view the AWS FIS experiments that you set up to measure the resiliency of your AWS resources and the amount of time it takes to recover from application, infrastructure, availability zone, and AWS Region incidents.

To view the list of active AWS FIS experiments from the dashboard, choose **Dashboard** from the left navigation menu. 

In the **Implemented experiments** table, you can identify the AWS FIS experiments using the following information:
+ **Experiment ID** – Identifier of the AWS FIS experiment.
+ **Action** – Indicates the AWS FIS action associated with the AWS FIS experiment. Additionally, if there are more than one action, it highlights the number of AWS FIS actions associated with the AWS FIS experiment. You can identify the details by hovering over them or by navigating to them.
+ **Experiment template ID** – Identifier of the AWS FIS experiment template that was used to create the AWS FIS experiment.<a name="view-active-fis-experiments"></a>

**To view the list of implemented AWS FIS experiments from applications**

1. In the left navigation menu, choose **Applications**.

1. Select an application from the **Applications** table. 

   To find an application, enter the application name in the **Find applications** box.

1. Choose **Fault injection experiments**.

   In the **Implemented experiments** table, you can identify the AWS FIS experiments implemented in your application using the following information:
   + **Experiment ID** – Identifier of the AWS FIS experiment.
   + **Action** – Indicates the AWS FIS action associated with the AWS FIS experiment. Additionally, if there are more than one action, it highlights the number of AWS FIS actions associated with the AWS FIS experiment. You can identify the details by hovering over them or by navigating to them.
   + **Experiment template ID** – Identifier of the AWS FIS experiment template that was used to create the AWS FIS experiment.<a name="view-recommended-fis-experiments"></a>

**To view the recommended AWS FIS experiments from assessments**

1. In the left navigation menu, choose **Applications**.

1. Select an application from the **Applications** table. 

   To find an application, enter the application name in the **Find applications** box.

1. Choose **Assessments** tab.

   In the **Assessments** table, you can identify your assessments using the following information:
   + **Name** – Name of the assessment you had provided at the time of creation.
   + **Status** – Indicates the execution state of the assessment.
   + **Compliance status** – Indicates if the assessment is compliant with the resiliency policy.
   + **Resiliency** – Indicates if your application has drifted from the RTO and RPO targets defined in the attached resiliency policy or not from the previous successful assessment.
   + **App version** – Version of your application that was assessed.
   + **Invoker** – Indicates the role that invoked the assessment.
   + **Start time** – Indicates the start time of the assessment.
   + **End time** – Indicates the end time of the assessment.
   + **ARN** – The Amazon Resource Name (ARN) of the assessment.

1. Select an assessment from the **Assessments** table.

1. Choose **Operational recommendations**.

1. Choose the right arrow before **Fault injection experiments**.

   This section lists all the AWS FIS experiments recommended by AWS Resilience Hub for your application to stress-test and improve its resilience. Based on your implementation, the AWS FIS experiments are categorized into the following states:
   + **Implemented** – Indicates that the experiments recommended by AWS Resilience Hub are implemented in your application. Choose the number below to view all the implemented experiments in the **Experiments** table.
   + **Partially implemented** – Indicates that the experiments recommended by AWS Resilience Hub are partially implemented in your application. Choose the number below to view all the partially implemented experiments in the **Experiments** table.
   + **Not implemented** – Indicates that the experiments recommended by AWS Resilience Hub are unimplemented in your application. Choose the number below to view all the unimplemented experiments in the **Experiments** table.
   + **Excluded** – Indicates that the experiments recommended by AWS Resilience Hub are excluded from your application. Choose the number below to view all the excluded experiments in the **Experiments** table. For more information about including and excluding recommended experiments, see [Including or excluding operational recommendations](https://docs.aws.amazon.com/resilience-hub/latest/userguide/exclude-recommend.html?icmpid=docs_resiliencehub_help_panel_operational_recommendations_alarms).

   **Experiments** table lists all the implemented AWS FIS experiments that impact the resiliency score of your application. You can identify the AWS FIS experiments using the following information:
   + **Action name** – Indicates the AWS FIS action recommended for your application. When the **State** is set to **Not trackable**, it indicates that the AWS FIS experiment is a scenario. Choose the scenario name to view its details on the **Scenario library** page in the AWS FIS console.
   + **State** – Indicates the current implementation state of the AWS FIS experiment. That is, **Implemented**, **Partially implemented**, **Not implemented**, and **Excluded**.
**Note**  
AWS FIS scenario is a console-only feature with multiple predefined actions. Hence, AWS Resilience Hub cannot track it and it will set the **State** to **Not trackable**.
   + **Description** – Describes the objective of the AWS FIS action.

# AWS Fault Injection Service experiment failures/status check
<a name="test-failures"></a>

AWS Resilience Hub allows you to track the status of your experiment that you have started. For more information, see the [To view the recommended AWS FIS experiments from assessments](view-fis-experiment.md#view-recommended-fis-experiments) procedure.

**Topics**
+ [Analyzing AWS FIS experiment execution using AWS Systems Manager](test-failures-ssm.md)
+ [AWS FIS experiment failures while testing Kubernetes pods running in your Amazon Elastic Kubernetes Service clusters](test-failures-eks.md)

# Analyzing AWS FIS experiment execution using AWS Systems Manager
<a name="test-failures-ssm"></a>

After running an AWS FIS experiment, you can view the execution details in the AWS Systems Manager. 

1. Go to **CloudTrail** > **Event History**.

1. Filter events by **User name** using the experiment ID.

1. View the StartAutomationExecution entry. **Request ID** is the SSM automation ID.

1.  Go to **AWS Systems Manager **> **Automation**.

1. Filter by **Execution ID** using SSM automation ID and view the automation details.

   You can analyze the execution with any Systems Manager automation. For more information, see the [AWS Systems Manager Automation](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-automation.html) user guide. The execution input parameters appear in the **Input parameters** section of the **Execution Detail** and include optional parameters not appearing in the AWS FIS experiment.

   You can find information on step status and other step details by drilling down to specific steps within the Execution steps.

**Common failures**

The following are common failures encountered while executing an assessment report:
+ Alarm template was not deployed before the Test/SOP experiment was executed. This causes an error message during the automation step.
  + **Failure message:** `The following parameters were not found: [/ResilienceHub/Alarm/3dee49a1-9877-452a-bb0c-a958479a8ef2/nat-gw-alarm-bytes-out-to-source-2020-09-21_nat-02ad9bc4fbd4e6135]. Make sure all the SSM parameters in automation document are created in SSM Parameter Store.`
  + **Remediation:** Ensure to render the relevant alarm and deploy the resulting template before rerunning the fault injection experiment.
+ Missing permissions in the execution role. This error message occurs if the provided execution role is missing a permission and appears within the step details.
  + **Failure message:** `An error occurred (Unauthorized Operation) when calling the DescribeInstanceStatus operation: You are not authorized to perform this operation. Please Refer to Automation Service Troubleshooting Guide for more diagnosis details`.
  + **Remediation**: Verify you provided the correct execution role. If this was done, add the required permission and rerun the assessment.
+ Execution succeeded but did not have the expected result. This is the result of incorrect parameters or an internal automation issue.
  + **Failure message:** The execution succeeded, so no error message is shown.
  + **Remediation:** Check the input parameters and look at the executed steps as explained in the Analyze AWS FIS experiment execution before examining the individual steps for expected inputs and outputs.

# AWS FIS experiment failures while testing Kubernetes pods running in your Amazon Elastic Kubernetes Service clusters
<a name="test-failures-eks"></a>

The following are common Amazon Elastic Kubernetes Service (Amazon EKS) failures encountered while testing Kubernetes pods running in your Amazon EKS clusters:
+ Incorrect configuration of IAM roles for AWS FIS experiments or the Kubernetes service account.
  + **Failure messages:** 
    + `Error resolving targets. Kubernetes API returned ApiException with error code 401`. 
    + `Error resolving targets. Kubernetes API returned ApiException with error code 403`. 
    + `Unable to inject AWS FIS Pod: Kubernetes API returned status code 403. Check Amazon EKS logs for more details`. 
  + **Remediation:** Verify the following.
    + Ensure that you have followed the instruction in [Use the AWS FIS`aws:eks:pod` actions](https://docs.aws.amazon.com/fis/latest/userguide/eks-pod-actions.html).
    + Ensure that you have created and configured a Kubernetes Service Account with the necessary RBAC permissions and the correct namespace.
    + Ensure that you have mapped the provided IAM role (see the output of the CloudFormation stack of the test) to the Kubernetes user.
+ Unable to start AWS FIS Pod: Max failed sidecar containers reached. This usually happens when the memory is not sufficient to run the AWS FIS sidecar container.
  + **Failure message:** `Unable to heartbeat FIS Pod: Max failed sidecar containers reached`.
  + **Remediation:** One option to avoid this error is to reduce the target load percentage to be aligned with the available memory or CPU.
+ Alarm assertion failed at the beginning of the experiment. This error occurs because the related alarm has no datapoint.
  + **Failure message:** `Assertion failed for the following alarms`. Lists all the alarms for which the assertion has failed.
  + **Remediation:** Ensure that Container Insights are correctly installed for the alarms and the alarm is not turned on (in `ALARM` state).