# Using Amazon CloudWatch alarms
<a name="CloudWatch_Alarms"></a>

You can create alarms that watch metrics and send notifications or automatically make changes to the resources you are monitoring when a threshold is breached. For example, you can monitor the CPU usage and disk reads and writes of your Amazon EC2 instances and then use that data to determine whether you should launch additional instances to handle increased load. You can also use this data to stop under-used instances to save money.

 You can create both *metric* and *composite* alarms in Amazon CloudWatch. 

You can create alarms on Metrics Insights queries that use AWS resource tags to filter and group metrics. To use tags with alarms, on the [https://console.aws.amazon.com/connect/](https://console.aws.amazon.com/connect/), choose **Settings**. On the **CloudWatch Settings** page, under **Enable resource tags on telemetry**, choose **Enable**. For context-aware monitoring that automatically adapts to your tagging strategy, create alarms on Metrics Insights queries using AWS resource tags. This allows you to monitor all resources tagged with specific applications or environments.
+ A *metric alarm* watches a single CloudWatch metric or the result of a math expression based on CloudWatch metrics. The alarm performs one or more actions based on the value of the metric or expression relative to a threshold over a number of time periods. The action can be sending a notification to an Amazon SNS topic, performing an Amazon EC2 action or an Amazon EC2 Auto Scaling action, starting an investigation in CloudWatch investigations operational investigations, or creating an OpsItem or incident in Systems Manager.
+ A *PromQL alarm* monitors metrics using a Prometheus Query Language (PromQL) instant query on metrics ingested through the CloudWatch OTLP endpoint. The alarm tracks individual breaching time series as contributors and uses duration-based pending and recovery periods to control state transitions. For more information, see [PromQL alarms](alarm-promql.md).
+ A *composite alarm* includes a rule expression that takes into account the alarm states of other alarms that you have created. The composite alarm goes into ALARM state only if all conditions of the rule are met. The alarms specified in a composite alarm's rule expression can include metric alarms and other composite alarms.

  Using composite alarms can reduce alarm noise. You can create multiple metric alarms, and also create a composite alarm and set up alerts only for the composite alarm. For example, a composite might go into ALARM state only when all of the underlying metric alarms are in ALARM state.

  Composite alarms can send Amazon SNS notifications when they change state, and can create investigations, Systems Manager OpsItems, or incidents when they go into ALARM state, but can't perform EC2 actions or Auto Scaling actions.

**Note**  
 You can create as many alarms as you want in your AWS account. 

 You can add alarms to dashboards, so you can monitor and receive alerts about your AWS resources and applications across multiple regions. After you add an alarm to a dashboard, the alarm turns gray when it's in the `INSUFFICIENT_DATA` state and red when it's in the `ALARM` state. The alarm is shown with no color when it's in the `OK` state. 

 You also can favorite recently visited alarms from the *Favorites and recents* option in the CloudWatch console navigation pane. The *Favorites and recents* option has columns for your favorited alarms and recently visited alarms. 

An alarm invokes actions only when the alarm changes state. The exception is for alarms with Auto Scaling actions. For Auto Scaling actions, the alarm continues to invoke the action once per minute that the alarm remains in the new state. 

An alarm can watch a metric in the same account. If you have enabled cross-account functionality in your CloudWatch console, you can also create alarms that watch metrics in other AWS accounts. Creating cross-account composite alarms is not supported. Creating cross-account alarms that use math expressions is supported, except that the `ANOMALY_DETECTION_BAND`, `INSIGHT_RULE`, and `SERVICE_QUOTA` functions are not supported for cross-account alarms.

**Note**  
CloudWatch doesn't test or validate the actions that you specify, nor does it detect any Amazon EC2 Auto Scaling or Amazon SNS errors resulting from an attempt to invoke nonexistent actions. Make sure that your alarm actions exist.

# Concepts
<a name="alarm-concepts"></a>

CloudWatch alarms monitor metrics and trigger actions when thresholds are breached. Understanding how alarms evaluate data and respond to conditions is essential for effective monitoring.

**Topics**
+ [

# Alarm data queries
](alarm-data-queries.md)
+ [

# Alarm evaluation
](alarm-evaluation.md)
+ [

# PromQL alarms
](alarm-promql.md)
+ [

# Composite alarms
](alarm-combining.md)
+ [

# Alarm actions
](alarm-actions.md)
+ [

# Alarm Mute Rules
](alarm-mute-rules.md)
+ [

# Limits
](alarm-limits.md)

# Alarm data queries
<a name="alarm-data-queries"></a>

CloudWatch alarms can monitor various data sources. Choose the appropriate query type based on your monitoring needs.

## Metrics
<a name="alarm-query-metrics"></a>

Monitor a single CloudWatch metric. This is the most common alarm type for tracking resource performance. For more information about metrics, see [CloudWatch Metrics concepts](cloudwatch_concepts.md).

For more information, see [Create a CloudWatch alarm based on a static threshold](ConsoleAlarms.md).

## Metric math
<a name="alarm-query-metric-math"></a>

You can set an alarm on the result of a math expression that is based on one or more CloudWatch metrics. A math expression used for an alarm can include as many as 10 metrics. Each metric must be using the same period.

For an alarm based on a math expression, you can specify how you want CloudWatch to treat missing data points. In this case, the data point is considered missing if the math expression doesn't return a value for that data point.

Alarms based on math expressions can't perform Amazon EC2 actions.

For more information about metric math expressions and syntax, see [Using math expressions with CloudWatch metrics](using-metric-math.md).

For more information, see [Create a CloudWatch alarm based on a metric math expression](Create-alarm-on-metric-math-expression.md).

## Metrics Insights
<a name="alarm-query-metrics-insights"></a>

 A CloudWatch Metrics Insights query helps you query metrics at scale using SQL-like syntax. You can create an alarm on any Metrics Insights query, including queries that return multiple time series. This capability significantly expands your monitoring options. When you create an alarm based on a Metrics Insights query, the alarm automatically adjusts as resources are added to or removed from your monitored group. Create the alarm once, and any resource that matches your query definition and filters joins the alarm monitoring scope when its corresponding metric becomes available. For multi-time series queries, each returned time series becomes a contributor to the alarm, allowing for more granular and dynamic monitoring.

Here are two primary use cases for CloudWatch Metrics Insights alarms:
+ Anomaly Detection and Aggregate Monitoring

  Create an alarm on a Metrics Insights query that returns a single aggregated time series. This approach works well for dynamic alarms that monitor aggregated metrics across your infrastructure or applications. For example, you can monitor the maximum CPU utilization across all your instances, with the alarm automatically adjusting as you scale your fleet.

  To create an aggregate monitoring alarm, use this query structure:

  ```
  SELECT FUNCTION(metricName)
                    FROM SCHEMA(...)
                    WHERE condition;
  ```
+ Per-Resource Fleet Monitoring

  Create an alarm that monitors multiple time series, where each time series functions as a contributor with its own state. The alarm activates when any contributor enters the ALARM state, triggering resource-specific actions. For example, monitor database connections across multiple RDS instances to prevent connection rejections.

  To monitor multiple time series, use this query structure:

  ```
  SELECT AVG(DatabaseConnections)
                    FROM AWS/RDS
                    WHERE condition
                    GROUP BY DBInstanceIdentifier
                    ORDER BY AVG() DESC;
  ```

  When creating multi-time series alarms, you must include two key clauses in your query:
  + A `GROUP BY` clause that defines how to structure the time series and determines how many time series the query will produce
  + An `ORDER BY` clause that establishes a deterministic sorting of your metrics, enabling the alarm to evaluate the most important signals first

  These clauses are essential for proper alarm evaluation. The `GROUP BY` clause splits your data into separate time series (for example, by instance ID), while the `ORDER BY` clause ensures consistent and prioritized processing of these time series during alarm evaluation.

For more information on how to create a multi time series alarm, see [Create an alarm based on a Multi Time Series Metrics Insights query](multi-time-series-alarm.md).

## Log group-metric filters
<a name="alarm-query-log-metric-filter"></a>

 You can create an alarm based on a log group-metric filter. With metric filters, you can look for terms and patterns in log data as the data is sent to CloudWatch. For more information, see [Create metrics from log events using filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) in the *Amazon CloudWatch Logs User Guide*. 

For more information on how to create an alarm based on log group-metric filter, see [Alarming on logs](Alarm-On-Logs.md).

## PromQL
<a name="alarm-query-promql"></a>

You can create an alarm that uses a Prometheus Query Language (PromQL) instant query to monitor metrics ingested through the CloudWatch OTLP endpoint.

For more information about how PromQL alarms work, see [PromQL alarms](alarm-promql.md).

For more information on how to create a PromQL alarm, see [Create an alarm using a PromQL query](Create_PromQL_Alarm.md).

## External data source
<a name="alarm-query-external"></a>

You can create alarms that watch metrics from data sources that aren't in CloudWatch. For more information about creating connections to these other data sources, see [Query metrics from other data sources](MultiDataSourceQuerying.md).

For more information on how to create an alarm based on a connected data source, see [Create an alarm based on a connected data source](Create_MultiSource_Alarm.md).

# Alarm evaluation
<a name="alarm-evaluation"></a>

## Metric alarm states
<a name="alarm-states"></a>

A metric alarm has the following possible states:
+ `OK` – The metric or expression is within the defined threshold.
+ `ALARM` – The metric or expression is outside of the defined threshold.
+ `INSUFFICIENT_DATA` – The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state.

## Alarm evaluation state
<a name="alarm-evaluation-state"></a>

In addition to the alarm state, each alarm has an evaluation state that provides information about the alarm evaluation process. The following states may occur:
+ `PARTIAL_DATA` – Indicates that not all the available data was able to be retrieved due to quota limitations. For more information, see [How partial data is handled](cloudwatch-metrics-insights-alarms-partial-data.md).
+ `EVALUATION_ERROR` – Indicates configuration errors in alarm setup that require review and correction. Refer to StateReason field of the alarm for more details.
+ `EVALUATION_FAILURE` – Indicates temporary CloudWatch issues. We recommend manual monitoring until the issue is resolved

You can view the evaluation state in the alarm details in the console, or by using the `describe-alarms` CLI command or `DescribeAlarms` API.

## Alarm evaluation settings
<a name="alarm-evaluation-settings"></a>

When you create an alarm, you specify three settings to enable CloudWatch to evaluate when to change the alarm state:
+ **Period** is the length of time to use to evaluate the metric or expression to create each individual data point for an alarm. It is expressed in seconds.
+ **Evaluation Periods** is the number of the most recent periods, or data points, to evaluate when determining alarm state.
+ **Datapoints to Alarm** is the number of data points within the Evaluation Periods that must be breaching to cause the alarm to go to the `ALARM` state. The breaching data points don't have to be consecutive, but they must all be within the last number of data points equal to **Evaluation Period**.

For any period of one minute or longer, an alarm is evaluated every minute and the evaluation is based on the window of time defined by the **Period** and **Evaluation Periods**. For example, if the **Period** is 5 minutes (300 seconds) and **Evaluation Periods** is 1, then at the end of minute 5 the alarm evaluates based on data from minutes 1 to 5. Then at the end of minute 6, the alarm is evaluated based on the data from minutes 2 to 6.

If the alarm period is 10 seconds, 20 seconds, or 30 seconds, the alarm is evaluated every 10 seconds. For more information, see [High-resolution alarms](#high-resolution-alarms).

If the number of evaluation periods multiplied by the length of each evaluation period exceeds one day, the alarm is evaluated once per hour. For more details about how these multi-day alarms are evaluated, see [Example of evaluating a multi-day alarm](#evaluate-multiday-alarm).

In the following figure, the alarm threshold for a metric alarm is set to three units. Both **Evaluation Period** and **Datapoints to Alarm** are 3. That is, when all existing data points in the most recent three consecutive periods are above the threshold, the alarm goes to `ALARM` state. In the figure, this happens in the third through fifth time periods. At period six, the value dips below the threshold, so one of the periods being evaluated is not breaching, and the alarm state changes back to `OK`. During the ninth time period, the threshold is breached again, but for only one period. Consequently, the alarm state remains `OK`.

![\[Alarm threshold trigger alarm\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm_graph.png)


When you configure **Evaluation Periods** and **Datapoints to Alarm** as different values, you're setting an "M out of N" alarm. **Datapoints to Alarm** is ("M") and **Evaluation Periods** is ("N").The evaluation interval is the number of evaluation periods multiplied by the period length. For example, if you configure 4 out of 5 data points with a period of 1 minute, the evaluation interval is 5 minutes. If you configure 3 out of 3 data points with a period of 10 minutes, the evaluation interval is 30 minutes.

**Note**  
If data points are missing soon after you create an alarm, and the metric was being reported to CloudWatch before you created the alarm, CloudWatch retrieves the most recent data points from before the alarm was created when evaluating the alarm.

## High-resolution alarms
<a name="high-resolution-alarms"></a>

If you set an alarm on a high-resolution metric, you can specify a high-resolution alarm with a period of 10 seconds, 20 seconds, or 30 seconds. There is a higher charge for high-resolution alarms. For more information about high-resolution metrics, see [Publish custom metrics](publishingMetrics.md).

## Example of evaluating a multi-day alarm
<a name="evaluate-multiday-alarm"></a>

An alarm is a multi-day alarm if the number of evaluation periods multiplied by the length of each evaluation period exceeds one day. Multi-day alarms are evaluated once per hour. When multi-day alarms are evaluated, CloudWatch takes into account only the metrics up to the current hour at the :00 minute when evaluating.

For example, consider an alarm that monitors a job that runs every 3 days at 10:00.

1. At 10:02, the job fails

1. At 10:03, the alarm evaluates and stays in `OK` state, because the evaluation considers data only up to 10:00.

1. At 11:03, the alarm considers data up to 11:00 and goes into `ALARM` state.

1. At 11:43, you correct the error and the job now runs successfully.

1. At 12:03, the alarm evaluates again, sees the successful job, and returns to `OK` state.

# Configuring how CloudWatch alarms treat missing data
<a name="alarms-and-missing-data"></a>

Sometimes, not every expected data point for a metric gets reported to CloudWatch. For example, this can happen when a connection is lost, a server goes down, or when a metric reports data only intermittently by design.

CloudWatch enables you to specify how to treat missing data points when evaluating an alarm. This helps you to configure your alarm so that it goes to `ALARM` state only when appropriate for the type of data being monitored. You can avoid false positives when missing data doesn't indicate a problem.

Similar to how each alarm is always in one of three states, each specific data point reported to CloudWatch falls under one of three categories:
+ Not breaching (within the threshold)
+ Breaching (violating the threshold)
+ Missing

For each alarm, you can specify CloudWatch to treat missing data points as any of the following:
+ `notBreaching` – Missing data points are treated as "good" and within the threshold
+ `breaching` – Missing data points are treated as "bad" and breaching the threshold
+ `ignore` – The current alarm state is maintained
+ `missing` – If all data points in the alarm evaluation range are missing, the alarm transitions to INSUFFICIENT\$1DATA.

The best choice depends on the type of metric and the purpose of the alarm. For example, if you are creating an application rollback alarm using a metric that continually reports data, you might want to treat missing data points as breaching, because it might indicate that something is wrong. But for a metric that generates data points only when an error occurs, such as `ThrottledRequests` in Amazon DynamoDB, you would want to treat missing data as `notBreaching`. The default behavior is `missing`.

**Important**  
Alarms configured on Amazon EC2 metrics can temporarily enter the INSUFFICIENT\$1DATA state if there are missing metric data points. This is rare, but can happen when the metric reporting is interrupted, even when the Amazon EC2 instance is healthy. For alarms on Amazon EC2 metrics that are configured to take stop, terminate, reboot, or recover actions, we recommend that you configure those alarms to treat missing data as `missing`, and to have these alarms trigger only when in the ALARM state.

Choosing the best option for your alarm prevents unnecessary and misleading alarm condition changes, and also more accurately indicates the health of your system.

**Important**  
Alarms that evaluate metrics in the `AWS/DynamoDB` namespace default to ignore missing data. You can override this if you choose a different option for how the alarm should treat missing data. When an `AWS/DynamoDB` metric has missing data, alarms that evaluate that metric remain in their current state.

## How alarm state is evaluated when data is missing
<a name="alarms-evaluating-missing-data"></a>

Whenever an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than the number specified as **Evaluation Periods**. The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The time frame of the data points that it attempts to retrieve is the *evaluation range*.

Once CloudWatch retrieves these data points, the following happens:
+ If no data points in the evaluation range are missing, CloudWatch evaluates the alarm based on the most recent data points collected. The number of data points evaluated is equal to the **Evaluation Periods** for the alarm. The extra data points from farther back in the evaluation range are not needed and are ignored.
+ If some data points in the evaluation range are missing, but the total number of existing data points that were successfully retrieved from the evaluation range is equal to or more than the alarm's **Evaluation Periods**, CloudWatch evaluates the alarm state based on the most recent real data points that were successfully retrieved, including the necessary extra data points from farther back in the evaluation range. In this case, the value you set for how to treat missing data is not needed and is ignored.
+ If some data points in the evaluation range are missing, and the number of actual data points that were retrieved is lower than the alarm's number of **Evaluation Periods**, CloudWatch fills in the missing data points with the result you specified for how to treat missing data, and then evaluates the alarm. However, all real data points in the evaluation range are included in the evaluation. CloudWatch uses missing data points only as few times as possible. 

**Note**  
A particular case of this behavior is that CloudWatch alarms might repeatedly re-evaluate the last set of data points for a period of time after the metric has stopped flowing. This re-evaluation might cause the alarm to change state and re-execute actions, if it had changed state immediately prior to the metric stream stopping. To mitigate this behavior, use shorter periods.

The following tables illustrate examples of the alarm evaluation behavior. In the first table, **Datapoints to Alarm** and **Evaluation Periods** are both 3. CloudWatch retrieves the 5 most recent data points when evaluating the alarm, in case some of the most recent 3 data points are missing. 5 is the evaluation range for the alarm.

Column 1 shows the 5 most recent data points, because the evaluation range is 5. These data points are shown with the most recent data point on the right. 0 is a non-breaching data point, X is a breaching data point, and - is a missing data point.

Column 2 shows how many of the 3 necessary data points are missing. Even though the most recent 5 data points are evaluated, only 3 (the setting for **Evaluation Periods**) are necessary to evaluate the alarm state. The number of data points in Column 2 is the number of data points that must be "filled in", using the setting for how missing data is being treated. 

In columns 3-6, the column headers are the possible values for how to treat missing data. The rows in these columns show the alarm state that is set for each of these possible ways to treat missing data.


| Data points | \$1 of data points that must be filled | MISSING | IGNORE | BREACHING | NOT BREACHING | 
| --- | --- | --- | --- | --- | --- | 
|  0 - X - X  |  0  |  `OK`  |  `OK`  |  `OK`  |  `OK`  | 
|  0 - - - -  |  2  |  `OK`  |  `OK`  |  `OK`  |  `OK`  | 
|  - - - - -  |  3  |  `INSUFFICIENT_DATA`  |  Retain current state  |  `ALARM`  |  `OK`  | 
|  0 X X - X  |  0  |  `ALARM`  |  `ALARM`  |  `ALARM`  |  `ALARM`  | 
|  - - X - -   |  2  |  `ALARM`  |  Retain current state  |  `ALARM`  |  `OK`  | 

In the second row of the preceding table, the alarm stays `OK` even if missing data is treated as breaching, because the one existing data point is not breaching, and this is evaluated along with two missing data points which are treated as breaching. The next time this alarm is evaluated, if the data is still missing it will go to `ALARM`, as that non-breaching data point will no longer be in the evaluation range.

The third row, where all five of the most recent data points are missing, illustrates how the various settings for how to treat missing data affect the alarm state. If missing data points are considered breaching, the alarm goes into ALARM state, while if they are considered not breaching, then the alarm goes into OK state. If missing data points are ignored, the alarm retains the current state it had before the missing data points. And if missing data points are just considered as missing, then the alarm does not have enough recent real data to make an evaluation, and goes into INSUFFICIENT\$1DATA.

In the fourth row, the alarm goes to `ALARM` state in all cases because the three most recent data points are breaching, and the alarm's **Evaluation Periods** and **Datapoints to Alarm** are both set to 3. In this case, the missing data point is ignored and the setting for how to evaluate missing data is not needed, because there are 3 real data points to evaluate.

Row 5 represents a special case of alarm evaluation called *premature alarm state*. For more information, see [Avoiding premature transitions to alarm state](#CloudWatch-alarms-avoiding-premature-transition).

In the next table, the **Period** is again set to 5 minutes, and **Datapoints to Alarm** is only 2 while **Evaluation Periods** is 3. This is a 2 out of 3, M out of N alarm.

The evaluation range is 5. This is the maximum number of recent data points that are retrieved and can be used in case some data points are missing.


| Data points | \$1 of missing data points | MISSING | IGNORE | BREACHING | NOT BREACHING | 
| --- | --- | --- | --- | --- | --- | 
|  0 - X - X  |  0  |  `ALARM`  |  `ALARM`  |  `ALARM`  |  `ALARM`  | 
|  0 0 X 0 X  |  0  |  `ALARM`  |  `ALARM`  |  `ALARM`  |  `ALARM`  | 
|  0 - X - -  |  1  |  `OK`  |  `OK`  |  `ALARM`  |  `OK`  | 
|  - - - - 0  |  2  |  `OK`  |  `OK`  |  `ALARM`  |  `OK`  | 
|  - - - X -  |  2  |  `ALARM`  |  Retain current state  |  `ALARM`  |  `OK`  | 

In rows 1 and 2, the alarm always goes to ALARM state because 2 of the 3 most recent data points are breaching. In row 2, the two oldest data points in the evaluation range are not needed because none of the 3 most recent data points are missing, so these two older data points are ignored.

In rows 3 and 4, the alarm goes to ALARM state only if missing data is treated as breaching, in which case the two most recent missing data points are both treated as breaching. In row 4, these two missing data points that are treated as breaching provide the two necessary breaching data points to trigger the ALARM state.

Row 5 represents a special case of alarm evaluation called *premature alarm state*. For more information, see the following section.

### Avoiding premature transitions to alarm state
<a name="CloudWatch-alarms-avoiding-premature-transition"></a>

CloudWatch alarm evaluation includes logic to try to avoid false alarms, where the alarm goes into ALARM state prematurely when data is intermittent. The example shown in row 5 in the tables in the previous section illustrate this logic. In those rows, and in the following examples, the **Evaluation Periods** is 3 and the evaluation range is 5 data points. **Datapoints to Alarm** is 3, except for the M out of N example, where **Datapoints to Alarm** is 2.

Suppose an alarm's most recent data is `- - - - X`, with four missing data points and then a breaching data point as the most recent data point. Because the next data point may be non-breaching, the alarm does not go immediately into ALARM state when the data is either `- - - - X` or `- - - X -` and **Datapoints to Alarm** is 3. This way, false positives are avoided when the next data point is non-breaching and causes the data to be `- - - X O` or `- - X - O`.

However, if the last few data points are `- - X - -`, the alarm goes into ALARM state even if missing data points are treated as missing. This is because alarms are designed to always go into ALARM state when the oldest available breaching datapoint during the **Evaluation Periods** number of data points is at least as old as the value of **Datapoints to Alarm**, and all other more recent data points are breaching or missing. In this case, the alarm goes into ALARM state even if the total number of datapoints available is lower than M (**Datapoints to Alarm**).

This alarm logic applies to M out of N alarms as well. If the oldest breaching data point during the evaluation range is at least as old as the value of **Datapoints to Alarm**, and all of the more recent data points are either breaching or missing, the alarm goes into ALARM state no matter the value of M (**Datapoints to Alarm**).

## Missing Data in CloudWatch Metrics Insights alarms
<a name="mi-missing-data-treatment"></a>

 ** Alarms based on Metrics Insights queries that aggregate to a single time series ** 

 The missing data scenarios and their effects upon alarm evaluation are the same as a standard metric alarm in terms of the configured missing data treatment. See, [Configuring how CloudWatch alarms treat missing data](#alarms-and-missing-data). 

 ** Alarms based on Metrics Insights queries that produce multiple time series ** 

Missing data scenarios for Metrics Insights alarms occur when: 
+  Individual datapoints within a time series are not present. 
+  One or more time series disappear when evaluating upon multiple time series. 
+  No time series are retrieved by the query. 

 Missing data scenarios affect the alarm evaluation in the following manner: 
+  For the evaluation of a time series, the treat missing data treatment is applied for individual datapoints within the time series. For example, if 3 datapoints were queried for the time series but only 1 was received, 2 datapoints would follow the configured missing data configuration. 
+  If a time series is not retrieved by the query anymore, it will transition to `OK` no matter the treat missing data treatment. Alarm actions associated with the `OK` transition at the contributor level are executed and the `StateReason` specifies that the aforementioned contributor was not found with the message, "No data was returned for this contributor". The state of the alarm will depend on the state of the other contributors that were retrieved by the query. 
+  At alarm level, if the query returns an empty result (no time series at all), the treat missing data treatment is applied. For example, if the treat missing data was set as `BREACHING`, the alarm will transition to `ALARM`. 

# How partial data is handled
<a name="cloudwatch-metrics-insights-alarms-partial-data"></a>

## How partial data from a Metrics Insights query is evaluated
<a name="cloudwatch-metrics-insights-query-evaluation"></a>

If the Metrics Insights query used for the alarm matches more than 10,000 metrics, the alarm is evaluated based on the first 10,000 metrics that the query finds. This means that the alarm is being evaluated on partial data.

You can use the following methods to find whether a Metrics Insights alarm is currently evaluating its alarm state based on partial data: 
+ In the console, if you choose an alarm to see the **Details** page, the message **Evaluation warning: Not evaluating all data** appears on that page.
+ You see the value `PARTIAL_DATA` in the `EvaluationState` field when you use the [ describe-alarms](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/cloudwatch/describe-alarms.html?highlight=describe%20alarms) AWS CLI command or the [ DescribeAlarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DescribeAlarms.html) API.

Alarms also publish events to Amazon EventBridge when it goes into the partial data state, so you can create an EventBridge rule to watch for these events. In these events, the `evaluationState` field has the value `PARTIAL_DATA`. The following is an example.

```
{
    "version": "0",
    "id": "12345678-3bf9-6a09-dc46-12345EXAMPLE",
    "detail-type": "CloudWatch Alarm State Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-11-08T11:26:05Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:my-alarm-name"
    ],
    "detail": {
        "alarmName": "my-alarm-name",
        "state": {
            "value": "ALARM",
            "reason": "Threshold Crossed: 3 out of the last 3 datapoints [20000.0 (08/11/22 11:25:00), 20000.0 (08/11/22 11:24:00), 20000.0 (08/11/22 11:23:00)] were greater than the threshold (0.0) (minimum 1 datapoint for OK -> ALARM transition).",
            "reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2022-11-08T11:26:05.399+0000\",\"startDate\":\"2022-11-08T11:23:00.000+0000\",\"period\":60,\"recentDatapoints\":[20000.0,20000.0,20000.0],\"threshold\":0.0,\"evaluatedDatapoints\":[{\"timestamp\":\"2022-11-08T11:25:00.000+0000\",\"value\":20000.0}]}",
            "timestamp": "2022-11-08T11:26:05.401+0000",
            "evaluationState": "PARTIAL_DATA"
        },
        "previousState": {
            "value": "INSUFFICIENT_DATA",
            "reason": "Unchecked: Initial alarm creation",
            "timestamp": "2022-11-08T11:25:51.227+0000"
        },
        "configuration": {
            "metrics": [
                {
                    "id": "m2",
                    "expression": "SELECT SUM(PartialDataTestMetric) FROM partial_data_test",
                    "returnData": true,
                    "period": 60
                }
            ]
        }
    }
}
```

If the query for the alarm includes a GROUP BY statement that initially returns more than 500 time series, the alarm is evaluated based on the first 500 time series that the query finds. However, if you use an ORDER BY clause, then all the time series that the query finds are sorted, and the 500 that have the highest or lowest values according to your ORDER BY clause are used to evaluate the alarm. 

## How partial data from a multi data source alarm is evaluated
<a name="multi-data-source-partial-data"></a>

If the Lambda function returns partial data:
+ The alarm continues to be evaluated on the data points that are returned.
+ You can use the following methods to find whether an alarm on a Lambda function is currently evaluating its alarm state based on partial data:
  + In the console, choose an alarm and choose the **Details** page. If you see the message **Evaluation warning: Not evaluating all data appears on that page**, it is evaluating on partial data.
  + If you see the value `PARTIAL_DATA` in the `EvaluationState` field when you use the `describe-alarms` AWS CLI command or the DescribeAlarms API, it is evaluating on partial data.
+ An alarm also publishes events to Amazon EventBridge when it goes into the partial data state.

# Percentile-based alarms and low data samples
<a name="percentiles-with-low-samples"></a>

When you set a percentile as the statistic for an alarm, you can specify what to do when there is not enough data for a good statistical assessment. You can choose to have the alarm evaluate the statistic anyway and possibly change the alarm state. Or, you can have the alarm ignore the metric while the sample size is low, and wait to evaluate it until there is enough data to be statistically significant.

For percentiles between 0.5 (inclusive) and 1.00 (exclusive), this setting is used when there are fewer than 10/(1-percentile) data points during the evaluation period. For example, this setting would be used if there were fewer than 1000 samples for an alarm on a p99 percentile. For percentiles between 0 and 0.5 (exclusive), the setting is used when there are fewer than 10/percentile data points.

# PromQL alarms
<a name="alarm-promql"></a>

A PromQL alarm monitors metrics using a Prometheus Query Language (PromQL) instant query. The query selects metrics ingested through the CloudWatch OTLP endpoint, and all matching time series returned by the query are considered to be breaching. The alarm evaluates the query at a regular interval and tracks each breaching time series independently as a *contributor*.

For information about ingesting metrics using OpenTelemetry, see [OpenTelemetry](CloudWatch-OpenTelemetry-Sections.md).

## How PromQL alarms work
<a name="promql-alarm-how-it-works"></a>

A PromQL alarm evaluates a PromQL instant query on a recurring schedule defined by the `EvaluationInterval`. The query returns only the time series that satisfy the condition. Each returned time series is a *contributor*, identified by its unique set of attributes.

The alarm uses duration-based state transitions:
+ When a contributor is returned by the query, it is considered *breaching*. If the contributor continues breaching for the duration specified by `PendingPeriod`, the contributor transitions to `ALARM` state.
+ When a contributor stops being returned by the query, it is considered *recovering*. If the contributor remains absent for the duration specified by `RecoveryPeriod`, the contributor transitions back to `OK` state.

The alarm is in `ALARM` state when at least one contributor has been breaching for longer than the pending period. The alarm returns to `OK` state when all contributors have recovered.

## PromQL alarm configuration
<a name="promql-alarm-configuration"></a>

A PromQL alarm is configured with the following parameters:
+ **PendingPeriod** is the duration in seconds that a contributor must continuously breach before the contributor transitions to `ALARM` state. This is equivalent to the Prometheus alert rule's `for` duration.
+ **RecoveryPeriod** is the duration in seconds that a contributor must stop breaching before the contributor transitions back to `OK` state. This is equivalent to the Prometheus alert rule's `keep_firing_for` duration.
+ **EvaluationInterval** is how frequently, in seconds, the alarm evaluates the PromQL query.

To create a PromQL alarm, see [Create an alarm using a PromQL query](Create_PromQL_Alarm.md).

# Composite alarms
<a name="alarm-combining"></a>

With CloudWatch, you can combine several alarms into one *composite alarm* to create a summarized, aggregated health indicator over a whole application or group of resources. Composite alarms are alarms that determine their state by monitoring the states of other alarms. You define rules to combine the status of those monitored alarms using Boolean logic.

You can use composite alarms to reduce alarm noise by taking actions only at an aggregated level. For example, you can create a composite alarm to send a notification to your web server team if any alarm related to your web server triggers. When any of those alarms goes into the ALARM state, the composite alarm goes itself in the ALARM state and sends a notification to your team. If other alarms related to your web server also go into the ALARM state, your team does not get overloaded with new notifications since the composite alarm has already notified them about the existing situation.

You can also use composite alarms to create complex alarming conditions and take actions only when many different conditions are met. For example, you can create a composite alarm that combines a CPU alarm and a memory alarm, and would only notify your team if both the CPU and the memory alarms have triggered.

**Using composite alarms**

When you use composite alarms, you have two options:
+ Configure the actions you want to take only at the composite alarm level, and create the underlying monitored alarms without actions
+ Configure a different set of actions at the composite alarm level. For example, the composite alarm actions could engage a different team in case of a widespread issue.

Composite alarms can take only the following actions:
+ Notify Amazon SNS topics
+ Invoke Lambda functions
+ Create OpsItems in Systems Manager Ops Center
+ Create incidents in Systems Manager Incident Manager

**Note**  
All the underlying alarms in your composite alarm must be in the same account and the same Region as your composite alarm. However, if you set up a composite alarm in a CloudWatch cross-account observability monitoring account, the underlying alarms can watch metrics in different source accounts and in the monitoring account itself. For more information, see [CloudWatch cross-account observability](CloudWatch-Unified-Cross-Account.md).  
 A single composite alarm can monitor 100 underlying alarms, and 150 composite alarms can monitor a single underlying alarm.

**Rule expressions**

All composite alarms contain rule expressions. Rule expressions tell composite alarms which other alarms to monitor and determine their states from. Rule expressions can refer to metric alarms and composite alarms. When you reference an alarm in a rule expression, you designate a function to the alarm that determines which of the following three states the alarm will be in:
+ ALARM

  ALARM ("alarm-name or alarm-ARN") is TRUE if the alarm is in ALARM state.
+ OK

  OK ("alarm-name or alarm-ARN") is TRUE if the alarm is in OK state.
+ INSUFFICIENT\$1DATA

  INSUFFICIENT\$1DATA (“alarm-name or alarm-ARN") is TRUE if the named alarm is in INSUFFICIENT\$1DATA state.

**Note**  
TRUE always evaluates to TRUE, and FALSE always evaluates to FALSE.

**Alarm references**

When referencing an alarm, using either the alarm name or ARN, the rule syntax can support referencing the alarm with or without quotation marks (") around the alarm name or ARN.
+ If specified without quotes, alarm names or ARNs must not contain spaces, round brackets, or commas.
+ If specified within quotes, alarm names or ARNs that *include* double quotes (") must enclose the " using backslash escape (\$1) characters for correct interpretation of the reference.

**Syntax**

The syntax of the expression you use to combine several alarms into one composite alarm uses boolean logic and functions. The following table describes the operators and functions available in rule expressions:


| Operator/Function | Description | 
| --- | --- | 
| AND | Logical AND operator. Returns TRUE when all specified conditions are TRUE. | 
| OR | Logical OR operator. Returns TRUE when at least one of the specified conditions is TRUE. | 
| NOT | Logical NOT operator. Returns TRUE when the specified condition is FALSE. | 
| AT\$1LEAST | Function that returns TRUE when a minimum number or percentage of specified alarms are in the required state. Format: AT\$1LEAST(M, STATE\$1CONDITION, (alarm1, alarm2, ...alarmN)) where M can be an absolute number or percentage (for example, 50%), and STATE\$1CONDITION can be ALARM, OK, INSUFFICIENT\$1DATA, NOT ALARM, NOT OK, or NOT INSUFFICIENT\$1DATA. | 

You can use parentheses to group conditions and control the order of evaluation in complex expressions.

**Example expressions**

The request parameter `AlarmRule` supports the use of the logical operators `AND`, `OR`, and `NOT`, as well as the `AT_LEAST` function, so you can combine multiple functions into a single expressions. The following example expressions show how you can configure the underlying alarms in your composite alarm: 
+ `ALARM(CPUUtilizationTooHigh) AND ALARM(DiskReadOpsTooHigh)`

  The expression specifies that the composite alarm goes into `ALARM` only if `CPUUtilizationTooHigh` and `DiskReadOpsTooHigh` are in `ALARM`.
+ `AT_LEAST(2, ALARM, (WebServer1CPU, WebServer2CPU, WebServer3CPU, WebServer4CPU))`

  The expression specifies that the composite alarm goes into `ALARM` when at least 2 out of the 4 web server CPU alarms are in `ALARM` state. This allows you to trigger alerts based on a threshold of affected resources rather than requiring all or just one to be in alarm state.
+ `AT_LEAST(50%, OK, (DatabaseConnection1, DatabaseConnection2, DatabaseConnection3, DatabaseConnection4))`

  The expression specifies that the composite alarm goes into `ALARM` when at least 50% of the database connection alarms are in `OK` state. Using percentages allows the rule to adapt dynamically as you add or remove monitored alarms.
+ `ALARM(CPUUtilizationTooHigh) AND NOT ALARM(DeploymentInProgress)`

  The expression specifies that the composite alarm goes into `ALARM` if `CPUUtilizationTooHigh` is in `ALARM` and `DeploymentInProgress` is not in `ALARM`. This is an example of a composite alarm that reduces alarm noise during a deployment window.
+ `AT_LEAST(2, ALARM, (AZ1Health, AZ2Health, AZ3Health)) AND NOT ALARM(MaintenanceWindow)`

  The expression specifies that the composite alarm goes into `ALARM` when at least 2 out of 3 availability zone health alarms are in `ALARM` state and the maintenance window alarm is not in `ALARM`. This combines the AT\$1LEAST function with other logical operators for more complex monitoring scenarios.

# Alarm suppression
<a name="alarm-suppression"></a>

Composite alarm action suppression allows you to temporarily disable alarm actions without deleting or modifying the alarm configuration. This is useful during planned maintenance, deployments, or when investigating known issues.

With composite alarm action suppression, you define alarms as suppressor alarms. Suppressor alarms prevent composite alarms from taking actions. For example, you can specify a suppressor alarm that represents the status of a supporting resource. If the supporting resource is down, the suppressor alarm prevents the composite alarm from sending notifications.

## When to use alarm suppression
<a name="alarm-suppression-use-cases"></a>

Common situations where alarm suppression is useful:
+ Maintenance windows of your application
+ Application deployments
+ Ongoing incident investigation
+ Testing and development activities

## How suppressor alarms work
<a name="alarm-suppression-how-it-works"></a>

You specify suppressor alarms when you configure composite alarms. Any alarm can function as a suppressor alarm. When a suppressor alarm changes states from `OK` to `ALARM`, its composite alarm stops taking actions. When a suppressor alarm changes states from `ALARM` to `OK`, its composite alarm resumes taking actions.

Because composite alarms allow you to get an aggregated view of your health across multiple alarms, there are common situations where it is expected for those alarms to trigger. For example, during a maintenance window of your application or when you investigate an ongoing incident. In such situations, you may want to suppress the actions of your composite alarms, to prevent unwanted notifications or the creation of new incident tickets

 With composite alarm action suppression, you define alarms as suppressor alarms. Suppressor alarms prevent composite alarms from taking actions. For example, you can specify a suppressor alarm that represents the status of a supporting resource. If the supporting resource is down, the suppressor alarm prevents the composite alarm from sending notifications. Composite alarm action suppression helps you reduce alarm noise, so you spend less time managing your alarms and more time focusing on your operations. 

 You specify suppressor alarms when you configure composite alarms. Any alarm can function as a suppressor alarm. When a suppressor alarm changes states from `OK` to `ALARM`, its composite alarm stops taking actions. When a suppressor alarm changes states from `ALARM` to `OK`, its composite alarm resumes taking actions. 

### `WaitPeriod` and `ExtensionPeriod`
<a name="Create_Composite_Alarm_Suppression_Wait_Extension"></a>

 When you specify a suppressor alarm, you set the parameters `WaitPeriod` and `ExtensionPeriod`. These parameters prevent composite alarms from taking actions unexpectedly while suppressor alarms change states. Use `WaitPeriod` to compensate for any delays that can occur when a suppressor alarm changes from `OK` to `ALARM`. For example, if a suppressor alarm changes from `OK` to `ALARM` within 60 seconds, set `WaitPeriod` to 60 seconds. 

![\[Actions suppression within WaitPeriod\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/example1border.png)


 In the image, the composite alarm changes from `OK` to `ALARM` at t2. A `WaitPeriod` starts at t2 and ends at t8. This gives the suppressor alarm time to change states from `OK` to `ALARM` at t4 before it suppresses the composite alarm's actions when the `WaitPeriod` expires at t8. 

 Use `ExtensionPeriod` to compensate for any delays that can occur when a composite alarm changes to `OK` following a suppressor alarm changing to `OK`. For example, if a composite alarm changes to `OK` within 60 seconds of a suppressor alarm changing to `OK`, set `ExtensionPeriod` to 60 seconds. 

![\[Actions suppression within ExtensionPeriod\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/example2border.png)


 In the image, the suppressor alarm changes from `ALARM` to `OK` at t2. An `ExtensionPeriod` starts at t2 and ends at t8. This gives the composite alarm time to change from `ALARM` to `OK` before the `ExtensionPeriod` expires at t8. 

 Composite alarms don't take actions when `WaitPeriod` and `ExtensionPeriod` become active. Composite alarms take actions that are based on their currents states when `ExtensionPeriod` and `WaitPeriod` become inactive. We recommend that you set the value for each parameter to 60 seconds, as evaluates metric alarms every minute. You can set the parameters to any integer in seconds. 

 The following examples describe in more detail how `WaitPeriod` and `ExtensionPeriod` prevent composite alarms from taking actions unexpectedly. 

**Note**  
 In the following examples, `WaitPeriod` is configured as 2 time units, and `ExtensionPeriod` is configured as 3 time units. 

#### Examples
<a name="example_scenarios"></a>

 ** Example 1: Actions are not suppressed after `WaitPeriod` ** 

![\[first example of action suppression\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/example3border.png)


 In the image, the composite alarm changes states from `OK` to `ALARM` at t2. A `WaitPeriod` starts at t2 and ends at t4, so it can prevent the composite alarm from taking actions. After the `WaitPeriod` expires at t4, the composite alarm takes its actions because the suppressor alarm is still in `OK`. 

 ** Example 2: Actions are suppressed by alarm before `WaitPeriod` expires ** 

![\[second example of action suppression\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/example4border.png)


 In the image, the composite alarm changes states from `OK` to `ALARM` at t2. A `WaitPeriod` starts at t2 and ends at t4. This gives the suppressor alarm time to change states from `OK` to `ALARM` at t3. Because the suppressor alarm changes states from `OK` to `ALARM` at t3, the `WaitPeriod` that started at t2 is discarded, and the suppressor alarm now stops the composite alarm from taking actions. 

 ** Example 3: State transition when actions are suppressed by `WaitPeriod` ** 

![\[third example of action suppression\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/example5border.png)


 In the image, the composite alarm changes states from `OK` to `ALARM` at t2. A `WaitPeriod` starts at t2 and ends at t4. This gives the suppressor alarm time to change states. The composite alarm changes back to `OK` at t3, so the `WaitPeriod` that started at t2 is discarded. A new `WaitPeriod` starts at t3 and ends at t5. After the new `WaitPeriod` expires at t5, the composite alarm takes its actions. 

 ** Example 4: State transition when actions are suppressed by alarm ** 

![\[fourth example of action suppression\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/cwasexamplefourborder.png)


 In the image, the composite alarm changes states from `OK` to `ALARM` at t2. The suppressor alarm is already in `ALARM`. The suppressor alarm stops the composite alarm from taking actions. 

 ** Example 5: Actions are not suppressed after `ExtensionPeriod` ** 

![\[fifth example of action suppression\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/example7border.png)


 In the image, the composite alarm changes states from `OK` to `ALARM` at t2. A `WaitPeriod` starts at t2 and ends at t4. This gives the suppressor alarm time to change states from `OK` to `ALARM` at t3 before it suppresses the composite alarm's actions until t6. Because the suppressor alarm changes states from `OK` to `ALARM` at t3, the `WaitPeriod` that started at t2 is discarded. At t6, the suppressor alarm changes to `OK`. An `ExtensionPeriod` starts at t6 and ends at t9. After the `ExtensionPeriod` expires, the composite alarm takes its actions. 

 ** Example 6: State transition when actions are suppressed by `ExtensionPeriod` ** 

![\[sixth example of action suppression\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/cwasexamplesixrborder.png)


 In the image, the composite alarm changes states from `OK` to `ALARM` at t2. A `WaitPeriod` starts at t2 and ends at t4. This gives the suppressor alarm time to change states from `OK` to `ALARM` at t3 before it suppresses the composite alarm's actions until t6. Because the suppressor alarm changes states from `OK` to `ALARM` at t3, the `WaitPeriod` that started at t2 is discarded. At t6, the suppressor alarm changes back to `OK`. An `ExtensionPeriod` starts at t6 and ends at t9. When the composite alarm changes back to `OK` at t7, the `ExtensionPeriod` is discarded, and a new `WaitPeriod` starts at t7 and ends at t9. 

**Tip**  
 If you replace the action suppressor alarm, any active `WaitPeriod` or `ExtensionPeriod` is discarded. 

## Action Suppression and Mute Rules
<a name="action-suppression-and-mute-rules"></a>

 When both action suppression and alarm mute rules are active for a composite alarm, mute rules take precedence and suppress all alarm actions. After the mute window ends, the composite alarm's action suppression configuration determines whether actions are executed based on the suppressor alarm state and configured wait or extension periods. For more information about alarm mute rules, see [Alarm Mute Rules](alarm-mute-rules.md). 

# Alarm actions
<a name="alarm-actions"></a>

You can specify what actions an alarm takes when it changes state between the OK, ALARM, and INSUFFICIENT\$1DATA states.

Most actions can be set for the transition into each of the three states. Except for Auto Scaling actions, the actions happen only on state transitions, and are not performed again if the condition persists for hours or days.

The following are supported as alarm actions:
+ Notify one or more subscribers by using an Amazon Simple Notification Service topic. Subscribers can be applications as well as persons.
+ Invoke a Lambda function. This is the easiest way for you to automate custom actions on alarm state changes.
+ Alarms based on EC2 metrics can also perform EC2 actions, such as stopping, terminating, rebooting, or recovering an EC2 instance.
+ Alarms can perform actions to scale an Auto Scaling group.
+ Alarms can create OpsItems in Systems Manager Ops Center or create incidents in AWS Systems Manager Incident Manager. These actions are performed only when the alarm goes into ALARM state.
+ An alarm can start an investigation when it goes into ALARM state.

Alarms also emit events to Amazon EventBridge when they change state, and you can set up Amazon EventBridge to trigger other actions for these state changes.

## Alarm actions and notifications
<a name="alarm-actions-notifications"></a>

The following table shows the actions executed for alarms along with their behavior for multiple time series (or contributors) alarms:


| Action Type | Metrics Insights Multiple Time Series Alarm support | PromQL Alarm support | More Information | 
| --- | --- | --- | --- | 
| SNS notifications | Contributor Level | Contributor Level | [Amazon SNS event destinations](https://docs.aws.amazon.com/sns/latest/dg/sns-event-destinations.html) | 
| EC2 actions (stop, terminate, reboot, recover) | Not supported | Not supported | [Stop, terminate, reboot, or recover an EC2 instance](UsingAlarmActions.md) | 
| Auto Scaling actions | Not supported | Not supported | [Step and simple scaling policies for Amazon EC2 Auto Scaling](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scaling-simple-step.html) | 
| Systems Manager OpsItem creation | Alarm Level | Not supported | [Configure CloudWatch alarms to create OpsItems](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-create-OpsItems-from-CloudWatch-Alarms.html) | 
| Systems Manager Incident Manager incidents | Alarm Level | Not supported | [Creating incidents automatically with CloudWatch alarms](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html#incident-tracking-auto-alarms) | 
| Lambda function invocation | Contributor Level | Contributor Level | [Invoke a Lambda function from an alarm](alarms-and-actions-Lambda.md) | 
| CloudWatch investigations investigation | Alarm Level | Not supported | [Start a CloudWatch investigations from an alarm](Start-Investigation-Alarm.md) | 

The content of alarm notifications differs depending on the alarm type:
+ Single-metric alarms include both a state reason and detailed state reason data, showing the specific datapoints that caused the state change.
+ Multi-time series Metrics Insights alarms provide a simplified state reason for each contributor, without the detailed state reason data block.
+ PromQL alarms do not include a state reason or state reason data in their notifications.

**Example Notification Content Examples**  
Single-metric alarm notification includes detailed data:  

```
{
  "stateReason": "Threshold Crossed: 3 out of the last 3 datapoints [32.6 (03/07/25 08:29:00), 33.8 (03/07/25 08:24:00), 41.0 (03/07/25 08:19:00)] were greater than the threshold (31.0)...",
  "stateReasonData": {
    "version": "1.0",
    "queryDate": "2025-07-03T08:34:06.300+0000",
    "startDate": "2025-07-03T08:19:00.000+0000",
    "statistic": "Average",
    "period": 300,
    "recentDatapoints": [41, 33.8, 32.6],
    "threshold": 31,
    "evaluatedDatapoints": [
      {
        "timestamp": "2025-07-03T08:29:00.000+0000",
        "sampleCount": 5,
        "value": 32.6
      }
      // Additional datapoints...
    ]
  }
}
```
Multiple time series Metrics Insights Alarm SNS notification for Contributor example:  

```
{
  "AlarmName": "DynamoDBInsightsAlarm",
  "NewStateValue": "ALARM",
  "NewStateReason": "Threshold Crossed: 1 datapoint was less than the threshold (1.0). The most recent datapoint which crossed the threshold: [0.0 (01/12/25 13:34:00)].",
  "StateChangeTime": "2025-12-01T13:42:04.919+0000",
  "OldStateValue": "OK",
  "AlarmContributorId": "6d442278dba546f6",
  "AlarmContributorAttributes": {
    "TableName": "example-dynamodb-table-name"
  }
  // Additional information...
}
```
PromQL Alarm SNS notification for Contributor example:  

```
{
  "AlarmName": "HighCPUUsageAlarm",
  "NewStateValue": "ALARM",
  "StateChangeTime": "2025-12-01T13:42:04.919+0000",
  "OldStateValue": "OK",
  "AlarmContributorId": "1d502278dcd546a1",
  "AlarmContributorAttributes": {
    "team": "example-team-name"
  }
  // Additional information...
}
```

## Muting Alarm Actions
<a name="mute-alarm-actions"></a>

 Alarm mute rules allow you to automatically mute alarm actions during predefined time windows, such as maintenance periods or operational events. CloudWatch continues monitoring alarm states while preventing unwanted notifications. For more information, see [Alarm Mute Rules](alarm-mute-rules.md). 

**Mute rules vs. disabling alarm actions**  
 Alarm mute rules temporarily mute actions during scheduled time windows and automatically unmute when the window ends. In contrast, the `DisableAlarmActions` API permanently disables alarm actions until you manually call `EnableAlarmActions`. The `EnableAlarmActions` API does not unmute alarms that are muted by active mute rules. 

**Note**  
 Muting an alarm does not stop CloudWatch from sending alarm events for alarm create, update, delete, and state changes to Amazon EventBridge. 

# Alarm Mute Rules
<a name="alarm-mute-rules"></a>

 Alarm Mute Rules is a CloudWatch feature that provides you a mechanism to automatically mute alarm actions during predefined time windows. When you create a mute rule, you define specific time periods and target alarms whose actions will be muted. CloudWatch will continue monitoring and evaluating alarm states while preventing unwanted notifications or automated alarm actions during expected operational events. 

 Alarm Mute Rules help you manage critical operational scenarios where alarm actions would be unnecessary or disruptive. For example, during planned maintenance windows, you can prevent automated alarm actions while your systems are intentionally offline or experiencing expected issues, allowing you to perform maintenance without interruptions. For operations during non-business hours such as weekends or holidays, you can mute non-critical alarm actions when immediate response is not required, reducing alarm noise and unnecessary notifications to your operations team. In testing environments, mute rules allow you to temporarily mute alarm actions during scenarios such as load testing where high resource usage or error rates are expected and don't require immediate attention. When your team is actively troubleshooting issues, mute rules allow you to prevent duplicate alarm actions from being triggered, helping you focus on resolution without being distracted by redundant alarm notifications. 

## Defining alarm mute rules
<a name="defining-alarm-mute-rules"></a>

 Alarm mute rules can be defined using: **rules** and **targets**. 
+  **Rules** - define the time windows when alarm actions should be muted. Rules are composed of three attributes: 
  +  **Expression** – Defines when the mute period begins and how it repeats. You can use two types of expressions: 
    +  **Cron expressions** – Use standard cron syntax to create recurring mute windows. This approach is ideal for regular maintenance schedules, such as weekly system updates or daily backup operations. Cron expressions allow you to specify complex recurring patterns, including specific days of the week, months, or times. 

       *Syntax for cron expression* 

      ```
      ┌───────────── minute (0 - 59)
      │ ┌───────────── hour (0 - 23)
      │ │ ┌───────────── day of the month (1 - 31)
      │ │ │ ┌───────────── month (1 - 12) (or JAN-DEC)
      │ │ │ │ ┌───────────── day of the week (0 - 6) (0 or 7 is Sunday, or MON-SUN)
      │ │ │ │ │
      │ │ │ │ │
      * * * * *
      ```
      +  The characters `*`, `,`, `-` will be supported in all fields. 
      +  English names can be used for the `month` (JAN-DEC) and `day of week` (SUN-SAT) fields 
    +  **At expressions** – Use at expressions for one-time mute windows. This approach works well for planned operational events that occur once at a known time. 

      ```
      Syntax: `at(yyyy-MM-ddThh:mm)`
      ```
  +  **Duration** – Specifies how long the mute rule lasts once activated. Duration must be specified in ISO-8601 format with a minimum of 1 minute (PT1M) and maximum of 15 days (P15D). 
  +  **Timezone** – Specifies the timezone in which the mute window will be applied according to the expressions, using standard timezone identifiers such as "America/Los\$1Angeles" or "Europe/London". 
+  **Targets** - specify the list of alarm names whose actions will be muted during the defined time windows. You can include both metric alarms and composite alarms in your target list. 

 You can optionally include start and end timestamps to provide additional boundaries for your mute windows. Start timestamps ensure that mute rules don't activate before a specific date and time, while end timestamps prevent rules from being applied beyond the specified date and time. 

 For more information about creating alarm mute rules programmatically, see [PutAlarmMuteRule](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutAlarmMuteRule.html). 

**Note**  
 The targeted alarms must be present in the same AWS account and same AWS Region in which the mute rule is created. 
 A single alarm mute rule can target up to 100 alarms by alarm names. 

 The CloudWatch console includes a dedicated "Alarm Mute Rules" tab that provides centralized management of all your mute rules within your AWS account. You can search for specific mute rules using the mute rule attributes such as rule name. 

## Mute Rule Status
<a name="mute-rule-status"></a>

 Once created, an alarm mute rule can be in one of the below three statuses: 
+  **SCHEDULED** – The mute rule will become active at some time in the future according to the configured time window expression. 
+  **ACTIVE** – The mute rule is currently active as per the configured time window expression and actively muting targeted alarm actions. 
+  **EXPIRED** – The mute rule will not be SCHEDULED/ACTIVE anymore in the future. This occurs for one-time mute rules after the mute window has ended, or for recurring mute rules when an end timestamp is configured and that time has passed. 

## Effects of mute rules on alarms
<a name="effects-of-mute-rules"></a>

 During an active mute window, when a targeted alarm changes state and has actions configured, CloudWatch mutes those actions from executing. Mutes are applied only to alarm actions, meaning that alarms continue to be evaluated and state changes are visible in the CloudWatch console, but configured actions such as Amazon Simple Notification Service notifications, Amazon Elastic Compute Cloud Auto Scaling actions, or Amazon EC2 actions are prevented from executing. CloudWatch continues to evaluate alarm states normally throughout the mute period, and you can view this information through alarm history. 

 When a mute window ends, if the targeted alarm(s) remains in an alarming state (OK/ALARM/INSUFFICIENT\$1DATA), CloudWatch automatically re-triggers the alarm actions that were muted during the window. This ensures that your alarm actions are executed for ongoing issues once the planned mute period ends, maintaining the integrity of your monitoring system. 

**Note**  
 When you mute an alarm:   
 All the actions associated with the targeted alarms are muted 
 Actions associated with all alarm states (OK, ALARM, and INSUFFICIENT\$1DATA) are muted 

## Viewing and managing muted alarms
<a name="viewing-managing-muted-alarms-link"></a>

For information about viewing and managing muted alarms, see [Viewing and managing muted alarms](viewing-managing-muted-alarms.md).

# How alarm mute rules work
<a name="alarm-mute-rules-behaviour"></a>

The following scenarios illustrate how alarm mute rules affect the targeted alarms and how the alarm actions are muted or executed.

**Note**  
 Muting an alarm will mute actions associated with all alarm states, including OK, ALARM, and INSUFFICIENT\$1DATA. The behaviours illustrated below apply to actions associated with all alarm states. 
 When you mute a Metrics Insights alarm, all contributor metric series for that alarm are automatically muted as well. 

## Scenario: Alarm actions are muted when a mute rule is active
<a name="scenario-actions-muted-during-active-rule"></a>

Consider that,
+ An Alarm has actions configured for its ALARM state
+ An alarm mute rule is scheduled to be active from t1 to t5 that targets the Alarm

![\[alt text not found\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm_mute_rules_scenario-1.png)

+ At **t0** - Alarm is in OK state, mute rule status is SCHEDULED
+ At **t1** - Mute rule status becomes ACTIVE
+ At **t2** - Alarm transitions to ALARM state, action is muted as the alarm is effectively muted by the mute rule.
+ At **t4** - Alarm returns to OK state while mute rule is still active
+ At **t5** - Mute rule becomes inactive, but no ALARM action is executed because alarm is now in OK state

## Scenario: Alarm action muted when mute rule is active and re-triggered after mute window
<a name="scenario-action-retriggered-after-mute"></a>

Consider that,
+ An Alarm has actions configured for its ALARM state
+ An alarm mute rule is scheduled to be active from t1 to t5 that targets the Alarm

![\[alt text not found\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm_mute_rules_scenario-2.png)

+ At **t0** - Alarm is in OK state, mute rule status is SCHEDULED
+ At **t1** - Mute rule status becomes ACTIVE
+ At **t2** - Alarm transitions to ALARM state, action is muted as the alarm is effectively muted by the mute rule.
+ At **t4** - Mute window becomes inactive, alarm is still in ALARM state
+ At **t5** - Alarm action is executed because the mute window has ended and the alarm remains in the same state (ALARM) in which it was originally muted

## Scenario: Multiple overlapping alarm mute rules
<a name="scenario-multiple-overlapping-rules"></a>

Consider that,
+ An Alarm has actions configured for its ALARM state

Consider that there are two mute rules,
+ Alarm Mute Rule 1 - mutes Alarm from t1 to t5
+ Alarm Mute Rule 2 - mutes Alarm from t3 to t9

![\[alt text not found\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm_mute_rules_scenario-3.png)

+ At **t0** - Alarm is in OK state, both mute rules are SCHEDULED
+ At **t1** - First mute rule becomes ACTIVE
+ At **t2** - Alarm transitions to ALARM state, action is muted
+ At **t3** - Second mute rule becomes ACTIVE
+ At **t5** - First mute rule becomes inactive, but alarm action remains muted because second mute rule is still active
+ At **t8** - Alarm action is executed because the second mute window has ended and the alarm remains in the same state (ALARM) in which it was originally muted

## Scenario: Muted alarm actions are executed when mute rule update ends the mute window
<a name="scenario-rule-update-ends-mute"></a>

Consider that,
+ An Alarm has actions configured for its ALARM state
+ An alarm mute rule is scheduled to be active from t1 to t8 that targets the Alarm

![\[alt text not found\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm_mute_rules_scenario-4.png)

+ At **t0** - Alarm is in OK state, mute rule is SCHEDULED
+ At **t1** - Mute rule becomes ACTIVE
+ At **t2** - Alarm transitions to ALARM state, actions are muted
+ At **t6** - The mute rule configuration is updated such that the time window ends at t6. Alarm actions are immediately executed at t6 because the mute rule is no longer active.

**Note**  
The same behaviour applies,  
If the mute rule is deleted at t6. Deleting the mute rule immediately unmutes the alarm.
If the alarm is removed from the mute rule targets (at t6) then the alarm will be unmuted immediately.

## Scenario: New actions are executed if alarm actions are updated during mute window
<a name="scenario-actions-updated-during-mute"></a>

Consider that,
+ An Alarm has actions configured for its ALARM state
+ An alarm mute rule is scheduled to be active from t1 to t8 that targets the Alarm

![\[alt text not found\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm_mute_rules_scenario-5.png)

+ At **t0** - Alarm is in OK state, mute rule is SCHEDULED. An SNS action is configured with the alarm state ALARM.
+ At **t1** - Mute rule becomes ACTIVE
+ At **t2** - Alarm transitions to ALARM state, the configured SNS action is muted
+ At **t6** - Alarm configuration is updated to remove SNS action and add Lambda action
+ At **t8** - The lambda action configured to the Alarm is executed because the mute window has ended and the alarm remains in the same state (ALARM) in which it was originally muted

**Note**  
If all the alarm actions are removed during the mute window (at t6 in above example) then no actions will be executed at the end of mute window (at t8 above)

## Example schedules for common use cases
<a name="common-use-cases"></a>

 The following examples show how to configure time window expressions for common use cases. 

 **Scenario 1: Muting alarm actions during scheduled maintenance windows** – Regular maintenance activities that occur on a predictable schedule, such as system or database updates when services are intentionally unavailable or operating in degraded mode. 
+  Cron expression `0 2 * * SUN` with duration `PT4H` - Mutes alarms every Sunday from 2:00 AM to 6:00 AM for weekly system maintenance. 
+  Cron expression `0 1 1 * *` with duration `PT6H` - Mutes alarms on the first day of each month from 1:00 AM to 7:00 AM for monthly database maintenance. 

 **Scenario 2: Muting non-critical alarms during non-business hours** – Reducing alert fatigue during weekends or holidays when immediate attention is not required. 
+  Cron expression `0 18 * * FRI` with duration `P2DT12H` - Mutes alarms every weekend from Friday 6:00 PM to Monday 6:00 AM. 

 **Scenario 3: Muting performance alarms during daily backup operations** – Daily automated backup processes that temporarily increase resource utilization and may trigger performance-related alarms during predictable time windows. 
+  Cron expression `0 23 * * *` with duration `PT2H` - Mutes alarms every day from 11:00 PM to 1:00 AM during nightly backup operations that temporarily increase disk I/O and CPU utilization. 

 **Scenario 4: Muting duplicate alarms during active troubleshooting sessions** – Temporary muting of alarm actions while teams are actively investigating and resolving issues, preventing notification noise and allowing focused problem resolution. 
+  At expression `at(2024-05-10T14:00)` with duration `PT4H` - Mutes alarms on May 10, 2024 from 2:00 PM to 6:00 PM during an active incident response session. 

 **Scenario 5: Muting alarm actions during planned company shutdowns** – One-time extended maintenance periods or company-wide shutdowns where all systems are intentionally offline for extended periods. 
+  At expression `at(2024-12-23T00:00)` with duration `P7D` - Mutes alarms for the entire week of December 23-29, 2024 during annual company shutdown. 

# Limits
<a name="alarm-limits"></a>

## General CloudWatch quotas
<a name="general-cloudwatch-quotas"></a>

For information about general CloudWatch service quotas that apply to alarms, see [CloudWatch service quotas](cloudwatch_limits.md).

## Limits that apply to alarms based on metric math expressions
<a name="metric-math-alarm-limits"></a>

Alarms based on metric math expressions can reference a maximum of 10 metrics. This is a hard limit that cannot be increased. If you need to monitor more than 10 metrics in a single alarm, consider one of the following approaches:
+ If the metrics are in the same namespace, use a Metrics Insights query in your alarm instead of a metric math expression. Metrics Insights can aggregate across many metrics with a single query.
+ Pre-aggregate metrics into custom metrics using a Lambda function, then reference the aggregated metrics in your alarm expression.
+ Split your logic across multiple alarms and combine them using a composite alarm.

## Limits that apply to alarms based on Metrics Insights queries
<a name="metrics-insights-alarm-limits"></a>

When working with CloudWatch Metrics Insights alarms, be aware of these functional limits:
+ A default of 200 alarms using the Metrics Insights query per account per Region
+ Only the latest 3 hours of data can be used for evaluating the alarm's conditions. However, you can visualize up to two weeks of data on the alarm's detail page graph
+ Alarms evaluating multiple time series will limit the number of contributors in ALARM to 100
  + Assuming the query retrieves 150 time series:
    +  If there are fewer than 100 contributors in ALARM (for example 95), the `StateReason` will be "95 out of 150 time series evaluated to ALARM" 
    +  If there are more than 100 contributors in ALARM (for example 105), the `StateReason` will be "100\$1 time series evaluated to ALARM" 
  + Furthermore, if the volume of attributes is too large, the number of contributors in ALARM can be limited to less than 100.
+ Metrics Insights limits on the maximum number of time series analyzed or returned apply
+ During alarm evaluation, the `EvaluationState` will be set to `PARTIAL_DATA` for the following limits: 
  +  If the Metrics Insights query returns more than 500 time series. 
  +  If the Metrics Insights query matches more than 10,000 metrics. 

For more information on CloudWatch service quotas and limits, see [CloudWatch Metrics Insights service quotas](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/cloudwatch-metrics-insights-limits.html).

## Limits that apply to alarms based on PromQL queries
<a name="promql-limits"></a>

When working with CloudWatch PromQL alarms, be aware of these functional limits:
+ Alarms evaluating multiple time series will limit the number of contributors in ALARM to 100
  +  If there are fewer than 100 contributors in ALARM (for example 95), the `StateReason` will be "95 time series evaluated to ALARM" 
  +  If there are more than 100 contributors in ALARM (for example 105), the `StateReason` will be "100\$1 time series evaluated to ALARM" 
  + Furthermore, if the volume of attributes is too large, the number of contributors in ALARM can be limited to less than 100.
+ PromQL query limits on the maximum number of time series analyzed or returned apply
+ During alarm evaluation, the `EvaluationState` will be set to `PARTIAL_DATA` if the PromQL query returns more than 500 time series. 

## Limits that apply to alarms based on connected data sources
<a name="MultiSource_Alarm_Details"></a>
+ When CloudWatch evaluates an alarm, it does so every minute, even if the period for the alarm is longer than one minute. For the alarm to work, the Lambda function must be able to return a list of timestamps starting on any minute, not only on multiples of the period length. These timestamps must be spaced one period length apart.

  Therefore, if the data source queried by the Lambda can only return timestamps that are multiples of the period length, the function should "re-sample" the fetched data to match the timestamps expected by the `GetMetricData` request.

  For example, an alarm with a five-minute period is evaluated every minute using five-minute windows that shift by one minute each time. In this case:
  + For the alarm evaluation at 12:15:00, CloudWatch expects data points with timestamps of `12:00:00`, `12:05:00`, and `12:10:00`. 
  + Then for the alarm evaluation at 12:16:00, CloudWatch expects data points with timestamps of `12:01:00`, `12:06:00`, and `12:11:00`. 
+ When CloudWatch evaluates an alarm, any data points returned by the Lambda function that don't align with the expected timestamps are dropped, and the alarm is evaluated using the remaining expected data points. For example, when the alarm is evaluated at `12:15:00` it expects data with timestamps of `12:00:00`, `12:05:00`, and `12:10:00`. If it receives data with timestamps of `12:00:00`, `12:05:00`, `12:06:00`, and `12:10:00`, the data from `12:06:00` is dropped and CloudWatch evaluates the alarm using the other timestamps.

  Then for the next evaluation at `12:16:00`, it expects data with timestamps of `12:01:00`, `12:06:00`, and `12:11:00`. If it only has the data with timestamps of `12:00:00`, `12:05:00`, and `12:10:00`, all of these data points are ignored at 12:16:00 and the alarm transitions into the state according to how you specified the alarm to treat missing data. For more information, see [Alarm evaluation](alarm-evaluation.md).
+ We recommend that you create these alarms to take actions when they transition to the `INSUFFICIENT_DATA` state, because several Lambda function failure use cases will transition the alarm to `INSUFFICIENT_DATA` regardless of the way that you set the alarm to treat missing data. 
+ If the Lambda function returns an error:
  + If there is a permission problem with calling the Lambda function, the alarm begins having missing data transitions according to how you specified the alarm to treat missing data when you created it.
  + Any other error coming from the Lambda function causes the alarm to transition to `INSUFFICIENT_DATA`.
+ If the metric requested by the Lambda function has some delay so that the last data point is always missing, you should use a workaround. You can create an M out of N alarm or increase the evaluation period of the alarm. For more information about M out of N alarms, see [Alarm evaluation](alarm-evaluation.md).

# Getting started
<a name="alarm-getting-started"></a>

**Topics**
+ [

# Create alarms
](Create-Alarms.md)
+ [

# Acting on alarm changes
](Acting_Alarm_Changes.md)
+ [

# Configure alarm mute rules
](alarm-mute-rules-configure.md)

# Create alarms
<a name="Create-Alarms"></a>

**Topics**
+ [

# Create a CloudWatch alarm based on a static threshold
](ConsoleAlarms.md)
+ [

# Create a CloudWatch alarm based on a metric math expression
](Create-alarm-on-metric-math-expression.md)
+ [

# Create a CloudWatch alarm based on anomaly detection
](Create_Anomaly_Detection_Alarm.md)
+ [

# Create an alarm based on a Multi Time Series Metrics Insights query
](multi-time-series-alarm.md)
+ [

# Create an alarm based on a connected data source
](Create_MultiSource_Alarm.md)
+ [

# Create an alarm using a PromQL query
](Create_PromQL_Alarm.md)
+ [

# Alarming on logs
](Alarm-On-Logs.md)
+ [

# Create a composite alarm
](Create_Composite_Alarm.md)

# Create a CloudWatch alarm based on a static threshold
<a name="ConsoleAlarms"></a>

You choose a CloudWatch metric for the alarm to watch, and the threshold for that metric. The alarm goes to `ALARM` state when the metric breaches the threshold for a specified number of evaluation periods.

If you are creating an alarm in an account set up as a monitoring account in CloudWatch cross-account observability, you can set up the alarm to watch a metric in a source account linked to this monitoring account. For more information, see [CloudWatch cross-account observability](CloudWatch-Unified-Cross-Account.md).

**To create an alarm based on a single metric**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All alarms**.

1. Choose **Create alarm**.

1. Choose **Select Metric**.

1. Do one of the following:
   + Choose the service namespace that contains the metric that you want. Continue choosing options as they appear to narrow the choices. When a list of metrics appears, select the check box next to the metric that you want.
   + In the search box, enter the name of a metric, account ID, account label, dimension, or resource ID. Then, choose one of the results and continue until a list of metrics appears. Select the check box next to the metric that you want. 

1. Choose the **Graphed metrics** tab.

   1. Under **Statistic** , choose one of the statistics or predefined percentiles, or specify a custom percentile (for example, **p95.45**).

   1. Under **Period**, choose the evaluation period for the alarm. When evaluating the alarm, each period is aggregated into one data point.

      You can also choose whether the y-axis legend appears on the left or right while you're creating the alarm. This preference is used only while you're creating the alarm.

   1. Choose **Select metric**.

      The **Specify metric and conditions** page appears, showing a graph and other information about the metric and statistic that you selected.

1. Under **Conditions**, specify the following:

   1. For **Whenever *metric* is**, specify whether the metric must be greater than, less than, or equal to the threshold. Under **than...**, specify the threshold value.

   1. Choose **Additional configuration**. For **Datapoints to alarm**, specify how many evaluation periods (data points) must be in the `ALARM` state to trigger the alarm. If the two values here match, you create an alarm that goes to `ALARM` state if that many consecutive periods are breaching.

      To create an M out of N alarm, specify a lower number for the first value than you specify for the second value. For more information, see [Alarm evaluation](alarm-evaluation.md).

   1. For **Missing data treatment**, choose how to have the alarm behave when some data points are missing. For more information, see [Configuring how CloudWatch alarms treat missing data](alarms-and-missing-data.md).

   1. If the alarm uses a percentile as the monitored statistic, a **Percentiles with low samples** box appears. Use it to choose whether to evaluate or ignore cases with low sample rates. If you choose **ignore (maintain alarm state)**, the current alarm state is always maintained when the sample size is too low. For more information, see [Percentile-based alarms and low data samples](percentiles-with-low-samples.md). 

1. Choose **Next**.

1. Under **Notification**, select an SNS topic to notify when the alarm is in `ALARM` state, `OK` state, or `INSUFFICIENT_DATA` state.

   To have the alarm send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**.

   In CloudWatch cross-account observability, you can choose to have notifications sent to multiple AWS accounts. For example, to both the monitoring account and the source account.

   To have the alarm not send notifications, choose **Remove**.

1. To have the alarm perform Auto Scaling, EC2, Lambda, investigation, or Systems Manager actions, choose the appropriate button and choose the alarm state and action to perform. Alarms can perform Systems Manager actions and investigation actions only when they go into ALARM state. For more information about Systems Manager actions, see [ Configuring CloudWatch to create OpsItems from alarms](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-create-OpsItems-from-CloudWatch-Alarms.html) and [ Incident creation](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html).

   To have the alarm start an investigation, choose **Add investigation action** and then select your investigation group. For more information about , see [CloudWatch investigations](Investigations.md).
**Note**  
To create an alarm that performs an SSM Incident Manager action, you must have certain permissions. For more information, see [ Identity-based policy examples for AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/security_iam_id-based-policy-examples.html).

1. When finished, choose **Next**.

1. Enter a name and description for the alarm. The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources. Then choose **Next**.

1. Under **Preview and create**, confirm that the information and conditions are what you want, then choose **Create alarm**.

You can also add alarms to a dashboard. For more information, see [Adding an alarm to a CloudWatch dashboard](add_alarm_dashboard.md). 

# Create a CloudWatch alarm based on a metric math expression
<a name="Create-alarm-on-metric-math-expression"></a>

Metric alarms are designed to evaluate time series that you define from either a single metric, or a metric math expression that combines or transforms one or more metrics into a time series that deliver insights more closely aligned to your unique needs. To create an alarm based on a metric math expression, choose one or more CloudWatch metrics to use in the expression. Then, specify the expression, threshold, and evaluation periods.

You can't create an alarm based on the **SEARCH** expression. Only alarms based on Metrics Insights SQL queries can operate on multiple time series.

**To create an alarm that's based on a metric math expression**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, and then choose **All alarms**.

1. Choose **Create alarm**.

1. Choose **Select Metric**, and then perform one of the following actions:
   + Select a namespace from the **AWS namespaces** dropdown or **Custom namespaces** dropdown. After you select a namespace, you continue choosing options until a list of metrics appears, where you select the checkbox next to the correct metric.
   + Use the search box to find a metric, account ID, dimension, or resource ID. After you enter the metric, dimension, or resource ID, you continue choosing options until a list of metrics appears, where you select the check box next to the correct metric.

1. (Optional) If you want to add another metric to a metric math expression, you can use the search box to find a specific metric. You can add as many as 10 metrics to a metric math expression.

1. Select the **Graphed metrics** tab. For each of the metrics that you previously added, perform the following actions:

   1. Under the **Statistic** column, select the dropdown menu. In the dropdown menu, choose one of the predefined statistics or percentiles. Use the search box in the dropdown menu to specify a custom percentile.

   1. Under the **Period** column, select the dropdown menu. In the dropdown menu, choose one of the predefined evaluation periods.

      While you're creating your alarm, you can specify whether the Y-axis legend appears on the left or right side of your graph.
**Note**  
When CloudWatch evaluates alarms, periods are aggregated into single data points.

1. Choose the **Add math** dropdown, and then select **Start with an empty expression** from the list of predefined metric math expressions.

   After you choose **Start with an empty expression**, a math expression box appears where you apply or edit math expressions.

1. In the math expression box, enter your math expression, and then choose **Apply**.

   After you choose **Apply**, an **ID** column appears next to the **Label** column.

   To use a metric or the result of another metric math expression as part of your current math expression's formula, you use the value that's shown under the **ID** column. To change the value of **ID**, you select the pen-and-paper icon next to the current value. The new value must begin with a lowercase letter and can include numbers, letters, and the underscore symbol. Changing the value of **ID** to a more significant name can make your alarm graph easier to understand.

   For information about the functions that are available for metric math, see [Metric math syntax and functions](using-metric-math.md#metric-math-syntax).

1. (Optional) Add more math expressions, using both metrics and the results of other math expressions in the formulas of the new math expressions.

1. When you have the expression to use for the alarm, clear the check boxes to the left of every other expression and every metric on the page. Only the check box next to the expression to use for the alarm should be selected. The expression that you choose for the alarm must produce a single time series and show only one line on the graph. Then choose **Select metric**.

   The **Specify metric and conditions** page appears, showing a graph and other information about the math expression that you have selected.

1. For **Whenever *expression* is**, specify whether the expression must be greater than, less than, or equal to the threshold. Under **than...**, specify the threshold value.

1. Choose **Additional configuration**. For **Datapoints to alarm**, specify how many evaluation periods (data points) must be in the `ALARM` state to trigger the alarm. If the two values here match, you create an alarm that goes to `ALARM` state if that many consecutive periods are breaching.

   To create an M out of N alarm, specify a lower number for the first value than you specify for the second value. For more information, see [Alarm evaluation](alarm-evaluation.md).

1. For **Missing data treatment**, choose how to have the alarm behave when some data points are missing. For more information, see [Configuring how CloudWatch alarms treat missing data](alarms-and-missing-data.md).

1. Choose **Next**.

1. Under **Notification**, select an SNS topic to notify when the alarm is in `ALARM` state, `OK` state, or `INSUFFICIENT_DATA` state.

   To have the alarm send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**.

   To have the alarm not send notifications, choose **Remove**.

1. To have the alarm perform Auto Scaling, Amazon EC2, Lambda, or Systems Manager actions, choose the appropriate button and choose the alarm state and action to perform. If you choose a Lambda function as an alarm action, you specify the function name or ARN, and you can optionally choose a specific version of the function.

   Alarms can perform Systems Manager actions only when they go into ALARM state. For more information about Systems Manager actions, see see [ Configuring CloudWatch to create OpsItems from alarms](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-create-OpsItems-from-CloudWatch-Alarms.html) and [ Incident creation](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html).
**Note**  
To create an alarm that performs an SSM Incident Manager action, you must have certain permissions. For more information, see [ Identity-based policy examples for AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/security_iam_id-based-policy-examples.html).

1. When finished, choose **Next**.

1. Enter a name and description for the alarm. Then choose **Next**.

   The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources.

1. Under **Preview and create**, confirm that the information and conditions are what you want, then choose **Create alarm**.

You can also add alarms to a dashboard. For more information, see [Adding an alarm to a CloudWatch dashboard](add_alarm_dashboard.md). 

# Create a CloudWatch alarm based on anomaly detection
<a name="Create_Anomaly_Detection_Alarm"></a>

You can create an alarm based on CloudWatch anomaly detection, which analyzes past metric data and creates a model of expected values. The expected values take into account the typical hourly, daily, and weekly patterns in the metric.

You set a value for the anomaly detection threshold, and CloudWatch uses this threshold with the model to determine the "normal" range of values for the metric. A higher value for the threshold produces a thicker band of "normal" values.

You can choose whether the alarm is triggered when the metric value is above the band of expected values, below the band, or either above or below the band.

You also can create anomaly detection alarms on single metrics and the outputs of metric math expressions. You can use these expressions to create graphs that visualize anomaly detection bands.

In an account set up as a monitoring account for CloudWatch cross-account observability, you can create anomaly detectors on metrics in source accounts in addition to metrics in the monitoring account.

For more information, see [Using CloudWatch anomaly detection](CloudWatch_Anomaly_Detection.md).

**Note**  
If you're already using anomaly detection for visualization purposes on a metric in the Metrics console and you create an anomaly detection alarm on that same metric, then the threshold that you set for the alarm doesn't change the threshold that you already set for visualization. For more information, see [Creating a graph](graph_a_metric.md#create-metric-graph).

**To create an alarm that's based on anomaly detection**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1.  In the navigation pane, choose **Alarms**, **All alarms**. 

1.  Choose **Create alarm**. 

1.  Choose **Select Metric**. 

1. Do one of the following:
   +  Choose the service namespace that contains your metric, and then continue choosing options as they appear to narrow down your options. When a list of metrics appears, select the check box that's next to your metric. 
   +  In the search box, enter the name of a metric, dimension, or resource ID. Select one of the results, and then continue choosing options as they appear until a list of metrics appears. Select the check box that's next to your metric. 

1.  Choose **Graphed metric**. 

   1.  (Optional) For *Statistic*, choose the dropdown, and then select one of the predefined statistics or percentiles. You can use the search box in the dropdown to specify a custom percentile, such as **p95.45**. 

   1.  (Optional) For *Period*, choose the dropdown, and then select one of the predefined evaluation periods. 
**Note**  
 When CloudWatch evaluates your alarm, it aggragates the period into a single datapoint. For an anomaly detection alarm, the evaluation period must be one minute or longer. 

1.  Choose **Next**. 

1.  Under ***Conditions***, specify the following: 

   1. Choose **Anomaly detection**.

       If the model for this metric and statistic already exists, CloudWatch displays a preview of the anomaly detection band in the graph at the top of the screen. After you create your alarm, it can take up to 15 minutes for the actual anomaly detection band to appear in the graph. Before that, the band that you see is an approximation of the anomaly detection band. 
**Tip**  
 To see the graph at the top of the screen in a longer time frame, choose **Edit** at the top-right of the screen. 

       If the model for this metric and statistic doesn't already exist, CloudWatch generates the anomaly detection band after you finish creating your alarm. For new models, it can take up to 3 hours for the actual anomaly detection band to appear in your graph. It can take up to two weeks for the new model to train, so the anomaly detection band shows more accurate expected values. 

   1. For **Whenever *metric* is**, specify when to trigger the alarm. For example, when the metric is greater than, lower than, or outside the band (in either direction).

   1. For **Anomaly detection threshold**, choose the number to use for the anomaly detection threshold. A higher number creates a thicker band of "normal" values that is more tolerant of metric changes. A lower number creates a thinner band that will go to `ALARM` state with smaller metric deviations. The number does not have to be a whole number.

   1. Choose **Additional configuration**. For **Datapoints to alarm**, specify how many evaluation periods (data points) must be in the `ALARM` state to trigger the alarm. If the two values here match, you create an alarm that goes to `ALARM` state if that many consecutive periods are breaching.

      To create an M out of N alarm, specify a number for the first value that is lower than the number for the second value. For more information, see [Alarm evaluation](alarm-evaluation.md).

   1. For **Missing data treatment**, choose how the alarm behaves when some data points are missing. For more information, see [Configuring how CloudWatch alarms treat missing data](alarms-and-missing-data.md).

   1. If the alarm uses a percentile as the monitored statistic, a **Percentiles with low samples** box appears. Use it to choose whether to evaluate or ignore cases with low sample rates. If you choose **Ignore (maintain alarm state)**, the current alarm state is always maintained when the sample size is too low. For more information, see [Percentile-based alarms and low data samples](percentiles-with-low-samples.md). 

1.  Choose **Next**. 

1. Under **Notification**, select an SNS topic to notify when the alarm is in `ALARM` state, `OK` state, or `INSUFFICIENT_DATA` state.

   To send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**.

   Choose **Remove** if you don't want the alarm to send notifications.

1. You can set up the alarm to perform EC2 actions or invoke a Lambda function when it changes state, or to create a Systems Manager OpsItem or incident when it goes into ALARM state. To do this, choose the appropriate button and then choose the alarm state and action to perform.

   If you choose a Lambda function as an alarm action, you specify the function name or ARN, and you can optionally choose a specific version of the function.

   For more information about Systems Manager actions, see [ Configuring CloudWatch to create OpsItems from alarms](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-create-OpsItems-from-CloudWatch-Alarms.html) and [ Incident creation](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html).
**Note**  
To create an alarm that performs an AWS Systems Manager Incident Manager action, you must have certain permissions. For more information, see [ Identity-based policy examples for AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/security_iam_id-based-policy-examples.html).

1.  Choose **Next**. 

1.  Under ***Name and description***, enter a name and description for your alarm, and choose **Next**. The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources. 
**Tip**  
 The alarm name must contain only UTF-8 characters, and can't contain ASCII control characters 

1.  Under ***Preview and create***, confirm that your alarm's information and conditions are correct, and choose **Create alarm**. 

## Editing an anomaly detection model
<a name="Modify_Anomaly_Detection_Model"></a>

After you create an alarm, you can adjust the anomaly detection model. You can exclude certain time periods from being used in the model creation. It is critical that you exclude unusual events such as system outages, deployments, and holidays from the training data. You can also specify whether to adjust the model for Daylight Savings Time changes.

**To edit the anomaly detection model for an alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All alarms**.

1. Choose the name of the alarm. If necessary, use the search box to find the alarm.

1. Choose **View**, **In metrics**.

1. In the **Details** column, choose the **ANOMALY\$1DETECTION\$1BAND** keyword, and then choose **Edit anomaly detection model** in the popup.  
![\[The Graphed Metrics tab with the ANOMALY_DETECTION_BAND popup menu displayed.\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/Anomaly_Detection_Edit.PNG)

1. To exclude a time period from being used to produce the model, choose the calendar icon by **End date**. Then, select or enter the days and times to exclude from training and choose **Apply**.

1. If the metric is sensitive to Daylight Savings Time changes, select the appropriate time zone in the **Metric timezone** box.

1. Choose **Update**.

## Deleting an anomaly detection model
<a name="Delete_Anomaly_Detection_Model"></a>

Using anomaly detection for an alarm accrues charges. As a best practice, if your alarm no longer needs an anomaly detection model, delete the alarm first and the model second. When anomaly detection alarms are evaluated, any missing anomaly detectors are created on your behalf. If you delete the model without deleting the alarm, the alarm automatically recreates the model.

**To delete an alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All Alarms**.

1. Choose the name of the alarm.

1. Choose **Actions**, **Delete**.

1. In the confirmation box, choose **Delete**.

**To delete an anomaly detection model that was used for an alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1.  In the navigation pane, choose **Metrics**, and then choose **All metrics**. 

1.  Choose **Browse**, and then select the metric that includes the anomaly detection model. You can search for your metric in the search box or select your metric by choosing through the options. 
   +  (Optional) If you're using the original interface, choose **All metrics**, and then choose the metric that includes the anomaly detection model. You can search for your metric in the search box or select your metric by choosing through the options. 

1.  Choose **Graphed metrics**. 

1. In the **Graphed metrics** tab, in the **Details** column, choose the **ANOMALY\$1DETECTION\$1BAND** keyword, and then choose **Delete anomaly detection model** in the popup.  
![\[The Graphed Metrics tab with the ANOMALY_DETECTION_BAND popup menu displayed.\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/Anomaly_Detection_Edit.PNG)
   +  (Optional) If you're using the original interface, choose **Edit model**. You're directed to a new screen. On the new screen, choose **Delete model**, and then choose **Delete**. 

# Create an alarm based on a Multi Time Series Metrics Insights query
<a name="multi-time-series-alarm"></a>

You can create an alarm that monitors multiple time series across a fleet of resources. Unlike single-instance alarms that trigger actions on individual instances, fleet monitoring alarms let you aggregate metrics across multiple resources and trigger based on fleet-wide conditions.

## Setting up a Multi Time Series alarm using the AWS Management Console
<a name="multi-time-series-alarm-console"></a>

This example shows how to create an alarm that monitors memory utilization across a fleet of instances and alerts you when more than two instances exceed a threshold.

**To create a multi time series alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All Alarms**.

1. Choose **Create alarm**.

1. Choose **Select metric**.

1. Under **Metrics**, enter a Metrics Insights query:

   ```
   SELECT MAX(mem_used_percent)
   FROM "CWAgent"
   GROUP BY InstanceId
   ORDER BY MAX() DESC
   ```

1. Choose **Next**.

1. Under **Conditions**, specify the following:
   + For **Threshold type**, choose **Static**.
   + For **When metric is**, choose **Greater than** and enter **80**.
   + For **Datapoints to alarm**, enter **2**.

1. Configure notifications and actions as needed.

1. Add a name and description for your alarm.

1. Choose **Create alarm**.

This alarm differs from single-instance alarms in several ways:
+ It monitors multiple time series simultaneously through the use of a metrics query. The metrics query is refreshed every time the alarm evaluates, thus your alarm automatically adapts as resources are created, paused, or deleted.
+ For each contributor that breaches the threshold, the alarm sends a contributor state change event, which has a different event type in EventBridge than an alarm state change event. The alarm itself also changes state: as soon as at least one contributor is in alarm, the alarm also enters the alarm state.
+ Some actions however, such as SSM Incident, are triggered at the alarm level. Such actions are not repeated when the list of contributors in alarm changes.

This alarm differs from aggregated metric-query alarms in several ways:
+ It monitors time series individually instead of an aggregate, using the `GROUP BY` clause.
+ It follows the level of granularity that you set according to your needs: for example, it can alarm on every Amazon EC2 instance (most granular level of Amazon EC2 metrics) or per Amazon RDS table (aggregated across various operations on a table), depending on which fields you set in the `GROUP BY` clause
+ It prioritizes evaluation using the `ORDER BY` clause.
+ For each contributor that breaches the threshold, the alarm sends a contributor state change event, which has a different event type in EventBridge than an alarm state change event. The alarm itself also changes state: as soon as at least one contributor is in alarm, the alarm also enters the alarm state.
+ Some actions however, such as SSM Incident, are triggered at the alarm level. Such actions are not repeated when the list of contributors in alarm changes.

# Create an alarm based on a connected data source
<a name="Create_MultiSource_Alarm"></a>

You can create alarms that watch metrics from data sources that aren't in CloudWatch. For more information about creating connections to these other data sources, see [Query metrics from other data sources](MultiDataSourceQuerying.md).

**To create an alarm on metrics from a data source that you have connected to**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**, **All metrics**.

1. Choose the **Multi source query** tab.

1. For **Data source**, select the data source that you want to use.

1. The query builder prompts you for the information necessary for the query to retrieve the metrics to use for the alarm. The workflow is different for each data source, and is tailored to the data source. For example, for Amazon Managed Service for Prometheus and Prometheus data sources, a PromQL query editor box with a query helper appears.

1. When you have finished constructing the query, choose **Graph query**.

1. If the sample graph looks the way that you expect, choose **Create alarm**.

1. The **Specify metric and conditions** page appears. If the query you are using produces more than one time series, you see a warning banner at the top of the page. If you do, select a function to use to aggregate the time series in **Aggregation function**. 

1. (Optional) Add a **Label** for the alarm.

1.  For **Whenever *your-metric-name* is . . .**, choose **Greater**, **Greater/Equal**, **Lower/Equal**, or **Lower**. Then for **than . . .**, specify a number for your threshold value. 

1. Choose **Additional configuration**. For **Datapoints to alarm**, specify how many evaluation periods (data points) must be in the `ALARM` state to trigger the alarm. If the two values here match, you create an alarm that goes to `ALARM` state if that many consecutive periods are breaching.

   To create an M out of N alarm, specify a number for the first value that is lower than the number for the second value. For more information, see [Alarm evaluation](alarm-evaluation.md).

1. For **Missing data treatment**, choose how the alarm behaves when some data points are missing. For more information, see [Configuring how CloudWatch alarms treat missing data](alarms-and-missing-data.md).

1. Choose **Next**.

1.  For **Notification**, specify an Amazon SNS topic to notify when your alarm transitions to the `ALARM`, `OK`, or `INSUFFICIENT_DATA` state. 

   1.  (Optional) To send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**.
**Note**  
We recommend that you set the alarm to take actions when it goes into **Insufficient data** state in addition to when it goes into **Alarm** state. This is because many issues with the Lambda function that connects to the data source can cause the alarm to transition to **Insufficient data**.

   1.  (Optional) To not send Amazon SNS notifications, choose **Remove**. 

1. To have the alarm perform Auto Scaling, Lambda, or Systems Manager actions, choose the appropriate button and choose the alarm state and action to perform. If you choose a Lambda function as an alarm action, you specify the function name or ARN, and you can optionally choose a specific version of the function.

   Alarms can perform Systems Manager actions only when they go into ALARM state. For more information about Systems Manager actions, see see [ Configuring CloudWatch to create OpsItems from alarms](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-create-OpsItems-from-CloudWatch-Alarms.html) and [ Incident creation](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html).
**Note**  
To create an alarm that performs an SSM Incident Manager action, you must have certain permissions. For more information, see [ Identity-based policy examples for AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/security_iam_id-based-policy-examples.html).

1. Choose **Next**.

1.  Under **Name and description**, enter a name and description for your alarm, and choose **Next**. The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources. 
**Tip**  
 The alarm name must contain only UTF-8 characters. It can't contain ASCII control characters. 

1.  Under **Preview and create**, confirm that your alarm's information and conditions are correct, and choose **Create alarm**. 

# Create an alarm using a PromQL query
<a name="Create_PromQL_Alarm"></a>

You can create a CloudWatch alarm that uses a PromQL instant query to monitor metrics ingested through the CloudWatch OTLP endpoint. All matching time series returned by the query are considered to be breaching, and the alarm tracks each breaching time series as a contributor. For more information about how PromQL alarms work, see [PromQL alarms](alarm-promql.md).

## Creating a PromQL alarm using the AWS Management Console
<a name="promql-alarm-create-console"></a>

This example shows how to create an alarm that monitors a gauge metric and alerts you when its value drops below 20.

**To create a PromQL alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All Alarms**.

1. Choose **Create alarm**.

1. Choose **PromQL** for the metric type.

1. In **Editor** mode, enter the PromQL query:

   ```
   my_gauge_metric < 20
   ```

1. Under **Conditions**, specify the following:
   + For **Evaluation interval**, choose **1 minute**, to define how often the PromQL query is evaluated.
   + For **Pending period**, enter **120**, duration in seconds a contributor must be breaching before entering ALARM state.
   + For **Recovery period**, enter **300**, duration in seconds a contributor must not be breaching before entering OK state.

1. Configure notifications and actions as needed.

1. Add a name and description for your alarm.

1. Choose **Next**.

1. Choose **Create alarm**.

## Creating a PromQL alarm (AWS CLI)
<a name="promql-alarm-create-cli"></a>

Use the [PutMetricAlarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html) API action to create a PromQL alarm.

**Example Create a PromQL alarm that triggers when a gauge metric drops below 20**  

```
aws cloudwatch put-metric-alarm \
  --alarm-name MyPromQLAlarm \
  --evaluation-criteria '{"PromQLCriteria":{"Query":"my_gauge_metric < 20"}}' \
  --evaluation-interval 60
```

**Example Create a PromQL alarm with a pending period**  
This alarm waits 300 seconds (5 minutes) before transitioning to `ALARM` state, and waits 600 seconds (10 minutes) before recovering.  

```
aws cloudwatch put-metric-alarm \
  --alarm-name HighLatencyAlarm \
  --evaluation-criteria '{"PromQLCriteria":{"Query":"histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 0.5","PendingPeriod":300,"RecoveryPeriod":600}}' \
  --evaluation-interval 60
```

**Example Create a PromQL alarm with an SNS notification action**  

```
aws cloudwatch put-metric-alarm \
  --alarm-name MyPromQLAlarmWithAction \
  --evaluation-criteria '{"PromQLCriteria":{"Query":"my_gauge_metric < 20","PendingPeriod":0,"RecoveryPeriod":0}}' \
  --evaluation-interval 60 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:MyTopic
```

## Creating a PromQL alarm from Query Studio
<a name="promql-alarm-create-query-studio"></a>

This example shows how to create a PromQL alarm from Query Studio that alerts you when the average HTTP request duration for a service exceeds 500 milliseconds.

Unlike standard CloudWatch alarms where the threshold is configured as a separate step, PromQL alarms define the alarm condition (threshold) as part of the query itself. For example, the comparison operator (`>`) and threshold value (`0.5`) are embedded directly in the PromQL expression.

**To create a PromQL alarm from Query Studio**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane below **Metrics**, choose **Query Studio (Preview)**.

1. Select **PromQL** from the query language drop-down menu.

1. Build your query using one of the following modes:
   + In **Builder** mode, select a metric name from the **Metric** field (for example, `http.server.request.duration`). Add label filters as needed (for example, `@resource.service.name` = `my-api`). To define the alarm threshold, select a **Basic Operation** (for example, `>`) and enter a **Number** (for example, `0.5`).
   + In **Code** mode, enter the PromQL expression directly, for example:

     ```
     histogram_avg({"http.server.request.duration", "@resource.service.name"="my-api"}) > 0.5
     ```

1. Choose **Run** to execute the query and verify it returns the expected results.

1. Choose **Create alarm** from the actions menu.

1. You are redirected to the CloudWatch alarm creation page with your PromQL query pre-populated.

1. Under **Conditions**, specify the following:
   + For **Evaluation interval**, choose **1 minute**, to define how often the PromQL query is evaluated.
   + For **Pending period**, enter **60**, duration in seconds the query must be breaching before entering ALARM state. This means the latency must exceed the threshold for at least 60 seconds before the alarm fires.
   + For **Recovery period**, enter **120**, duration in seconds the query must not be breaching before entering OK state. This means the latency must stay below the threshold for at least 120 seconds before the alarm recovers.

1. Configure notifications and actions as needed.

1. Add a name and description for your alarm.

1. Choose **Next**.

1. Choose **Create alarm**.

**Note**  
The PromQL query must return a single time series to create an alarm. If your query returns multiple time series, use aggregation functions such as `sum`, `avg`, or `topk` to reduce the result to a single series before creating the alarm.

# Alarming on logs
<a name="Alarm-On-Logs"></a>

The steps in the following sections explain how to create CloudWatch alarms on logs.

## Create a CloudWatch alarm based on a log group-metric filter
<a name="Create_alarm_log_group_metric_filter"></a>

 The procedure in this section describes how to create an alarm based on a log group-metric filter. With metric filters, you can look for terms and patterns in log data as the data is sent to CloudWatch. For more information, see [Create metrics from log events using filters](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html) in the *Amazon CloudWatch Logs User Guide*. Before you create an alarm based on a log group-metric filter, you must complete the following actions: 
+  Create a log group. For more information, see [Working with log groups and log streams](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html#Create-Log-Group) in the *Amazon CloudWatch Logs User Guide*. 
+  Create a metric filter. For more information, see [Create a metric filter for a log group](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CreateMetricFilterProcedure.html) in the *Amazon CloudWatch Logs User Guide*. 

**To create an alarm based on a log group-metric filter**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1.  From the navigation pane, choose **Logs**, and then choose **Log groups**. 

1.  Choose the log group that includes your metric filter. 

1.  Choose **Metric filters**. 

1.  In the metric filters tab, select the box for the metric filter that you want to base your alarm on. 

1.  Choose **Create alarm**. 

1.  (Optional) Under **Metric**, edit **Metric name**, **Statistic**, and **Period**. 

1.  Under **Conditions**, specify the following: 

   1.  For **Threshold type**, choose **Static** or **Anomaly detection**. 

   1.  For **Whenever *your-metric-name* is . . .**, choose **Greater**, **Greater/Equal**, **Lower/Equal** , or **Lower**. 

   1.  For **than . . .**, specify a number for your threshold value. 

1.  Choose **Additional configuration**. 

   1.  For **Data points to alarm**, specify how many data points trigger your alarm to go into the `ALARM` state. If you specify matching values, your alarm goes into the `ALARM` state if that many consecutive periods are breaching. To create an M-out-of-N alarm, specify a number for the first value that's lower than the number you specify for the second value. For more information, see [Alarm evaluation](alarm-evaluation.md). 

   1.  For **Missing data treatment**, select an option to specify how to treat missing data when your alarm is evaluated. 

1.  Choose **Next**. 

1.  For **Notification**, specify an Amazon SNS topic to notify when your alarm is in the `ALARM`, `OK`, or `INSUFFICIENT_DATA` state. 

   1.  (Optional) To send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**. 

   1.  (Optional) To not send notifications, choose **Remove**. 

1. To have the alarm perform Auto Scaling, EC2, Lambda, or Systems Manager actions, choose the appropriate button and choose the alarm state and action to perform. If you choose a Lambda function as an alarm action, you specify the function name or ARN, and you can optionally choose a specific version of the function.

   Alarms can perform Systems Manager actions only when they go into ALARM state. For more information about Systems Manager actions, see see [ Configuring CloudWatch to create OpsItems from alarms](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-create-OpsItems-from-CloudWatch-Alarms.html) and [ Incident creation](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html).
**Note**  
To create an alarm that performs an SSM Incident Manager action, you must have certain permissions. For more information, see [ Identity-based policy examples for AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/security_iam_id-based-policy-examples.html).

1.  Choose **Next**. 

1.  For **Name and description**, enter a name and description for your alarm. The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources. 

1.  For **Preview and create**, check that your configuration is correct, and choose **Create alarm**. 

# Create a composite alarm
<a name="Create_Composite_Alarm"></a>

The steps in this section explain how to use the CloudWatch console to create a composite alarm. You can also use the API or AWS CLI to create a composite alarm. For more information, see [PutCompositeAlarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutCompositeAlarm.html) or [put-composite-alarm](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/cloudwatch/put-composite-alarm.html) 

**To create a composite alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, and then choose **All alarms**.

1. From the list of alarms, select the check box next to each of the existing alarms that you want to reference in your rule expression, and then choose **Create composite alarm**.

1. Under **Specify composite alarm conditions**, specify the rule expression for your new composite alarm.
**Note**  
Automatically, the alarms that you selected from the list of alarms are listed in the **Conditions** box. By default, the `ALARM` function has been designated to each of your alarms, and each of your alarms is joined by the logical operator `OR`.

   You can use the following substeps to modify your rule expression:

   1. You can change the required state for each of your alarms from `ALARM` to `OK` or `INSUFFICIENT_DATA`.

   1. You can change the logical operator in your rule expression from `OR` to `AND` or `NOT`, and you can add parentheses to group your functions.

   1. You can include other alarms in your rule expression or delete alarms from your rule expression.

   **Example: Rule expression with conditions**

   ```
   (ALARM("CPUUtilizationTooHigh") OR 
   ALARM("DiskReadOpsTooHigh")) AND 
   OK("NetworkOutTooHigh")
   ```

   In the example rule expression where the composite alarm goes into `ALARM` when ALARM ("CPUUtilizationTooHigh" or ALARM("DiskReadOpsTooHigh") is in `ALARM` at the same time as OK("NetworkOutTooHigh") is in `OK`.

1. When finished, choose **Next**.

1. Under **Configure actions**, you can choose from the following:

   For ***Notification***
   + **Select an exisiting SNS topic**, **Create a new SNS topic**, or **Use a topic ARN** to define the SNS topic that will receive the notification.
   + **Add notification**, so your alarm can send multiple notifications for the same alarm state or different alarm states.
   + **Remove** to stop your alarm from sending notifications or taking actions.

   (Optional) To have the alarm invoke a Lambda function when it changes state, choose **Add Lambda action**. Then specify the function name or ARN, and optionally choose a specific version of the function.

   For ***Systems Manager action***
   + **Add Systems Manager action**, so your alarm can perform an SSM action when it goes into ALARM.

   To learn more about Systems Manager actions, see [Configuring CloudWatch to create OpsItems from alarms](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-create-OpsItems-from-CloudWatch-Alarms.html) in the *AWS Systems Manager User Guide* and [Incident creation](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html) in the *Incident Manager User Guide*. To create an alarm that performs an SSM Incident Manager action, you must have the correct permissions. For more information, see [Identity-based policy examples for AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/security_iam_id-based-policy-examples.html) in the *Incident Manager User Guide*.

   To have the alarm start an investigation, choose **Add investigation action** and then select your investigation group. For more information about , see [CloudWatch investigations](Investigations.md).

1. When finished, choose **Next**.

1. Under **Add name and description**, enter an alarm name and *optional* description for your new composite alarm. The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources. 

1. When finished, choose **Next**.

1. Under **Preview and create**, confirm your information, and then choose **Create composite alarm**.
**Note**  
You can create a cycle of composite alarms, where one composite alarm and another composite alarm depend on each other. If you find yourself in this scenario, your composite alarms stop being evaluated, and you can't delete your composite alarms because they're dependent on each other. The easiest way to break the cycle of dependecy between your composite alarms is to change the function `AlarmRule` in one of your composite alarms to `False`.

# Acting on alarm changes
<a name="Acting_Alarm_Changes"></a>

CloudWatch can notify users on two types of alarm changes: when an alarm changes state, and when the configuration of an alarm gets updated.

When an alarm evaluates, it might change from one state to another, such as ALARM or OK. For Metrics Insights alarms that monitor multiple time series, each time series (contributor) can only be in ALARM or OK state, never in INSUFFICIENT\$1DATA state. This is because a time series only exists when data is present.

Additionally, CloudWatch sends events to Amazon EventBridge whenever alarms change state, and when alarms are created, deleted, or updated. You can write EventBridge rules to take actions or be notified when EventBridge receives these events.

For more information about alarm actions, see [Alarm actions](alarm-actions.md).

**Topics**
+ [

# Notifying users on alarm changes
](Notify_Users_Alarm_Changes.md)
+ [

# Invoke a Lambda function from an alarm
](alarms-and-actions-Lambda.md)
+ [

# Start a CloudWatch investigations from an alarm
](Start-Investigation-Alarm.md)
+ [

# Stop, terminate, reboot, or recover an EC2 instance
](UsingAlarmActions.md)
+ [

# Alarm events and EventBridge
](cloudwatch-and-eventbridge.md)

# Notifying users on alarm changes
<a name="Notify_Users_Alarm_Changes"></a>

This section explains how you can use AWS User Notifications or Amazon Simple Notification Service to have users be notified of alarm changes.

## Setting up AWS User Notifications
<a name="Alarm_User_Notifications"></a>

You can use [AWS User Notifications](https://docs.aws.amazon.com/notifications/latest/userguide/what-is-service.html) to set up delivery channels to get notified about CloudWatch alarm state change and configuration change events. You receive a notification when an event matches a rule that you specify. You can receive notifications for events through multiple channels, including email, [AWS Chatbot](https://docs.aws.amazon.com/chatbot/latest/adminguide/what-is.html) chat notifications, or [AWS Console Mobile Application push notifications](https://docs.aws.amazon.com/consolemobileapp/latest/userguide/managing-notifications.html). You can also see notifications in the at [Console Notifications Center](https://console.aws.amazon.com/notifications). User Notifications supports aggregation, which can reduce the number of notifications you receive during specific events.

The notification configurations you create with AWS User Notifications do not count towards the limit on the number of actions you can configure per target alarm state. As AWS User Notifications matches the events emitted to Amazon EventBridge, it sends notifications for all the alarms in your account and selected Regions, unless you specify an advanced filter to allowlist or denylist specific alarms or patterns.

The following example of an advanced filter matches an alarm state change from OK to ALARM on the alarm named `ServerCpuTooHigh`. 

```
{
"detail": {
    "alarmName": ["ServerCpuTooHigh"],
    "previousState": { "value": ["OK"] },
    "state": { "value": ["ALARM"] }
  }
}
```

You can use any of the properties published by an alarm in EventBridge events to create a filter. For more information, see [Alarm events and EventBridge](cloudwatch-and-eventbridge.md).

## Setting up Amazon SNS notifications
<a name="US_SetupSNS"></a>

You can use Amazon Simple Notification Service to send both application-to-application (A2A) messaging and application-to-person (A2P) messaging, including mobile text messaging (SMS) and email messages. For more information, see [ Amazon SNS event destinations](https://docs.aws.amazon.com/sns/latest/dg/sns-event-destinations.html).

For every state that an alarm can take, you can configure the alarm to send a message to an SNS topic. Every Amazon SNS topic you configure for a state on a given alarm counts towards the limit on the number of actions you can configure for that alarm and state. You can send messages to the same Amazon SNS topic from any alarms in your account, and use the same Amazon SNS topic for both application (A2A) and person (A2P) consumers. Because this configuration is done at the alarm level, only the alarms you have configured send messages to the selected Amazon SNS topic.

First, create a topic, then subscribe to it. You can optionally publish a test message to the topic. For an example, see [Setting up an Amazon SNS topic using the AWS Management Console](#set-up-sns-topic-console). Or for more information, see [ Getting started with Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/sns-getting-started.html).

Alternatively, if you plan to use the AWS Management Console to create your CloudWatch alarm, you can skip this procedure because you can create the topic when you create the alarm.

 When you create a CloudWatch alarm, you can add actions for any target state the alarm enters. Add an Amazon SNS notification for the state you want to be notified about, and select the Amazon SNS topic you created in the previous step to send an email notification when the alarm enters the selected state. 

**Note**  
When you create an Amazon SNS topic, you choose to make it a *standard topic* or a *FIFO topic*. CloudWatch guarantees the publication of all alarm notifications to both types of topics. However, even if you use a FIFO topic, in rare cases CloudWatch sends the notifications to the topic out of order. If you use a FIFO topic, the alarm sets the message group ID of the alarm notifications to be a hash of the ARN of the alarm.

**Topics**
+ [

### Preventing confused deputy security issues
](#SNS_Confused_Deputy)
+ [

### Setting up an Amazon SNS topic using the AWS Management Console
](#set-up-sns-topic-console)
+ [

### Setting up an SNS topic using the AWS CLI
](#set-up-sns-topic-cli)

### Preventing confused deputy security issues
<a name="SNS_Confused_Deputy"></a>

The confused deputy problem is a security issue where an entity that doesn't have permission to perform an action can coerce a more-privileged entity to perform the action. In AWS, cross-service impersonation can result in the confused deputy problem. Cross-service impersonation can occur when one service (the *calling service*) calls another service (the *called service*). The calling service can be manipulated to use its permissions to act on another customer's resources in a way it should not otherwise have permission to access. To prevent this, AWS provides tools that help you protect your data for all services with service principals that have been given access to resources in your account. 

We recommend using the [https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourcearn](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourcearn), [https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceaccount](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceaccount), [https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceorgid](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceorgid), and [https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceorgpaths](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceorgpaths) global condition context keys in resource policies to limit the permissions that Amazon SNS gives another service to the resource. Use `aws:SourceArn` to associate only one resource with cross-service access. Use `aws:SourceAccount` to let any resource in that account be associated with the cross-service use. Use `aws:SourceOrgID` to allow any resource from any account within an organization be associated with the cross-service use. Use `aws:SourceOrgPaths` to associate any resource from accounts within an AWS Organizations path with the cross-service use. For more information about using and understanding paths, see [aws:SourceOrgPaths](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceorgpaths) in the IAM User Guide.

The most effective way to protect against the confused deputy problem is to use the `aws:SourceArn` global condition context key with the full ARN of the resource. If you don't know the full ARN of the resource or if you are specifying multiple resources, use the `aws:SourceArn` global context condition key with wildcard characters (`*`) for the unknown portions of the ARN. For example, `arn:aws:servicename:*:123456789012:*`. 

If the `aws:SourceArn` value does not contain the account ID, such as an Amazon S3 bucket ARN, you must use both `aws:SourceAccount` and `aws:SourceArn` to limit permissions.

To protect against the confused deputy problem at scale, use the `aws:SourceOrgID` or `aws:SourceOrgPaths` global condition context key with the organization ID or organization path of the resource in your resource-based policies. Policies that include the `aws:SourceOrgID` or `aws:SourceOrgPaths` key will automatically include the correct accounts and you don't have to manually update the policies when you add, remove, or move accounts in your organization.

The value of `aws:SourceArn` must be the ARN of the alarm that is sending notifications.

The following example shows how you can use the `aws:SourceArn` and `aws:SourceAccount` global condition context keys in CloudWatch to prevent the confused deputy problem.

```
{
    "Statement": [{
        "Effect": "Allow",
        "Principal": {
            "Service": "cloudwatch.amazonaws.com"
        },
        "Action": "SNS:Publish",
        "Resource": "arn:aws:sns:us-east-2:444455556666:MyTopic",
        "Condition": {
            "ArnLike": {
                "aws:SourceArn": "arn:aws:cloudwatch:us-east-2:111122223333:alarm:*"
            },
            "StringEquals": {
                "aws:SourceAccount": "111122223333"
            }
        }
    }]
}
```

If an alarm ARN includes any non-ASCII characters, use only the `aws:SourceAccount` global condition key to limit the permissions.

### Setting up an Amazon SNS topic using the AWS Management Console
<a name="set-up-sns-topic-console"></a>

First, create a topic, then subscribe to it. You can optionally publish a test message to the topic.

**To create an SNS topic**

1. Open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. On the Amazon SNS dashboard, under **Common actions**, choose **Create Topic**. 

1. In the **Create new topic** dialog box, for **Topic name**, enter a name for the topic (for example, **my-topic**).

1. Choose **Create topic**.

1. Copy the **Topic ARN** for the next task (for example, arn:aws:sns:us-east-1:111122223333:my-topic).

**To subscribe to an SNS topic**

1. Open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the navigation pane, choose **Subscriptions**, **Create subscription**.

1. In the **Create subscription** dialog box, for **Topic ARN**, paste the topic ARN that you created in the previous task.

1. For **Protocol**, choose **Email**.

1. For **Endpoint**, enter an email address that you can use to receive the notification, and then choose **Create subscription**.

1. From your email application, open the message from AWS Notifications and confirm your subscription.

   Your web browser displays a confirmation response from Amazon SNS.

**To publish a test message to an SNS topic**

1. Open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the navigation pane, choose **Topics**.

1. On the **Topics** page, select a topic and choose **Publish to topic**.

1. In the **Publish a message** page, for **Subject**, enter a subject line for your message, and for **Message**, enter a brief message.

1. Choose **Publish Message**.

1. Check your email to confirm that you received the message.

### Setting up an SNS topic using the AWS CLI
<a name="set-up-sns-topic-cli"></a>

First, you create an SNS topic, and then you publish a message directly to the topic to test that you have properly configured it.

**To set up an SNS topic**

1. Create the topic using the [create-topic](https://docs.aws.amazon.com/cli/latest/reference/sns/create-topic.html) command as follows.

   ```
   1. aws sns create-topic --name my-topic
   ```

   Amazon SNS returns a topic ARN with the following format:

   ```
   1. {
   2.     "TopicArn": "arn:aws:sns:us-east-1:111122223333:my-topic"
   3. }
   ```

1. Subscribe your email address to the topic using the [subscribe](https://docs.aws.amazon.com/cli/latest/reference/sns/subscribe.html) command. If the subscription request succeeds, you receive a confirmation email message.

   ```
   1. aws sns subscribe --topic-arn arn:aws:sns:us-east-1:111122223333:my-topic --protocol email --notification-endpoint my-email-address
   ```

   Amazon SNS returns the following:

   ```
   1. {
   2.     "SubscriptionArn": "pending confirmation"
   3. }
   ```

1. From your email application, open the message from AWS Notifications and confirm your subscription.

   Your web browser displays a confirmation response from Amazon Simple Notification Service.

1. Check the subscription using the [list-subscriptions-by-topic](https://docs.aws.amazon.com/cli/latest/reference/sns/list-subscriptions-by-topic.html) command.

   ```
   1. aws sns list-subscriptions-by-topic --topic-arn arn:aws:sns:us-east-1:111122223333:my-topic
   ```

   Amazon SNS returns the following:

   ```
    1. {
    2.   "Subscriptions": [
    3.     {
    4.         "Owner": "111122223333",
    5.         "Endpoint": "me@mycompany.com",
    6.         "Protocol": "email",
    7.         "TopicArn": "arn:aws:sns:us-east-1:111122223333:my-topic",
    8.         "SubscriptionArn": "arn:aws:sns:us-east-1:111122223333:my-topic:64886986-bf10-48fb-a2f1-dab033aa67a3"
    9.     }
   10.   ]
   11. }
   ```

1. (Optional) Publish a test message to the topic using the [publish](https://docs.aws.amazon.com/cli/latest/reference/sns/publish.html) command.

   ```
   1. aws sns publish --message "Verification" --topic arn:aws:sns:us-east-1:111122223333:my-topic
   ```

   Amazon SNS returns the following.

   ```
   1. {
   2.     "MessageId": "42f189a0-3094-5cf6-8fd7-c2dde61a4d7d"
   3. }
   ```

1. Check your email to confirm that you received the message.

## Schema of Amazon SNS notifications when alarms change state
<a name="alarm-sns-schema"></a>

This section lists the schemas of the notifications sent to Amazon SNS topics when alarms change their state.

**Schema when a metric alarm changes state**

```
{
  "AlarmName": "string",
  "AlarmDescription": "string",
  "AWSAccountId": "string",
  "AlarmConfigurationUpdatedTimestamp": "string",
  "NewStateValue": "string",
  "NewStateReason": "string",
  "StateChangeTime": "string",
  "Region": "string",
  "AlarmArn": "string",
  "OldStateValue": "string",
  "OKActions": ["string"],
  "AlarmActions": ["string"],
  "InsufficientDataActions": ["string"],
  "Trigger": {
    "MetricName": "string",
    "Namespace": "string",
    "StatisticType": "string",
    "Statistic": "string",
    "Unit": "string or null",
    "Dimensions": [
      {
        "value": "string",
        "name": "string"
      }
    ],
    "Period": "integer",
    "EvaluationPeriods": "integer",
    "DatapointsToAlarm": "integer",
    "ComparisonOperator": "string",
    "Threshold": "number",
    "TreatMissingData": "string",
    "EvaluateLowSampleCountPercentile": "string or null"
  }
}
```

**Schema when a composite alarm changes state**

```
{
  "AlarmName": "string",
  "AlarmDescription": "string",
  "AWSAccountId": "string",
  "NewStateValue": "string",
  "NewStateReason": "string",
  "StateChangeTime": "string",
  "Region": "string",
  "AlarmArn": "string",
  "OKActions": [String],
  "AlarmActions": [String],
  "InsufficientDataActions": [String],
  "OldStateValue": "string",
  "AlarmRule": "string",
  "TriggeringChildren": [String]
}
```

# Invoke a Lambda function from an alarm
<a name="alarms-and-actions-Lambda"></a>

CloudWatch alarms guarantees an asynchronous invocation of the Lambda function for a given state change, except in the following cases:
+ When the function doesn't exist.
+ When CloudWatch is not authorized to invoke the Lambda function.

If CloudWatch can't reach the Lambda service or the message is rejected for another reason, CloudWatch retries until the invocation is successful. Lambda queues the message and handles execution retries. For more information about this execution model, including information about how Lambda handles errors, see [ Asynchronous invocation](https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html) in the AWS Lambda Developer Guide.

You can invoke a Lambda function in the same account, or in other AWS accounts.

When you specify an alarm to invoke a Lambda function as an alarm action, you can choose to specify the function name, function alias, or a specific version of a function.

When you specify a Lambda function as an alarm action, you must create a resource policy for the function to allow the CloudWatch service principal to invoke the function.

One way to do this is by using the AWS CLI, as in the following example:

```
aws lambda add-permission \
--function-name my-function-name \
--statement-id AlarmAction \
--action 'lambda:InvokeFunction' \
--principal lambda.alarms.cloudwatch.amazonaws.com \
--source-account 111122223333 \
--source-arn arn:aws:cloudwatch:us-east-1:111122223333:alarm:alarm-name
```

Alternatively, you can create a policy similar to one of the following examples and then assign it to the function.

The following example specifies the account where the alarm is located, so that only alarms in that account (111122223333) can invoke the function.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Id": "default",
    "Statement": [{
        "Sid": "AlarmAction",
        "Effect": "Allow",
        "Principal": {
            "Service": "lambda.alarms.cloudwatch.amazonaws.com"
        },
        "Action": "lambda:InvokeFunction",
        "Resource": "arn:aws:lambda:us-east-1:444455556666:function:function-name",
        "Condition": {
            "StringEquals": {
                "AWS:SourceAccount": "111122223333"
            }
        }
    }]
}
```

------

The following example has a narrower scope, allowing only the specified alarm in the specified account to invoke the function.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Id": "default",
  "Statement": [
    {
      "Sid": "AlarmAction",
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.alarms.cloudwatch.amazonaws.com"
      },
      "Action": "lambda:InvokeFunction",
      "Resource": "arn:aws:lambda:us-east-1:444455556666:function:function-name",
      "Condition": {
        "StringEquals": {
          "AWS:SourceAccount": "111122223333",
          "AWS:SourceArn": "arn:aws:cloudwatch:us-east-1:111122223333:alarm:alarm-name"
        }
      }
    }]
}
```

------

We don't recommend creating a policy that doesn't specify a source account, because such policies are vulnerable to confused deputy issues.

## Add Lambda metrics to CloudWatch investigations
<a name="Lambda-metrics-investigation"></a>

You can add Lambda metrics to your active CloudWatch investigations. When investigating an issue, Lambda metrics can provide valuable insights about function performance and behavior. For example, if you're investigating an application performance issue, Lambda metrics such as duration, error rates, or throttles might help identify the root cause.

To add Lambda metrics to CloudWatch investigations:

1. Open the AWS Lambda console at [https://console.aws.amazon.com/lambda/](https://console.aws.amazon.com/lambda/).

1. In the **Monitor** section, find the metric.

1. Open the context menu for the metric, choose **Investigate**, **Add to investigation**. Then, in the **Investigate** pane, select the name of the investigation.

## Event object sent from CloudWatch to Lambda
<a name="Lambda-action-payload"></a>

When you configure a Lambda function as an alarm action, CloudWatch delivers a JSON payload to the Lambda function when it invokes the function. This JSON payload serves as the event object for the function. You can extract data from this JSON object and use it in your function. The following is an example of an event object from a metric alarm.

```
{
  'source': 'aws.cloudwatch',
  'alarmArn': 'arn:aws:cloudwatch:us-east-1:444455556666:alarm:lambda-demo-metric-alarm',
  'accountId': '444455556666',
  'time': '2023-08-04T12:36:15.490+0000',
  'region': 'us-east-1',
  'alarmData': {
    'alarmName': 'lambda-demo-metric-alarm',
    'state': {
      'value': 'ALARM',
      'reason': 'test',
      'timestamp': '2023-08-04T12:36:15.490+0000'
    },
    'previousState': {
      'value': 'INSUFFICIENT_DATA',
      'reason': 'Insufficient Data: 5 datapoints were unknown.',
      'reasonData': '{"version":"1.0","queryDate":"2023-08-04T12:31:29.591+0000","statistic":"Average","period":60,"recentDatapoints":[],"threshold":5.0,"evaluatedDatapoints":[{"timestamp":"2023-08-04T12:30:00.000+0000"},{"timestamp":"2023-08-04T12:29:00.000+0000"},{"timestamp":"2023-08-04T12:28:00.000+0000"},{"timestamp":"2023-08-04T12:27:00.000+0000"},{"timestamp":"2023-08-04T12:26:00.000+0000"}]}',
      'timestamp': '2023-08-04T12:31:29.595+0000'
    },
    'configuration': {
      'description': 'Metric Alarm to test Lambda actions',
      'metrics': [
        {
          'id': '1234e046-06f0-a3da-9534-EXAMPLEe4c',
          'metricStat': {
            'metric': {
              'namespace': 'AWS/Logs',
              'name': 'CallCount',
              'dimensions': {
                'InstanceId': 'i-12345678'
              }
            },
            'period': 60,
            'stat': 'Average',
            'unit': 'Percent'
          },
          'returnData': True
        }
      ]
    }
  }
}
```

The following is an example of an event object from a composite alarm.

```
{
  'source': 'aws.cloudwatch',
  'alarmArn': 'arn:aws:cloudwatch:us-east-1:111122223333:alarm:SuppressionDemo.Main',
  'accountId': '111122223333',
  'time': '2023-08-04T12:56:46.138+0000',
  'region': 'us-east-1',
  'alarmData': {
    'alarmName': 'CompositeDemo.Main',
    'state': {
      'value': 'ALARM',
      'reason': 'arn:aws:cloudwatch:us-east-1:111122223333:alarm:CompositeDemo.FirstChild transitioned to ALARM at Friday 04 August, 2023 12:54:46 UTC',
      'reasonData': '{"triggeringAlarms":[{"arn":"arn:aws:cloudwatch:us-east-1:111122223333:alarm:CompositeDemo.FirstChild","state":{"value":"ALARM","timestamp":"2023-08-04T12:54:46.138+0000"}}]}',
      'timestamp': '2023-08-04T12:56:46.138+0000'
    },
    'previousState': {
      'value': 'ALARM',
      'reason': 'arn:aws:cloudwatch:us-east-1:111122223333:alarm:CompositeDemo.FirstChild transitioned to ALARM at Friday 04 August, 2023 12:54:46 UTC',
      'reasonData': '{"triggeringAlarms":[{"arn":"arn:aws:cloudwatch:us-east-1:111122223333:alarm:CompositeDemo.FirstChild","state":{"value":"ALARM","timestamp":"2023-08-04T12:54:46.138+0000"}}]}',
      'timestamp': '2023-08-04T12:54:46.138+0000',
      'actionsSuppressedBy': 'WaitPeriod',
      'actionsSuppressedReason': 'Actions suppressed by WaitPeriod'
    },
    'configuration': {
      'alarmRule': 'ALARM(CompositeDemo.FirstChild) OR ALARM(CompositeDemo.SecondChild)',
      'actionsSuppressor': 'CompositeDemo.ActionsSuppressor',
      'actionsSuppressorWaitPeriod': 120,
      'actionsSuppressorExtensionPeriod': 180
    }
  }
}
```

# Start a CloudWatch investigations from an alarm
<a name="Start-Investigation-Alarm"></a>

Start a CloudWatch investigations from an alarm, or from any point in the last two weeks of a CloudWatch alarm's history.

For more information about CloudWatch investigations, see [CloudWatch investigations](Investigations.md).

## Prerequisites
<a name="w2aac19c25b7c17b7"></a>

Before you can start a CloudWatch investigations from a CloudWatch alarm, you must create a resource policy for the function to allow the CloudWatch service principal to start the investigation. To do this using the AWS CLI, use a command similar to the following example:

```
aws aiops put-investigation-group-policy \
    --identifier arn:aws:aiops:us-east-1:111122223333:investigation-group/investigation_group_id \
    --policy "{\"Version\":\"2008-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"Service\":\"aiops.alarms.cloudwatch.amazonaws.com\"},\"Action\":[\"aiops:CreateInvestigation\",\"aiops:CreateInvestigationEvent\"],\"Resource\":\"*\",\"Condition\":{\"StringEquals\":{\"aws:SourceAccount\":\"111122223333\"},\"ArnLike\":{\"aws:SourceArn\":\"arn:aws:cloudwatch:us-east-1:111122223333:alarm:*\"}}}]}" \
    --region eu-north-1
```

Replace the example values with your own AWS account ID, region, and investigation group ID.

**Start an investigation from a CloudWatch alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the left navigation pane, choose **Alarms**, **All alarms**.

1. Choose the name of the alarm.

1. Choose the time period in the alarm history that you want to investigate.

1. Choose **Investigate**, **Start new investigation**.

1. For **New investigation title**, enter a name for the investigation. Then choose **Start investigation**.

   The CloudWatch investigations assistant starts and scans your telemetry data to find data that might be associated with this situation.

1. In the CloudWatch console's navigation pane, choose **Investigations**, then choose the name of the investigation that you just started.

   The **Findings** section displays a natural-language summary of the alarm's status and the reason that it was triggered. 

1. (Optional) In the graph of the alarm, right-click and choose to deep-dive into the alarm or the metric that it watches.

1. On the right side of the screen, choose the **Suggestions** tab.

   A list of other telemetry that CloudWatch investigations has discovered and that might be relevant to the investigation appears. These findings can include other metrics and CloudWatch Logs Insights query results. CloudWatch investigations ran these queries based on the alarm.
   + For each finding, choose **Add to findings** or **Discard**. 

     When you choose **Add to findings**, the telemetry is added to the **Findings** section, and CloudWatch investigations uses this information to direct its further scanning and suggestions.
   + For a CloudWatch Logs Insights query result, to change or edit the query and re-run it, open the context (right-click) menu for the results, and then choose **Open in Logs Insights**. For more information, see [Analyzing log data with CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html).

     To run a different query, when you get to the Logs Insights page, choose to use query assist to form a query using natural language. For more information, see [Use natural language to generate and update CloudWatch Logs Insights queries](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CloudWatchLogs-Insights-Query-Assist.html).
   + (Optional) If you know of telemetry in another AWS service that might apply to this investigation, go to that service's console and add the telemetry to the investigation. 

1. CloudWatch investigations might also add hypotheses to the list in the **Suggestions** tab. These hypotheses are generated by the investigation in natural language.

   For each hypothesis, choose **Add to findings** or **Discard**.

1. When you think you have completed the investigation and found the root cause of the issue, choose the **Overview** tab and then choose **Investigation summary**. CloudWatch investigations then creates a natural-language summary of the important findings and hypotheses from the investigation.

# Stop, terminate, reboot, or recover an EC2 instance
<a name="UsingAlarmActions"></a>

Using Amazon CloudWatch alarm actions, you can create alarms that automatically stop, terminate, reboot, or recover your EC2 instances. You can use the stop or terminate actions to help you save money when you no longer need an instance to be running. You can use the reboot and recover actions to automatically reboot those instances or recover them onto new hardware if a system impairment occurs.

There are a number of scenarios in which you might want to automatically stop or terminate your instance. For example, you might have instances dedicated to batch payroll processing jobs or scientific computing tasks that run for a period of time and then complete their work. Rather than letting those instances sit idle (and accrue charges), you can stop or terminate them, which helps you to save money. The main difference between using the stop and the terminate alarm actions is that you can easily restart a stopped instance if you need to run it again later. You can also keep the same instance ID and root volume. However, you cannot restart a terminated instance. Instead, you must launch a new instance.

You can add the stop, terminate, or reboot, actions to any alarm that is set on an Amazon EC2 per-instance metric, including basic and detailed monitoring metrics provided by Amazon CloudWatch (in the AWS/EC2 namespace), in addition to any custom metrics that include the "InstanceId=" dimension, as long as the InstanceId value refers to a valid running Amazon EC2 instance. You can also add the recover action to alarms that is set on any Amazon EC2 per-instance metric except for `StatusCheckFailed_Instance`.

**Important**  
Alarms configured on Amazon EC2 metrics can temporarily enter the INSUFFICIENT\$1DATA state if there are missing metric data points. This is rare, but can happen when the metric reporting is interrupted, even when the Amazon EC2 instance is healthy. For alarms on Amazon EC2 metrics that are configured to take stop, terminate, reboot, or recover actions, we recommend that you configure those alarms to treat missing data as `missing`, and to have these alarms trigger only when in the ALARM state.  
For more information about how you can configure CloudWatch to act on missing metrics that have alarms set on them, see [Configuring how CloudWatch alarms treat missing data](alarms-and-missing-data.md).

To set up a CloudWatch alarm action that can reboot, stop, or terminate an instance, you must use a service-linked IAM role, *AWSServiceRoleForCloudWatchEvents*. The AWSServiceRoleForCloudWatchEvents IAM role enables AWS to perform alarm actions on your behalf.

To create the service-linked role for CloudWatch Events, use the following command:

```
aws iam create-service-linked-role --aws-service-name events.amazonaws.com
```

**Console support**  
You can create alarms using the CloudWatch console or the Amazon EC2 console. The procedures in this documentation use the CloudWatch console. For procedures that use the Amazon EC2 console, see [Create alarms that stop, terminate, reboot, or rcover an instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/UsingAlarmActions.html) in the *Amazon EC2 User Guide*.

**Permissions**  
If you are using an AWS Identity and Access Management (IAM) account to create or modify an alarm that performs EC2 actions or Systems Manager OpsItem actions, you must have the `iam:CreateServiceLinkedRole` permission.

**Topics**
+ [

## Adding stop actions to Amazon CloudWatch alarms
](#AddingStopActions)
+ [

## Adding terminate actions to Amazon CloudWatch alarms
](#AddingTerminateActions)
+ [

## Adding reboot actions to Amazon CloudWatch alarms
](#AddingRebootActions)
+ [

## Adding recover actions to Amazon CloudWatch alarms
](#AddingRecoverActions)
+ [

## Viewing the history of triggered alarms and actions
](#ViewAlarmHistory)

## Adding stop actions to Amazon CloudWatch alarms
<a name="AddingStopActions"></a>

You can create an alarm that stops an Amazon EC2 instance when a certain threshold has been met. For example, you may run development or test instances and occasionally forget to shut them off. You can create an alarm that is triggered when the average CPU utilization percentage has been lower than 10 percent for 24 hours, signaling that it is idle and no longer in use. You can adjust the threshold, duration, and period to suit your needs, plus you can add an SNS notification, so that you will receive an email when the alarm is triggered.

Amazon EC2 instances that use an Amazon Elastic Block Store volume as the root device can be stopped or terminated, whereas instances that use the instance store as the root device can only be terminated.

**To create an alarm to stop an idle instance using the Amazon CloudWatch console**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All alarms**.

1. Choose **Create alarm**.

1. Choose **Select Metric**.

1. For **AWS namespaces**, choose **EC2**.

1. Do the following:

   1. Choose **Per-Instance Metrics**.

   1. Select the check box in the row with the correct instance and the **CPUUtilization** metric.

   1. Choose the **Graphed metrics** tab.

   1. For the statistic, choose **Average**.

   1. Choose a period (for example, **1 Hour**).

   1. Choose **Select metric**.

1. Under **Conditions**, do the following:

   1. Choose **Static**. 

   1. Under **Whenever CPUUtilization is**, choose **Lower**.

   1. For **than**, type **10**.

   1. Choose **Next**.

   1. Under **Notification**, for **Send notification to**, choose an existing SNS topic or create a new one.

      To create an SNS topic, choose **New list**. For **Send notification to**, type a name for the SNS topic (for example, Stop\$1EC2\$1Instance). For **Email list**, type a comma-separated list of email addresses to be notified when the alarm changes to the `ALARM` state. Each email address is sent a topic subscription confirmation email. You must confirm the subscription before notifications can be sent to an email address.

   1. Choose **Add EC2 Action**.

   1. For **Alarm state trigger**, choose **In alarm**. For **Take the following action**, choose **Stop this instance**.

   1. Choose **Next**.

   1. Enter a name and description for the alarm. The name must contain only ASCII characters. Then choose **Next**.

   1. Under **Preview and create**, confirm that the information and conditions are what you want, then choose **Create alarm**.

## Adding terminate actions to Amazon CloudWatch alarms
<a name="AddingTerminateActions"></a>

You can create an alarm that terminates an EC2 instance automatically when a certain threshold has been met (as long as termination protection is not enabled for the instance). For example, you might want to terminate an instance when it has completed its work, and you don't need the instance again. If you might want to use the instance later, you should stop the instance instead of terminating it. For information about enabling and disabling termination protection for an instance, see [Enabling Termination Protection for an Instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_ChangingDisableAPITermination.html) in the *Amazon EC2 User Guide*.

**To create an alarm to terminate an idle instance using the Amazon CloudWatch console**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **Create Alarm**.

1. For the **Select Metric** step, do the following:

   1. Under **EC2 Metrics**, choose **Per-Instance Metrics**.

   1. Select the row with the instance and the **CPUUtilization** metric.

   1. For the statistic, choose **Average**.

   1. Choose a period (for example, **1 Hour**).

   1. Choose **Next**.

1. For the **Define Alarm** step, do the following:

   1. Under **Alarm Threshold**, type a unique name for the alarm (for example, Terminate EC2 instance) and a description of the alarm (for example, Terminate EC2 instance when CPU is idle for too long). Alarm names must contain only ASCII characters.

   1. Under **Whenever**, for **is**, choose **<** and type **10**. For **for**, type **24** consecutive periods.

      A graphical representation of the threshold is shown under **Alarm Preview**.

   1. Under **Notification**, for **Send notification to**, choose an existing SNS topic or create a new one.

      To create an SNS topic, choose **New list**. For **Send notification to**, type a name for the SNS topic (for example, Terminate\$1EC2\$1Instance). For **Email list**, type a comma-separated list of email addresses to be notified when the alarm changes to the `ALARM` state. Each email address is sent a topic subscription confirmation email. You must confirm the subscription before notifications can be sent to an email address.

   1. Choose **EC2 Action**.

   1. For **Whenever this alarm**, choose **State is ALARM**. For **Take this action**, choose **Terminate this instance**.

   1. Choose **Create Alarm**.

## Adding reboot actions to Amazon CloudWatch alarms
<a name="AddingRebootActions"></a>

You can create an Amazon CloudWatch alarm that monitors an Amazon EC2 instance and automatically reboots the instance. The reboot alarm action is recommended for Instance Health Check failures (as opposed to the recover alarm action, which is suited for System Health Check failures). An instance reboot is equivalent to an operating system reboot. In most cases, it takes only a few minutes to reboot your instance. When you reboot an instance, it remains on the same physical host, so your instance keeps its public DNS name, private IP address, and any data on its instance store volumes.

Rebooting an instance doesn't start a new instance billing hour, unlike stopping and restarting your instance. For more information about rebooting an instance, see [Reboot Your Instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-reboot.html) in the *Amazon EC2 User Guide*.

**Important**  
To avoid a race condition between the reboot and recover actions, avoid setting the same evaluation period for both a reboot alarm and a recover alarm. We recommend that you set reboot alarms to three evaluation periods of one minute each. 

**To create an alarm to reboot an instance using the Amazon CloudWatch console**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **Create Alarm**.

1. For the **Select Metric** step, do the following:

   1. Under **EC2 Metrics**, choose **Per-Instance Metrics**.

   1. Select the row with the instance and the **StatusCheckFailed\$1Instance** metric.

   1. For the statistic, choose **Minimum**.

   1. Choose a period (for example, **1 Minute**).

   1. Choose **Next**.

1. For the **Define Alarm** step, do the following:

   1. Under **Alarm Threshold**, type a unique name for the alarm (for example, Reboot EC2 instance) and a description of the alarm (for example, Reboot EC2 instance when health checks fail). Alarm names must contain only ASCII characters.

   1. Under **Whenever**, for **is**, choose **>** and type **0**. For **for**, type **3** consecutive periods.

      A graphical representation of the threshold is shown under **Alarm Preview**.

   1. Under **Notification**, for **Send notification to**, choose an existing SNS topic or create a new one.

      To create an SNS topic, choose **New list**. For **Send notification to**, type a name for the SNS topic (for example, Reboot\$1EC2\$1Instance). For **Email list**, type a comma-separated list of email addresses to be notified when the alarm changes to the `ALARM` state. Each email address is sent a topic subscription confirmation email. You must confirm the subscription before notifications can be sent to an email address.

   1. Choose **EC2 Action**.

   1. For **Whenever this alarm**, choose **State is ALARM**. For **Take this action**, choose **Reboot this instance**.

   1. Choose **Create Alarm**.

## Adding recover actions to Amazon CloudWatch alarms
<a name="AddingRecoverActions"></a>

You can create an Amazon CloudWatch alarm that monitors an Amazon EC2 instance and automatically recovers the instance if it becomes impaired due to an underlying hardware failure or a problem that requires AWS involvement to repair. Terminated instances cannot be recovered. A recovered instance is identical to the original instance, including the instance ID, private IP addresses, Elastic IP addresses, and all instance metadata.

When the `StatusCheckFailed_System` alarm is triggered, and the recover action is initiated, you will be notified by the Amazon SNS topic that you chose when you created the alarm and associated the recover action. During instance recovery, the instance is migrated during an instance reboot, and any data that is in-memory is lost. When the process is complete, information is published to the SNS topic you've configured for the alarm. Anyone who is subscribed to this SNS topic will receive an email notification that includes the status of the recovery attempt and any further instructions. You will notice an instance reboot on the recovered instance.

The recover action can be used only with `StatusCheckFailed_System`, not with `StatusCheckFailed_Instance`.

Examples of problems that cause system status checks to fail include:
+ Loss of network connectivity
+ Loss of system power
+ Software issues on the physical host
+ Hardware issues on the physical host that impact network reachability

The recover action is supported only on some instance types. For more information about supported instance types and other requirements, see [ Recover your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html) and [ Requirements](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-recover.html#requirements-for-recovery).

**Important**  
To avoid a race condition between the reboot and recover actions, avoid setting the same evaluation period for both a reboot alarm and a recover alarm. We recommend that you set recover alarms to two evaluation periods of one minute each and reboot alarms to three evaluation periods of one minute each.

**To create an alarm to recover an instance using the Amazon CloudWatch console**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All alarms**. 

1. Choose **Create Alarm**.

1. Choose **Select Metric** and then do the following:

   1. Choose **EC2 Metrics**, **Per-Instance Metrics**.

   1. Select the row with the instance and the **StatusCheckFailed\$1System** metric, and then choose **Select metric**.

   1. For the statistic, choose **Minimum**.

   1. Choose a period (for example, **1 Minute**).
**Important**  
To avoid a race condition between the reboot and recover actions, avoid setting the same evaluation period for both a reboot alarm and a recover alarm. We recommend that you set recover alarms to two evaluation periods of one minute each.

1. For **Conditions**, do the following:

   1. Under **Threshold type**, choose **Static**.

   1. Under **Whenever**, choose **Greater** and enter **0** for **than...**.

   1. Choose **Additional configuration**, then for **Datapoints to alarm** specify 2 **out of** 2.

1. Choose **Next**.

1. Under **Notification**, do the following:

   1. For **Alarm state trigger**, choose **In alarm**.

   1. For **Send notification to the following SNS topic**, choose an existing SNS topic or create a new one.

   1. Choose **Add EC2 Action**.

   1. For **Alarm state trigger**, choose **In alarm**.

   1. For **Take the following action**, choose **Recover this instance**.

   1. Choose **Next**.

1. For **Alarm name**, type a unique name for the alarm (for example, **Recover EC2 instance**) and a description of the alarm (for example, **Recover EC2 instance when health checks fail**). Alarm names must contain only ASCII characters.

1. Choose **Next**.

1. Choose **Create Alarm**.

## Viewing the history of triggered alarms and actions
<a name="ViewAlarmHistory"></a>

You can view alarm and action history in the Amazon CloudWatch console. Amazon CloudWatch keeps the last 30 days of alarm and action history.

**To view the history of triggered alarms and actions**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms** and select an alarm.

1. To view the most recent state transition along with the time and metric values, choose **Details**.

1. To view the most recent history entries, choose **History**.

# Alarm events and EventBridge
<a name="cloudwatch-and-eventbridge"></a>

CloudWatch sends events to Amazon EventBridge whenever a CloudWatch alarm is created, updated, deleted, or changes alarm state. You can use EventBridge and these events to write rules that take actions, such as notifying you, when an alarm changes state. For more information, see [What is Amazon EventBridge?](https://docs.aws.amazon.com/eventbridge/latest/userguide/what-is-amazon-eventbridge.html)

CloudWatch guarantees the delivery of alarm state change events to EventBridge.

## Alarm State Change Events
<a name="CloudWatch-state-change-events"></a>

This section shows example events sent to EventBridge when an alarm's state changes. Select a tab to view different types of alarm state change events.

------
#### [ Single Metric Alarm ]

Events generated when a single metric alarm changes state. These events include `state` and `previousState` fields with the alarm's evaluation results.

```
{
    "version": "0",
    "id": "c4c1c1c9-6542-e61b-6ef0-8c4d36933a92",
    "detail-type": "CloudWatch Alarm State Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2019-10-02T17:04:40Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServerCpuTooHigh"
    ],
    "detail": {
        "alarmName": "ServerCpuTooHigh",
        "configuration": {
            "description": "Goes into alarm when server CPU utilization is too high!",
            "metrics": [
                {
                    "id": "30b6c6b2-a864-43a2-4877-c09a1afc3b87",
                    "metricStat": {
                        "metric": {
                            "dimensions": {
                                "InstanceId": "i-12345678901234567"
                            },
                            "name": "CPUUtilization",
                            "namespace": "AWS/EC2"
                        },
                        "period": 300,
                        "stat": "Average"
                    },
                    "returnData": true
                }
            ]
        },
        "previousState": {
            "reason": "Threshold Crossed: 1 out of the last 1 datapoints [0.0666851903306472 (01/10/19 13:46:00)] was not greater than the threshold (50.0) (minimum 1 datapoint for ALARM -> OK transition).",
            "reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-01T13:56:40.985+0000\",\"startDate\":\"2019-10-01T13:46:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[0.0666851903306472],\"threshold\":50.0}",
            "timestamp": "2019-10-01T13:56:40.987+0000",
            "value": "OK"
        },
        "state": {
            "reason": "Threshold Crossed: 1 out of the last 1 datapoints [99.50160229693434 (02/10/19 16:59:00)] was greater than the threshold (50.0) (minimum 1 datapoint for OK -> ALARM transition).",
            "reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-02T17:04:40.985+0000\",\"startDate\":\"2019-10-02T16:59:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[99.50160229693434],\"threshold\":50.0}",
            "timestamp": "2019-10-02T17:04:40.989+0000",
            "value": "ALARM"
        },
        "muteDetail": {
            "mutedByArn": "arn:aws:cloudwatch:us-east-1:1234567890:alarm-mute-rule:testMute",
            "muteWindowStart": "2026-01-01T10:00:00.000+0000",
            "muteWindowEnd": "2026-01-01T12:00:00.000+0000"
        }
    }
}
```

------
#### [ Metric Math Alarm ]

Events generated when a metric math alarm changes state. These events include the math expression details in the `configuration` field.

```
{
    "version": "0",
    "id": "2dde0eb1-528b-d2d5-9ca6-6d590caf2329",
    "detail-type": "CloudWatch Alarm State Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2019-10-02T17:20:48Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:TotalNetworkTrafficTooHigh"
    ],
    "detail": {
        "alarmName": "TotalNetworkTrafficTooHigh",
        "configuration": {
            "description": "Goes into alarm if total network traffic exceeds 10Kb",
            "metrics": [
                {
                    "expression": "SUM(METRICS())",
                    "id": "e1",
                    "label": "Total Network Traffic",
                    "returnData": true
                },
                {
                    "id": "m1",
                    "metricStat": {
                        "metric": {
                            "dimensions": {
                                "InstanceId": "i-12345678901234567"
                            },
                            "name": "NetworkIn",
                            "namespace": "AWS/EC2"
                        },
                        "period": 300,
                        "stat": "Maximum"
                    },
                    "returnData": false
                },
                {
                    "id": "m2",
                    "metricStat": {
                        "metric": {
                            "dimensions": {
                                "InstanceId": "i-12345678901234567"
                            },
                            "name": "NetworkOut",
                            "namespace": "AWS/EC2"
                        },
                        "period": 300,
                        "stat": "Maximum"
                    },
                    "returnData": false
                }
            ]
        },
        "previousState": {
            "reason": "Unchecked: Initial alarm creation",
            "timestamp": "2019-10-02T17:20:03.642+0000",
            "value": "INSUFFICIENT_DATA"
        },
        "state": {
            "reason": "Threshold Crossed: 1 out of the last 1 datapoints [45628.0 (02/10/19 17:10:00)] was greater than the threshold (10000.0) (minimum 1 datapoint for OK -> ALARM transition).",
            "reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-02T17:20:48.551+0000\",\"startDate\":\"2019-10-02T17:10:00.000+0000\",\"period\":300,\"recentDatapoints\":[45628.0],\"threshold\":10000.0}",
            "timestamp": "2019-10-02T17:20:48.554+0000",
            "value": "ALARM"
        },
        "muteDetail": {
            "mutedByArn": "arn:aws:cloudwatch:us-east-1:1234567890:alarm-mute-rule:testMute",
            "muteWindowStart": "2026-01-01T10:00:00.000+0000",
            "muteWindowEnd": "2026-01-01T12:00:00.000+0000"
        }
    }
}
```

------
#### [ Anomaly Detection Alarm ]

Events generated when an anomaly detection alarm changes state. These events include upper and lower threshold bounds in the `reasonData` field.

```
{
    "version": "0",
    "id": "daafc9f1-bddd-c6c9-83af-74971fcfc4ef",
    "detail-type": "CloudWatch Alarm State Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2019-10-03T16:00:04Z",
    "region": "us-east-1",
    "resources": ["arn:aws:cloudwatch:us-east-1:123456789012:alarm:EC2 CPU Utilization Anomaly"],
    "detail": {
        "alarmName": "EC2 CPU Utilization Anomaly",
        "state": {
            "value": "ALARM",
            "reason": "Thresholds Crossed: 1 out of the last 1 datapoints [0.0 (03/10/19 15:58:00)] was less than the lower thresholds [0.020599444741798756] or greater than the upper thresholds [0.3006915352732461] (minimum 1 datapoint for OK -> ALARM transition).",
            "reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-03T16:00:04.650+0000\",\"startDate\":\"2019-10-03T15:58:00.000+0000\",\"period\":60,\"recentDatapoints\":[0.0],\"recentLowerThresholds\":[0.020599444741798756],\"recentUpperThresholds\":[0.3006915352732461]}",
            "timestamp": "2019-10-03T16:00:04.653+0000"
        },
        "previousState": {
            "value": "OK",
            "reason": "Thresholds Crossed: 1 out of the last 1 datapoints [0.166666666664241 (03/10/19 15:57:00)] was not less than the lower thresholds [0.0206719426210418] or not greater than the upper thresholds [0.30076870222143803] (minimum 1 datapoint for ALARM -> OK transition).",
            "reasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-10-03T15:59:04.670+0000\",\"startDate\":\"2019-10-03T15:57:00.000+0000\",\"period\":60,\"recentDatapoints\":[0.166666666664241],\"recentLowerThresholds\":[0.0206719426210418],\"recentUpperThresholds\":[0.30076870222143803]}",
            "timestamp": "2019-10-03T15:59:04.672+0000"
        },
        "muteDetail": {
            "mutedByArn": "arn:aws:cloudwatch:us-east-1:1234567890:alarm-mute-rule:testMute",
            "muteWindowStart": "2026-01-01T10:00:00.000+0000",
            "muteWindowEnd": "2026-01-01T12:00:00.000+0000"
        },
        "configuration": {
            "description": "Goes into alarm if CPU Utilization is out of band",
            "metrics": [{
                "id": "m1",
                "metricStat": {
                    "metric": {
                        "namespace": "AWS/EC2",
                        "name": "CPUUtilization",
                        "dimensions": {
                            "InstanceId": "i-12345678901234567"
                        }
                    },
                    "period": 60,
                    "stat": "Average"
                },
                "returnData": true
            }, {
                "id": "ad1",
                "expression": "ANOMALY_DETECTION_BAND(m1, 0.8)",
                "label": "CPUUtilization (expected)",
                "returnData": true
            }]
        }
    }
}
```

------
#### [ Composite Alarm ]

Events generated when a composite alarm changes state. These events include suppression information in the `actionsSuppressedBy` and `actionsSuppressedReason` fields.

```
{
    "version": "0",
    "id": "d3dfc86d-384d-24c8-0345-9f7986db0b80",
    "detail-type": "CloudWatch Alarm State Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-07-22T15:57:45Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServiceAggregatedAlarm"
    ],
    "detail": {
        "alarmName": "ServiceAggregatedAlarm",
        "state": {
            "actionsSuppressedBy": "WaitPeriod",
            "actionsSuppressedReason": "Actions suppressed by WaitPeriod",
            "value": "ALARM",
            "reason": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:SuppressionDemo.EventBridge.FirstChild transitioned to ALARM at Friday 22 July, 2022 15:57:45 UTC",
            "reasonData": "{\"triggeringAlarms\":[{\"arn\":\"arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServerCpuTooHigh\",\"state\":{\"value\":\"ALARM\",\"timestamp\":\"2022-07-22T15:57:45.394+0000\"}}]}",
            "timestamp": "2022-07-22T15:57:45.394+0000"
        },
        "previousState": {
            "value": "OK",
            "reason": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:SuppressionDemo.EventBridge.Main was created and its alarm rule evaluates to OK",
            "reasonData": "{\"triggeringAlarms\":[{\"arn\":\"arn:aws:cloudwatch:us-east-1:123456789012:alarm:TotalNetworkTrafficTooHigh\",\"state\":{\"value\":\"OK\",\"timestamp\":\"2022-07-14T16:28:57.770+0000\"}},{\"arn\":\"arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServerCpuTooHigh\",\"state\":{\"value\":\"OK\",\"timestamp\":\"2022-07-14T16:28:54.191+0000\"}}]}",
            "timestamp": "2022-07-22T15:56:14.552+0000"
        },
        "configuration": {
            "alarmRule": "ALARM(ServerCpuTooHigh) OR ALARM(TotalNetworkTrafficTooHigh)",
            "actionsSuppressor": "ServiceMaintenanceAlarm",
            "actionsSuppressorWaitPeriod": 120,
            "actionsSuppressorExtensionPeriod": 180
        },
        "muteDetail": {
            "mutedByArn": "arn:aws:cloudwatch:us-east-1:1234567890:alarm-mute-rule:testMute",
            "muteWindowStart": "2026-01-01T10:00:00.000+0000",
            "muteWindowEnd": "2026-01-01T12:00:00.000+0000"
        }
    }
}
```

------
#### [ Multi Time Series Alarm ]

 Events generated when an Alarm Contributor or an Alarm changes state. Alarm Contributor state change events contain the id and attributes of the Alarm Contributor as well as the most recent datapoint that breached the threshold. Alarm State change events have a summary of the number of contributors that caused the alarm to transition in their state reason. 

**Alarm Contributor Example**

```
{
  "version": "0",
  "id": "6d226bbc-07f0-9a31-3359-1736968f8ded",
  "detail-type": "CloudWatch Alarm Contributor State Change",
  "source": "aws.cloudwatch",
  "account": "123456789012",
  "time": "2025-12-01T13:42:04Z",
  "region": "us-east-1",
  "resources": [
    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:DynamoDBInsightsAlarm"
  ],
  "detail": {
    "alarmName": "DynamoDBInsightsAlarm",
    "alarmContributor": {
      "id": "6d442278dba546f6",
      "attributes": {
        "TableName": "example-dynamodb-table-name"
      }
    },
    "state": {
      "value": "ALARM",
      "reason": "Threshold Crossed: 1 datapoint was less than the threshold (1.0). The most recent datapoint which crossed the threshold: [0.0 (01/12/25 13:34:00)].",
      "timestamp": "2025-12-01T13:42:04.919+0000"
    },
    "configuration": {
      "metrics": [
        {
          "id": "m1",
          "expression": "SELECT AVG(ConsumedWriteCapacityUnits) FROM \"AWS/DynamoDB\" GROUP BY TableName ORDER BY MAX() DESC",
          "returnData":true,
          "period": 60
        }
      ],
      "description": "Metrics Insights alarm for DynamoDB ConsumedWriteCapacity per TableName"
    },
    "muteDetail": {
        "mutedByArn": "arn:aws:cloudwatch:us-east-1:1234567890:alarm-mute-rule:testMute",
        "muteWindowStart": "2026-01-01T10:00:00.000+0000",
        "muteWindowEnd": "2026-01-01T12:00:00.000+0000"
    }
  }
}
```

**Alarm Example**

```
{
  "version": "0",
  "id": "80ddd249-dedf-7c4d-0708-0eb78132dd78",
  "detail-type": "CloudWatch Alarm State Change",
  "source": "aws.cloudwatch",
  "account": "123456789012",
  "time": "2025-12-01T13:42:04Z",
  "region": "us-east-1",
  "resources": [
    "arn:aws:cloudwatch:us-east-1:123456789012:alarm:DynamoDBInsightsAlarm"
  ],
  "detail": {
    "alarmName": "DynamoDBInsightsAlarm",
    "state": {
      "value": "ALARM",
      "reason": "6 out of 6 time series evaluated to ALARM",
      "timestamp": "2025-12-01T13:42:04.919+0000"
    },
    "previousState": {
      "value": "INSUFFICIENT_DATA",
      "reason": "Unchecked: Initial alarm creation",
      "timestamp": "2025-12-01T13:40:50.600+0000"
    },
    "configuration": {
      "metrics": [
        {
          "id": "m1",
          "expression": "SELECT AVG(ConsumedWriteCapacityUnits) FROM \"AWS/DynamoDB\" GROUP BY TableName ORDER BY MAX() DESC",
          "returnData": true,
          "period": 60
        }
      ],
      "description": "Metrics Insights alarm for DynamoDB ConsumedWriteCapacity per TableName"
    }
  }
}
```

------

## Alarm Configuration Change Events
<a name="CloudWatch-config-change-events"></a>

This section shows example events sent to EventBridge when an alarm's configuration changes. Configuration changes include creating, updating, or deleting alarms.

------
#### [ Creation Events ]

Events generated when new alarms are created. These events include the initial alarm configuration in the `configuration` field, with `operation` set to "create".

**Composite Alarm Example**

```
{
    "version": "0",
    "id": "91535fdd-1e9c-849d-624b-9a9f2b1d09d0",
    "detail-type": "CloudWatch Alarm Configuration Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-03-03T17:06:22Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServiceAggregatedAlarm"
    ],
    "detail": {
        "alarmName": "ServiceAggregatedAlarm",
        "operation": "create",
        "state": {
            "value": "INSUFFICIENT_DATA",
            "timestamp": "2022-03-03T17:06:22.289+0000"
        },
        "configuration": {
            "alarmRule": "ALARM(ServerCpuTooHigh) OR ALARM(TotalNetworkTrafficTooHigh)",
            "alarmName": "ServiceAggregatedAlarm",
            "description": "Aggregated monitor for instance",
            "actionsEnabled": true,
            "timestamp": "2022-03-03T17:06:22.289+0000",
            "okActions": [],
            "alarmActions": [],
            "insufficientDataActions": []
        }
    }
}
```

**Composite Alarm with Suppressor Example**

```
{
    "version": "0",
    "id": "454773e1-09f7-945b-aa2c-590af1c3f8e0",
    "detail-type": "CloudWatch Alarm Configuration Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-07-14T13:59:46Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServiceAggregatedAlarm"
    ],
    "detail": {
        "alarmName": "ServiceAggregatedAlarm",
        "operation": "create",
        "state": {
            "value": "INSUFFICIENT_DATA",
            "timestamp": "2022-07-14T13:59:46.425+0000"
        },
        "configuration": {
            "alarmRule": "ALARM(ServerCpuTooHigh) OR ALARM(TotalNetworkTrafficTooHigh)",
            "actionsSuppressor": "ServiceMaintenanceAlarm",
            "actionsSuppressorWaitPeriod": 120,
            "actionsSuppressorExtensionPeriod": 180,
            "alarmName": "ServiceAggregatedAlarm",
            "actionsEnabled": true,
            "timestamp": "2022-07-14T13:59:46.425+0000",
            "okActions": [],
            "alarmActions": [],
            "insufficientDataActions": []
        }
    }
}
```

------
#### [ Update Events ]

Events generated when existing alarms are modified. These events contain both `configuration` and `previousConfiguration` fields to show what changed.

**Metric Alarm Example**

```
{
    "version": "0",
    "id": "bc7d3391-47f8-ae47-f457-1b4d06118d50",
    "detail-type": "CloudWatch Alarm Configuration Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-03-03T17:06:34Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServerCpuTooHigh"
    ],
    "detail": {
        "alarmName": "ServerCpuTooHigh",
        "operation": "update",
        "state": {
            "value": "INSUFFICIENT_DATA",
            "timestamp": "2022-03-03T17:06:13.757+0000"
        },
        "configuration": {
            "evaluationPeriods": 1,
            "threshold": 80,
            "comparisonOperator": "GreaterThanThreshold",
            "treatMissingData": "ignore",
            "metrics": [
                {
                    "id": "86bfa85f-b14c-ebf7-8916-7da014ce23c0",
                    "metricStat": {
                        "metric": {
                            "namespace": "AWS/EC2",
                            "name": "CPUUtilization",
                            "dimensions": {
                                "InstanceId": "i-12345678901234567"
                            }
                        },
                        "period": 300,
                        "stat": "Average"
                    },
                    "returnData": true
                }
            ],
            "alarmName": "ServerCpuTooHigh",
            "description": "Goes into alarm when server CPU utilization is too high!",
            "actionsEnabled": true,
            "timestamp": "2022-03-03T17:06:34.267+0000",
            "okActions": [],
            "alarmActions": [],
            "insufficientDataActions": []
        },
        "previousConfiguration": {
            "evaluationPeriods": 1,
            "threshold": 70,
            "comparisonOperator": "GreaterThanThreshold",
            "treatMissingData": "ignore",
            "metrics": [
                {
                    "id": "d6bfa85f-893e-b052-a58b-4f9295c9111a",
                    "metricStat": {
                        "metric": {
                            "namespace": "AWS/EC2",
                            "name": "CPUUtilization",
                            "dimensions": {
                                "InstanceId": "i-12345678901234567"
                            }
                        },
                        "period": 300,
                        "stat": "Average"
                    },
                    "returnData": true
                }
            ],
            "alarmName": "ServerCpuTooHigh",
            "description": "Goes into alarm when server CPU utilization is too high!",
            "actionsEnabled": true,
            "timestamp": "2022-03-03T17:06:13.757+0000",
            "okActions": [],
            "alarmActions": [],
            "insufficientDataActions": []
        }
    }
}
```

**Metric Alarm with Suppressor Example**

```
    {
    "version": "0",
    "id": "4c6f4177-6bd5-c0ca-9f05-b4151c54568b",
    "detail-type": "CloudWatch Alarm Configuration Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-07-14T13:59:56Z",
    "region": "us-east-1",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServiceAggregatedAlarm"
    ],
    "detail": {
        "alarmName": "ServiceAggregatedAlarm",
        "operation": "update",
        "state": {
            "actionsSuppressedBy": "WaitPeriod",
            "value": "ALARM",
            "timestamp": "2022-07-14T13:59:46.425+0000"
        },
        "configuration": {
            "alarmRule": "ALARM(ServerCpuTooHigh) OR ALARM(TotalNetworkTrafficTooHigh)",
            "actionsSuppressor": "ServiceMaintenanceAlarm",
            "actionsSuppressorWaitPeriod": 120,
            "actionsSuppressorExtensionPeriod": 360,
            "alarmName": "ServiceAggregatedAlarm",
            "actionsEnabled": true,
            "timestamp": "2022-07-14T13:59:56.290+0000",
            "okActions": [],
            "alarmActions": [], Remove 
            "insufficientDataActions": []
        },
        "previousConfiguration": {
            "alarmRule": "ALARM(ServerCpuTooHigh) OR ALARM(TotalNetworkTrafficTooHigh)",
            "actionsSuppressor": "ServiceMaintenanceAlarm",
            "actionsSuppressorWaitPeriod": 120,
            "actionsSuppressorExtensionPeriod": 180,
            "alarmName": "ServiceAggregatedAlarm",
            "actionsEnabled": true,
            "timestamp": "2022-07-14T13:59:46.425+0000",
            "okActions": [],
            "alarmActions": [],
            "insufficientDataActions": []
        }
    }
}
```

------
#### [ Deletion Events ]

Events generated when alarms are deleted. These events include the final alarm configuration and set `operation` to "delete".

**Metric Math Alarm Example**

```
{
    "version": "0",
    "id": "f171d220-9e1c-c252-5042-2677347a83ed",
    "detail-type": "CloudWatch Alarm Configuration Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-03-03T17:07:13Z",
    "region": "us-east-*",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:TotalNetworkTrafficTooHigh"
    ],
    "detail": {
        "alarmName": "TotalNetworkTrafficTooHigh",
        "operation": "delete",
        "state": {
            "value": "INSUFFICIENT_DATA",
            "timestamp": "2022-03-03T17:06:17.672+0000"
        },
        "configuration": {
            "evaluationPeriods": 1,
            "threshold": 10000,
            "comparisonOperator": "GreaterThanThreshold",
            "treatMissingData": "ignore",
            "metrics": [{
                    "id": "m1",
                    "metricStat": {
                        "metric": {
                            "namespace": "AWS/EC2",
                            "name": "NetworkIn",
                            "dimensions": {
                                "InstanceId": "i-12345678901234567"
                            }
                        },
                        "period": 300,
                        "stat": "Maximum"
                    },
                    "returnData": false
                },
                {
                    "id": "m2",
                    "metricStat": {
                        "metric": {
                            "namespace": "AWS/EC2",
                            "name": "NetworkOut",
                            "dimensions": {
                                "InstanceId": "i-12345678901234567"
                            }
                        },
                        "period": 300,
                        "stat": "Maximum"
                    },
                    "returnData": false
                },
                {
                    "id": "e1",
                    "expression": "SUM(METRICS())",
                    "label": "Total Network Traffic",
                    "returnData": true
                }
            ],
            "alarmName": "TotalNetworkTrafficTooHigh",
            "description": "Goes into alarm if total network traffic exceeds 10Kb",
            "actionsEnabled": true,
            "timestamp": "2022-03-03T17:06:17.672+0000",
            "okActions": [],
            "alarmActions": [],
            "insufficientDataActions": []
        }
    }
}
```

**Metric Math Alarm with Suppressor Example**

```
{
    "version": "0",
    "id": "e34592a1-46c0-b316-f614-1b17a87be9dc",
    "detail-type": "CloudWatch Alarm Configuration Change",
    "source": "aws.cloudwatch",
    "account": "123456789012",
    "time": "2022-07-14T14:00:01Z",
    "region": "us-east-*",
    "resources": [
        "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServiceAggregatedAlarm"
    ],
    "detail": {
        "alarmName": "ServiceAggregatedAlarm",
        "operation": "delete",
        "state": {
            "actionsSuppressedBy": "WaitPeriod",
            "value": "ALARM",
            "timestamp": "2022-07-14T13:59:46.425+0000"
        },
        "configuration": {
            "alarmRule": "ALARM(ServerCpuTooHigh) OR ALARM(TotalNetworkTrafficTooHigh)",
            "actionsSuppressor": "ServiceMaintenanceAlarm",
            "actionsSuppressorWaitPeriod": 120,
            "actionsSuppressorExtensionPeriod": 360,
            "alarmName": "ServiceAggregatedAlarm",
            "actionsEnabled": true,
            "timestamp": "2022-07-14T13:59:56.290+0000",
            "okActions": [],
            "alarmActions": [],
            "insufficientDataActions": []
        }
    }
}
```

------

# Configure alarm mute rules
<a name="alarm-mute-rules-configure"></a>

 The steps in this section explain how to use the CloudWatch console to create an alarm mute rule. You can also use the API or AWS CLI to create an alarm mute rule. For more information, see [PutAlarmMuteRule](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutAlarmMuteRule.html). 

**To create an alarm mute rule**

1.  Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/). 

1.  In the navigation pane, choose **Alarms**, and then choose **Mute Rules** tab 

1.  Click **Create Alarm Mute Rule** button, which would open a wizard 

1.  Under **Alarm mute rule details** section, enter a name and optional description for your alarm mute rule 

1.  Under **Schedule pattern** section, choose one time or recurring schedule 

   1.  If you choose one-time occurrence, 

      1.  Select the Timezone to apply the mute schedules 

      1.  Choose a start date and time, to define when the mute rule should become active 

      1.  Choose end date and time, to define when the mute rule should expire 

   1.  If you choose Recurring schedule, you will have two options. Either to use the Console form or use cron expression to configure the recurring time schedules. 

      1.  Under **Schedule creation type** choose "Specify date, time and recurrence" to use the Console form 

         1.  Choose the Timezone to apply the mute rule 

         1.  Choose **Start date and time**, to define when the mute rule should become active 

         1.  Choose **Duration**, to define how long the mute rule should last once becomes active 

         1.  Choose **Repeat**, to define how the schedule should repeat like every day, every month, every weekend or on specific days during the week. 

         1.  Choose optional **Until**, to define when the mute schedule should expire. Default option is "Indefinitely" 

      1.  Under **Schedule creation type** choose "Set from a cron expression" to configure schedules using cron expressions 

         1.  Under **Cron expression** section enter the desired cron expression values. 

         1.  Choose **Duration**, to define how long the mute rule should last once becomes active 

         1.  Under optional **Timeframe** section, enter optional start and end date and time to define when the mute schedule should become active and expire. 

1.  Under **Target alarms** section, choose the alarms from the drop down to which you want to apply this mute rule 

1.  Under **Set tags for your mute rule** section, attach tags to your alarm mute rule. A tag is a key-value pair applied to a resource to hold metadata about that resource. For more information see [What are tags?](https://docs.aws.amazon.com/whitepapers/latest/tagging-best-practices/what-are-tags.html) 

1.  Select **Create alarm mute rule** button to create the mute rule 

## Quick mutes
<a name="quick-mutes"></a>

 Alarms could be added for a short time period as follows, 

1.  Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/). 

1.  In the navigation pane, choose **Alarms** 

1.  Select the alarms that you want to mute from the list 

1.  Under **Actions** menu choose **Mute** 

1.  Under the **Quick mute** section choose the predefined time periods 15min, 1h, 3h or select **Mute until** to set desired time period 

1.  Click **Confirm** to mute the alarms immediately until the chosen time period 

## Add alarms to existing mute rules
<a name="add-alarms-to-existing-rules"></a>

 Alarms could be added to existing mute rules as follows, 

1.  Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/). 

1.  In the navigation pane, choose **Alarms** 

1.  Select the alarms that you want to mute from the list 

1.  Under **Actions** menu choose **Mute** 

1.  Choose **Apply existing mute rules**, which should open a wizard 

1.  Select the mute rules from the drop down to which you want to add the alarms 

1.  Click **Apply** 

**Note**  
 Quick mute and adding an alarm to existing mute rules options are also available from the alarm details page. **Mute rules** tab in the details page displays all mute rules associated with the alarm. 

# Managing alarms
<a name="Manage-CloudWatch-Alarm"></a>

**Topics**
+ [

# Edit or delete a CloudWatch alarm
](Edit-CloudWatch-Alarm.md)
+ [

# Hide Auto Scaling alarms
](hide-autoscaling-alarms.md)
+ [

# Alarms and tagging
](CloudWatch_alarms_and_tagging.md)
+ [

# Viewing and managing muted alarms
](viewing-managing-muted-alarms.md)

# Edit or delete a CloudWatch alarm
<a name="Edit-CloudWatch-Alarm"></a>

You can edit or delete an existing alarm.

You can't change the name of an existing alarm. You can copy an alarm and give the new alarm a different name. To copy an alarm, select the check box next to the alarm name in the alarm list and choose **Action**, **Copy**.

**To edit an alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All Alarms**.

1. Choose the name of the alarm.

1. To add or remove tags, choose the **Tags** tab and then choose **Manage tags**.

1. To edit other parts of the alarm, choose **Actions**, **Edit**.

   The **Specify metric and conditions** page appears, showing a graph and other information about the metric and statistic that you selected.

1. To change the metric, choose **Edit**, choose the **All metrics** tab, and do one of the following:
   + Choose the service namespace that contains the metric that you want. Continue choosing options as they appear to narrow the choices. When a list of metrics appears, select the check box next to the metric that you want.
   + In the search box, enter the name of a metric, dimension, or resource ID and press Enter. Then choose one of the results and continue until a list of metrics appears. Select the check box next to the metric that you want. 

   Choose **Select metric**.

1. To change other aspects of the alarm, choose the appropriate options. To change how many data points must be breaching for the alarm to go into `ALARM` state or to change how missing data is treated, choose **Additional configuration**.

1. Choose **Next**.

1. Under **Notification**, **Auto Scaling action**, and **EC2 action**, optionally edit the actions taken when the alarm is triggered. Then choose **Next**.

1. Optionally change the alarm description.

   You can't change the name of an existing alarm. You can copy an alarm and give the new alarm a different name. To copy an alarm, select the check box next to the alarm name in the alarm list and choose **Action**, **Copy**.

1. Choose **Next**.

1. Under **Preview and create**, confirm that the information and conditions are what you want, then choose **Update alarm**.

**To update an email notification list that was created using the Amazon SNS console**

1. Open the Amazon SNS console at [https://console.aws.amazon.com/sns/v3/home](https://console.aws.amazon.com/sns/v3/home).

1. In the navigation pane, choose **Topics** and then select the ARN for your notification list (topic).

1. Do one of the following:
   + To add an email address, choose **Create subscription**. For **Protocol**, choose **Email**. For **Endpoint**, enter the email address of the new recipient. Choose **Create subscription**.
   + To remove an email address, choose the **Subscription ID**. Choose **Other subscription actions**, **Delete subscriptions**.

1. Choose **Publish to topic**.

**To delete an alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**.

1. Select the check box to the left of the name of the alarm, and choose **Actions**, **Delete**.

1. Choose **Delete**.

# Hide Auto Scaling alarms
<a name="hide-autoscaling-alarms"></a>

When you view your alarms in the AWS Management Console, you can hide the alarms related to both Amazon EC2 Auto Scaling and Application Auto Scaling. This feature is available only in the AWS Management Console.

**To temporarily hide Auto Scaling alarms**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All alarms**, and select **Hide Auto Scaling alarms**.

# Alarms and tagging
<a name="CloudWatch_alarms_and_tagging"></a>

*Tags* are key-value pairs that can help you organize and categorize your resources. You can also use them to scope user permissions by granting a user permission to access or change only resources with certain tag values. For more general information about tagging resources, see [ Tagging your AWS resources](https://docs.aws.amazon.com/tag-editor/latest/userguide/tagging.html)

The following list explains some details about how tagging works with CloudWatch alarms.
+ To be able to set or update tags for a CloudWatch resource, you must be signed on to an account that has the `cloudwatch:TagResource` permission. For example, to create an alarm and set tags for it, you must have the `cloudwatch:TagResource` permission in addition to the `the cloudwatch:PutMetricAlarm ` permission. We recommend that you make sure anyone in your organization who will create or update CloudWatch resources has the `cloudwatch:TagResource` permission.
+ Tags can be used for tag-based authorization control. For example, IAM user or role permissions can include conditions to limit CloudWatch calls to specific resources based on their tags. However, keep in mind the following
  + Tags with names that start with `aws:` can't be used for tag-based authorization control.
  + Composite alarms do not support tag-based authorization control.

# Viewing and managing muted alarms
<a name="viewing-managing-muted-alarms"></a>

 **Viewing muted alarms:** You can easily identify which alarms are currently muted through the CloudWatch console. In both the alarms list view and individual alarm detail pages, a mute icon appears next to alarms whose actions are currently being muted by active mute rules. This visual indicator helps you quickly understand which alarms actions are being currently muted until the mute window expires. 

![\[alt text not found\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm_mute_rules_icon.png)


 **Alarms timeline:** The CloudWatch alarms console provides a comprehensive timeline view that shows when your alarm actions were muted. This timeline displays mute periods alongside alarm state changes, giving you a complete historical view of both alarm behaviour and muting activity. You can use this timeline to analyze the effectiveness of your mute rules and understand how they correlate with your operational activities. 

![\[alt text not found\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/alarm-mutes-timelineview.png)


 **Programmatically checking alarm mute status:** To programmatically determine if an alarm is currently muted, you can use the [ListAlarmMuteRules](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_ListAlarmMuteRules.html) API with the alarm name as a filter criteria. This API returns all active mute rules that are affecting the specified alarm, allowing you to integrate mute status checks into your automation workflows, monitoring dashboards, or operational tools. 

 For example: To check if an alarm named "HighCPUAlarm" is currently muted, call the [ListAlarmMuteRules](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_ListAlarmMuteRules.html) API with the filter parameter set to the alarm name. The response will include all mute rules targeting that alarm, along with their current status (SCHEDULED, ACTIVE, or EXPIRED). 

 **Alarm history:** Whenever alarm actions are muted due to an active mute rule, CloudWatch writes a history entry to the alarm's history log. This provides a complete audit trail of when your alarms were muted, helping you understand the timeline of muting events and correlate them with operational activities. You can view this history through the CloudWatch console or retrieve it programmatically using the [DescribeAlarmHistory](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DescribeAlarmHistory.html) API. 

**Note**  
 When multiple alarm mute rules are active simultaneously, the most recently created mute rule name is written to the alarm history along with the total number of other active mute rules. 
 The timeline displays mute periods only when an alarm state transitions during an active mute window and actions were prevented from executing. 

**Tip**  
 You can manage alarm mute rules programmatically using the CloudWatch API. For more information, see [PutAlarmMuteRule](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutAlarmMuteRule.html), [GetAlarmMuteRule](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetAlarmMuteRule.html), [ListAlarmMuteRules](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_ListAlarmMuteRules.html), and [DeleteAlarmMuteRule](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DeleteAlarmMuteRule.html). 

## Common features of CloudWatch alarms
<a name="common-features-of-alarms"></a>

The following features apply to all CloudWatch alarms:
+ There is no limit to the number of alarms that you can create. To create or update an alarm, you use the CloudWatch console, the [PutMetricAlarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html) API action, or the [put-metric-alarm](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html) command in the AWS CLI.
+ Alarm names must contain only UTF-8 characters, and can't contain ASCII control characters
+ You can list any or all of the currently configured alarms, and list any alarms in a particular state by using the CloudWatch console, the [DescribeAlarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DescribeAlarms.html) API action, or the [describe-alarms](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/describe-alarms.html) command in the AWS CLI.
+ You can disable and enable alarm actions by using the [DisableAlarmActions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DisableAlarmActions.html) and [EnableAlarmActions](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_EnableAlarmActions.html) API actions, or the [disable-alarm-actions](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/disable-alarm-actions.html) and [enable-alarm-actions](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/enable-alarm-actions.html) commands in the AWS CLI. 
+ You can test an alarm by setting it to any state using the [SetAlarmState](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_SetAlarmState.html) API action or the [set-alarm-state](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/set-alarm-state.html) command in the AWS CLI. This temporary state change lasts only until the next alarm comparison occurs.
+ You can create an alarm for a custom metric before you've created that custom metric. For the alarm to be valid, you must include all of the dimensions for the custom metric in addition to the metric namespace and metric name in the alarm definition. To do this, you can use the [PutMetricAlarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_PutMetricAlarm.html) API action, or the [put-metric-alarm](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html) command in the AWS CLI.
+ You can view an alarm's history using the CloudWatch console, the [DescribeAlarmHistory](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_DescribeAlarmHistory.html) API action, or the [describe-alarm-history](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/describe-alarm-history.html) command in the AWS CLI. CloudWatch preserves alarm history for 30 days. Each state transition is marked with a unique timestamp. In rare cases, your history might show more than one notification for a state change. The timestamp enables you to confirm unique state changes.
+  You can favorite alarms from the *Favorites and recents* option in the CloudWatch console navigation pane by hovering over the alarm that you want to favorite and choosing the star symbol next to it. 
+ Alarms have an evaluation period quota. The evaluation period is calculated by multiplying the alarm period by the number of evaluation periods used.
  + The maximum evaluation period is seven days for alarms with a period of at least one hour (3600 seconds).
  + The maximum evaluation period is one day for alarms with a shorter period.
  + The maximum evaluation period is one day for alarms that use the custom Lambda data source.

**Note**  
Some AWS resources don't send metric data to CloudWatch under certain conditions.  
For example, Amazon EBS might not send metric data for an available volume that is not attached to an Amazon EC2 instance, because there is no metric activity to be monitored for that volume. If you have an alarm set for such a metric, you might notice its state change to `INSUFFICIENT_DATA`. This might indicate that your resource is inactive, and might not necessarily mean that there is a problem. You can specify how each alarm treats missing data. For more information, see [Configuring how CloudWatch alarms treat missing data](alarms-and-missing-data.md).

# Best practice alarm recommendations for AWS services
<a name="Best-Practice-Alarms"></a>

CloudWatch provides out-of-the box alarm recommendations. These are CloudWatch alarms that we recommend that you create for metrics that are published by other AWS services. These recommendations can help you identify the metrics that you should set alarms for to follow best practices for monitoring. The recommendations also suggest the alarm thresholds to set. Following these recommendations can help you not miss important monitoring of your AWS infrastructure.

To find the alarm recommendations, you use the metrics section of the CloudWatch console, and select the alarm recommendations filter toggle. If you navigate to the recommended alarms in the console and then create a recommended alarm, CloudWatch can pre-fill some of the alarm settings. For some recommended alarms, the alarm threshold value is also pre-filled. You can also use the console to download infrastructure-as-code alarm definitions for recommended alarms, and then use this code to create the alarm in AWS CloudFormation, the AWS CLI, or Terraform.

You can also see the list of recommended alarms in [Recommended alarms](Best_Practice_Recommended_Alarms_AWS_Services.md).

You are charged for the alarms that you create, at the same rate as any other alarms that you create in CloudWatch. Using the recommendations incurs no extra charges. For more information, see [Amazon CloudWatch Pricing](https://aws.amazon.com/cloudwatch/pricing/).

## Find and create recommended alarms
<a name="Best-Practice-Alarms-create"></a>

Follow these steps to find the metrics that CloudWatch recommends that you set alarms for, and optionally to create one of these alarms. The first procedure explains how to find the metrics that have recommended alarms, and how to create one of these alarms.

You can also get a bulk download of infrastructure-as-code alarm definitions for all recommended alarms in an AWS namespace, such as `AWS/Lambda` or `AWS/S3`. Those instructions are later in this topic.

**To find the metrics with recommended alarms, and create a single recommended alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**, **All metrics**.

1. Above the **Metrics** table, Choose **Alarm recommendations**.

   The list of metric namespaces is filtered to include only the metrics that have alarm recommendations and that services in your account are publishing.

1. Choose the namespace for a service.

   The list of metrics under this namespace is filtered to include only those that have alarm recommendations.

1. To see the alarm intent and recommended threshold for a metric, choose **View details**.

1. To create an alarm for one of the metrics, do one of the following:
   + To use the console to create the alarm, do the following:

     1. Select the checkbox for the metric and choose the **Graphed metrics** tab.

     1. Choose the alarm icon.  
![\[Create an alarm from a graphed metric\]](http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/images/metric_graph_alarm.png)

        The alarm creation wizard appears, with the metric name, statistic, and period filled in based on the alarm recommendation. If the recommendation includes a specific threshold value, that value is also pre-filled.

     1. Choose **Next**.

     1. Under **Notification**, select an SNS topic to notify when the alarm transitions to `ALARM` state, `OK` state, or `INSUFFICIENT_DATA` state.

        To have the alarm send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**.

        To have the alarm not send notifications, choose **Remove**.

     1. To have the alarm perform Auto Scaling or EC2 actions, choose the appropriate button and choose the alarm state and action to perform.

     1. When finished, choose **Next**.

     1. Enter a name and description for the alarm. The name must contain only ASCII characters. Then choose **Next**.

     1. Under **Preview and create**, confirm that the information and conditions are what you want, then choose **Create alarm**.
   + To download an infrastructure-as-code alarm definition to use in either AWS CloudFormation, AWS CLI, or Terraform, choose **Download alarm code** and select the format that you want. The downloaded code will have the recommended settings for the metric name, statistic, and threshold.

**To download infrastructure-as-code alarm definitions for all recommended alarms for an AWS service**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**, **All metrics**.

1. Above the **Metrics** table, Choose **Alarm recommendations**.

   The list of metric namespaces is filtered to include only the metrics that have alarm recommendations and that services in your account are publishing.

1. Choose the namespace for a service.

   The list of metrics under this namespace is filtered to include only those that have alarm recommendations.

1. The **Download alarm code** displays how many alarms are recommended for the metrics in this namespace. To download infrastructure-as-code alarm definitions for all recommended alarms, choose **Download alarm code** and then choose the code format that you want. 

# Recommended alarms
<a name="Best_Practice_Recommended_Alarms_AWS_Services"></a>

The following sections list the metrics that we recommend that you set best practice alarms for. For each metric, the dimensions, alarm intent, recommended threshold, threshold justification, and the period length and number of datapoints is also displayed.

Some metrics might appear twice in the list. This happens when different alarms are recommended for different combinations of dimensions of that metric.

**Datapoints to alarm** is the number of data points that must be breaching to send the alarm into ALARM state. **Evaluation periods** is the number of periods that are taken into account when the alarm is evaluated. If these numbers are the same, the alarm goes into ALARM state only when that number of consecutive periods have values that breach the threshold. If **Datapoints to alarm** is lower than **Evaluation periods**, then it is an "M out of N" alarm and the alarm goes into ALARM state if at least **Datapoints to alarm** data points are breaching within any **Evaluation periods** set of data points. For more information, see [Alarm evaluation](alarm-evaluation.md).

**Topics**
+ [

## Amazon API Gateway
](#ApiGateway)
+ [

## Amazon EC2 Auto Scaling
](#AutoScaling)
+ [

## AWS Certificate Manager (ACM)
](#CertificateManager)
+ [

## Amazon CloudFront
](#CloudFront)
+ [

## Amazon Cognito
](#Cognito)
+ [

## Amazon DynamoDB
](#DynamoDB)
+ [

## Amazon EBS
](#Recommended_EBS)
+ [

## Amazon EC2
](#EC2)
+ [

## Amazon ElastiCache
](#ElastiCache)
+ [

## Amazon ECS
](#ECS)
+ [

## Amazon ECS with Container Insights
](#ECS-ContainerInsights)
+ [

## Amazon ECS with Container Insights with enhanced observability
](#ECS-ContainerInsights_enhanced)
+ [

## Amazon EFS
](#EFS)
+ [

## Amazon EKS with Container Insights
](#EKS-ContainerInsights)
+ [

## Amazon EventBridge Scheduler
](#Eventbridge-Scheduler)
+ [

## Amazon Kinesis Data Streams
](#Kinesis)
+ [

## Lambda
](#Lambda)
+ [

## Lambda Insights
](#LambdaInsights)
+ [

## Amazon VPC (`AWS/NATGateway`)
](#NATGateway)
+ [

## AWS Private Link (`AWS/PrivateLinkEndpoints`)
](#PrivateLinkEndpoints)
+ [

## AWS Private Link (`AWS/PrivateLinkServices`)
](#PrivateLinkServices)
+ [

## `Amazon RDS`
](#RDS)
+ [

## `Amazon Route 53 Public Data Plane`
](#Route53)
+ [

## `Amazon S3`
](#S3)
+ [

## `S3ObjectLambda`
](#S3ObjectLambda)
+ [

## Amazon SNS
](#SNS)
+ [

## Amazon SQS
](#SQS)
+ [

## Site-to-Site VPN
](#VPN)

## Amazon API Gateway
<a name="ApiGateway"></a>

**4XXError**  
**Dimensions: **ApiName, Stage  
**Alarm description: **This alarm detects a high rate of client-side errors. This can indicate an issue in the authorization or client request parameters. It could also mean that a resource was removed or a client is requesting one that doesn't exist. Consider enabling CloudWatch Logs and checking for any errors that may be causing the 4XX errors. Moreover, consider enabling detailed CloudWatch metrics to view this metric per resource and method and narrow down the source of the errors. Errors could also be caused by exceeding the configured throttling limit. If the responses and logs are reporting high and unexpected rates of 429 errors, follow [this guide](https://repost.aws/knowledge-center/api-gateway-429-limit) to troubleshoot this issue.  
**Intent: **This alarm can detect high rates of client-side errors for the API Gateway requests.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **The suggested threshold detects when more than 5% of total requests are getting 4XX errors. However, you can tune the threshold to suit the traffic of the requests as well as acceptable error rates. You can also analyze historical data to determine the acceptable error rate for the application workload and then tune the threshold accordingly. Frequently occurring 4XX errors need to be alarmed on. However, setting a very low value for the threshold can cause the alarm to be too sensitive.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**5XXError**  
**Dimensions: **ApiName, Stage  
**Alarm description: **This alarm helps to detect a high rate of server-side errors. This can indicate that there is something wrong on the API backend, the network, or the integration between the API gateway and the backend API. This [documentation](https://repost.aws/knowledge-center/api-gateway-5xx-error) can help you troubleshoot the cause of 5xx errors.  
**Intent: **This alarm can detect high rates of server-side errors for the API Gateway requests.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **The suggested threshold detects when more than 5% of total requests are getting 5XX errors. However, you can tune the threshold to suit the traffic of the requests as well as acceptable error rates. you can also analyze historical data to determine the acceptable error rate for the application workload and then tune the threshold accordingly. Frequently occurring 5XX errors need to be alarmed on. However, setting a very low value for the threshold can cause the alarm to be too sensitive.  
**Period: **60  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**Count**  
**Dimensions: **ApiName, Stage  
**Alarm description: **This alarm helps to detect low traffic volume for the REST API stage. This can be an indicator of an issue with the application calling the API such as using incorrect endpoints. It could also be an indicator of an issue with the configuration or permissions of the API making it unreachable for clients.  
**Intent: **This alarm can detect unexpectedly low traffic volume for the REST API stage. We recommend that you create this alarm if your API receives a predictable and consistent number of requests under normal conditions. If you have detailed CloudWatch metrics enabled and you can predict the normal traffic volume per method and resource, we recommend that you create alternative alarms to have more fine-grained monitoring of traffic volume drops for each resource and method. This alarm is not recommended for APIs that don't expect constant and consistent traffic.  
**Statistic: **SampleCount  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold based on historical data analysis to determine what the expected baseline request count for your API is. Setting the threshold at a very high value might cause the alarm to be too sensitive at periods of normal and expected low traffic. Conversely, setting it at a very low value might cause the alarm to miss anomalous smaller drops in traffic volume.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**Count**  
**Dimensions: **ApiName, Stage, Resource, Method  
**Alarm description: **This alarm helps to detect low traffic volume for the REST API resource and method in the stage. This can indicate an issue with the application calling the API such as using incorrect endpoints. It could also be an indicator of an issue with the configuration or permissions of the API making it unreachable for clients.  
**Intent: **This alarm can detect unexpectedly low traffic volume for the REST API resource and method in the stage. We recommend that you create this alarm if your API receives a predictable and consistent number of requests under normal conditions. This alarm is not recommended for APIs that don't expect constant and consistent traffic.  
**Statistic: **SampleCount  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold based on historical data analysis to determine what the expected baseline request count for your API is. Setting the threshold at a very high value might cause the alarm to be too sensitive at periods of normal and expected low traffic. Conversely, setting it at a very low value might cause the alarm to miss anomalous smaller drops in traffic volume.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**Count**  
**Dimensions: **ApiId, Stage  
**Alarm description: **This alarm helps to detect low traffic volume for the HTTP API stage. This can indicate an issue with the application calling the API such as using incorrect endpoints. It could also be an indicator of an issue with the configuration or permissions of the API making it unreachable for clients.  
**Intent: **This alarm can detect unexpectedly low traffic volume for the HTTP API stage. We recommend that you create this alarm if your API receives a predictable and consistent number of requests under normal conditions. If you have detailed CloudWatch metrics enabled and you can predict the normal traffic volume per route, we recommend that you create alternative alarms to this in order to have more fine-grained monitoring of traffic volume drops for each route. This alarm is not recommended for APIs that don't expect constant and consistent traffic.  
**Statistic: **SampleCount  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold value based on historical data analysis to determine what the expected baseline request count for your API is. Setting the threshold at a very high value might cause the alarm to be too sensitive at periods of normal and expected low traffic. Conversely, setting it at a very low value might cause the alarm to miss anomalous smaller drops in traffic volume.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**Count**  
**Dimensions: **ApiId, Stage, Resource, Method  
**Alarm description: **This alarm helps to detect low traffic volume for the HTTP API route in the stage. This can indicate an issue with the application calling the API such as using incorrect endpoints. It could also indicate an issue with the configuration or permissions of the API making it unreachable for clients.  
**Intent: **This alarm can detect unexpectedly low traffic volume for the HTTP API route in the stage. We recommend that you create this alarm if your API receives a predictable and consistent number of requests under normal conditions. This alarm is not recommended for APIs that don't expect constant and consistent traffic.  
**Statistic: **SampleCount  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold value based on historical data analysis to determine what the expected baseline request count for your API is. Setting the threshold at a very high value might cause the alarm to be too sensitive at periods of normal and expected low traffic. Conversely, setting it at a very low value might cause the alarm to miss anomalous smaller drops in traffic volume.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**IntegrationLatency**  
**Dimensions: **ApiId, Stage  
**Alarm description: **This alarm helps to detect if there is high integration latency for the API requests in a stage. You can correlate the `IntegrationLatency` metric value with the corresponding latency metric of your backend such as the `Duration` metric for Lambda integrations. This helps you determine whether the API backend is taking more time to process requests from clients due to performance issues, or if there is some other overhead from initialization or cold start. Additionally, consider enabling CloudWatch Logs for your API and checking the logs for any errors that may be causing the high latency issues. Moreover, consider enabling detailed CloudWatch metrics to get a view of this metric per route, to help you narrow down the source of the integration latency.  
**Intent: **This alarm can detect when the API Gateway requests in a stage have a high integration latency. We recommend this alarm for WebSocket APIs, and we consider it optional for HTTP APIs because they already have separate alarm recommendations for the Latency metric. If you have detailed CloudWatch metrics enabled and you have different integration latency performance requirements per route, we recommend that you create alternative alarms in order to have more fine-grained monitoring of the integration latency for each route.  
**Statistic: **p90  
**Recommended threshold: **2000.0  
**Threshold justification: **The suggested threshold value does not work for all the API workloads. However, you can use it as a starting point for the threshold. You can then choose different threshold values based on the workload and acceptable latency, performance, and SLA requirements for the API. If is acceptable for the API to have a higher latency in general, set a higher threshold value to make the alarm less sensitive. However, if the API is expected to provide near real-time responses, set a lower threshold value. You can also analyze historical data to determine the expected baseline latency for the application workload, and then used to tune the threshold value accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**IntegrationLatency**  
**Dimensions: **ApiId, Stage, Route  
**Alarm description: **This alarm helps to detect if there is high integration latency for the WebSocket API requests for a route in a stage. You can correlate the `IntegrationLatency` metric value with the corresponding latency metric of your backend such as the `Duration` metric for Lambda integrations. This helps you determine whether the API backend is taking more time to process requests from clients due to performance issues or if there is some other overhead from initialization or cold start. Additionally, consider enabling CloudWatch Logs for your API and checking the logs for any errors that may be causing the high latency issues.  
**Intent: **This alarm can detect when the API Gateway requests for a route in a stage have high integration latency.  
**Statistic: **p90  
**Recommended threshold: **2000.0  
**Threshold justification: **The suggested threshold value does not work for all the API workloads. However, you can use it as a starting point for the threshold. You can then choose different threshold values based on the workload and acceptable latency, performance, and SLA requirements for the API. If it is acceptable for the API to have a higher latency in general, you can set a higher threshold value to make the alarm less sensitive. However, if the API is expected to provide near real-time responses, set a lower threshold value. You can also analyze historical data to determine the expected baseline latency for the application workload, and then used to tune the threshold value accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**Latency**  
**Dimensions: **ApiName, Stage  
**Alarm description: **This alarm detects high latency in a stage. Find the `IntegrationLatency` metric value to check the API backend latency. If the two metrics are mostly aligned, the API backend is the source of higher latency and you should investigate there for issues. Consider also enabling CloudWatch Logs and checking for errors that might be causing the high latency. Moreover, consider enabling detailed CloudWatch metrics to view this metric per resource and method and narrow down the source of the latency. If applicable, refer to the [troubleshooting with Lambda](https://repost.aws/knowledge-center/api-gateway-high-latency-with-lambda) or [troubleshooting for edge-optimized API endpoints](https://repost.aws/knowledge-center/source-latency-requests-api-gateway) guides.  
**Intent: **This alarm can detect when the API Gateway requests in a stage have high latency. If you have detailed CloudWatch metrics enabled and you have different latency performance requirements for each method and resource, we recommend that you create alternative alarms to have more fine-grained monitoring of the latency for each resource and method.  
**Statistic: **p90  
**Recommended threshold: **2500.0  
**Threshold justification: **The suggested threshold value does not work for all API workloads. However, you can use it as a starting point for the threshold. You can then choose different threshold values based on the workload and acceptable latency, performance, and SLA requirements for the API. If it is acceptable for the API to have a higher latency in general, you can set a higher threshold value to make the alarm less sensitive. However, if the API is expected to provide near real-time responses, set a lower threshold value. You can also analyze historical data to determine what the expected baseline latency is for the application workload and then tune the threshold value accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**Latency**  
**Dimensions: **ApiName, Stage, Resource, Method  
**Alarm description: **This alarm detects high latency for a resource and method in a stage. Find the `IntegrationLatency` metric value to check the API backend latency. If the two metrics are mostly aligned, the API backend is the source of higher latency and you should investigate there for performance issues. Consider also enabling CloudWatch Logs and checking for any errors that might be causing the high latency. You can also refer to the [troubleshooting with Lambda](https://repost.aws/knowledge-center/api-gateway-high-latency-with-lambda) or [troubleshooting for edge-optimized API endpoints](https://repost.aws/knowledge-center/source-latency-requests-api-gateway) guides if applicable.  
**Intent: **This alarm can detect when the API Gateway requests for a resource and method in a stage have high latency.  
**Statistic: **p90  
**Recommended threshold: **2500.0  
**Threshold justification: **The suggested threshold value does not work for all the API workloads. However, you can use it as a starting point for the threshold. You can then choose different threshold values based on the workload and acceptable latency, performance, and SLA requirements for the API. If it is acceptable for the API to have a higher latency in general, you can set a higher threshold value to make the alarm less sensitive. However, if the API is expected to provide near real-time responses, set a lower threshold value. You can also analyze historical data to determine the expected baseline latency for the application workload and then tune the threshold value accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**Latency**  
**Dimensions: **ApiId, Stage  
**Alarm description: **This alarm detects high latency in a stage. Find the `IntegrationLatency` metric value to check the API backend latency. If the two metrics are mostly aligned, the API backend is the source of higher latency and you should investigate there for performance issues. Consider also enabling CloudWatch Logs and checking for any errors that may be causing the high latency. Moreover, consider enabling detailed CloudWatch metrics to view this metric per route and narrow down the source of the latency. You can also refer to the [troubleshooting with Lambda integrations guide](https://repost.aws/knowledge-center/api-gateway-high-latency-with-lambda) if applicable.  
**Intent: **This alarm can detect when the API Gateway requests in a stage have high latency. If you have detailed CloudWatch metrics enabled and you have different latency performance requirements per route, we recommend that you create alternative alarms to have more fine-grained monitoring of the latency for each route.  
**Statistic: **p90  
**Recommended threshold: **2500.0  
**Threshold justification: **The suggested threshold value does not work for all the API workloads. However, it can be used as a starting point for the threshold. You can then choose different threshold values based on the workload and acceptable latency, performance and SLA requirements for the API. If it is acceptable for the API to have a higher latency in general, you can set a higher threshold value to make it less sensitive.However, if the API is expected to provide near real-time responses, set a lower threshold value. You can also analyze historical data to determine the expected baseline latency for the application workload and then tune the threshold value accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**Latency**  
**Dimensions: **ApiId, Stage, Resource, Method  
**Alarm description: **This alarm detects high latency for a route in a stage. Find the `IntegrationLatency` metric value to check the API backend latency. If the two metrics are mostly aligned, the API backend is the source of higher latency and should be investigated for performance issues. Consider also enabling CloudWatch logs and checking for any errors that might be causing the high latency. You can also refer to the [troubleshooting with Lambda integrations guide](https://repost.aws/knowledge-center/api-gateway-high-latency-with-lambda) if applicable.  
**Intent: **This alarm is used to detect when the API Gateway requests for a route in a stage have high latency.  
**Statistic: **p90  
**Recommended threshold: **2500.0  
**Threshold justification: **The suggested threshold value does not work for all the API workloads. However, it can be used as a starting point for the threshold. You can then choose different threshold values based on the workload and acceptable latency, performance, and SLA requirements for the API. If it is acceptable for the API to have a higher latency in general, you can set a higher threshold value to make the alarm less sensitive. However, if the API is expected to provide near real-time responses, set a lower threshold value. You can also analyze historical data to determine the expected baseline latency for the application workload and then tune the threshold value accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**4xx**  
**Dimensions: **ApiId, Stage  
**Alarm description: **This alarm detects a high rate of client-side errors. This can indicate an issue in the authorization or client request parameters. It could also mean that a route was removed or a client is requesting one that doesn't exist in the API. Consider enabling CloudWatch Logs and checking for any errors that may be causing the 4xx errors. Moreover, consider enabling detailed CloudWatch metrics to view this metric per route, to help you narrow down the source of the errors. Errors can also be caused by exceeding the configured throttling limit. If the responses and logs are reporting high and unexpected rates of 429 errors, follow [this guide](https://repost.aws/knowledge-center/api-gateway-429-limit) to troubleshoot this issue.  
**Intent: **This alarm can detect high rates of client-side errors for the API Gateway requests.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **The suggested threshold detects when more than 5% of total requests are getting 4xx errors. However, you can tune the threshold to suit the traffic of the requests as well as acceptable error rates. You can also analyze historical data to determine the acceptable error rate for the application workload and then tune the threshold accordingly. Frequently occurring 4xx errors need to be alarmed on. However, setting a very low value for the threshold can cause the alarm to be too sensitive.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**5xx**  
**Dimensions: **ApiId, Stage  
**Alarm description: **This alarm helps to detect a high rate of server-side errors. This can indicate that there is something wrong on the API backend, the network, or the integration between the API gateway and the backend API. This [documentation](https://repost.aws/knowledge-center/api-gateway-5xx-error) can help you troubleshoot the cause for 5xx errors.  
**Intent: **This alarm can detect high rates of server-side errors for the API Gateway requests.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **The suggested threshold detects when more than 5% of total requests are getting 5xx errors. However, you can tune the threshold to suit the traffic of the requests as well as acceptable error rates. You can also analyze historical data to determine what the acceptable error rate is for the application workload, and then you can tune the threshold accordingly. Frequently occurring 5xx errors need to be alarmed on. However, setting a very low value for the threshold can cause the alarm to be too sensitive.  
**Period: **60  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**MessageCount**  
**Dimensions: **ApiId, Stage  
**Alarm description: **This alarm helps to detect low traffic volume for the WebSocket API stage. This can indicate an issue when clients call the API such as using incorrect endpoints, or issues with the backend sending messages to clients. It could also indicate an issue with the configuration or permissions of the API, making it unreachable for clients.  
**Intent: **This alarm can detect unexpectedly low traffic volume for the WebSocket API stage. We recommend that you create this alarm if your API receives and sends a predictable and consistent number of messages under normal conditions. If you have detailed CloudWatch metrics enabled and you can predict the normal traffic volume per route, it is better to create alternative alarms to this one, in order to have more fine-grained monitoring of traffic volume drops for each route. We do not recommend this alarm for APIs that don't expect constant and consistent traffic.  
**Statistic: **SampleCount  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold value based on historical data analysis to determine what the expected baseline message count for your API is. Setting the threshold to a very high value might cause the alarm to be too sensitive at periods of normal and expected low traffic. Conversely, setting it to a very low value might cause the alarm to miss anomalous smaller drops in traffic volume.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**MessageCount**  
**Dimensions: **ApiId, Stage, Route  
**Alarm description: **This alarm helps detect low traffic volume for the WebSocket API route in the stage. This can indicate an issue with the clients calling the API such as using incorrect endpoints, or issues with the backend sending messages to clients. It could also indicate an issue with the configuration or permissions of the API, making it unreachable for clients.  
**Intent: **This alarm can detect unexpectedly low traffic volume for the WebSocket API route in the stage. We recommend that you create this alarm if your API receives and sends a predictable and consistent number of messages under normal conditions. We do not recommend this alarm for APIs that don't expect constant and consistent traffic.  
**Statistic: **SampleCount  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold based on historical data analysis to determine what the expected baseline message count for your API is. Setting the threshold to a very high value might cause the alarm to be too sensitive at periods of normal and expected low traffic. Conversely, setting it to a very low value might cause the alarm to miss anomalous smaller drops in traffic volume.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**ClientError**  
**Dimensions: **ApiId, Stage  
**Alarm description: **This alarm detects a high rate of client errors. This can indicate an issue in the authorization or message parameters. It could also mean that a route was removed or a client is requesting one that doesn't exist in the API. Consider enabling CloudWatch Logs and checking for any errors that may be causing the 4xx errors. Moreover, consider enabling detailed CloudWatch metrics to view this metric per route, to help you narrow down the source of the errors. Errors could also be caused by exceeding the configured throttling limit. If the responses and logs are reporting high and unexpected rates of 429 errors, follow [this guide](https://repost.aws/knowledge-center/api-gateway-429-limit) to troubleshoot this issue.  
**Intent: **This alarm can detect high rates of client errors for the WebSocket API Gateway messages.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **The suggested threshold detects when more than 5% of total requests are getting 4xx errors. You can tune the threshold to suit the traffic of the requests as well as to suit your acceptable error rates. You can also analyze historical data to determine the acceptable error rate for the application workload, and then tune the threshold accordingly. Frequently occurring 4xx errors need to be alarmed on. However, setting a very low value for the threshold can cause the alarm to be too sensitive.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ExecutionError**  
**Dimensions: **ApiId, Stage  
**Alarm description: **This alarm helps to detect a high rate of execution errors. This can be caused by 5xx errors from your integration, permission issues, or other factors preventing successful invocation of the integration, such as the integration being throttled or deleted. Consider enabling CloudWatch Logs for your API and checking the logs for the type and cause of the errors. Moreover, consider enabling detailed CloudWatch metrics to get a view of this metric per route, to help you narrow down the source of the errors. This [documentation](https://repost.aws/knowledge-center/api-gateway-websocket-error) can also help you troubleshoot the cause of any connection errors.  
**Intent: **This alarm can detect high rates of execution errors for the WebSocket API Gateway messages.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **The suggested threshold detects when more than 5% of total requests are getting execution errors. You can tune the threshold to suit the traffic of the requests, as well as to suit your acceptable error rates. You can analyze historical data to determine the acceptable error rate for the application workload, and then tune the threshold accordingly. Frequently occurring execution errors need to be alarmed on. However, setting a very low value for the threshold can cause the alarm to be too sensitive.  
**Period: **60  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon EC2 Auto Scaling
<a name="AutoScaling"></a>

**GroupInServiceCapacity**  
**Dimensions: **AutoScalingGroupName  
**Alarm description: **This alarm helps to detect when the capacity in the group is below the desired capacity required for your workload. To troubleshoot, check your scaling activities for launch failures and confirm that your desired capacity configuration is correct.  
**Intent: **This alarm can detect a low availability in your auto scaling group because of launch failures or suspended launches.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The threshold value should be the minimum capacity required to run your workload. In most cases, you can set this to match the GroupDesiredCapacity metric.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

## AWS Certificate Manager (ACM)
<a name="CertificateManager"></a>

**DaysToExpiry**  
**Dimensions: **CertificateArn  
**Alarm description: **This alarm helps you detect when a certificate managed by or imported into ACM is approaching its expiration date. It helps to prevent unexpected certificate expirations that could lead to service disruptions. When the alarm transitions into ALARM state, you should take immediate action to renew or re-import the certificate. For ACM-managed certificates, see the instructions at [certificate renewal process](https://docs.aws.amazon.com/acm/latest/userguide/troubleshooting-renewal.html). For imported certificates, see the instructions at [re-import process](https://docs.aws.amazon.com/acm/latest/userguide/import-reimport.html).  
**Intent: **This alarm can proactively alert you about upcoming certificate expirations. It provides sufficient advance notice to allow for manual intervention, enabling you to renew or replace certificates before they expire. This helps you maintain the security and availability of TLS-enabled services. When this goes into ALARM, immediately investigate the certificate status and initiate the renewal process if necessary.  
**Statistic: **Minimum  
**Recommended threshold: **44.0  
**Threshold justification: **The 44-day threshold provides a balance between early warning and avoiding false alarms. It allows sufficient time for manual intervention if automatic renewal fails. Adjust this value based on your certificate renewal process and operational response times.  
**Period: **86400  
**Datapoints to alarm: **1  
**Evaluation periods: **1  
**Comparison Operator: **LESS\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

## Amazon CloudFront
<a name="CloudFront"></a>

**5xxErrorRate**  
**Dimensions: **DistributionId, Region=Global  
**Alarm description: **This alarm monitors the percentage of 5xx error responses from your origin server, to help you detect if the CloudFront service is having issues. See [Troubleshooting error responses from your origin](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/troubleshooting-response-errors.html) for information to help you understand the problems with your server. Also, [turn on additional metrics](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/viewing-cloudfront-metrics.html#monitoring-console.distributions-additional) to get detailed error metrics.  
**Intent: **This alarm is used to detect problems with serving requests from the origin server, or problems with communication between CloudFront and your origin server.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on the tolerance for 5xx responses. You can analyze historical data and trends, and then set the threshold accordingly. Because 5xx errors can be caused by transient issues, we recommend that you set the threshold to a value greater than 0 so that the alarm is not too sensitive.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**OriginLatency**  
**Dimensions: **DistributionId, Region=Global  
**Alarm description: **The alarm helps to monitor if the origin server is taking too long to respond. If the server takes too long to respond, it might lead to a timeout. Refer to [find and fix delayed responses from applications on your origin server](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/http-504-gateway-timeout.html#http-504-gateway-timeout-slow-application) if you experience consistently high `OriginLatency` values.  
**Intent: **This alarm is used to detect problems with the origin server taking too long to respond.  
**Statistic: **p90  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You should calculate the value of about 80% of the origin response timeout, and use the result as the threshold value. If this metric is consistently close to the origin response timeout value, you might start experiencing 504 errors.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**FunctionValidationErrors**  
**Dimensions: **DistributionId, FunctionName, Region=Global  
**Alarm description: **This alarm helps you monitor validation errors from CloudFront functions so that you can take steps to resolve them. Analyze the CloudWatch function logs and look at the function code to find and resolve the root cause of the problem. See [Restrictions on edge functions](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/edge-functions-restrictions.html) to understand the common misconfigurations for CloudFront Functions.  
**Intent: **This alarm is used to detect validation errors from CloudFront functions.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **A value greater than 0 indicates a validation error. We recommend setting the threshold to 0 because validation errors imply a problem when CloudFront functions hand off back to CloudFront. For example, CloudFront needs the HTTP Host header in order to process a request. There is nothing stopping a user from deleting the Host header in their CloudFront functions code. But when CloudFront gets the response back and the Host header is missing, CloudFront throws a validation error.  
**Period: **60  
**Datapoints to alarm: **2  
**Evaluation periods: **2  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**FunctionExecutionErrors**  
**Dimensions: **DistributionId, FunctionName, Region=Global  
**Alarm description: **This alarm helps you monitor execution errors from CloudFront functions so that you can take steps to resolve them. Analyze the CloudWatch function logs and look at the function code to find and resolve the root cause of the problem.  
**Intent: **This alarm is used to detect execution errors from CloudFront functions.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **We recommend to set the threshold to 0 because an execution error indicates a problem with the code that occurs at runtime.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**FunctionThrottles**  
**Dimensions: **DistributionId, FunctionName, Region=Global  
**Alarm description: **This alarm helps you to monitor if your CloudFront function is throttled. If your function is throttled, it means that it is taking too long to execute. To avoid function throttles, consider optimizing the function code.  
**Intent: **This alarm can detect when your CloudFront function is throttled so that you can react and resolve the issue for a smooth customer experience.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **We recommend setting the threshold to 0, to allow quicker resolution of the function throttles.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon Cognito
<a name="Cognito"></a>

**SignUpThrottles**  
**Dimensions: **UserPool, UserPoolClient  
**Alarm description: **This alarm monitors the count of throttled requests. If users are consistently getting throttled, you should increase the limit by requesting a service quota increase. Refer to [Quotas in Amazon Cognito](https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html) to learn how to request a quota increase. To take actions proactively, consider tracking the [usage quota](https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html#track-quota-usage).  
**Intent: **This alarm helps to monitor the occurrence of throttled sign-up requests. This can help you know when to take actions to mitigate any degradation in sign-up experience. Sustained throttling of requests is a negative user sign-up experience.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **A well-provisioned user pool should not encounter any throttling which spans across multiple data points. So, a typical threshold for an expected workload should be zero. For an irregular workload with frequent bursts, you can analyze historical data to determine the acceptable throttling for the application workload, and then you can tune the threshold accordingly. A throttled request should be retried to minimize the impact on the application.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**SignInThrottles**  
**Dimensions: **UserPool, UserPoolClient  
**Alarm description: **This alarm monitors the count of throttled user authentication requests. If users are consistently getting throttled, you might need to increase the limit by requesting a service quota increase. Refer to [Quotas in Amazon Cognito](https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html) to learn how to request a quota increase. To take actions proactively, consider tracking the [usage quota](https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html#track-quota-usage).  
**Intent: **This alarm helps to monitor the occurrence of throttled sign-in requests. This can help you know when to take actions to mitigate any degradation in sign-in experience. Sustained throttling of requests is a bad user authentication experience.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **A well-provisioned user pool should not encounter any throttling which spans across multiple data points. So, a typical threshold for an expected workload should be zero. For an irregular workload with frequent bursts, you can analyze historical data to determine the acceptable throttling for the application workload, and then you can tune the threshold accordingly. A throttled request should be retried to minimize the impact on the application.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TokenRefreshThrottles**  
**Dimensions: **UserPool, UserPoolClient  
**Alarm description: **You can set the threshold value can to suit the traffic of the request as well as to match acceptable throttling for token refresh requests. Throttling is used to protect your system from too many requests. However, it is important to monitor if you are under provisioned for your normal traffic as well. You can analyze historical data to find the acceptable throttling for the application workload, and then you can tune your alarm threshold to be higher than your acceptable throttling level. Throttled requests should be retried by the application/service as they are transient. Therefore, a very low value for the threshold can cause alarm to be sensitive.  
**Intent: **This alarm helps to monitor the occurrence of throttled token refresh requests. This can help you know when to take actions to mitigate any potential problems, to ensure a smooth user experience and the health and reliability of your authentication system. Sustained throttling of requests is a bad user authentication experience.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Threshold value can also be set/tuned to suit the traffic of the request as well as acceptable throttling for token refresh requests. Throttling are there for protecting your system from too many requests, however it is important to monitor if you are under provisioned for your normal traffic as well and see if it is causing the impact. Historical data can also be analyzed to see what is the acceptable throttling for the application workload and threshold can be tuned higher than your usual acceptable throttling level. Throttled requests should be retried by the application/service as they are transient. Therefore, a very low value for the threshold can cause alarm to be sensitive.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**FederationThrottles**  
**Dimensions: **UserPool, UserPoolClient, IdentityProvider  
**Alarm description: **This alarm monitors the count of throttled identity federation requests. If you consistently see throttling, it might indicate that you need to increase the limit by requesting a service quota increase. Refer to [Quotas in Amazon Cognito](https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html) to learn how to request a quota increase.  
**Intent: **This alarm helps to monitor the occurrence of throttled identity federation requests. This can help you take proactive responses to performance bottlenecks or misconfigurations, and ensure a smooth authentication experience for your users. Sustained throttling of requests is a bad user authentication experience.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You can set the threshold to suit the traffic of the request as well as to match the acceptable throttling for identity federation requests. Throttling is used for protecting your system from too many requests. However, it is important to monitor if you are under provisioned for your normal traffic as well. You can analyze historical data to find the acceptable throttling for the application workload, and then set the threshold to a value above your acceptable throttling level. Throttled requests should be retried by the application/service as they are transient. Therefore, a very low value for the threshold can cause alarm to be sensitive.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon DynamoDB
<a name="DynamoDB"></a>

**AccountProvisionedReadCapacityUtilization**  
**Dimensions: **None  
**Alarm description: **This alarm detects if the account’s read capacity is reaching its provisioned limit. You can raise the account quota for read capacity utilization if this occurs. You can view your current quotas for read capacity units and request increases using [Service Quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html).  
**Intent: **The alarm can detect if the account’s read capacity utilization is approaching its provisioned read capacity utilization. If the utilization reaches its maximum limit, DynamoDB starts to throttle read requests.  
**Statistic: **Maximum  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to 80%, so that action (such as raising the account limits) can be taken before it reaches full capacity to avoid throttling.  
**Period: **300  
**Datapoints to alarm: **2  
**Evaluation periods: **2  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**AccountProvisionedWriteCapacityUtilization**  
**Dimensions: **None  
**Alarm description: **This alarm detects if the account’s write capacity is reaching its provisioned limit. You can raise the account quota for write capacity utilization if this occurs. You can view your current quotas for write capacity units and request increases using [Service Quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html).  
**Intent: **This alarm can detect if the account’s write capacity utilization is approaching its provisioned write capacity utilization. If the utilization reaches its maximum limit, DynamoDB starts to throttle write requests.  
**Statistic: **Maximum  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to 80%, so that the action (such as raising the account limits) can be taken before it reaches full capacity to avoid throttling.  
**Period: **300  
**Datapoints to alarm: **2  
**Evaluation periods: **2  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**AgeOfOldestUnreplicatedRecord**  
**Dimensions: **TableName, DelegatedOperation  
**Alarm description: **This alarm detects the delay in replication to a Kinesis data stream. Under normal operation, `AgeOfOldestUnreplicatedRecord` should be only milliseconds. This number grows based on unsuccessful replication attempts caused by customer-controlled configuration choices. Customer-controlled configuration examples that lead to unsuccessful replication attempts are an under-provisioned Kinesis data stream capacity that leads to excessive throttling. or a manual update to the Kinesis data stream’s access policies that prevents DynamoDB from adding data to the data stream. To keep this metric as low as possible, you need to ensure the right provisioning of Kinesis data stream capacity and make sure that DynamoDB’s permissions are unchanged.  
**Intent: **This alarm can monitor unsuccessful replication attempts and the resulting delay in replication to the Kinesis data stream.  
**Statistic: **Maximum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold according to the desired replication delay measured in milliseconds. This value depends on your workload's requirements and expected performance.  
**Period: **300  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**FailedToReplicateRecordCount**  
**Dimensions: **TableName, DelegatedOperation  
**Alarm description: **This alarm detects the number of records that DynamoDB failed to replicate to your Kinesis data stream. Certain items larger than 34 KB might expand in size to change data records that are larger than the 1 MB item size limit of Kinesis Data Streams. This size expansion occurs when these larger than 34 KB items include a large number of Boolean or empty attribute values. Boolean and empty attribute values are stored as 1 byte in DynamoDB, but expand up to 5 bytes when they’re serialized using standard JSON for Kinesis Data Streams replication. DynamoDB can’t replicate such change records to your Kinesis data stream. DynamoDB skips these change data records, and automatically continues replicating subsequent records.  
**Intent: **This alarm can monitor the number of records that DynamoDB failed to replicate to your Kinesis data stream because of the item size limit of Kinesis Data Streams.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **Set the threshold to 0 to detect any records that DynamoDB failed to replicate.  
**Period: **60  
**Datapoints to alarm: **1  
**Evaluation periods: **1  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ReadThrottleEvents**  
**Dimensions: **TableName  
**Alarm description: **This alarm detects if there are high number of read requests getting throttled for the DynamoDB table. To troubleshoot the issue, see [Troubleshooting throttling issues in Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TroubleshootingThrottling.html).  
**Intent: **This alarm can detect sustained throttling for read requests to the DynamoDB table. Sustained throttling of read requests can negatively impact your workload read operations and reduce the overall efficiency of the system.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold according to the expected read traffic for the DynamoDB table, accounting for an acceptable level of throttling. It is important to monitor whether you are under provisioned and not causing consistent throttling. You can also analyze historical data to find the acceptable throttling level for the application workload, and then tune the threshold to be higher than your usual throttling level. Throttled requests should be retried by the application or service as they are transient. Therefore, a very low threshold may cause the alarm to be too sensitive, causing unwanted state transitions.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ReadThrottleEvents**  
**Dimensions: **TableName, GlobalSecondaryIndexName  
**Alarm description: **This alarm detects if there are a high number of read requests getting throttled for the Global Secondary Index of the DynamoDB table. To troubleshoot the issue, see [Troubleshooting throttling issues in Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TroubleshootingThrottling.html).  
**Intent: **The alarm can detect sustained throttling for read requests for the Global Secondary Index of the DynamoDB Table. Sustained throttling of read requests can negatively impact your workload read operations and reduce the overall efficiency of the system.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold according to the expected read traffic for the DynamoDB table, accounting for an acceptable level of throttling. It is important to monitor if you are under provisioned and not causing consistent throttling. You can also analyze historical data to find an acceptable throttling level for the application workload, and then tune the threshold to be higher than your usual acceptable throttling level. Throttled requests should be retried by the application or service as they are transient. Therefore, a very low threshold may cause the alarm to be too sensitive, causing unwanted state transitions.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ReplicationLatency**  
**Dimensions: **TableName, ReceivingRegion  
**Alarm description: **The alarm detects if the replica in a Region for the global table is lagging behind the source Region. The latency can increase if an AWS Region becomes degraded and you have a replica table in that Region. In this case, you can temporarily redirect your application's read and write activity to a different AWS Region. If you are using 2017.11.29 (Legacy) of global tables, you should verify that write capacity units (WCUs) are identical for each of the replica tables. You can also make sure to follow recommendations in [Best practices and requirements for managing capacity](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/globaltables_reqs_bestpractices.html#globaltables_reqs_bestpractices.tables).  
**Intent: **The alarm can detect if the replica table in a Region is falling behind replicating the changes from another Region. This could cause your replica to diverge from the other replicas. It’s useful to know the replication latency of each AWS Region and alert if that replication latency increases continually. The replication of the table applies to global tables only.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on your use case. Replication latencies longer than 3 minutes are generally a cause for investigation. Review the criticality and requirements of replication delay and analyze historical trends, and then select the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**SuccessfulRequestLatency**  
**Dimensions: **TableName, Operation  
**Alarm description: **This alarm detects a high latency for the DynamoDB table operation ( indicated by the dimension value of the `Operation` in the alarm). See [this troubleshooting document](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TroubleshootingLatency.html) for troubleshooting latency issues in Amazon DynamoDB.  
**Intent: **This alarm can detect a high latency for the DynamoDB table operation. Higher latency for the operations can negatively impact the overall efficiency of the system.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **DynamoDB provides single-digit millisecond latency on average for singleton operations such as GetItem, PutItem, and so on. However, you can set the threshold based on acceptable tolerance for the latency for the type of operation and table involved in the workload. You can analyze historical data of this metric to find the usual latency for the table operation, and then set the threshold to a number which represents critical delay for the operation.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**SystemErrors**  
**Dimensions: **TableName  
**Alarm description: **This alarm detects a sustained high number of system errors for the DynamoDB table requests. If you continue to get 5xx errors, open the [AWS Service Health Dashboard](https://status.aws.amazon.com/) to check for operational issues with the service. You can use this alarm to get notified in case there is a prolonged internal service issue from DynamoDB and it helps you correlate with the issue your client application is facing. Refer [Error handling for DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes.http5xx) for more information.  
**Intent: **This alarm can detect sustained system errors for the DynamoDB table requests. System errors indicate internal service errors from DynamoDB and helps correlate to the issue that the client is having.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold according to the expected traffic, accounting for an acceptable level of system errors. You can also analyze historical data to find the acceptable error count for the application workload, and then tune the threshold accordingly. System errors should be retried by the application/service as they are transient. Therefore, a very low threshold might cause the alarm to be too sensitive, causing unwanted state transitions.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ThrottledPutRecordCount**  
**Dimensions: **TableName, DelegatedOperation  
**Alarm description: **This alarm detects the records getting throttled by your Kinesis data stream during the replication of change data capture to Kinesis. This throttling happens because of insufficient Kinesis data stream capacity. If you experience excessive and regular throttling, you might need to increase the number of Kinesis stream shards proportionally to the observed write throughput of your table. To learn more about determining the size of a Kinesis data stream, see [Determining the Initial Size of a Kinesis Data Stream](https://docs.aws.amazon.com/streams/latest/dev/amazon-kinesis-streams.html#how-do-i-size-a-stream).  
**Intent: **This alarm can monitor the number of records that that were throttled by your Kinesis data stream because of insufficient Kinesis data stream capacity.  
**Statistic: **Maximum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You might experience some throttling during exceptional usage peaks, but throttled records should remain as low as possible to avoid higher replication latency (DynamoDB retries sending throttled records to the Kinesis data stream). Set the threshold to a number which can help you catch regular excessive throttling. You can also analyze historical data of this metric to find the acceptable throttling rates for the application workload. Tune the threshold to a value that the application can tolerate based on your use case.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**UserErrors**  
**Dimensions: **None  
**Alarm description: **This alarm detects a sustained high number of user errors for the DynamoDB table requests. You can check client application logs during the issue time frame to see why the requests are invalid. You can check [HTTP status code 400](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.MessagesAndCodes.http400) to see the type of error you are getting and take action accordingly. You might have to fix the application logic to create valid requests.  
**Intent: **This alarm can detect sustained user errors for the DynamoDB table requests. User errors for requested operations mean that the client is producing invalid requests and it is failing.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold to zero to detect any client side errors. Or you can set it to a higher value if you want to avoid the alarm triggering for a very lower number of errors. Decide based on your use case and traffic for the requests.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**WriteThrottleEvents**  
**Dimensions: **TableName  
**Alarm description: **This alarm detects if there are a high number of write requests getting throttled for the DynamoDB table. See [Troubleshooting throttling issues in Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TroubleshootingThrottling.html) to troubleshoot the issue.  
**Intent: **This alarm can detect sustained throttling for write requests to the DynamoDB table. Sustained throttling of write requests can negatively impact your workload write operations and reduce the overall efficiency of the system.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold according to the expected write traffic for the DynamoDB table, accounting for an acceptable level of throttling. It is important to monitor if you are under provisioned and not causing consistent throttling. You can also analyze historical data to find the acceptable level of throttling for the application workload, and then tune the threshold to a value higher than your usual acceptable throttling level. Throttled requests should be retried by the application/service as they are transient. Therefore, a very low threshold might cause the alarm to be too sensitive, causing unwanted state transitions.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**WriteThrottleEvents**  
**Dimensions: **TableName, GlobalSecondaryIndexName  
**Alarm description: **This alarm detects if there are a high number of write requests getting throttled for Global Secondary Index of the DynamoDB table. See [Troubleshooting throttling issues in Amazon DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/TroubleshootingThrottling.html) to troubleshoot the issue.  
**Intent: **This alarm can detect sustained throttling for write requests for the Global Secondary Index of DynamoDB Table. Sustained throttling of write requests can negatively impact your workload write operations and reduce the overall efficiency of the system.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold according to the expected Write traffic for the DynamoDB table, accounting for an acceptable level of throttling. It is important to monitor if you are under provisioned and not causing consistent throttling. You can also analyze historical data to find the acceptable throttling level for the application workload, and then tune the threshold to a value higher than your usual acceptable throttling level. Throttled requests should be retried by the application/service as they are transient. Therefore, a very low value might cause the alarm to be too sensitive, causing unwanted state transitions.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon EBS
<a name="Recommended_EBS"></a>

**VolumeStalledIOCheck**  
**Dimensions: **VolumeId, InstanceId  
**Alarm description: **This alarm helps you monitor the IO performance of your Amazon EBS volumes. This check detects underlying issues with the Amazon EBS infrastructure, such as hardware or software issues on the storage subsystems underlying the Amazon EBS volumes, hardware issues on the physical host that impact the reachability of the Amazon EBS volumes from your Amazon EC2 instance, and can detect connectivity issues between the instance and the Amazon EBS volumes. If the Stalled IO Check fails, you can either wait for AWS to resolve the issue, or you can take action such as replacing the affected volume or stopping and restarting the instance to which the volume is attached. In most cases, when this metric fails, Amazon EBS will automatically diagnose and recover your volume within a few minutes.  
**Intent: **This alarm can detect the status of your Amazon EBS volumes to determine when these volumes are impaired and can not complete I/O operations.  
**Statistic: **Maximum  
**Recommended threshold: **1.0  
**Threshold justification: **When a status check fails, the value of this metric is 1. The threshold is set so that whenever the status check fails, the alarm is in ALARM state.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

## Amazon EC2
<a name="EC2"></a>

**CPUUtilization**  
**Dimensions: **InstanceId  
**Alarm description: **This alarm helps to monitor the CPU utilization of an EC2 instance. Depending on the application, consistently high utilization levels might be normal. But if performance is degraded, and the application is not constrained by disk I/O, memory, or network resources, then a maxed-out CPU might indicate a resource bottleneck or application performance problems. High CPU utilization might indicate that an upgrade to a more CPU intensive instance is required. If detailed monitoring is enabled, you can change the period to 60 seconds instead of 300 seconds. For more information, see [Enable or turn off detailed monitoring for your instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-cloudwatch-new.html).  
**Intent: **This alarm is used to detect high CPU utilization.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Typically, you can set the threshold for CPU utilization to 70-80%. However, you can adjust this value based on your acceptable performance level and workload characteristics. For some systems, consistently high CPU utilization may be normal and not indicate a problem, while for others, it may be cause of concern. Analyze historical CPU utilization data to identify the usage, find what CPU utilization is acceptable for your system, and set the threshold accordingly.  
**Period: **300  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**StatusCheckFailed**  
**Dimensions: **InstanceId  
**Alarm description: **This alarm helps to monitor both system status checks and instance status checks. If either type of status check fails, then this alarm should be in ALARM state.  
**Intent: **This alarm is used to detect the underlying problems with instances, including both system status check failures and instance status check failures.  
**Statistic: **Maximum  
**Recommended threshold: **1.0  
**Threshold justification: **When a status check fails, the value of this metric is 1. The threshold is set so that whenever the status check fails, the alarm is in ALARM state.  
**Period: **300  
**Datapoints to alarm: **2  
**Evaluation periods: **2  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**StatusCheckFailed\$1AttachedEBS**  
**Dimensions: **InstanceId  
**Alarm description: **This alarm helps you monitor whether the Amazon EBS volumes attached to an instance are reachable and able to complete I/O operations. This status check detects underlying issues with the compute or Amazon EBS infrastructure such as the following:  
+ Hardware or software issues on the storage subsystems underlying the Amazon EBS volumes
+ Hardware issues on the physical host that impact reachability of the Amazon EBS volumes
+ Connectivity issues between the instance and Amazon EBS volumes
When the attached EBS status check fails, you can either wait for Amazon to resolve the issue, or you can take an action such as replacing the affected volumes or stopping and restarting the instance.  
**Intent: **This alarm is used to detect unreachable Amazon EBS volumes attached to an instance. These can cause failures in I/O operations.  
**Statistic: **Maximum  
**Recommended threshold: **1.0  
**Threshold justification: **When a status check fails, the value of this metric is 1. The threshold is set so that whenever the status check fails, the alarm is in ALARM state.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

## Amazon ElastiCache
<a name="ElastiCache"></a>

**CPUUtilization**  
**Dimensions: **CacheClusterId, CacheNodeId  
**Alarm description: **This alarm helps to monitor the CPU utilization for the entire ElastiCache instance, including the database engine processes and other processes running on the instance. AWS Elasticache supports two engine types: Memcached and Redis OSS. When you reach high CPU utilization on a Memcached node, you should consider scaling up your instance type or adding new cache nodes. For Redis OSS, if your main workload is from read requests, you should consider adding more read replicas to your cache cluster. If your main workload is from write requests, you should consider adding more shards to distribute the workload across more primary nodes if you’re running in clustered mode, or scaling up your instance type if you’re running Redis OSS in non-clustered mode.  
**Intent: **This alarm is used to detect high CPU utilization of ElastiCache hosts. It is useful to get a broad view of the CPU usage across the entire instance, including non-engine processes.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold to the percentage that reflects a critical CPU utilization level for your application. For Memcached, the engine can use up to num\$1threads cores. For Redis OSS, the engine is largely single-threaded, but might use additional cores if available to accelerate I/O. In most cases, you can set the threshold to about 90% of your available CPU. Because Redis OSS is single-threaded, the actual threshold value should be calculated as a fraction of the node's total capacity.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**CurrConnections**  
**Dimensions: **CacheClusterId, CacheNodeId  
**Alarm description: **This alarm detects high connection count, which might indicate heavy load or performance issues. A constant increase of `CurrConnections` might lead to exhaustion of the 65,000 available connections. It may indicate that connections improperly closed on the application side and were left established on the server side. You should consider using connection pooling or idle connection timeouts to limit the number of connections made to the cluster, or for Redis OSS, consider tuning [tcp-keepalive](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/ParameterGroups.Redis.html) on your cluster to detect and terminate potential dead peers.  
**Intent: **The alarm helps you identify high connection counts that could impact the performance and stability of your ElastiCache cluster.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on the acceptable range of connections for your cluster. Review the capacity and the expected workload of your ElastiCache cluster and analyze the historical connection counts during regular usage to establish a baseline, and then select a threshold accordingly. Remember that each node can support up to 65,000 concurrent connections.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**DatabaseMemoryUsagePercentage**  
**Dimensions: **CacheClusterId  
**Alarm description: **This alarm helps you monitor the memory utilization of your cluster. When your `DatabaseMemoryUsagePercentage` reaches 100%, the Redis OSS maxmemory policy is triggered and evictions might occur based on the policy selected. If no object in the cache matches the eviction policy, write operations fail. Some workloads expect or rely on evictions, but if not, you will need to increase the memory capacity of your cluster. You can scale your cluster out by adding more primary nodes, or scale it up by using a larger node type. Refer to [Scaling ElastiCache for Redis OSS clusters](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Scaling.html) for details.  
**Intent: **This alarm is used to detect high memory utilization of your cluster so that you can avoid failures when writing to your cluster. It is useful to know when you’ll need to scale up your cluster if your application does not expect to experience evictions.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Depending on your application’s memory requirements and the memory capacity of your ElastiCache cluster, you should set the threshold to the percentage that reflects the critical level of memory usage of the cluster. You can use historical memory usage data as reference for acceptable memory usage threshold.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**EngineCPUUtilization**  
**Dimensions: **CacheClusterId  
**Alarm description: **This alarm helps to monitor the CPU utilization of a Redis OSS engine thread within the ElastiCache instance. Common reasons for high engine CPU are long-running commands that consume high CPU, a high number of requests, an increase of new client connection requests in a short time period, and high evictions when the cache doesn’t have enough memory to hold new data. You should consider [Scaling ElastiCache for Redis OSS clusters](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Scaling.html) by adding more nodes or scaling up your instance type.  
**Intent: **This alarm is used to detect high CPU utilization of the Redis OSS engine thread. It is useful if you want to monitor the CPU usage of the database engine itself.  
**Statistic: **Average  
**Recommended threshold: **90.0  
**Threshold justification: **Set the threshold to a percentage that reflects the critical engine CPU utilization level for your application. You can benchmark your cluster using your application and expected workload to correlate EngineCPUUtilization and performance as a reference, and then set the threshold accordingly. In most cases, you can set the threshold to about 90% of your available CPU.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ReplicationLag**  
**Dimensions: **CacheClusterId  
**Alarm description: **This alarm helps to monitor the replication health of your ElastiCache cluster. A high replication lag means that the primary node or the replica can’t keep up the pace of the replication. If your write activity is too high, consider scaling your cluster out by adding more primary nodes, or scaling it up by using a larger node type. Refer to [Scaling ElastiCache for Redis OSS clusters](https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/Scaling.html) for details. If your read replicas are overloaded by the amount of read requests, consider adding more read replicas.  
**Intent: **This alarm is used to detect a delay between data updates on the primary node and their synchronization to replica node. It helps to ensure data consistency of a read replica cluster node.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold according to your application's requirements and the potential impact of replication lag. You should consider your application's expected write rates and network conditions for the acceptable replication lag.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon ECS
<a name="ECS"></a>

The following are the recommended alarms for Amazon ECS.

**CPUReservation**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps you detect a high CPU reservation of the ECS cluster. High CPU reservation might indicate that the cluster is running out of registered CPUs for the task. To troubleshoot, you can add more capacity, you can scale the cluster, or you can set up auto scaling.  
**Intent: **The alarm is used to detect whether the total number of CPU units reserved by tasks on the cluster is reaching the total CPU units registered for the cluster. This helps you know when to scale up the cluster. Reaching the total CPU units for the cluster can result in running out of CPU for tasks. If you have EC2 capacity providers managed scaling turned on, or you have associated Fargate to capacity providers, then this alarm is not recommended.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold for CPU reservation to 80%. Alternatively, you can choose a lower value based on cluster characteristics.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**CPUUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect a high CPU utilization of the ECS service. If there is no ongoing ECS deployment, a maxed-out CPU utilization might indicate a resource bottleneck or application performance problems. To troubleshoot, you can increase the CPU limit.  
**Intent: **This alarm is used to detect high CPU utilization for the Amazon ECS service. Consistent high CPU utilization can indicate a resource bottleneck or application performance problems.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **The service metrics for CPU utilization might exceed 100% utilization. However, we recommend that you monitor the metric for high CPU utilization to avoid impacting other services. Set the threshold to about 80-85%. We recommend that you update your task definitions to reflect actual usage to prevent future issues with other services.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**EBSFilesystemUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect high storage utilization of the Amazon EBS volume attached to Amazon ECS tasks. If the utilization of the Amazon EBS volume is consistently high, you can check the usage and increase the volume size for new tasks.  
**Intent: **This alarm is used to detect high storage utilization of the Amazon EBS volumes attached to Amazon ECS tasks. Consistently high storage utilization can indicate that the Amazon EBS volume is full and it might lead to failure of the container.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **You can set the threshold for Amazon EBS file system utilization to about 80%. You can adjust this value based on the acceptable storage utilization. For a read-only snapshot volume, a high utilization might indicate that the volume is right sized. For an active data volume, high storage utilization might indicate that the application is writing a large amount of data which may cause the container to fail if there is not enough capacity.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**MemoryReservation**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps you detect a high memory reservation of the Amazon ECS cluster. High memory reservation might indicate a resource bottleneck for the cluster. To troubleshoot, analyze the service task for performance to see if memory utilization of the task can be optimized. Also, you can register more memory or set up auto scaling.  
**Intent: **The alarm is used to detect whether the total memory units reserved by tasks on the cluster is reaching the total memory units registered for the cluster. This can help you know when to scale up the cluster. Reaching the total memory units for the cluster can cause the cluster to be unable to launch new tasks. If you have EC2 capacity providers managed scaling turned on or you have associated Fargate to capacity providers, this alarm is not recommended.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold for memory reservation to 80%. You can adjust this to a lower value based on cluster characteristics.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**MemoryUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect a high memory utilization of the Amazon ECS service. If there is no ongoing Amazon ECS deployment, a maxed-out memory utilization might indicate a resource bottleneck or application performance problems. To troubleshoot, you can increase the memory limit.  
**Intent: **This alarm is used to detect high memory utilization for the Amazon ECS service. Consistent high memory utilization can indicate a resource bottleneck or application performance problems.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **The service metrics for memory utilization might exceed 100% utilization. However, we recommend that you monitor the metric for high memory utilization to avoid impacting other services. Set the threshold to about 80%. We recommend that you update your task definitions to reflect actual usage to prevent future issues with other services.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**HTTPCode\$1Target\$15XX\$1Count**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect a high server-side error count for the ECS service. This can indicate that there are errors that cause the server to be unable to serve requests. To troubleshoot, check your application logs.  
**Intent: **This alarm is used to detect a high server-side error count for the ECS service.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Calculate the value of about 5% of the your average traffic and use this value as a starting point for the threshold. You can find the average traffic by using the `RequestCount` metric. You can also analyze historical data to determine the acceptable error rate for the application workload, and then tune the threshold accordingly. Frequently occurring 5XX errors need to be alarmed on. However, setting a very low value for the threshold can cause the alarm to be too sensitive.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TargetResponseTime**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect a high target response time for ECS service requests. This can indicate that there are problems that cause the service to be unable to serve requests in time. To troubleshoot, check the CPUUtilization metric to see if the service is running out of CPU, or check the CPU utilization of other downstream services that your service depends on.  
**Intent: **This alarm is used to detect a high target response time for ECS service requests.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on your use case. Review the criticality and requirements of the target response time of the service and analyze the historical behavior of this metric to determine sensible threshold levels.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon ECS with Container Insights
<a name="ECS-ContainerInsights"></a>

The following are the recommended alarms for Amazon ECS with Container Insights.

**EphemeralStorageUtilized**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect high ephemeral storage utilized of the Fargate cluster. If ephemeral storage is consistently high, you can check ephemeral storage usage and increase the ephemeral storage.  
**Intent: **This alarm is used to detect high ephemeral storage usage for the Fargate cluster. Consistent high ephemeral storage utilized can indicate that the disk is full and it might lead to failure of the container.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold to about 90% of the ephemeral storage size. You can adjust this value based on your acceptable ephemeral storage utilization of the Fargate cluster. For some systems, a consistently high ephemeral storage utilized might be normal, while for others, it might lead to failure of the container.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**RunningTaskCount**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect a low running task count of the Amazon ECS service. If the running task count is too low, it can can indicate that the application can’t handle the service load and it might lead to performance issues. If there is no running task, the Amazon ECS service might be unavailable or there might be deployment issues.  
**Intent: **This alarm is used to detect whether the number of running tasks are too low. A consistent low running task count can indicate Amazon ECS service deployment or performance issues.  
**Statistic: **Average  
**Recommended threshold: **0.0  
**Threshold justification: **You can adjust the threshold based on the minimum running task count of the Amazon ECS service. If the running task count is 0, the Amazon ECS service will be unavailable.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**TaskCpuUtilization**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps you detect high CPU utilization of tasks in your Amazon ECS cluster. If task CPU utilization is consistently high, you might need to optimize your tasks or increase their CPU reservation.  
**Intent: **This alarm is used to detect high CPU utilization for tasks in the Amazon ECS cluster. Consistent high CPU utilization can indicate that the tasks are under stress and might need more CPU resources or optimization to maintain performance.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the task's CPU reservation. You can adjust this value based on your acceptable CPU utilization for the tasks. For some workloads, consistently high CPU utilization might be normal, while for others, it might indicate performance issues or the need for more resources.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TaskCpuUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect high CPU utilization of tasks belonging to the Amazon ECS service. If task CPU utilization is consistently high, you might need to optimize your tasks or increase their CPU reservation.  
**Intent: **This alarm is used to detect high CPU utilization for tasks belonging to the Amazon ECS service. Consistent high CPU utilization can indicate that the tasks are under stress and might need more CPU resources or optimization to maintain performance.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the task's CPU reservation. You can adjust this value based on your acceptable CPU utilization for the tasks. For some workloads, consistently high CPU utilization might be normal, while for others, it might indicate performance issues or the need for more resources.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ContainerCpuUtilization**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm monitors the percentage of CPU units used by containers in your Amazon ECS cluster relative to their reserved CPU. It helps detect when containers are approaching their CPU limits based on the ContainerCpuUtilized/ContainerCpuReserved ratio.  
**Intent: **This alarm detects when containers in the Amazon ECS cluster are using a high percentage of their reserved CPU capacity, calculated as `ContainerCpuUtilized/ContainerCpuReserved`. Sustained high values indicate containers are operating near their CPU limits and might need capacity adjustments.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the container's CPU utilization ratio. This provides an early warning when containers are approaching their CPU capacity limits while allowing for normal fluctuations in CPU usage. The threshold can be adjusted based on your workload characteristics and performance requirements.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ContainerCpuUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm monitors the percentage of CPU units used by containers belonging to the Amazon ECS service relative to their reserved CPU. It helps detect when containers are approaching their CPU limits based on the ContainerCpuUtilized/ContainerCpuReserved ratio.  
**Intent: **This alarm detects when containers belonging to the Amazon ECS service are using a high percentage of their reserved CPU capacity, calculated as ContainerCpuUtilized/ContainerCpuReserved. Sustained high values indicate containers are operating near their CPU limits and might need capacity adjustments.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the container's CPU utilization ratio. This provides an early warning when containers are approaching their CPU capacity limits while allowing for normal fluctuations in CPU usage. The threshold can be adjusted based on your workload characteristics and performance requirements.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TaskEphemeralStorageUtilization**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps you detect high ephemeral storage utilization of tasks in your Amazon ECS cluster. If storage utilization is consistently high, you might need to optimize your storage usage or increase the storage reservation.  
**Intent: **This alarm is used to detect high ephemeral storage utilization for tasks in the Amazon ECS cluster. Consistent high storage utilization can indicate that the task is running out of disk space and might need more storage resources or optimization to maintain proper operation.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the task's ephemeral storage reservation. You can adjust this value based on your acceptable storage utilization for the tasks. For some workloads, consistently high storage utilization might be normal, while for others, it might indicate potential disk space issues or the need for more storage.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TaskEphemeralStorageUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect high ephemeral storage utilization of tasks belonging to the Amazon ECS service. If storage utilization is consistently high, you might need to optimize your storage usage or increase the storage reservation.  
**Intent: **This alarm is used to detect high ephemeral storage utilization for tasks belonging to the Amazon ECS service. Consistent high storage utilization can indicate that the task is running out of disk space and might need more storage resources or optimization to maintain proper operation.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the task's ephemeral storage reservation. You can adjust this value based on your acceptable storage utilization for the tasks. For some workloads, consistently high storage utilization might be normal, while for others, it might indicate potential disk space issues or the need for more storage.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TaskMemoryUtilization**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps you detect high memory utilization of tasks in your Amazon ECS cluster. If memory utilization is consistently high, you might need to optimize your tasks or increase the memory reservation.  
**Intent: **This alarm is used to detect high memory utilization for tasks in the Amazon ECS cluster. Consistent high memory utilization can indicate that the task is under memory pressure and might need more memory resources or optimization to maintain stability.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the task's memory reservation. You can adjust this value based on your acceptable memory utilization for the tasks. For some workloads, consistently high memory utilization might be normal, while for others, it might indicate memory pressure or the need for more resources.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TaskMemoryUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect high memory utilization of tasks belonging to the Amazon ECS service. If memory utilization is consistently high, you might need to optimize your tasks or increase the memory reservation.  
**Intent: **This alarm is used to detect high memory utilization for tasks belonging to the Amazon ECS service. Consistent high memory utilization can indicate that the task is under memory pressure and might need more memory resources or optimization to maintain stability.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the task's memory reservation. You can adjust this value based on your acceptable memory utilization for the tasks. For some workloads, consistently high memory utilization might be normal, while for others, it might indicate memory pressure or the need for more resources.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ContainerMemoryUtilization**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps you detect high memory utilization of containers in your Amazon ECS cluster. If memory utilization is consistently high, you might need to optimize your containers or increase the memory reservation.  
**Intent: **This alarm is used to detect high memory utilization for containers in the Amazon ECS cluster. Consistent high memory utilization can indicate that the container is under memory pressure and might need more memory resources or optimization to maintain stability.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the container's memory reservation. You can adjust this value based on your acceptable memory utilization for the containers. For some workloads, consistently high memory utilization might be normal, while for others, it might indicate memory pressure or the need for more resources.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ContainerMemoryUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect high memory utilization of containers belonging to the Amazon ECS service. If memory utilization is consistently high, you might need to optimize your containers or increase the memory reservation.  
**Intent: **This alarm is used to detect high memory utilization for containers belonging to the Amazon ECS service. Consistent high memory utilization can indicate that the container is under memory pressure and might need more memory resources or optimization to maintain stability.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the container's memory reservation. You can adjust this value based on your acceptable memory utilization for the containers. For some workloads, consistently high memory utilization might be normal, while for others, it might indicate memory pressure or the need for more resources.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**instance\$1filesystem\$1utilization**  
**Dimensions: **InstanceId, ContainerInstanceId, ClusterName  
**Alarm description: **This alarm helps you detect a high file system utilization of the Amazon ECS cluster. If the file system utilization is consistently high, check the disk usage.  
**Intent: **This alarm is used to detect high file system utilization for the Amazon ECS cluster. A consistent high file system utilization can indicate a resource bottleneck or application performance problems, and it might prevent running new tasks.  
**Statistic: **Average  
**Recommended threshold: **90.0  
**Threshold justification: **You can set the threshold for file system utilization to about 90-95%. You can adjust this value based on the acceptable file system capacity level of the Amazon ECS cluster. For some systems, a consistently high file system utilization might be normal and not indicate a problem, while for others, it might be a cause of concern and might lead to performance issues and prevent running new tasks.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon ECS with Container Insights with enhanced observability
<a name="ECS-ContainerInsights_enhanced"></a>

The following are the recommended alarms for Amazon ECS with Container Insights with enhanced observability.

**TaskCpuUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect the total percentage of CPU units being used by a task.   
**Intent: **This alarm is used to detect high task CPU utilization.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Typically, you can set the threshold for CPU utilization to 80%. However, you can adjust this value based on your acceptable performance level and workload characteristics. For some tasks, consistently high CPU utilization may be normal and not indicate a problem, while for others, it may be cause of concern. Analyze historical CPU utilization data to identify the usage, find what CPU utilization is acceptable for your tasks, and set the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TaskMemoryUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect the total percentage of memory being used by a task.   
**Intent: **This alarm is used to detect high memory utilization.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Typically, you can set the threshold for memory utilization to 80%. However, you can adjust this value based on your acceptable performance level and workload characteristics. For some tasks, consistently high memory utilization may be normal and not indicate a problem, while for others, it may be cause of concern. Analyze historical memory utilization data to identify the usage, find what memory utilization is acceptable for your tasks, and set the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**ContainerCPUUtilization**  
**Dimensions: **ContainerName, ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect the total percentage of CPU units being used by a container.   
**Intent: **This alarm is used to detect high task CPU utilization.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Typically, you can set the threshold for CPU utilization to 80%. However, you can adjust this value based on your acceptable performance level and workload characteristics. For some containers, consistently high CPU utilization may be normal and not indicate a problem, while for others, it may be cause of concern. Analyze historical CPU utilization data to identify the usage, find what CPU utilization is acceptable for your containers, and set the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ContainerMemoryUtilization**  
**Dimensions: **ContainerName, ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect the total percentage of memory units being used by a container.  
**Intent: **This alarm is used to detect high task memory utilization.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Typically, you can set the threshold for memory utilization to 80%. However, you can adjust this value based on your acceptable performance level and workload characteristics. For some containers, consistently high memory utilization may be normal and not indicate a problem, while for others, it may be cause of concern. Analyze historical memory utilization data to identify the usage, find what memory utilization is acceptable for your tasks, and set the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**TaskEBSfilesystemUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect the total percentage of ephemeral storage being used by a task. .  
**Intent: **This alarm is used to detect high Amazon EBS file system usage for a task.   
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the Amazon EBS file system size.   
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**TaskEphemeralStorageUtilization**  
**Dimensions: **ClusterName, ServiceName  
**Alarm description: **This alarm helps you detect the total percentage of ephemeral storage being used by a task.  
**Intent: **This alarm is used to detect high ephemeral storage usage for a task. Consistent high ephemeral storage utilized can indicate that the disk is full and it might lead to failure of the task.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **Set the threshold to about 80% of the ephemeral storage size.   
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon EFS
<a name="EFS"></a>

**PercentIOLimit**  
**Dimensions: **FileSystemId  
**Alarm description: **This alarm helps in ensuring that the workload stays within the I/O limit available to the file system. If the metric reaches its I/O limit consistently, consider moving the application to a file system that uses Max I/O performance as mode. For troubleshooting, check clients that are connected to the file system and applications of the clients that throttles the file system.  
**Intent: **This alarm is used to detect how close the file system is to reach the I/O limit of the General Purpose performance mode. Consistent high I/O percentage can be an indicator of the file system cannot scale with respect to I/O requests enough and the file system can be a resource bottleneck for the applications that use the file system.  
**Statistic: **Average  
**Recommended threshold: **100.0  
**Threshold justification: **When the file system reaches its I/O limit, it may respond to read and write requests slower. Therefore, it is recommended that the metric is monitored to avoid impacting applications that use the file system. The threshold can be set around 100%. However, this value can be adjusted to a lower value based on file system characteristics.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**BurstCreditBalance**  
**Dimensions: **FileSystemId  
**Alarm description: **This alarm helps in ensuring that there is available burst credit balance for the file system usage. When there is no available burst credit, applications access to the the file system will be limited due to low throughput. If the metric drops to 0 consistently, consider changing the throughput mode to [Elastic or Provisioned throughput mode](https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes).  
**Intent: **This alarm is used to detect low burst credit balance of the file system. Consistent low burst credit balance can be an indicator of the slowing down in throughput and increase in I/O latency.  
**Statistic: **Average  
**Recommended threshold: **0.0  
**Threshold justification: **When the file system run out of burst credits and even if the baseline throughput rate is lower, EFS continues to provide a metered throughput of 1 MiBps to all file systems. However, it is recommended that the metric is monitored for low burst credit balance to avoid the file system acting as resource bottleneck for the applications. The threshold can be set around 0 bytes.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **LESS\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

## Amazon EKS with Container Insights
<a name="EKS-ContainerInsights"></a>

**node\$1cpu\$1utilization**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps to detect high CPU utilization in worker nodes of the EKS cluster. If the utilization is consistently high, it might indicate a need for replacing your worker nodes with instances that have greater CPU or a need to scale the system horizontally.  
**Intent: **This alarm helps to monitor the CPU utilization of the worker nodes in the EKS cluster so that the system performance doesn't degrade.  
**Statistic: **Maximum  
**Recommended threshold: **80.0  
**Threshold justification: **It is recommended to set the threshold at less than or equal to 80% to allow enough time to debug the issue before the system starts seeing impact.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**node\$1filesystem\$1utilization**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps to detect high file system utilization in the worker nodes of the EKS cluster. If the utilization is consistently high, you might need to update your worker nodes to have larger disk volume, or you might need to scale horizontally.  
**Intent: **This alarm helps to monitor the filesystem utilization of the worker nodes in the EKS cluster. If the utilization reaches 100%, it can lead to application failure, disk I/O bottlenecks, pod eviction, or the node to become unresponsive entirely.  
**Statistic: **Maximum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **If there's sufficient disk pressure (meaning that the disk is getting full), nodes are marked as not healthy, and the pods are evicted from the node. Pods on a node with disk pressure are evicted when the available file system is lower than the eviction thresholds set on the kubelet. Set the alarm threshold so that you have enough time to react before the node is evicted from the cluster.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**node\$1memory\$1utilization**  
**Dimensions: **ClusterName  
**Alarm description: **This alarm helps in detecting high memory utilization in worker nodes of the EKS cluster. If the utilization is consistently high, it might indicate a need to scale the number of pod replicas, or optimize your application.  
**Intent: **This alarm helps to monitor the memory utilization of the worker nodes in the EKS cluster so that the system performance doesn't degrade.  
**Statistic: **Maximum  
**Recommended threshold: **80.0  
**Threshold justification: **It is recommended to set the threshold at less than or equal to 80% to allow having enough time to debug the issue before the system starts seeing impact.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**pod\$1cpu\$1utilization\$1over\$1pod\$1limit**  
**Dimensions: **ClusterName, Namespace, Service  
**Alarm description: **This alarm helps in detecting high CPU utilization in pods of the EKS cluster. If the utilization is consistently high, it might indicate a need to increase the CPU limit for the affected pod.  
**Intent: **This alarm helps to monitor the CPU utilization of the pods belonging to a Kubernetes Service in the EKS cluster, so that you can quickly identify if a service's pod is consuming higher CPU than expected.  
**Statistic: **Maximum  
**Recommended threshold: **80.0  
**Threshold justification: **It is recommended to set the threshold at less than or equal to 80% to allow having enough time to debug the issue before the system starts seeing impact.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**pod\$1memory\$1utilization\$1over\$1pod\$1limit**  
**Dimensions: **ClusterName, Namespace, Service  
**Alarm description: **This alarm helps in detecting high memory utilization in pods of the EKS cluster. If the utilization is consistently high, it might indicate a need to increase the memory limit for the affected pod.  
**Intent: **This alarm helps to monitor the memory utilization of the pods in the EKS cluster so that the system performance doesn't degrade.  
**Statistic: **Maximum  
**Recommended threshold: **80.0  
**Threshold justification: **It is recommended to set the threshold at less than or equal to 80% to allow having enough time to debug the issue before the system starts seeing impact.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon EventBridge Scheduler
<a name="Eventbridge-Scheduler"></a>

**TargetErrorThrottledCount**  
**Dimensions: **None  
**Alarm description: **This alarm helps you identify target throttling. To avoid target throttling error, consider [configuring flexible time windows](https://docs.aws.amazon.com/scheduler/latest/UserGuide/managing-schedule-flexible-time-windows.html) to spread your invocation load or increasing limits with the target service.  
**Intent: **This alarm is used to detect target throttling errors, which can cause schedule delays.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **If the target throttling error is consistently greater than 0, schedule delivery is delayed. For some systems, target throttling errors for a brief period of time might be normal, while for others, it might be a cause of concern. Set this alarm's threshold, `datapointsToAlarm`, and `evaluationPeriods` accordingly.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**InvocationThrottleCount**  
**Dimensions: **None  
**Alarm description: **This alarm helps you identify invocation throttling by Amazon EventBridge Scheduler. To avoid invocation throttling errors, consider [configuring flexible time windows](https://docs.aws.amazon.com/scheduler/latest/UserGuide/managing-schedule-flexible-time-windows.html) to spread your invocation load or [increasing invocations throttle limit](https://docs.aws.amazon.com/scheduler/latest/UserGuide/scheduler-quotas.html).  
**Intent: **This alarm is used to detect Amazon EventBridge Scheduler invocation throttling errors, which can cause schedule delays.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **If the invocation throttling is consistently greater than 0, schedule delivery is delayed. For some systems, invocation throttling errors for a brief period of time might be normal, while for others, it might be a cause of concern. Set this alarm's threshold, `datapointsToAlarm`, and `evaluationPeriods` accordingly.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**InvocationDroppedCount**  
**Dimensions: **None  
**Alarm description: **This alarm helps you identify invocations dropped by Amazon EventBridge Scheduler. Consider investigating by [configuring a DLQ](https://docs.aws.amazon.com/scheduler/latest/UserGuide/configuring-schedule-dlq.html) for the schedule.  
**Intent: **This alarm is used to detect dropped invocations by Amazon EventBridge Scheduler. If you have configured a DLQ correctly on all of your schedules, dropped invocations will appear in the DLQ and you can skip setting up this alarm.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **Set the threshold to 0 to detect dropped invocations.  
**Period: **60  
**Datapoints to alarm: **1  
**Evaluation periods: **1  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**InvocationsFailedToBeSentToDeadLetterCount**  
**Dimensions: **None  
**Alarm description: **This alarm helps you identify invocations that were failed to be sent to the configured DLQ by Amazon EventBridge Scheduler. If the metric is consistently greater than 0, modify your DLQ configuration to resolve the issue. Use `InvocationsFailedToBeSentToDeadLetterCount`\$1metrics to determine the issue.  
**Intent: **This alarm is used to detect invocations failed to be sent to the configured DLQ by Amazon EventBridge Scheduler.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **Set the threshold to 0 to detect any invocations that were failed to be sent to the configured DLQ. Retryable errors also show up in this metric, so `datapointsToAlarm` for this alarm has been set to 15.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon Kinesis Data Streams
<a name="Kinesis"></a>

**GetRecords.IteratorAgeMilliseconds**  
**Dimensions: **StreamName  
**Alarm description: **This alarm can detect if iterator maximum age is too high. For real-time data processing applications, configure data retention according to tolerance of the delay. This is usually within minutes. For applications that process historic data, use this metric to monitor catchup speed. A quick solution to stop data loss is to increase the retention period while you troubleshoot the issue. You can also increase the number of workers processing records in your consumer application. The most common causes for gradual iterator age increase are insufficient physical resources or record processing logic that has not scaled with an increase in stream throughput. See [link](https://repost.aws/knowledge-center/kinesis-data-streams-iteratorage-metric) for more details.  
**Intent: **This alarm is used to detect if data in your stream is going to expire because of being preserved too long or because record processing is too slow. It helps you avoid data loss after reaching 100% of the stream retention time.  
**Statistic: **Maximum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on the stream retention period and tolerance of processing delay for the records. Review your requirements and analyze historical trends, and then set the threshold to the number of milliseconds that represents a critical processing delay. If an iterator's age passes 50% of the retention period (by default, 24 hours, configurable up to 365 days), there is a risk for data loss because of record expiration. You can monitor the metric to make sure that none of your shards ever approach this limit.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**GetRecords.Success**  
**Dimensions: **StreamName  
**Alarm description: **This metric increments whenever your consumers successfully read data from your stream. `GetRecords` doesn't return any data when it throws an exception. The most common exception is `ProvisionedThroughputExceededException` because request rate for the stream is too high, or because available throughput is already served for the given second. Reduce the frequency or size of your requests. For more information, see Streams [Limits](https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html) in the Amazon Kinesis Data Streams Developer Guide, and [Error Retries and Exponential Backoff in AWS](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html).  
**Intent: **This alarm can detect if the retrieval of records from the stream by consumers is failing. By setting an alarm on this metric, you can proactively detect any issues with data consumption, such as increased error rates or a decline in successful retrievals. This allows you to take timely actions to resolve potential problems and maintain a smooth data processing pipeline.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Depending on the importance of retrieving records from the stream, set the threshold based on your application’s tolerance for failed records. The threshold should be the corresponding percentage of successful operations. You can use historical GetRecords metric data as reference for the acceptable failure rate. You should also consider retries when setting the threshold because failed records can be retried. This helps to prevent transient spikes from triggering unnecessary alerts.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**PutRecord.Success**  
**Dimensions: **StreamName  
**Alarm description: **This alarm detects when the number of failed `PutRecord` operations breaches the threshold. Investigate the data producer logs to find the root causes of the failures. The most common reason is insufficient provisioned throughput on the shard that caused the `ProvisionedThroughputExceededException`. It happens because the request rate for the stream is too high, or the throughput attempted to be ingested into the shard is too high. Reduce the frequency or size of your requests. For more information, see Streams [Limits](https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html) and [Error Retries and Exponential Backoff in AWS](https://docs.aws.amazon.com/sdkref/latest/guide/feature-retry-behavior.html).  
**Intent: **This alarm can detect if ingestion of records into the stream is failing. It helps you identify issues in writing data to the stream. By setting an alarm on this metric, you can proactively detect any issues of producers in publishing data to the stream, such as increased error rates or a decrease in successful records being published. This enables you to take timely actions to address potential problems and maintain a reliable data ingestion process.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Depending on the importance of data ingestion and processing to your service, set the threshold based on your application’s tolerance for failed records. The threshold should be the corresponding percentage of successful operations. You can use historical PutRecord metric data as reference for the acceptable failure rate. You should also consider retries when setting the threshold because failed records can be retried.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**PutRecords.FailedRecords**  
**Dimensions: **StreamName  
**Alarm description: **This alarm detects when the number of failed `PutRecords` exceeds the threshold. Kinesis Data Streams attempts to process all records in each `PutRecords` request, but a single record failure does not stop the processing of subsequent records. The main reason for these failures is exceeding the throughput of a stream or an individual shard. Common causes are traffic spikes and network latencies that cause records to arrive to the stream unevenly. You should detect unsuccessfully processed records and retry them in a subsequent call. Refer to [Handling Failures When Using PutRecords](https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-sdk.html) for more details.  
**Intent: **This alarm can detect consistent failures when using batch operation to put records to your stream. By setting an alarm on this metric, you can proactively detect an increase in failed records, enabling you to take timely actions to address the underlying problems and ensure a smooth and reliable data ingestion process.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold to the number of failed records reflecting the tolerance of the the application for failed records. You can use historical data as reference for the acceptable failure value. You should also consider retries when setting the threshold because failed records can be retried in subsequent PutRecords calls.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ReadProvisionedThroughputExceeded**  
**Dimensions: **StreamName  
**Alarm description: **The alarm tracks the number of records that result in read throughput capacity throttling. If you find that you are being consistently throttled, you should consider adding more shards to your stream to increase your provisioned read throughput. If there is more than one consumer application running on the stream, and they share the `GetRecords` limit, we recommend that you register new consumer applications via Enhanced Fan-Out. If adding more shards does not lower the number of throttles, you may have a “hot” shard that is being read from more than other shards are. Enable enhanced monitoring, find the “hot” shard, and split it.  
**Intent: **This alarm can detect if consumers are throttled when they exceed your provisioned read throughput (determined by the number of shards you have). In that case, you won’t be able to read from the stream, and the stream can start backing up.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Usually throttled requests can be retried and hence setting the threshold to zero makes the alarm too sensitive. However, consistent throttling can impact reading from the stream and should trigger the alarm. Set the threshold to a percentage according to the throttled requests for the application and retry configurations.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**SubscribeToShardEvent.MillisBehindLatest**  
**Dimensions: **StreamName, ConsumerName  
**Alarm description: **This alarm detects when the delay of record processing in the application breaches the threshold. Transient problems such as API operation failures to a downstream application can cause a sudden increase in the metric. You should investigate if they consistently happen. A common cause is the consumer is not processing records fast enough because of insuﬃcient physical resources or record processing logic that has not scaled with an increase in stream throughput. Blocking calls in critical path is often the cause of slowdowns in record processing. You can increase your parallelism by increasing the number of shards. You should also confirm underlying processing nodes have sufficient physical resources during peak demand.  
**Intent: **This alarm can detect delay in the subscription to shard event of the stream. This indicates a processing lag and can help identify potential issues with the consumer application's performance or the overall stream's health. When the processing lag becomes significant, you should investigate and address any bottlenecks or consumer application inefficiencies to ensure real-time data processing and minimize data backlog.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on the delay that your application can tolerate. Review your application's requirements and analyze historical trends, and then select a threshold accordingly. When the SubscribeToShard call succeeds, your consumer starts receiving SubscribeToShardEvent events over the persistent connection for up to 5 minutes, after which time you need to call SubscribeToShard again to renew the subscription if you want to continue to receive records.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**WriteProvisionedThroughputExceeded**  
**Dimensions: **StreamName  
**Alarm description: **This alarm detects when the number of records resulting in write throughput capacity throttling reached the threshold. When your producers exceed your provisioned write throughput (determined by the number of shards you have), they are throttled and you won’t be able to put records to the stream. To address consistent throttling, you should consider adding shards to your stream. This raises your provisioned write throughput and prevents future throttling. You should also consider partition key choice when ingesting records. Random partition key is preferred because it spreads records evenly across the shards of the stream, whenever possible.  
**Intent: **This alarm can detect if your producers are being rejected for writing records because of throttling of the stream or shard. If your stream is in Provisioned mode, then setting this alarm helps you proactively take actions when the data stream reaches its limits, allowing you to optimize the provisioned capacity or take appropriate scaling actions to avoid data loss and maintain smooth data processing.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Usually throttled requests can be retried, so setting the threshold to zero makes the alarm too sensitive. However, consistent throttling can impact writing to the stream, and you should set the alarm threshold to detect this. Set the threshold to a percentage according to the throttled requests for the application and retry configurations.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Lambda
<a name="Lambda"></a>

**ClaimedAccountConcurrency**  
**Dimensions: **None  
**Alarm description: **This alarm helps to monitor if the concurrency of your Lambda functions is approaching the Region-level concurrency limit of your account. A function starts to be throttled if it reaches the concurrency limit. You can take the following actions to avoid throttling.   

1. [Request a concurrency increase](https://repost.aws/knowledge-center/lambda-concurrency-limit-increase) in this Region.

1. Identify and reduce any unused reserved concurrency or provisioned concurrency.

1. Identify performance issues in the functions to improve the speed of processing and therefore improve throughput.

1. Increase the batch size of the functions, so that more messages are processed by each function invocation.
**Intent: **This alarm can proactively detect if the concurrency of your Lambda functions is approaching the Region-level concurrency quota of your account, so that you can act on it. Functions are throttled if `ClaimedAccountConcurrency` reaches the Region-level concurrency quota of the account. If you are using Reserved Concurrency (RC) or Provisioned Concurrency (PC), this alarm gives you more visibility on concurrency utilization than an alarm on `ConcurrentExecutions` would.  
**Statistic: **Maximum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You should calculate the value of about 90% of the concurrency quota set for the account in the Region, and use the result as the threshold value. By default, your account has a concurrency quota of 1,000 across all functions in a Region. However, you should check the quota of your account from the Service Quotas dashboard.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**Errors**  
**Dimensions: **FunctionName  
**Alarm description: **This alarm detects high error counts. Errors includes the exceptions thrown by the code as well as exceptions thrown by the Lambda runtime. You can check the logs related to the function to diagnose the issue.  
**Intent: **The alarm helps detect high error counts in function invocations.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold to a number greater than zero. The exact value can depend on the tolerance for errors in your application. Understand the criticality of the invocations that the function is handling. For some applications, any error might be unacceptable, while other applications might allow for a certain margin of error.  
**Period: **60  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**Throttles**  
**Dimensions: **FunctionName  
**Alarm description: **This alarm detects a high number of throttled invocation requests. Throttling occurs when there is no concurrency is available for scale up. There are several approaches to resolve this issue. 1) Request a concurrency increase from AWS Support in this Region. 2) Identify performance issues in the function to improve the speed of processing and therefore improve throughput. 3) Increase the batch size of the function, so that more messages are processed by each function invocation.  
**Intent: **The alarm helps detect a high number of throttled invocation requests for a Lambda function. It is important to know if requests are constantly getting rejected due to throttling and if you need to improve Lambda function performance or increase concurrency capacity to avoid constant throttling.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold to a number greater than zero. The exact value of the threshold can depend on the tolerance of the application. Set the threshold according to its usage and scaling requirements of the function.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**Duration**  
**Dimensions: **FunctionName  
**Alarm description: **This alarm detects long duration times for processing an event by a Lambda function. Long durations might be because of changes in function code making the function take longer to execute, or the function's dependencies taking longer.  
**Intent: **This alarm can detect a long running duration of a Lambda function. High runtime duration indicates that a function is taking a longer time for invocation, and can also impact the concurrency capacity of invocation if Lambda is handling a higher number of events. It is critical to know if the Lambda function is constantly taking longer execution time than expected.  
**Statistic: **p90  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The threshold for the duration depends on your application and workloads and your performance requirements. For high-performance requirements, set the threshold to a shorter time to see if the function is meeting expectations. You can also analyze historical data for duration metrics to see the if the time taken matches the performance expectation of the function, and then set the threshold to a longer time than the historical average. Make sure to set the threshold lower than the configured function timeout.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ConcurrentExecutions**  
**Dimensions: **FunctionName  
**Alarm description: **This alarm helps to monitor if the concurrency of the function is approaching the Region-level concurrency limit of your account. A function starts to be throttled if it reaches the concurrency limit. You can take the following actions to avoid throttling.  

1. Request a concurrency increase in this Region.

1. Identify performance issues in the functions to improve the speed of processing and therefore improve throughput.

1. Increase the batch size of the functions, so that more messages are processed by each function invocation.
To get better visibility on reserved concurrency and provisioned concurrency utilization, set an alarm on the new metric `ClaimedAccountConcurrency` instead.  
**Intent: **This alarm can proactively detect if the concurrency of the function is approaching the Region-level concurrency quota of your account, so that you can act on it. A function is throttled if it reaches the Region-level concurrency quota of the account.  
**Statistic: **Maximum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold to about 90% of the concurrency quota set for the account in the Region. By default, your account has a concurrency quota of 1,000 across all functions in a Region. However, you can check the quota of your account, as it can be increased by contacting AWS support.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Lambda Insights
<a name="LambdaInsights"></a>

We recommend setting best-practice alarms for the following Lambda Insights metrics.

**memory\$1utilization**  
**Dimensions: **function\$1name  
**Alarm description: **This alarm is used to detect if the memory utilization of a lambda function is approaching the configured limit. For troubleshooting, you can try to 1) Optimize your code. 2) Rightly size your memory allocation by accurately estimating the memory requirements. You can refer to [Lambda Power Tuning](https://docs.aws.amazon.com/lambda/latest/operatorguide/profile-functions.html) for the same. 3) Use connection pooling. Refer to [Using Amazon RDS Proxy with Lambda](https://aws.amazon.com/blogs/compute/using-amazon-rds-proxy-with-aws-lambda/) for the connection pooling for RDS database. 4) You can also consider designing your functions to avoid storing large amounts of data in memory between invocations.  
**Intent: **This alarm is used to detect if the memory utilization for the Lambda function is approaching the configured limit.  
**Statistic: **Average  
**Threshold Suggestion: **90.0  
**Threshold Justification: **Set the threshold to 90% to get an alert when memory utilization exceeds 90% of the allocated memory. You can adjust this to a lower value if you have a concern for the workload for memory utilization. You can also check the historical data for this metric and set the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation Periods: **10  
**ComparisonOperator: **GREATER\$1THAN\$1THRESHOLD

## Amazon VPC (`AWS/NATGateway`)
<a name="NATGateway"></a>

**ErrorPortAllocation**  
**Dimensions: **NatGatewayId  
**Alarm description: **This alarm helps to detect when the NAT Gateway is unable to allocate ports to new connections. To resolve this issue, see [Resolve port allocation errors on NAT Gateway.](https://repost.aws/knowledge-center/vpc-resolve-port-allocation-errors)  
**Intent: **This alarm is used to detect if the NAT gateway could not allocate a source port.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **If the value of ErrorPortAllocation is greater than zero, that means too many concurrent connections to a single popular destination are open through NATGateway.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**PacketsDropCount**  
**Dimensions: **NatGatewayId  
**Alarm description: **This alarm helps to detect when packets are dropped by NAT Gateway. This might happen because of an issue with NAT Gateway, so check [AWS service health dashboard](https://health.aws.amazon.com/health/status) for the status of AWS NAT Gateway in your Region. This can help you correlate the network issue related to traffic using NAT gateway.  
**Intent: **This alarm is used to detect if packets are being dropped by NAT Gateway.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You should calculate the value of 0.01 percent of the total traffic on the NAT Gateway and use that result as the threshold value. Use historical data of the traffic on NAT Gateway to determine the threshold.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## AWS Private Link (`AWS/PrivateLinkEndpoints`)
<a name="PrivateLinkEndpoints"></a>

**PacketsDropped**  
**Dimensions: **VPC Id, VPC Endpoint Id, Endpoint Type, Subnet Id, Service Name  
**Alarm description: **This alarm helps to detect if the endpoint or endpoint service is unhealthy by monitoring the number of packets dropped by the endpoint. Note that packets larger than 8500 bytes that arrive at the VPC endpoint are dropped. For troubleshooting, see [connectivity problems between an interface VPC endpoint and an endpoint service](https://repost.aws/knowledge-center/connect-endpoint-service-vpc).  
**Intent: **This alarm is used to detect if the endpoint or endpoint service is unhealthy.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold according to the use case. If you want to be aware of the unhealthy status of the endpoint or endpoint service, you should set the threshold low so that you get a chance to fix the issue before a huge data loss. You can use historical data to understand the tolerance for dropped packets and set the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## AWS Private Link (`AWS/PrivateLinkServices`)
<a name="PrivateLinkServices"></a>

**RstPacketsSent**  
**Dimensions: **Service Id, Load Balancer Arn, Az  
**Alarm description: **This alarm helps you detect unhealthy targets of an endpoint service based on the number of reset packets that are sent to endpoints. When you debug connection errors with a consumer of your service, you can validate whether the service is resetting connections with the RstPacketsSent metric, or if something else is failing on the network path.  
**Intent: **This alarm is used to detect unhealthy targets of an endpoint service.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The threshold depends on the use case. If your use case can tolerate targets being unhealthy, you can set the threshold high. If the use case can’t tolerate unhealthy targets you can set the threshold very low.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## `Amazon RDS`
<a name="RDS"></a>

**CPUUtilization**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor consistent high CPU utilization. CPU utilization measures non-idle time. Consider using [Enhanced Monitoring](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.Enabling.html) or [Performance Insights](https://aws.amazon.com/rds/performance-insights/) to review which [wait time](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring-Available-OS-Metrics.html) is consuming the most of the CPU time (`guest`, `irq`, `wait`, `nice`, and so on) for MariaDB, MySQL, Oracle, and PostgreSQL. Then evaluate which queries consume the highest amount of CPU. If you can't tune your workload, consider moving to a larger DB instance class.  
**Intent: **This alarm is used to detect consistent high CPU utilization in order to prevent very high response time and time-outs. If you want to check micro-bursting of CPU utilization you can set a lower alarm evaluation time.  
**Statistic: **Average  
**Recommended threshold: **90.0  
**Threshold justification: **Random spikes in CPU consumption might not hamper database performance, but sustained high CPU can hinder upcoming database requests. Depending on the overall database workload, high CPU at your RDS/Aurora instance can degrade the overall performance.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**DatabaseConnections**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm detects a high number of connections. Review existing connections and terminate any that are in `sleep` state or that are improperly closed. Consider using connection pooling to limit the number of new connections. Alternatively, increase the DB instance size to use a class with more memory and hence a higher default value for `max\$1connections` or increase the `max\$1connections` value in [RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Limits.html) and Aurora [MySQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Managing.Performance.html) and [PostgreSQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Managing.html) for the current class if it can support your workload.  
**Intent: **This alarm is used to help prevent rejected connections when the maximum number of DB connections is reached. This alarm is not recommended if you frequently change DB instance class, because doing so changes the memory and default maximum number of connections.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The number of connections allowed depends on the size of your DB instance class and database engine-specific parameters related to processes/connections. You should calculate a value between 90-95% of the maximum number of connections for your database and use that result as the threshold value.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**EBSByteBalance%**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor a low percentage of throughput credits remaining. For troubleshooting, check [latency problems in RDS](https://repost.aws/knowledge-center/rds-latency-ebs-iops-bottleneck).  
**Intent: **This alarm is used to detect a low percentage of throughput credits remaining in the burst bucket. Low byte balance percentage can cause throughput bottleneck issues. This alarm is not recommended for Aurora PostgreSQL instances.  
**Statistic: **Average  
**Recommended threshold: **10.0  
**Threshold justification: **A throughput credit balance below 10% is considered to be poor and you should set the threshold accordingly. You can also set a lower threshold if your application can tolerate a lower throughput for the workload.  
**Period: **60  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**EBSIOBalance%**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor low percentage of IOPS credits remaining. For troubleshooting, see [latency problems in RDS](https://repost.aws/knowledge-center/rds-latency-ebs-iops-bottleneck).  
**Intent: **This alarm is used to detect a low percentage of I/O credits remaining in the burst bucket. Low IOPS balance percentage can cause IOPS bottleneck issues. This alarm is not recommended for Aurora instances.  
**Statistic: **Average  
**Recommended threshold: **10.0  
**Threshold justification: **An IOPS credits balance below 10% is considered to be poor and you can set the threshold accordingly. You can also set a lower threshold, if your application can tolerate a lower IOPS for the workload.  
**Period: **60  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**FreeableMemory**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor low freeable memory which can mean that there is a spike in database connections or that your instance may be under high memory pressure. Check for memory pressure by monitoring the CloudWatch metrics for `SwapUsage``in addition to `FreeableMemory`. If the instance memory consumption is frequently too high, this indicates that you should check your workload or upgrade your instance class. For Aurora reader DB instance, consider adding additional reader DB instances to the cluster. For information about troubleshooting Aurora, see [freeable memory issues](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_Troubleshooting.html#Troubleshooting.FreeableMemory).  
**Intent: **This alarm is used to help prevent running out of memory which can result in rejected connections.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Depending on the workload and instance class, different values for the threshold can be appropriate. Ideally, available memory should not go below 25% of total memory for prolonged periods. For Aurora, you can set the threshold close to 5%, because the metric approaching 0 means that the DB instance has scaled up as much as it can. You can analyze the historical behavior of this metric to determine sensible threshold levels.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**FreeLocalStorage**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor low free local storage. Aurora PostgreSQL-Compatible Edition uses local storage for storing error logs and temporary files. Aurora MySQL uses local storage for storing error logs, general logs, slow query logs, audit logs, and non-InnoDB temporary tables. These local storage volumes are backed by Amazon EBS Store and can be extended by using a larger DB instance class. For troubleshooting, check Aurora [PostgreSQL-Compatible](https://repost.aws/knowledge-center/postgresql-aurora-storage-issue) and [MySQL-Compatible](https://repost.aws/knowledge-center/aurora-mysql-local-storage).  
**Intent: **This alarm is used to detect how close the Aurora DB instance is to reaching the local storage limit, if you do not use Aurora Serverless v2 or higher. Local storage can reach capacity when you store non-persistent data, such as temporary table and log files, in the local storage. This alarm can prevent an out-of-space error that occurs when your DB instance runs out of local storage.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You should calculate about 10%-20% of the amount of storage available based on velocity and trend of volume usage, and then use that result as the threshold value to proactively take action before the volume reaches its limit.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**FreeStorageSpace**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm watches for a low amount of available storage space. Consider scaling up your database storage if you frequently approach storage capacity limits. Include some buffer to accommodate unforeseen increases in demand from your applications. Alternatively, consider enabling RDS storage auto scaling. Additionally, consider freeing up more space by deleting unused or outdated data and logs. For further information, check [RDS run out of storage document](https://repost.aws/knowledge-center/rds-out-of-storage) and [PostgreSQL storage issues document](https://repost.aws/knowledge-center/diskfull-error-rds-postgresql).  
**Intent: **This alarm helps prevent storage full issues. This can prevent downtime that occurs when your database instance runs out of storage. We do not recommend using this alarm if you have storage auto scaling enabled, or if you frequently change the storage capacity of the database instance.  
**Statistic: **Minimum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The threshold value will depend on the currently allocated storage space. Typically, you should calculate the value of 10 percent of the allocated storage space and use that result as the threshold value.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**MaximumUsedTransactionIDs**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps prevent transaction ID wraparound for PostgreSQL. Refer to the troubleshooting steps in [this blog](https://aws.amazon.com/blogs/database/implement-an-early-warning-system-for-transaction-id-wraparound-in-amazon-rds-for-postgresql/) to investigate and resolve the issue. You can also refer to [this blog](https://aws.amazon.com/blogs/database/understanding-autovacuum-in-amazon-rds-for-postgresql-environments/) to familiarize yourself further with autovacuum concepts, common issues and best practices.  
**Intent: **This alarm is used to help prevent transaction ID wraparound for PostgreSQL.  
**Statistic: **Average  
**Recommended threshold: **1.0E9  
**Threshold justification: **Setting this threshold to 1 billion should give you time to investigate the problem. The default autovacuum\$1freeze\$1max\$1age value is 200 million. If the age of the oldest transaction is 1 billion, autovacuum is having a problem keeping this threshold below the target of 200 million transaction IDs.  
**Period: **60  
**Datapoints to alarm: **1  
**Evaluation periods: **1  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ReadLatency**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor high read latency. If storage latency is high, it's because the workload is exceeding resource limits. You can review I/O utilization relative to instance and allocated storage configuration. Refer to [troubleshoot the latency of Amazon EBS volumes caused by an IOPS bottleneck](https://repost.aws/knowledge-center/rds-latency-ebs-iops-bottleneck). For Aurora, you can switch to an instance class that has [I/O-Optimized storage configuration](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.Aurora_Fea_Regions_DB-eng.Feature.storage-type.html). See [Planning I/O in Aurora](https://aws.amazon.com/blogs/database/planning-i-o-in-amazon-aurora/) for guidance.  
**Intent: **This alarm is used to detect high read latency. Database disks normally have a low read/write latency, but they can have issues that can cause high latency operations.  
**Statistic: **p90  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on your use case. Read latencies higher than 20 milliseconds are likely a cause for investigation. You can also set a higher threshold if your application can have higher latency for read operations. Review the criticality and requirements of read latency and analyze the historical behavior of this metric to determine sensible threshold levels.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**ReplicaLag**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps you understand the number of seconds a replica is behind the primary instance. A PostgreSQL Read Replica reports a replication lag of up to five minutes if there are no user transactions occurring on the source database instance. When the ReplicaLag metric reaches 0, the replica has caught up to the primary DB instance. If the ReplicaLag metric returns -1, then replication is currently not active. For guidance related to RDS PostgreSQL, see [replication best practices](https://aws.amazon.com/blogs/database/best-practices-for-amazon-rds-postgresql-replication/) and for troubleshooting `ReplicaLag` and related errors, see [troubleshooting ReplicaLag](https://repost.aws/knowledge-center/rds-postgresql-replication-lag).  
**Intent: **This alarm can detect the replica lag which reflects the data loss that could happen in case of a failure of the primary instance. If the replica gets too far behind the primary and the primary fails, the replica will be missing data that was in the primary instance.  
**Statistic: **Maximum  
**Recommended threshold: **60.0  
**Threshold justification: **Typically, the acceptable lag depends on the application. We recommend no more than 60 seconds.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**WriteLatency**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor high write latency. If storage latency is high, it's because the workload is exceeding resource limits. You can review I/O utilization relative to instance and allocated storage configuration. Refer to [troubleshoot the latency of Amazon EBS volumes caused by an IOPS bottleneck](https://repost.aws/knowledge-center/rds-latency-ebs-iops-bottleneck). For Aurora, you can switch to an instance class that has [I/O-Optimized storage configuration](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Concepts.Aurora_Fea_Regions_DB-eng.Feature.storage-type.html). See [Planning I/O in Aurora](https://aws.amazon.com/blogs/database/planning-i-o-in-amazon-aurora/) for guidance.  
**Intent: **This alarm is used to detect high write latency. Although database disks typically have low read/write latency, they may experience problems that cause high latency operations. Monitoring this will assure you the disk latency is as low as expected.  
**Statistic: **p90  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on your use case. Write latencies higher than 20 milliseconds are likely a cause for investigation. You can also set a higher threshold if your application can have a higher latency for write operations. Review the criticality and requirements of write latency and analyze the historical behavior of this metric to determine sensible threshold levels.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**DBLoad**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor high DB load. If the number of processes exceed the number of vCPUs, the processes start queuing. When the queuing increases, the performance is impacted. If the DB load is often above the maximum vCPU, and the primary wait state is CPU, the CPU is overloaded. In this case, you can monitor `CPUUtilization`, `DBLoadCPU` and queued tasks in Performance Insights/Enhanced Monitoring. You might want to throttle connections to the instance, tune any SQL queries with a high CPU load, or consider a larger instance class. High and consistent instances of any wait state indicate that there might be bottlenecks or resource contention issues to resolve.  
**Intent: **This alarm is used to detect a high DB load. High DB load can cause performance issues in the DB instance. This alarm is not applicable to serverless DB instances.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The maximum vCPU value is determined by the number of vCPU (virtual CPU) cores for your DB instance. Depending on the maximum vCPU, different values for the threshold can be appropriate. Ideally, DB load should not go above vCPU line.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**AuroraVolumeBytesLeftTotal**  
**Dimensions: **DBClusterIdentifier  
**Alarm description: **This alarm helps to monitor low remaining total volume. When the total volume left reaches the size limit, the cluster reports an out-of-space error. Aurora storage automatically scales with the data in the cluster volume and expands up to 128 TiB or 64 TiB depending on the [DB engine version](https://repost.aws/knowledge-center/aurora-version-number). Consider reducing storage by dropping tables and databases that you no longer need. For more information, check [storage scaling](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Performance.html).  
**Intent: **This alarm is used to detect how close the Aurora cluster is to the volume size limit. This alarm can prevent an out-of-space error that occurs when your cluster runs out of space. This alarm is recommended only for Aurora MySQL.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You should calculate 10%-20% of the actual size limit based on velocity and trend of volume usage increase, and then use that result as the threshold value to proactively take action before the volume reaches its limit.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**AuroraBinlogReplicaLag**  
**Dimensions: **DBClusterIdentifier, Role=WRITER  
**Alarm description: **This alarm helps to monitor the error state of Aurora writer instance replication. For more information, see [Replicating Aurora MySQL DB clusters across AWS Regions](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Replication.CrossRegion.html). For troubleshooting, see [Aurora MySQL replication issues](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/CHAP_Troubleshooting.html#CHAP_Troubleshooting.MySQL).  
**Intent: **This alarm is used to detect whether the writer instance is in an error state and can’t replicate the source. This alarm is recommended only for Aurora MySQL.  
**Statistic: **Average  
**Recommended threshold: **-1.0  
**Threshold justification: **We recommend that you use -1 as the threshold value because Aurora MySQL publishes this value if the replica is in an error state.  
**Period: **60  
**Datapoints to alarm: **2  
**Evaluation periods: **2  
**Comparison Operator: **LESS\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**BlockedTransactions**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor a high blocked transaction count in an Aurora DB instance. Blocked transactions can end in either a rollback or a commit. High concurrency, idles in transaction, or long running transactions can lead to blocked transactions. For troubleshooting, see [Aurora MySQL](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/ams-waits.row-lock-wait.html) documentation.  
**Intent: **This alarm is used to detect a high count of blocked transactions in an Aurora DB instance in order to prevent transaction rollbacks and performance degradation.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You should calculate 5% of all transactions of your instance using the `ActiveTransactions` metric and use that result as the threshold value. You can also review the criticality and requirements of blocked transactions and analyze the historical behavior of this metric to determine sensible threshold levels.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**BufferCacheHitRatio**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps you monitor a consistent low cache hit ratio of the Aurora cluster. A low hit ratio indicates that your queries on this DB instance are frequently going to disk. For troubleshooting, investigate your workload to see which queries are causing this behavior, and see the [DB instance RAM recommendations](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.BestPractices.html#Aurora.BestPractices.Performance.Sizing) document.  
**Intent: **This alarm is used to detect consistent low cache hit ratio in order to prevent a sustained performance decrease in the Aurora instance.  
**Statistic: **Average  
**Recommended threshold: **80.0  
**Threshold justification: **You can set the threshold for buffer cache hit ratio to 80%. However, you can adjust this value based on your acceptable performance level and workload characteristics.  
**Period: **60  
**Datapoints to alarm: **10  
**Evaluation periods: **10  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**EngineUptime**  
**Dimensions: **DBClusterIdentifier, Role=WRITER  
**Alarm description: **This alarm helps to monitor low downtime of the writer DB instance. The writer DB instance can go down due to a reboot, maintenance, upgrade, or failover. When the uptime reaches 0 because of a failover in the cluster, and the cluster has one or more Aurora Replicas, then an Aurora Replica is promoted to the primary writer instance during a failure event. To increase the availability of your DB cluster, consider creating one or more Aurora Replicas in two or more different Availability Zones. For more information check [factors that influence Aurora downtime](https://repost.aws/knowledge-center/aurora-mysql-downtime-factors).  
**Intent: **This alarm is used to detect whether the Aurora writer DB instance is in downtime. This can prevent long-running failure in the writer instance that occurs because of a crash or failover.  
**Statistic: **Average  
**Recommended threshold: **0.0  
**Threshold justification: **A failure event results in a brief interruption, during which read and write operations fail with an exception. However, service is typically restored in less than 60 seconds, and often less than 30 seconds.  
**Period: **60  
**Datapoints to alarm: **2  
**Evaluation periods: **2  
**Comparison Operator: **LESS\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**RollbackSegmentHistoryListLength**  
**Dimensions: **DBInstanceIdentifier  
**Alarm description: **This alarm helps to monitor a consistent high rollback segment history length of an Aurora instance. A high InnoDB history list length indicates that a large number of old row versions, queries and database shutdowns have become slower. For more information and troubleshooting, see [the InnoDB history list length increased significantly](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/proactive-insights.history-list.html) documentation.  
**Intent: **This alarm is used to detect consistent high rollback segment history length. This can help you prevent sustained performance degradation and high CPU usage in the Aurora instance. This alarm is recommended only for Aurora MySQL.  
**Statistic: **Average  
**Recommended threshold: **1000000.0  
**Threshold justification: **Setting this threshold to 1 million should give you time to investigate the problem. However, you can adjust this value based on your acceptable performance level and workload characteristics.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**StorageNetworkThroughput**  
**Dimensions: **DBClusterIdentifier, Role=WRITER  
**Alarm description: **This alarm helps to monitor high storage network throughput. If storage network throughput passes the total network bandwidth of the [EC2 instance](https://aws.amazon.com/ec2/instance-types/), it can lead to high read and write latency, which can cause degraded performance. You can check your EC2 instance type from AWS Console. For troubleshooting, check any changes on write/read latencies and evaluate if you’ve also hit an alarm on this metric. If that is the case, evaluate your workload pattern during the times that the alarm was triggered. This can help you identify if you can optimize y our workload to reduce the total amount of network traffic. If this is not possible, you might need to consider scaling your instance.  
**Intent: **This alarm is used to detect high storage network throughput. Detecting high throughput can prevent network packet drops and degraded performance.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **You should calculate about 80%-90% of the total network bandwidth of the EC2 instance type, and then use that result as the threshold value to proactively take action before the network packets are affected. You can also review the criticality and requirements of storage network throughput and analyze the historical behavior of this metric to determine sensible threshold levels.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## `Amazon Route 53 Public Data Plane`
<a name="Route53"></a>

**HealthCheckStatus**  
**Dimensions: **HealthCheckId  
**Alarm description: **This alarm helps to detect unhealthy endpoints as per health checkers. To understand the reason for a failure that results in unhealthy status, use the Health Checkers tab in the Route 53 Health Check Console to view the status from each Region as well as the last failure of the health check. The status tab also displays the reason that the endpoint is reported as unhealthy. Refer to [troubleshooting steps](https://repost.aws/knowledge-center/route-53-fix-unhealthy-health-checks).  
**Intent: **This alarm uses Route53 health checkers to detect unhealthy endpoints.  
**Statistic: **Average  
**Recommended threshold: **1.0  
**Threshold justification: **The status of the endpoint is reported as 1 when it's healthy. Everything less than 1 is unhealthy.  
**Period: **60  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

## `Amazon S3`
<a name="S3"></a>

**4xxErrors**  
**Dimensions: **BucketName, FilterId  
**Alarm description: **This alarm helps us report the total number of 4xx error status codes that are made in response to client requests. 403 error codes might indicate an incorrect IAM policy, and 404 error codes might indicate mis-behaving client application, for example. [Enabling S3 server access logging](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-server-access-logging.html) on a temporary basis will help you to pinpoint the issue's origin using the fields HTTP status and Error Code. To understand more about the error code, see [Error Responses](https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html).  
**Intent: **This alarm is used to create a baseline for typical 4xx error rates so that you can look into any abnormalities that might indicate a setup issue.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **The recommended threshold is to detect if more than 5% of total requests are getting 4XX errors. Frequently occurring 4XX errors should be alarmed. However, setting a very low value for the threshold can cause alarm to be too sensitive. You can also tune the threshold to suit to the load of the requests, accounting for an acceptable level of 4XX errors. You can also analyze historical data to find the acceptable error rate for the application workload, and then tune the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**5xxErrors**  
**Dimensions: **BucketName, FilterId  
**Alarm description: **This alarm helps you detect a high number of server-side errors. These errors indicate that a client made a request that the server couldn’t complete. This can help you correlate the issue your application is facing because of S3. For more information to help you efficiently handle or reduce errors, see [Optimizing performance design patterns](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-timeouts-retries). Errors might also be caused by an the issue with S3, check [AWS service health dashboard](https://health.aws.amazon.com/health/status) for the status of Amazon S3 in your Region.  
**Intent: **This alarm can help to detect if the application is experiencing issues due to 5xx errors.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **We recommend setting the threshold to detect if more than 5% of total requests are getting 5XXError. However, you can tune the threshold to suit the traffic of the requests, as well as acceptable error rates. You can also analyze historical data to see what is the acceptable error rate for the application workload, and tune the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**OperationsFailedReplication**  
**Dimensions: **SourceBucket, DestinationBucket, RuleId  
**Alarm description: **This alarm helps in understanding a replication failure. This metric tracks the status of new objects replicated using S3 CRR or S3 SRR, and also tracks existing objects replicated using S3 batch replication. See [Replication troubleshooting](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication-troubleshoot.html) for more details.  
**Intent: **This alarm is used to detect if there is a failed replication operation.  
**Statistic: **Maximum  
**Recommended threshold: **0.0  
**Threshold justification: **This metric emits a value of 0 for successful operations, and nothing when there are no replication operations carried out for the minute. When the metric emits a value greater than 0, the replication operation is unsuccessful.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## `S3ObjectLambda`
<a name="S3ObjectLambda"></a>

**4xxErrors**  
**Dimensions: **AccessPointName, DataSourceARN  
**Alarm description: **This alarm helps us report the total number of 4xx error status code that are made in response to client requests. [Enabling S3 server access logging](https://docs.aws.amazon.com/AmazonS3/latest/userguide/enable-server-access-logging.html) on a temporary basis will help you to pinpoint the issue's origin using the fields HTTP status and Error Code.  
**Intent: **This alarm is used to create a baseline for typical 4xx error rates so that you can look into any abnormalities that might indicate a setup issue.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **We recommend setting the threshold to detect if more than 5% of total requests are getting 4XXError. Frequently occurring 4XX errors should be alarmed. However, setting a very low value for the threshold can cause alarm to be too sensitive. You can also tune the threshold to suit to the load of the requests, accounting for an acceptable level of 4XX errors. You can also analyze historical data to find the acceptable error rate for the application workload, and then tune the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**5xxErrors**  
**Dimensions: **AccessPointName, DataSourceARN  
**Alarm description: **This alarm helps to detect high number of server-side errors. These errors indicate that a client made a request that the server couldn’t complete. These errors might be caused by an issue with S3, check [AWS service health dashboard](https://health.aws.amazon.com/health/status) for the status of Amazon S3 in your Region. This can help you correlate the issue your application is facing because of S3. For information to help you efficiently handle or reduce these errors, see [Optimizing performance design patterns](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-timeouts-retries).  
**Intent: **This alarm can help to detect if the application is experiencing issues due to 5xx errors.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **We recommend setting the threshold to detect if more than 5% of total requests are getting 5XX errors. However, you can tune the threshold to suit the traffic of the requests, as well as acceptable error rates. You can also analyze historical data to see what is the acceptable error rate for the application workload, and tune the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**LambdaResponse4xx**  
**Dimensions: **AccessPointName, DataSourceARN  
**Alarm description: **This alarm helps you detect and diagnose failures (500s) in calls to S3 Object Lambda. These errors can be caused by errors or misconfigurations in the Lambda function responsible for responding to your requests. Investigating the CloudWatch Log Streams of the Lambda function associated with the Object Lambda Access Point can help you pinpoint the issue's origin based on the response from S3 Object Lambda.  
**Intent: **This alarm is used to detect 4xx client errors for WriteGetObjectResponse calls.  
**Statistic: **Average  
**Recommended threshold: **0.05  
**Threshold justification: **We recommend setting the threshold to detect if more than 5% of total requests are getting 4XXError. Frequently occurring 4XX errors should be alarmed. However, setting a very low value for the threshold can cause alarm to be too sensitive. You can also tune the threshold to suit to the load of the requests, accounting for an acceptable level of 4XX errors. You can also analyze historical data to find the acceptable error rate for the application workload, and then tune the threshold accordingly.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon SNS
<a name="SNS"></a>

**NumberOfMessagesPublished**  
**Dimensions: **TopicName  
**Alarm description: **This alarm can detect when the number of SNS messages published is too low. For troubleshooting, check why the publishers are sending less traffic.  
**Intent: **This alarm helps you proactively monitor and detect significant drops in notification publishing. This helps you identify potential issues with your application or business processes, so that you can take appropriate actions to maintain the expected flow of notifications. You should create this alarm if you expect your system to have a minimum traffic that it is serving.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The number of messages published should be in line with the expected number of published messages for your application. You can also analyze the historical data, trends and traffic to find the right threshold.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**NumberOfNotificationsDelivered**  
**Dimensions: **TopicName  
**Alarm description: **This alarm can detect when the number of SNS messages delivered is too low. This could be because of unintentional unsubscribing of an endpoint, or because of an SNS event that causes messages to experience delay.  
**Intent: **This alarm helps you detect a drop in the volume of messages delivered. You should create this alarm if you expect your system to have a minimum traffic that it is serving.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The number of messages delivered should be in line with the expected number of messages produced and the number of consumers. You can also analyze the historical data, trends and traffic to find the right threshold.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**NumberOfNotificationsFailed**  
**Dimensions: **TopicName  
**Alarm description: **This alarm can detect when the number of failed SNS messages is too high. To troubleshoot failed notifications, enable logging to CloudWatch Logs. Checking the logs can help you find which subscribers are failing, as well as the status codes they are returning.  
**Intent: **This alarm helps you proactively find issues with the delivery of notifications and take appropriate actions to address them.  
**Statistic: **Sum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on the impact of failed notifications. Review the SLAs provided to your end users, fault tolerance and criticality of notifications and analyze historical data, and then select a threshold accordingly. The number of notifications failed should be 0 for topics that have only SQS, Lambda or Firehose subscriptions.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**NumberOfNotificationsFilteredOut-InvalidAttributes**  
**Dimensions: **TopicName  
**Alarm description: **This alarm helps to monitor and resolve potential problems with the publisher or subscribers. Check if a publisher is publishing messages with invalid attributes or if an inappropriate filter is applied to a subscriber. You can also analyze CloudWatch Logs to help find the root cause of the issue.  
**Intent: **The alarm is used to detect if the published messages are not valid or if inappropriate filters have been applied to a subscriber.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **Invalid attributes are almost always a mistake by the publisher. We recommend to set the threshold to 0 because invalid attributes are not expected in a healthy system.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**NumberOfNotificationsFilteredOut-InvalidMessageBody**  
**Dimensions: **TopicName  
**Alarm description: **This alarm helps to monitor and resolve potential problems with the publisher or subscribers. Check if a publisher is publishing messages with invalid message bodies, or if an inappropriate filter is applied to a subscriber. You can also analyze CloudWatch Logs to help find the root cause of the issue.  
**Intent: **The alarm is used to detect if the published messages are not valid or if inappropriate filters have been applied to a subscriber.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **Invalid message bodies are almost always a mistake by the publisher. We recommend to set the threshold to 0 because invalid message bodies are not expected in a healthy system.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**NumberOfNotificationsRedrivenToDlq**  
**Dimensions: **TopicName  
**Alarm description: **This alarm helps to monitor the number of messages that are moved to a dead-letter queue.  
**Intent: **The alarm is used to detect messages that moved to a dead-letter queue. We recommend that you create this alarm when SNS is coupled with SQS, Lambda or Firehose.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **In a healthy system of any subscriber type, messages should not be moved to the dead-letter queue. We recommend that you be notified if any messages land in the queue, so that you can identify and address the root cause, and potentially redrive the messages in the dead-letter queue to prevent data loss.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**NumberOfNotificationsFailedToRedriveToDlq**  
**Dimensions: **TopicName  
**Alarm description: **This alarm helps to monitor messages that couldn't be moved to a dead-letter queue. Check whether your dead-letter queue exists and that it's configured correctly. Also, verify that SNS has permissions to access the dead-letter queue. Refer to the [dead-letter queue documentation](https://docs.aws.amazon.com/sns/latest/dg/sns-dead-letter-queues.html) to learn more.  
**Intent: **The alarm is used to detect messages that couldn't be moved to a dead-letter queue.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **It's almost always a mistake if messages can't be moved to the dead-letter queue. The recommendation for the threshold is 0, meaning all messages that fail processing must be able to reach the dead-letter queue when the queue has been configured.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**SMSMonthToDateSpentUSD**  
**Dimensions: **TopicName  
**Alarm description: **The alarm helps to monitor if you have a sufficient quota in your account for SNS to be able to deliver messages. If you reach your quota, SNS won't be able to deliver SMS messages. For information about setting your monthly SMS spend quota, or for information about requesting a spend quota increase with AWS, see [Setting SMS messaging preferences](https://docs.aws.amazon.com/sns/latest/dg/sms_preferences.html).  
**Intent: **This alarm is used to detect if you have a sufficient quota in your account for your SMS messages to be delivered successfully.  
**Statistic: **Maximum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold in accordance with the quota (Account spend limit) for the account. Choose a threshold which informs you early enough that you are reaching your quota limit so that you have time to request an increase.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

**SMSSuccessRate**  
**Dimensions: **TopicName  
**Alarm description: **This alarm helps to monitor the rate of failing SMS message deliveries. You can set up [Cloudwatch Logs](https://docs.aws.amazon.com/sns/latest/dg/sms_stats_cloudwatch.html) to understand the nature of the failure and take action based on that.  
**Intent: **This alarm is used to detect failing SMS message deliveries.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **Set the threshold for the alarm in line with your tolerance for failing SMS message deliveries.  
**Period: **60  
**Datapoints to alarm: **5  
**Evaluation periods: **5  
**Comparison Operator: **GREATER\$1THAN\$1THRESHOLD

## Amazon SQS
<a name="SQS"></a>

**ApproximateAgeOfOldestMessage**  
**Dimensions: **QueueName  
**Alarm description: **This alarm watches the age of the oldest message in the queue. You can use this alarm to monitor if your consumers are processing SQS messages at the desired speed. Consider increasing the consumer count or consumer throughput to reduce message age. This metric can be used in combination with `ApproximateNumberOfMessagesVisible` to determine how big the queue backlog is and how quickly messages are being processed. To prevent messages from being deleted before processed, consider configuring the dead-letter queue to sideline potential poison pill messages.  
**Intent: **This alarm is used to detect whether the age of the oldest message in the QueueName queue is too high. High age can be an indication that messages are not processed quickly enough or that there are some poison-pill messages that are stuck in the queue and can't be processed.   
**Statistic: **Maximum  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on the expected message processing time. You can use historical data to calculate the average message processing time, and then set the threshold to 50% higher than the maximum expected SQS message processing time by queue consumers.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**ApproximateNumberOfMessagesNotVisible**  
**Dimensions: **QueueName  
**Alarm description: **This alarm helps to detect a high number of in-flight messages with respect to `QueueName`. For troubleshooting, check [message backlog decreasing](https://repost.aws/knowledge-center/sqs-message-backlog).  
**Intent: **This alarm is used to detect a high number of in-flight messages in the queue. If consumers do not delete messages within the visibility timeout period, when the queue is polled, messages reappear in the queue. For FIFO queues, there can be a maximum of 20,000 in-flight messages. If you reach this quota, SQS returns no error messages. A FIFO queue looks through the first 20k messages to determine available message groups. This means that if you have a backlog of messages in a single message group, you cannot consume messages from other message groups that were sent to the queue at a later time until you successfully consume the messages from the backlog.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **The recommended threshold value for this alarm is highly dependent on the expected number of messages in flight. You can use historical data to calculate the maximum expected number of messages in flight and set the threshold to 50% over this value. If consumers of the queue are processing but not deleting messages from the queue, this number will suddenly increase.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**ApproximateNumberOfMessagesVisible**  
**Dimensions: **QueueName  
**Alarm description: **This alarm watches for the message queue backlog to be bigger than expected, indicating that consumers are too slow or there are not enough consumers. Consider increasing the consumer count or speeding up consumers, if this alarm goes into ALARM state.  
**Intent: **This alarm is used to detect whether the message count of the active queue is too high and consumers are slow to process the messages or there are not enough consumers to process them.  
**Statistic: **Average  
**Recommended threshold: **Depends on your situation  
**Threshold justification: **An unexpectedly high number of messages visible indicates that messages are not being processed by a consumer at the expected rate. You should consider historical data when you set this threshold.  
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **GREATER\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

**NumberOfMessagesSent**  
**Dimensions: **QueueName  
**Alarm description: **This alarm helps to detect if there are no messages being sent from a producer with respect to `QueueName`. For troubleshooting, check the reason that the producer is not sending messages.  
**Intent: **This alarm is used to detect when a producer stops sending messages.  
**Statistic: **Sum  
**Recommended threshold: **0.0  
**Threshold justification: **If the number of messages sent is 0, the producer is not sending any messages. If this queue has a low TPS, increase the number of EvaluationPeriods accordingly.   
**Period: **60  
**Datapoints to alarm: **15  
**Evaluation periods: **15  
**Comparison Operator: **LESS\$1THAN\$1OR\$1EQUAL\$1TO\$1THRESHOLD

## Site-to-Site VPN
<a name="VPN"></a>

**TunnelState**  
**Dimensions: **VpnId  
**Alarm description: **This alarm helps you understand if the state of one or more tunnels is DOWN. For troubleshooting, see [VPN tunnel troubleshooting](https://repost.aws/knowledge-center/vpn-tunnel-troubleshooting).  
**Intent: **This alarm is used to detect if at least one tunnel is in the DOWN state for this VPN, so that you can troubleshoot the impacted VPN. This alarm will always be in the ALARM state for networks that only have a single tunnel configured.  
**Statistic: **Minimum  
**Recommended threshold: **1.0  
**Threshold justification: **A value less than 1 indicates that at least one tunnel is in DOWN state.  
**Period: **300  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

**TunnelState**  
**Dimensions: **TunnelIpAddress  
**Alarm description: **This alarm helps you understand if the state of this tunnel is DOWN. For troubleshooting, see [VPN tunnel troubleshooting](https://repost.aws/knowledge-center/vpn-tunnel-troubleshooting).  
**Intent: **This alarm is used to detect if the tunnel is in the DOWN state, so that you can troubleshoot the impacted VPN. This alarm will always be in the ALARM state for networks that only have a single tunnel configured.  
**Statistic: **Minimum  
**Recommended threshold: **1.0  
**Threshold justification: **A value less than 1 indicates that the tunnel is in DOWN state.  
**Period: **300  
**Datapoints to alarm: **3  
**Evaluation periods: **3  
**Comparison Operator: **LESS\$1THAN\$1THRESHOLD

# Alarm use cases and examples
<a name="Alarm-Use-Cases"></a>

The following sections provides examples and tutorials for alarms for common use cases.

**Topics**
+ [

# Create a billing alarm to monitor your estimated AWS charges
](monitor_estimated_charges_with_cloudwatch.md)
+ [

# Create a CPU usage alarm
](US_AlarmAtThresholdEC2.md)
+ [

# Create a load balancer latency alarm that sends email
](US_AlarmAtThresholdELB.md)
+ [

# Create a storage throughput alarm that sends email
](US_AlarmAtThresholdEBS.md)
+ [

# Create an alarm on Performance Insights counter metrics from an AWS database
](CloudWatch_alarm_database_performance_insights.md)

# Create a billing alarm to monitor your estimated AWS charges
<a name="monitor_estimated_charges_with_cloudwatch"></a>

You can monitor your estimated AWS charges by using Amazon CloudWatch. When you enable the monitoring of estimated charges for your AWS account, the estimated charges are calculated and sent several times daily to CloudWatch as metric data.

Billing metric data is stored in the US East (N. Virginia) Region and represents worldwide charges. This data includes the estimated charges for every service in AWS that you use, in addition to the estimated overall total of your AWS charges.

The alarm triggers when your account billing exceeds the threshold you specify. It triggers only when the current billing exceeds the threshold. It doesn't use projections based on your usage so far in the month.

If you create a billing alarm at a time when your charges have already exceeded the threshold, the alarm goes to the `ALARM` state immediately.

**Note**  
For information about analyzing CloudWatch charges that you have already been billed for, see [Analyzing, optimizing, and reducing CloudWatch costs](cloudwatch_billing.md).

**Topics**
+ [

## Enabling billing alerts
](#turning_on_billing_metrics)
+ [

## Create a billing alarm
](#creating_billing_alarm_with_wizard)
+ [

## Deleting a billing alarm
](#deleting_billing_alarm)

## Enabling billing alerts
<a name="turning_on_billing_metrics"></a>

Before you can create an alarm for your estimated charges, you must enable billing alerts, so that you can monitor your estimated AWS charges and create an alarm using billing metric data. After you enable billing alerts, you can't disable data collection, but you can delete any billing alarms that you created.

After you enable billing alerts for the first time, it takes about 15 minutes before you can view billing data and set billing alarms.

**Requirements**
+ You must be signed in using account root user credentials or as an IAM user that has been given permission to view billing information.
+ For consolidated billing accounts, billing data for each linked account can be found by logging in as the paying account. You can view billing data for total estimated charges and estimated charges by service for each linked account, in addition to the consolidated account.
+ In a consolidated billing account, member linked account metrics are captured only if the payer account enables the **Receive Billing Alerts** preference. If you change which account is your management/payer account, you must enable the billing alerts in the new management/payer account.
+ The account must not be part of the Amazon Partner Network (APN) because billing metrics are not published to CloudWatch for APN accounts. For more information, see [AWS Partner Network](https://aws.amazon.com/partners/).

**To enable the monitoring of estimated charges**

1. Open the AWS Billing and Cost Management console at [https://console.aws.amazon.com/costmanagement/](https://console.aws.amazon.com/costmanagement/).

1. In the navigation pane, choose **Billing Preferences**.

1. By **Alert preferences** choose **Edit**.

1. Choose **Receive CloudWatch Billing Alerts**.

1. Choose **Save preferences**.

## Create a billing alarm
<a name="creating_billing_alarm_with_wizard"></a>

**Important**  
 Before you create a billing alarm, you must set your Region to US East (N. Virginia). Billing metric data is stored in this Region and represents worldwide charges. You also must enable billing alerts for your account or in the management/payer account (if you are using consolidated billing). For more information, see [Enabling billing alerts](#turning_on_billing_metrics). 

 In this procedure, you create an alarm that sends a notification when your estimated charges for AWS exceed a defined threshold. 

**To create a billing alarm using the CloudWatch console**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1.  In the navigation pane, choose **Alarms**, and then choose **All alarms**. 

1.  Choose **Create alarm**. 

1.  Choose **Select metric**. In **AWS Namespaces**, choose **Billing**, and then choose **Total Estimated Charge**. 
**Note**  
 If you don't see the **Billing**/**Total Estimated Charge** metric, enable billing alerts, and change your Region to US East (N. Virginia). For more information, see [Enabling billing alerts](#turning_on_billing_metrics). 

1.  Select the box for the **EstimatedCharges** metric, and then choose **Select metric**. 

1. For **Statistic**, choose **Maximum**.

1. For **Period**, choose **6 hours**.

1.  For **Threshold type**, choose **Static**. 

1.  For **Whenever EstimatedCharges is . . .**, choose **Greater**. 

1.  For **than . . .**, define the value that you want to cause your alarm to trigger. For example, **200** USD. 

   The **EstimatedCharges** metric values are only in US dollars (USD), and the currency conversion is provided by Amazon Services LLC. For more information, see [ What is AWS Billing?](https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-what-is.html).
**Note**  
 After you define a threshold value, the preview graph displays your estimated charges for the current month. 

1. Choose **Additional Configuration** and do the following:
   + For **Datapoints to alarm**, specify **1 out of 1**.
   + For **Missing data treatment**, choose **Treat missing data as missing**.

1.  Choose **Next**. 

1.  Under **Notification**, ensure that **In alarm** is selected. Then specify an Amazon SNS topic to be notified when your alarm is in the `ALARM` state. The Amazon SNS topic can include your email address so that you receive email when the billing amount crosses the threshold that you specified.

   You can select an existing Amazon SNS topic, create a new Amazon SNS topic, or use a topic ARN to notify other account. If you want your alarm to send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**. 

1.  Choose **Next**. 

1.  Under **Name and description**, enter a name for your alarm. The name must contain only UTF-8 characters, and can't contain ASCII control characters. 

   1.  (Optional) Enter a description of your alarm. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources. 

1. Choose **Next**.

1.  Under **Preview and create**, make sure that your configuration is correct, and then choose **Create alarm**. 

## Deleting a billing alarm
<a name="deleting_billing_alarm"></a>

You can delete your billing alarm when you no longer need it.

**To delete a billing alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. If necessary, change the Region to US East (N. Virginia). Billing metric data is stored in this Region and reflects worldwide charges.

1. In the navigation pane, choose **Alarms**, **All alarms**.

1. Select the check box next to the alarm and choose **Actions**, **Delete**.

1. When prompted for confirmation, choose **Yes, Delete**.

# Create a CPU usage alarm
<a name="US_AlarmAtThresholdEC2"></a>

You can create an CloudWatch alarm that sends a notification using Amazon SNS when the alarm changes state from `OK` to `ALARM`.

The alarm changes to the `ALARM` state when the average CPU use of an EC2 instance exceeds a specified threshold for consecutive specified periods.

## Setting up a CPU usage alarm using the AWS Management Console
<a name="cpu-usage-alarm-console"></a>

Use these steps to use the AWS Management Console to create a CPU usage alarm.

**To create an alarm based on CPU usage**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All Alarms**.

1. Choose **Create alarm**.

1. Choose **Select metric**.

1. In the **All metrics** tab, choose **EC2 metrics**.

1. Choose a metric category (for example, **Per-Instance Metrics**).

1. Find the row with the instance that you want listed in the **InstanceId** column and **CPUUtilization** in the **Metric Name** column. Select the check box next to this row, and choose **Select metric**.

1. Under **Specify metric and conditions**, for **Statistic** choose **Average**, choose one of the predefined percentiles, or specify a custom percentile (for example, **p95.45**).

1. Choose a period (for example, **5 minutes**).

1. Under **Conditions**, specify the following:

   1. For **Threshold type**, choose **Static**.

   1. For **Whenever CPUUtilization is**, specify **Greater**. Under **than...**, specify the threshold that is to trigger the alarm to go to ALARM state if the CPU utilization exceeds this percentage. For example, 70.

   1. Choose **Additional configuration**. For **Datapoints to alarm**, specify how many evaluation periods (data points) must be in the `ALARM` state to trigger the alarm. If the two values here match, you create an alarm that goes to `ALARM` state if that many consecutive periods are breaching.

      To create an M out of N alarm, specify a lower number for the first value than you specify for the second value. For more information, see [Alarm evaluation](alarm-evaluation.md).

   1. For **Missing data treatment**, choose how to have the alarm behave when some data points are missing. For more information, see [Configuring how CloudWatch alarms treat missing data](alarms-and-missing-data.md).

   1. If the alarm uses a percentile as the monitored statistic, a **Percentiles with low samples** box appears. Use it to choose whether to evaluate or ignore cases with low sample rates. If you choose **ignore (maintain alarm state)**, the current alarm state is always maintained when the sample size is too low. For more information, see [Percentile-based alarms and low data samples](percentiles-with-low-samples.md). 

1. Choose **Next**.

1. Under **Notification**, choose **In alarm** and select an SNS topic to notify when the alarm is in `ALARM` state

   To have the alarm send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**.

   To have the alarm not send notifications, choose **Remove**.

1. When finished, choose **Next**.

1. Enter a name and description for the alarm. Then choose **Next**.

   The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources.

1. Under **Preview and create**, confirm that the information and conditions are what you want, then choose **Create alarm**.

## Setting up a CPU usage alarm using the AWS CLI
<a name="cpu-usage-alarm-cli"></a>

Use these steps to use the AWS CLI to create a CPU usage alarm.

**To create an alarm based on CPU usage**

1. Set up an SNS topic. For more information, see [Setting up Amazon SNS notifications](Notify_Users_Alarm_Changes.md#US_SetupSNS).

1. Create an alarm using the [put-metric-alarm](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html) command as follows. 

   ```
   aws cloudwatch put-metric-alarm --alarm-name cpu-mon --alarm-description "Alarm when CPU exceeds 70%" --metric-name CPUUtilization --namespace AWS/EC2 --statistic Average --period 300 --threshold 70 --comparison-operator GreaterThanThreshold --dimensions  Name=InstanceId,Value=i-12345678 --evaluation-periods 2 --alarm-actions arn:aws:sns:us-east-1:111122223333:my-topic --unit Percent
   ```

1. Test the alarm by forcing an alarm state change using the [set-alarm-state](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/set-alarm-state.html) command.

   1. Change the alarm state from `INSUFFICIENT_DATA` to `OK`.

      ```
      aws cloudwatch set-alarm-state --alarm-name cpu-mon --state-reason "initializing" --state-value OK
      ```

   1. Change the alarm state from `OK` to `ALARM`.

      ```
      aws cloudwatch set-alarm-state --alarm-name cpu-mon --state-reason "initializing" --state-value ALARM
      ```

   1. Check that you have received a notification about the alarm.

# Create a load balancer latency alarm that sends email
<a name="US_AlarmAtThresholdELB"></a>

You can set up an Amazon SNS notification and configure an alarm that monitors latency exceeding 100 ms for your Classic Load Balancer.

## Setting up a latency alarm using the AWS Management Console
<a name="load-balancer-alarm-console"></a>

Use these steps to use the AWS Management Console to create a load balancer latency alarm.

**To create a load balancer latency alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All Alarms**.

1. Choose **Create alarm**.

1. Under **CloudWatch Metrics by Category**, choose the **ELB Metrics** category.

1. Select the row with the Classic Load Balancer and the **Latency** metric.

1. For the statistic, choose **Average**, choose one of the predefined percentiles, or specify a custom percentile (for example, **p95.45**).

1. For the period, choose **1 Minute**.

1. Choose **Next**.

1. Under **Alarm Threshold**, enter a unique name for the alarm (for example, **myHighCpuAlarm**) and a description of the alarm (for example, **Alarm when Latency exceeds 100s**). Alarm names must contain only UTF-8 characters, and can't contain ASCII control characters

   The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources.

1. Under **Whenever**, for **is**, choose **>** and enter **0.1**. For **for**, enter **3**.

1. Under **Additional settings**, for **Treat missing data as**, choose **ignore (maintain alarm state)** so that missing data points don't trigger alarm state changes.

   For **Percentiles with low samples**, choose **ignore (maintain the alarm state)** so that the alarm evaluates only situations with adequate numbers of data samples. 

1. Under **Actions**, for **Whenever this alarm**, choose **State is ALARM**. For **Send notification to**, choose an existing SNS topic or create a new one.

   To create an SNS topic, choose **New list**. For **Send notification to**, enter a name for the SNS topic (for example, **myHighCpuAlarm**), and for **Email list**, enter a comma-separated list of email addresses to be notified when the alarm changes to the `ALARM` state. Each email address is sent a topic subscription confirmation email. You must confirm the subscription before notifications can be sent.

1. Choose **Create Alarm**.

## Setting up a latency alarm using the AWS CLI
<a name="load-balancer-alarm-cli"></a>

Use these steps to use the AWS CLI to create a load balancer latency alarm.

**To create a load balancer latency alarm**

1. Set up an SNS topic. For more information, see [Setting up Amazon SNS notifications](Notify_Users_Alarm_Changes.md#US_SetupSNS).

1. Create the alarm using the [put-metric-alarm](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html) command as follows:

   ```
   1. aws cloudwatch put-metric-alarm --alarm-name lb-mon --alarm-description "Alarm when Latency exceeds 100s" --metric-name Latency --namespace AWS/ELB --statistic Average --period 60 --threshold 100 --comparison-operator GreaterThanThreshold --dimensions Name=LoadBalancerName,Value=my-server --evaluation-periods 3 --alarm-actions arn:aws:sns:us-east-1:111122223333:my-topic --unit Seconds
   ```

1. Test the alarm by forcing an alarm state change using the [set-alarm-state](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/set-alarm-state.html) command.

   1. Change the alarm state from `INSUFFICIENT_DATA` to `OK`.

      ```
      1. aws cloudwatch set-alarm-state --alarm-name lb-mon --state-reason "initializing" --state-value OK
      ```

   1. Change the alarm state from `OK` to `ALARM`.

      ```
      1. aws cloudwatch set-alarm-state --alarm-name lb-mon --state-reason "initializing" --state-value ALARM
      ```

   1. Check that you have received an email notification about the alarm.

# Create a storage throughput alarm that sends email
<a name="US_AlarmAtThresholdEBS"></a>

You can set up an SNS notification and configure an alarm that is triggered when Amazon EBS exceeds 100 MB throughput.

## Setting up a storage throughput alarm using the AWS Management Console
<a name="storage-alarm-console"></a>

Use these steps to use the AWS Management Console to create an alarm based on Amazon EBS throughput.

**To create a storage throughput alarm**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, **All Alarms**.

1. Choose **Create alarm**.

1. Under **EBS Metrics**, choose a metric category.

1. Select the row with the volume and the **VolumeWriteBytes** metric.

1. For the statistic, choose **Average**. For the period, choose **5 Minutes**. Choose **Next**.

1. Under **Alarm Threshold**, enter a unique name for the alarm (for example, **myHighWriteAlarm**) and a description of the alarm (for example, **VolumeWriteBytes exceeds 100,000 KiB/s**). The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources.

1. Under **Whenever**, for **is**, choose **>** and enter **100000**. For **for**, enter **15** consecutive periods.

   A graphical representation of the threshold is shown under **Alarm Preview**.

1. Under **Additional settings**, for **Treat missing data as**, choose **ignore (maintain alarm state)** so that missing data points don't trigger alarm state changes.

1. Under **Actions**, for **Whenever this alarm**, choose **State is ALARM**. For **Send notification to**, choose an existing SNS topic or create one.

   To create an SNS topic, choose **New list**. For **Send notification to**, enter a name for the SNS topic (for example, **myHighCpuAlarm**), and for **Email list**, enter a comma-separated list of email addresses to be notified when the alarm changes to the `ALARM` state. Each email address is sent a topic subscription confirmation email. You must confirm the subscription before notifications can be sent to an email address.

1. Choose **Create Alarm**.

## Setting up a storage throughput alarm using the AWS CLI
<a name="storage-alarm-cli"></a>

Use these steps to use the AWS CLI to create an alarm based on Amazon EBS throughput.

**To create a storage throughput alarm**

1. Create an SNS topic. For more information, see [Setting up Amazon SNS notifications](Notify_Users_Alarm_Changes.md#US_SetupSNS).

1. Create the alarm.

   ```
   1. aws cloudwatch put-metric-alarm --alarm-name ebs-mon --alarm-description "Alarm when EBS volume exceeds 100MB throughput" --metric-name VolumeReadBytes --namespace AWS/EBS --statistic Average --period 300 --threshold 100000000 --comparison-operator GreaterThanThreshold --dimensions Name=VolumeId,Value=my-volume-id --evaluation-periods 3 --alarm-actions arn:aws:sns:us-east-1:111122223333:my-alarm-topic --insufficient-data-actions arn:aws:sns:us-east-1:111122223333:my-insufficient-data-topic
   ```

1. Test the alarm by forcing an alarm state change using the [set-alarm-state](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/set-alarm-state.html) command.

   1. Change the alarm state from `INSUFFICIENT_DATA` to `OK`.

      ```
      1. aws cloudwatch set-alarm-state --alarm-name ebs-mon --state-reason "initializing" --state-value OK
      ```

   1. Change the alarm state from `OK` to `ALARM`.

      ```
      1. aws cloudwatch set-alarm-state --alarm-name ebs-mon --state-reason "initializing" --state-value ALARM
      ```

   1. Change the alarm state from `ALARM` to `INSUFFICIENT_DATA`.

      ```
      1. aws cloudwatch set-alarm-state --alarm-name ebs-mon --state-reason "initializing" --state-value INSUFFICIENT_DATA
      ```

   1. Check that you have received an email notification about the alarm.

# Create an alarm on Performance Insights counter metrics from an AWS database
<a name="CloudWatch_alarm_database_performance_insights"></a>

CloudWatch includes a **DB\$1PERF\$1INSIGHTS** metric math function which you can use to bring Performance Insights counter metrics into CloudWatch from Amazon Relational Database Service and Amazon DocumentDB (with MongoDB compatibility). **DB\$1PERF\$1INSIGHTS** also brings in the `DBLoad` metric at sub-minute intervals. You can set CloudWatch alarms on these metrics.

For more information about Amazon RDS Performance Insights, see [ Monitoring DB load with Performance Insights on Amazon RDS](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.html).

For more information about Amazon DocumentDB Performance Insights, see [ Monitoring with Performance Insights](https://docs.aws.amazon.com/documentdb/latest/developerguide/performance-insights.html.html).

Anomaly detection is not supported for alarms based on the **DB\$1PERF\$1INSIGHTS** function.

**Note**  
High-resolution metrics with sub-minute granularity retrieved by **DB\$1PERF\$1INSIGHTS** are only applicable to the **DBLoad** metric, or for operating system metrics if you have enabled Enhanced Monitoring at a higher resolution. For more information about Amazon RDS enhanced monitoring, see [ Monitoring OS metrics with Enhanced Monitoring.](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.html).  
You can create a high-resolution alarm using the **DB\$1PERF\$1INSIGHTS** function. The maximum evaluation range for a high-resolution alarm is three hours. You can use the CloudWatch console to graph metrics retrieved with the **DB\$1PERF\$1INSIGHTS** function for any time range.

**To create an alarm that's based on Performance Insights metrics**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Alarms**, and then choose **All alarms**.

1. Choose **Create alarm**.

1. Choose **Select Metric**.

1. Choose the **Add math** dropdown, and then select **All functions**, **DB\$1PERF\$1INSIGHTS** from the list.

   After you choose **DB\$1PERF\$1INSIGHTS**, a math expression box appears where you apply or edit math expressions.

1. In the math expression box, enter your **DB\$1PERF\$1INSIGHTS** math expression, and then choose **Apply**.

   For example, **DB\$1PERF\$1INSIGHTS(‘RDS’, ‘db-ABCDEFGHIJKLMNOPQRSTUVWXY1’, ‘os.cpuUtilization.user.avg’)**
**Important**  
When you use the **DB\$1PERF\$1INSIGHTS** math expression, you must specify the Unique Database Resource ID of the database. This is different than the database identifier. To find the database resource ID in the Amazon RDS console, choose the DB instance to see its details. Then choose the **Configuration** tab. The **Resource ID** is displayed in the **Configuration** section.

   For information about the **DB\$1PERF\$1INSIGHTS** function and other functions that are available for metric math, see [Metric math syntax and functions](using-metric-math.md#metric-math-syntax).

1. Choose **Select metric**.

   The **Specify metric and conditions** page appears, showing a graph and other information about the math expression that you have selected.

1. For **Whenever *expression* is**, specify whether the expression must be greater than, less than, or equal to the threshold. Under **than...**, specify the threshold value.

1. Choose **Additional configuration**. For **Datapoints to alarm**, specify how many evaluation periods (data points) must be in the `ALARM` state to trigger the alarm. If the two values here match, you create an alarm that goes to `ALARM` state if that many consecutive periods are breaching.

   To create an M out of N alarm, specify a lower number for the first value than you specify for the second value. For more information, see [Alarm evaluation](alarm-evaluation.md).

1. For **Missing data treatment**, choose how to have the alarm behave when some data points are missing. For more information, see [Configuring how CloudWatch alarms treat missing data](alarms-and-missing-data.md).

1. Choose **Next**.

1. Under **Notification**, select an SNS topic to notify when the alarm is in `ALARM` state, `OK` state, or `INSUFFICIENT_DATA` state.

   To have the alarm send multiple notifications for the same alarm state or for different alarm states, choose **Add notification**.

   To have the alarm not send notifications, choose **Remove**.

1. To have the alarm perform Auto Scaling, EC2, Lambda, or Systems Manager actions, choose the appropriate button and choose the alarm state and action to perform. If you choose a Lambda function as an alarm action, you specify the function name or ARN, and you can optionally choose a specific version of the function.

   Alarms can perform Systems Manager actions only when they go into ALARM state. For more information about Systems Manager actions, see see [ Configuring CloudWatch to create OpsItems from alarms](https://docs.aws.amazon.com/systems-manager/latest/userguide/OpsCenter-create-OpsItems-from-CloudWatch-Alarms.html) and [ Incident creation](https://docs.aws.amazon.com/incident-manager/latest/userguide/incident-creation.html).
**Note**  
To create an alarm that performs an SSM Incident Manager action, you must have certain permissions. For more information, see [ Identity-based policy examples for AWS Systems Manager Incident Manager](https://docs.aws.amazon.com/incident-manager/latest/userguide/security_iam_id-based-policy-examples.html).

1. When finished, choose **Next**.

1. Enter a name and description for the alarm. Then choose **Next**.

   The name must contain only UTF-8 characters, and can't contain ASCII control characters. The description can include markdown formatting, which is displayed only in the alarm **Details** tab in the CloudWatch console. The markdown can be useful to add links to runbooks or other internal resources.

1. Under **Preview and create**, confirm that the information and conditions are what you want, then choose **Create alarm**.