

# Using rules to modify or monitor metrics as they are received
<a name="AMP-Ruler"></a>

You can set up rules to act upon metrics as they are received by Amazon Managed Service for Prometheus. These rules can monitor the metrics or even create new, computed, metrics based on the metrics received.

Amazon Managed Service for Prometheus supports two types of *rules* that it evaluates at regular intervals:
+ *Recording rules* allow you to precompute frequently needed or computationally expensive expressions and save their results as a new set of time series. Querying the precomputed result is often much faster than running the original expression every time it is needed.
+ *Alerting rules* allow you to define alert conditions based on PromQL and a threshold. When the rule triggers the threshold, a notification is sent to [alert manager](AMP-alert-manager.md), which can be configured to managed the rules, or forward them to notification downstream to receivers such as Amazon Simple Notification Service.

To use rules in Amazon Managed Service for Prometheus, you create one or more YAML rules files that define the rules. An Amazon Managed Service for Prometheus rules file has the same format as a rules file in standalone Prometheus. For more information, see [Defining Recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [Alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) in the Prometheus documentation.

You can have multiple rules files in a workspace. Each separate rules file is contained within a separate *namespace*. Having multiple rules files lets you import existing Prometheus rules files to a workspace without having to change or combine them. Different rule group namespaces can also have different tags.

**Rule sequencing**

Within a rules file, rules are contained within *rules groups*. Rules within a single rules group in a rules file are always evaluated in order from top to bottom. Therefore, in recording rules, the result of one recording rule can be used in the computation of a later recording rule or in an alerting rule in the same rule group. However, because you can't specify the order in which to run separate rules files, you can't use the results from one recording rule to compute a rule in a different rule group or a different rules file.

**Topics**
+ [Understanding IAM permissions needed for using rules](AMP-ruler-IAM-permissions.md)
+ [Create a rules file](AMP-ruler-rulesfile.md)
+ [Upload a rules configuration file to Amazon Managed Service for Prometheus](AMP-rules-upload.md)
+ [Edit or replace a rules configuration file](AMP-rules-edit.md)
+ [Troubleshoot rule evaluations](troubleshoot-rule-evaluations.md)
+ [Troubleshooting Ruler](Troubleshooting-rule-fail-error.md)

# Understanding IAM permissions needed for using rules
<a name="AMP-ruler-IAM-permissions"></a>

You must give users permissions to use rules in Amazon Managed Service for Prometheus. Create an AWS Identity and Access Management (IAM) policy with the following permissions, and assign the policy to your users, groups, or roles.

**Note**  
For more information about IAM, see [Identity and Access Management for Amazon Managed Service for Prometheus](security-iam.md).

**Policy to give access to use rules**

The following policy gives access to use rules for all resources in your account.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "aps:CreateRuleGroupsNamespace",
                "aps:ListRuleGroupsNamespaces",
                "aps:DescribeRuleGroupsNamespace",
                "aps:PutRuleGroupsNamespace",
                "aps:DeleteRuleGroupsNamespace"
            ],
            "Resource": "*"
        }
    ]
}
```

------

**Policy to give access to only one namespace**

You can also create policy that gives access to only specific policies. The following sample policy gives access only to the `RuleGroupNamespace` specified. To use this policy, replace *<account>*, *<region>*, *<workspace-id>*, and *<namespace-name>* with appropriate values for your account.

# Create a rules file
<a name="AMP-ruler-rulesfile"></a>

To use rules in Amazon Managed Service for Prometheus, you create a rules file that defines the rules. An Amazon Managed Service for Prometheus rules file is a YAML text file that has the same format as a rules file in standalone Prometheus. For more information, see [Defining Recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) and [Alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) in the *Prometheus* documentation.

The following is a basic example of a rules file:

```
groups:
  - name: cpu_metrics
     interval: 60s
     rules:
      - record: avg_cpu_usage
        expr: avg(rate(node_cpu_seconds_total[5m])) by (instance)
      - alert: HighAverageCPU
        expr: avg_cpu_usage > 0.8
        for: 10m
        keep_firing_for: 20m
        labels:
          severity: critical
        annotations:
          summary: "Average CPU usage across cluster is too high"
```

This example creates a rule group `cpu_metrics` which is evaluated every 60 seconds. This rule group creates a new metric using a recording rule, called `avg_cpu_usage` and then uses that in an alert. The following describes some of the properties used. For more information about alerting rules and other properties you can include, see [Alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) in the *Prometheus* documentation.
+ `record: avg_cpu_usage` – This recording rule creates a new metric called `avg_cpu_usage`.
+ The default evaluation interval of rule groups is 60 seconds if the `interval` property is not specified.
+ `expr: avg(rate(node_cpu_seconds_total[5m])) by (instance)` – This expression for the recording rule calculates the average rate of CPU usage over the last 5 minutes for each node, grouping by the `instance` label.
+ `alert: HighAverageCPU` – This alert rule creates a new alert called `HighAverageCPU`
+ `expr: avg_cpu_usage > 0.8 ` – This expression tells the alert to look for samples where the average CPU usage goes over 80%.
+ `for: 10m` – The alert will only fire if the average CPU usage exceeds 80% for at least 10 minutes.

  In this case, the metric is calculated as an average over the last 5 minutes. So the alert will only fire if there are at least two consecutive 5-minute samples (10 minutes total) where the average CPU usage is above 80%.
+ `keep_firing_for: 20m` – This alert will continue to fire until the samples are below the threshold for at least 20 minutes. This can be useful to avoid the alert going up and down repeatedly in succession.

**Note**  
You can create a rules definition file locally and then upload it to Amazon Managed Service for Prometheus, or you can create, edit and upload the definition directly within the Amazon Managed Service for Prometheus console. Either way, the same formatting rules apply. To learn more about uploading and editing your file, see [Upload a rules configuration file to Amazon Managed Service for Prometheus](AMP-rules-upload.md).

# Upload a rules configuration file to Amazon Managed Service for Prometheus
<a name="AMP-rules-upload"></a>

Once you know what rules you want in your rules configuration file, you can either create and edit it within the console, or you can upload a file with the console or AWS CLI.

**Note**  
If you are running an Amazon EKS cluster, you can also upload a rule configuration file using [AWS Controllers for Kubernetes](integrating-ack.md).

**To use the Amazon Managed Service for Prometheus console to edit or replace your rules configuration and create the namespace**

1. Open the Amazon Managed Service for Prometheus console at [https://console.aws.amazon.com/prometheus/](https://console.aws.amazon.com/prometheus/home).

1. In the upper left corner of the page, choose the menu icon, and then choose **All workspaces**.

1. Choose the workspace ID of the workspace, and then choose the **Rules management** tab.

1. Choose **Add namespace**.

1. Choose **Choose file**, and select the rules definition file.

   Alternately, you can create and edit a rules definition file directly in the Amazon Managed Service for Prometheus console by selecting **Define configuration**. This will create a sample default definition file that you edit before uploading.

1. (Optional) To add tags to the namespace, choose **Add new tag**.

   Then, for **Key**, enter a name for the tag. You can add an optional value for the tag in **Value**. 

   To add another tag, choose **Add new tag**.

1. Choose **Continue**. Amazon Managed Service for Prometheus creates a new namespace with the same name as the rules file that you selected.

**To use the AWS CLI to upload an alert manager configuration to a workspace in a new namespace**

1. Base64 encode the contents of your alert manager file. On Linux, you can use the following command:

   ```
   base64 input-file output-file
   ```

   On macOS, you can use the following command:

   ```
   openssl base64 input-file output-file
   ```

1. Enter one of the following commands to create the namespace and upload the file.

   On AWS CLI version 2, enter:

   ```
   aws amp create-rule-groups-namespace --data file://path_to_base_64_output_file --name namespace-name  --workspace-id my-workspace-id --region region
   ```

   On AWS CLI version 1, enter:

   ```
   aws amp create-rule-groups-namespace --data fileb://path_to_base_64_output_file --name namespace-name  --workspace-id my-workspace-id --region region
   ```

1. It takes a few seconds for your alert manager configuration to become active. To check the status, enter the following command:

   ```
   aws amp describe-rule-groups-namespace --workspace-id workspace_id --name namespace-name --region region
   ```

   If the `status` is `ACTIVE`, your rules file has taken effect.

# Edit or replace a rules configuration file
<a name="AMP-rules-edit"></a>

If you want to change the rules in a rule file that you have already uploaded to Amazon Managed Service for Prometheus, you can either upload a new rules file to replace the existing configuration, or you can edit the current configuration directly in the console. Optionally, you can download the current file, edit it in a text editor, then upload the new version.

**To use the Amazon Managed Service for Prometheus console to edit your rules configuration**

1. Open the Amazon Managed Service for Prometheus console at [https://console.aws.amazon.com/prometheus/](https://console.aws.amazon.com/prometheus/home).

1. In the upper left corner of the page, choose the menu icon, and then choose **All workspaces**.

1. Choose the workspace ID of the workspace, and then choose the **Rules management** tab.

1. Select the name of the rules configuration file that you want to edit.

1. (Optional) If you want to download the current rules configuration file, choose **Download** or **Copy**.

1. Choose **Modify** to edit the configuration directly within the console. Choose **Save** when complete.

   Alternately, you can choose **Replace configuration** to upload a new configuration file. If so, select the new rules definition file, and choose **Continue** to upload it.

**To use the AWS CLI to edit a rules configuration file**

1. Base64 encode the contents of your rules file. On Linux, you can use the following command:

   ```
   base64 input-file output-file
   ```

   On macOS, you can use the following command:

   ```
   openssl base64 input-file output-file
   ```

1. Enter one of the following commands to upload the new file.

   On AWS CLI version 2, enter:

   ```
   aws amp put-rule-groups-namespace --data file://path_to_base_64_output_file --name namespace-name  --workspace-id my-workspace-id --region region
   ```

   On AWS CLI version 1, enter:

   ```
   aws amp put-rule-groups-namespace --data fileb://path_to_base_64_output_file --name namespace-name  --workspace-id my-workspace-id --region region
   ```

1. It takes a few seconds for your rules file to become active. To check the status, enter the following command:

   ```
   aws amp describe-rule-groups-namespace --workspace-id workspace_id --name namespace-name --region region
   ```

   If the `status` is `ACTIVE`, your rules file has taken effect. Until then, the previous version of this rules file is still active.

# Troubleshoot rule evaluations
<a name="troubleshoot-rule-evaluations"></a>

This guide provides step-by-step troubleshooting procedures for common issues with rule evaluations in Amazon Managed Service for Prometheus (AMP). Follow these procedures to diagnose and resolve problems with your alerting and recording rules.

**Topics**
+ [Validate alert firing status](#troubleshoot-rule-validate-firing-status)
+ [Resolve missing alert notifications](#troubleshoot-rule-missing-alert-notifications)
+ [Check rule health status](#troubleshoot-rule-check-health-status)
+ [Use offset in queries to handle ingestion delays](#troubleshoot-rule-offset-queries)
+ [Common issues and solutions](#troubleshoot-rule-common-issues)
+ [Best practices for rule evaluations](#troubleshoot-rule-best-practices)

## Validate alert firing status
<a name="troubleshoot-rule-validate-firing-status"></a>

When troubleshooting rule evaluation issues, first verify if your alert has fired by querying the synthetic time series `ALERTS`. The `ALERTS` time series include the following labels:
+ **alertname** – The name of the alert.
+ **alertstate** – Either **pending** or **firing**.
  + **pending** – The alert is waiting for the duration specified in the `for` clause.
  + **firing** – The alert has met the conditions for the specified duration. Additional labels are defined in your alerting rule.

**Note**  
While an alert is **firing** or **pending**, the sample value is **1**. When your alert is idle, no samples are produced.

## Resolve missing alert notifications
<a name="troubleshoot-rule-missing-alert-notifications"></a>

If alerts are firing but notifications are not arriving, verify the following Alertmanager settings:

1. **Verify your Alertmanager configuration** – Check that routes receivers, and settings are correctly configured. Review route block settings, including wait times, time intervals, and required labels, which can affect alert firing. Compare alerting rules with their corresponding routes and receivers to confirm proper matching. For routes with `time_interval`, verify that timestamps fall within the specified intervals.

1. **Check alert receiver permissions** – When using an Amazon SNS topic, verify AMP has the required permissions to publish notifications. For more information, see [Giving Amazon Managed Service for Prometheus permission to send alert messages to your Amazon SNS topic](AMP-alertmanager-receiver-AMPpermission.md).

1. **Validate receiver payload compatibility** – Confirm your alert receiver accepts Alertmanager's payload format. For Amazon SNS requirements, see [Understanding Amazon SNS message validation rules](AMP-alertmanager-receiver-validation-truncation.md).

1. **Review Alertmanager logs** – AMP provides vended logs from Alertmanager to help debug notification issues. For more information, see [Monitor Amazon Managed Service for Prometheus events with CloudWatch Logs](CW-logs.md).

For more information about Alertmanager, see [Managing and forwarding alerts in Amazon Managed Service for Prometheus with alert manager](AMP-alert-manager.md).

## Check rule health status
<a name="troubleshoot-rule-check-health-status"></a>

Malformed rules can cause evaluation failures. Use the following methods to identify why a rule failed to evaluate:

**Example**  
**Use the ListRules API**  
The [ListRules](AMP-APIReference-ListRules.md) API provides information about rule health. Check the `health` and `lastError` fields to diagnose issues.  
**Example response:**  

```
{
  "status": "success",
  "data": {
    "groups": [
      {
        "name": "my_rule_group",
        "file": "my_namespace",
        "rules": [
          {
            "state": "firing",
            "name": "broken_alerting_rule",
            "query": "...",
            "duration": 0,
            "keepFiringFor": 0,
            "labels": {},
            "annotations": {},
            "alerts": [],
            "health": "err",
            "lastError": "vector contains metrics with the same labelset after applying alert labels",
            "type": "alerting",
            "lastEvaluation": "1970-01-01T00:00:00.00000000Z",
            "evaluationTime": 0.08
          }
        ]
      }
    ]
  }
}
```

**Example**  
**Use vended logs**  
The ListRules API only displays the most recent information. For a more detailed history, enable [vended logs](CW-logs.md) in your workspace to access:  
+ Timestamps of evaluation failures
+ Detailed error messages
+ Historical evaluation data
**Example vended log message:**  

```
{
  "workspaceId": "ws-a2c55905-e0b4-4065-a310-d83ce597a391",
  "message": {
    "log": "Evaluating rule failed, name=broken_alerting_rule, group=my_rule_group, namespace=my_namespace, err=vector contains metrics with the same labelset after applying alert labels",
    "level": "ERROR",
    "name": "broken_alerting_rule",
    "group": "my_rule_group",
    "namespace": "my_namespace"
  },
  "component": "ruler"
}
```
For more examples of logs from Ruler or Alertmanager, see [Troubleshooting Ruler](Troubleshooting-rule-fail-error.md) and [Managing and forwarding alerts in Amazon Managed Service for Prometheus with alert manager](AMP-alert-manager.md).

## Use offset in queries to handle ingestion delays
<a name="troubleshoot-rule-offset-queries"></a>

By default, expressions are evaluated with no offset (instant query), using values at the evaluation time. If metrics ingestion is delayed, recording rules might not represent the same values as when you manually evaluate the expression after all metrics are ingested.

**Tip**  
Using the offset modifier can reduce issues caused by ingestion delays. For more information, see [Offset modifier](https://prometheus.io/docs/prometheus/latest/querying/basics/#offset-modifier) in the *Prometheus documentation*.

### Example: Handling delayed metrics
<a name="example-delayed-metrics"></a>

If your rule evaluates at 12:00, but the latest sample for the metric is from 11:45 due to ingestion delay, the rule will find no samples at the 12:00 timestamp. To mitigate this, add an offset, such as: **my\$1metric\$1name offset 15m **.

### Example: Handle metrics from multiple sources
<a name="example-metrics-multiple-sources"></a>

When metrics originate from different sources, such as two servers, they might be ingested at different times. To mitigate this, form an expression, such as: **metric\$1from\$1server\$1A / metric\$1from\$1server\$1B **

If the rule evaluates between the ingestion times of server A and server B, you'll get unexpected results. Using an offset can help align the evaluation times.

## Common issues and solutions
<a name="troubleshoot-rule-common-issues"></a>

**Gaps in recording rule data**

If you notice gaps in your recording rule data compared to manual evaluation (when you directly execute the recording rule's original PromQL expression through the query API or UI), this might be due to one of the following:

1. **Long evaluation times** – A rule group cannot have multiple simultaneous evaluations. If evaluation time exceeds the configured interval, subsequent evaluations may be missed. Multiple consecutive missed evaluations exceeding the configured interval can cause the recording rule to become stale. For more information, see [Staleness](https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness) in the *Prometheus documentation*. You can monitor evaluation duration using the CloudWatch metric `RuleGroupLastEvaluationDuration` to identify rule groups that are taking too long to evaluate.

1. **Monitoring missed evaluations** – AMP provides the `RuleGroupIterationsMissed` CloudWatch metric to track when evaluations are skipped. The ListRules API displays the evaluation time and last evaluation time for each rule/group, which can help identify patterns of missed evaluations. For more information, see [ListRules](AMP-APIReference-ListRules.md).

**Recommendation: Split rules into separate groups**

To reduce evaluation durations, split rules into separate rule groups. Rules within a group execute sequentially, while rule groups can execute in parallel. Keep related rules that depend on each other in the same group. Generally, smaller rule groups ensure more consistent evaluations and fewer gaps.

## Best practices for rule evaluations
<a name="troubleshoot-rule-best-practices"></a>

1. **Optimize rule group size** – Keep rule groups small to ensure consistent evaluations. Group related rules together, but avoid large rule groups.

1. **Set appropriate evaluation intervals** – Balance between timely alerts and system load. Review the stability patterns of your monitored metrics to understand their normal fluctuation ranges.

1. **Use offset modifiers for delayed metrics** – Add offsets to compensate for ingestion delays. Adjust offset duration based on observed ingestion patterns.

1. **Monitor evaluation performance** – Track the `RuleGroupIterationsMissed` metric. Review evaluation times in the ListRules API.

1. **Validate rule expressions** – Ensure expressions match exactly between rule definitions and manual queries. Test expressions with different time ranges to understand behavior.

1. **Review rule health regularly** – Check for errors in rule evaluations. Monitor vended logs for recurring issues.

By following these troubleshooting steps and best practices, you can identify and resolve common issues with rule evaluations in Amazon Managed Service for Prometheus.

# Troubleshooting Ruler
<a name="Troubleshooting-rule-fail-error"></a>

Using [Monitor Amazon Managed Service for Prometheus events with CloudWatch Logs](CW-logs.md), you can troubleshoot Alert Manager and Ruler related issues. This section contains ruler related troubleshooting topics. 

**

**When the log contains the following ruler failure error**

```
{
    "workspaceId": "ws-12345c67-89c0-4d12-345b-f14db70f7a99",
    "message": {
        "log": "Evaluating rule failed, name=failure, group=canary_long_running_vl_namespace, namespace=canary_long_running_vl_namespace, err=found duplicate series for the match group {dimension1=\\\"1\\\"} on the right hand-side of the operation: [{__name__=\\\"fake_metric2\\\", dimension1=\\\"1\\\", dimension2=\\\"b\\\"}, {__name__=\\\"fake_metric2\\\", dimension1=\\\"1\\\", dimension2=\\\"a\\\"}];many-to-many matching not allowed: matching labels must be unique on one side",
        "level": "ERROR",
        "name": "failure",
        "group": "canary_long_running_vl_namespace",
        "namespace": "canary_long_running_vl_namespace"
    },
    "component": "ruler"
}
```

This means that some error occurred while executing the rule. 

**Action to take**

Use the error message to troubleshoot the rule execution.