Alerting fundamentals
This documentation topic is designed for Grafana workspaces that support Grafana version 8.x.
For Grafana workspaces that support Grafana version 10.x, see Working in Grafana version 10.
For Grafana workspaces that support Grafana version 9.x, see Working in Grafana version 9.
This section provides information about the fundamental concepts of Grafana alerting.
Alerting concepts
The following table describes the key concepts in Grafana alerting.
Key concept or feature | Definition |
---|---|
Data sources for Alerting |
Select data sources from which you want to query and visualize metrics, logs and traces. |
Scheduler |
Evaluates your alert rules; the component that periodically runs queries against data sources. It is only applicable to Grafana-managed rules. |
Alertmanager |
Manages the routing and grouping of alert instances. |
Alert rule |
A set of evaluation criteria for when an alert rule should fire. An alert rule consists of one or more queries and expressions, a condition, the frequency of evaluation, and the duration over which the condition is met. An alert rule can produce multiple alert instances. |
Alert instance |
An alert instance is an instance of an alert rule. A single-dimensional alert rule has one alert instance. A multidimensional alert rule has one or more alert instances. A single alert rule that matches to multiple results, such as CPU against 10 VMs, is counted as multiple (in this case 10) alert instances. This number can vary over time. For example, an alert rule that monitors CPU usage for all VMs in a system has more alert instances as VMs are added. For more information about alert-instance quotas, see Quota reached errors. |
Alert group |
The Alertmanager groups alert instances by default using the labels for the root notification policy. This controls de-duplication and groups of alert instances which are sent to contact points. |
Contact point |
Define how your contacts are notified when an alert rule fires. |
Message templating |
Create reusable custom templates and use them in contact points. |
Notification policy |
Set of rules for where, when, and how the alerts are grouped and routed to contact points. |
Labels and label matchers |
Labels uniquely identify alert rules. They link alert rules to notification policies and silences, determining which policy should handle them and which alert rules should be silenced. |
Silences |
Stop notifications from one or more alert instances. The difference between a silence and a mute timing is that a silence lasts for a specified window of time where a mute timing happens on a recurring schedule. Uses label matchers to silence alert instances. |
Mute timings |
Specify a time interval when you don’t want new notifications to be generated or sent. You can freeze alert notifications for recurring periods of time, such as during a maintenance period. Must be linked to an existing notification policy. |
Alert data sources
Grafana managed alerts query the following backend data sources that have alerting enabled.
-
Data sources built-in, or developed and maintained by Grafana:
Alertmanager
,Graphite
,Prometheus
(including Amazon Managed Service for Prometheus),Loki
,InfluxDB
,Amazon OpenSearch Service
,Google Cloud Monitoring
,Amazon CloudWatch
,Azure Monitor
,MySQL
,PostgreSQL
,MSSQL
,OpenTSDB
,Oracle
, andAzure Monitor
.
Alerting on numeric data
Numeric data that is not in a time series format can be directly alerted on, or passed into Server Side Expressions. This allows for more processing and resulting efficiency within the data source, and it can also simplify alert rules. When alerting on numeric data instead of time series data, there is no need to reduce each labeled time series into a single number. Instead labeled numbers are returned to Grafana instead.
Tabular data
This feature is supported with backend data sources that query tabular data, including SQL data sources, such as MySQL, Postgres, MSSQL, and Oracle.
A query with Grafana managed alerts or Server Side Expressions is considered numeric with these data sources:
-
If the
Format AS
option is set toTable
in the data source query. -
If the table response returned to Grafana from the query includes only one numeric (for example, int, double, or float) column, and optionally additional string columns.
If there are string columns, then those columns become labels. The name of column becomes the label name, and the value for each row becomes the value of the corresponding label. If multiple rows are returned, then each row should be uniquely identified by their labels.
Example
If you have a MySQL table called Diskspace, as the following.
Time | Host | Disk | PercentFree |
---|---|---|---|
2021-June-7 |
web1 |
/etc |
3 |
2021-June-7 |
web2 |
/var |
4 |
2021-June-7 |
web3 |
/var |
8 |
… |
… |
… |
… |
You can query the data filtering on time, but without returning the time series to Grafana. For example, an alert that would be initiate per Host, Disk when there is less than 5% free space could look like the following.
SELECT Host, Disk, CASE WHEN PercentFree < 5.0 THEN PercentFree ELSE 0 END FROM ( SELECT Host, Disk, Avg(PercentFree) FROM DiskSpace Group By Host, Disk Where __timeFilter(Time)
This query returns the following table response to Grafana.
Host | Disk | PercentFree |
---|---|---|
web1 |
/etc |
3 |
web2 |
/var |
4 |
web3 |
/var |
0 |
When this query is used as the condition in an alert rule, then the cases where the value is non-zero alert. As a result, three alert instances are produced, as the following table.
Labels | Status |
---|---|
{Host=web1,disk=/etc} |
Alerting |
{Host=web2,disk=/var} |
Alerting |
{Host=web3,disk=/var} |
Normal |
Alertmanager
Grafana includes built-in support for Prometheus Alertmanager. The Alertmanager helps both group and manage alert rules, adding a layer of orchestration on top of the alerting engines. By default, notifications for Grafana managed alerts are handled by the embedded Alertmanager that is part of core Grafana. You can configure the Alertmanager’s contact points, notification policies, and templates from the Grafana alerting UI by selecting the Grafana option from the Alertmanager dropdown.
Grafana alerting has support for external Alertmanager configuration (for more information on Alertmanager as an external datasource, see Connect to an Alertmanager data source). When you add an external Alertmanager, the Alertmanager dropdown shows a list of available external Alertmanager data sources. Select a data source to create and manage alerting for standalone Cortex or Loki data sources.
State and health of alerting rules
The state and health of alerting rules help you understand several key status indicators about your alerts. There are three key components: alert state, alerting rule state, and alerting rule health. Although related, each component conveys slightly different information.
Alerting rule state
-
Normal – None of the time series returned by the evaluation engine is in a
Pending
orFiring
state. -
Pending – At least one of the time series returned by the evaluation engine is
Pending
. -
Firing – At least one of the time series returned by the evaluation engine is
Firing
.
Alert state
-
Normal – Condition for the alerting rule is false for every time series returned by the evaluation engine.
-
Alerting – Condition of the alerting rule is true for at least one time series returned by the evaluation engine. The duration for which the condition must be true before an alert is initiated, if set, is met or has exceeded.
-
Pending – Condition of the alerting rule is true for at least one time series returned by the evaluation engine. The duration for which the condition must be true before an alert is initiated, if set, has not been met.
-
NoData – The alerting rule has not returned a time series, all values for the time series are null, or all values for the time series are zero.
-
Error – Error when attempting to evaluate an alerting rule.
Alerting rule health
-
Ok – No error when evaluating an alerting rule.
-
Error – Error when evaluating an alerting rule.
-
NoData – The absence of data in at least one time series returned during a rule evaluation.