Recommended CloudWatch alarms for Amazon OpenSearch Service
CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for some
amount of time. For example, you might want AWS to email you if your cluster health
status is red
for longer than one minute. This section includes some
recommended alarms for Amazon OpenSearch Service and how to respond to them.
You can automatically deploy these alarms using AWS CloudFormation. For a sample stack, see the
related GitHub
repository
Note
If you deploy the CloudFormation stack, the KMSKeyError
and
KMSKeyInaccessible
alarms will exists in an Insufficient
Data
state because these metrics only appear if a domain encounters a
problem with its encryption key.
For more information about configuring alarms, see Creating Amazon CloudWatch Alarms in the Amazon CloudWatch User Guide.
Alarm | Issue |
---|---|
ClusterStatus.red maximum is >= 1 for 1 minute, 1
consecutive time |
At least one primary shard and its replicas are not allocated to a node. See Red cluster status. |
ClusterStatus.yellow maximum is >= 1 for 1 minute, 5
consecutive times |
At least one replica shard is not allocated to a node. See Yellow cluster status. |
FreeStorageSpace minimum is <= 20480 for 1 minute, 1
consecutive time |
A node in your cluster is down to 20 GiB of free storage space. See Lack of available storage space. This value is in MiB, so rather than 20480, we recommend setting it to 25% of the storage space for each node. |
ClusterIndexWritesBlocked is >= 1 for 5 minutes, 1
consecutive time |
Your cluster is blocking write requests. See ClusterBlockException. |
Nodes minimum is < x for 1 day, 1 consecutive time |
x is the number of nodes in your cluster. This alarm indicates that at least one node in your cluster has been unreachable for one day. See Failed cluster nodes. |
AutomatedSnapshotFailure maximum is >= 1 for 1 minute, 1
consecutive time |
An automated snapshot failed. This failure is often the result of a
red cluster health status. See Red cluster status. For a summary of all automated snapshots and some information about failures, try one of the following requests:
|
CPUUtilization or WarmCPUUtilization
maximum is >= 80% for 15 minutes, 3 consecutive times |
100% CPU utilization might occur sometimes, but sustained high usage is problematic. Consider using larger instance types or adding instances. |
JVMMemoryPressure maximum is >= 95% for 1 minute, 3
consecutive times |
The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances. |
OldGenJVMMemoryPressure maximum is >= 80% for 1 minute,
3 consecutive times |
|
ManagerCPUUtilization maximum is >= 50% for 15 minutes,
3 consecutive times |
Consider using larger instance types for your dedicated manager nodes. Because of their role in cluster stability and blue/green deployments, dedicated manager nodes should have lower CPU usage than data nodes. |
ManagerJVMMemoryPressure maximum is >= 95% for 1 minute,
3 consecutive times |
|
ManagerOldGenJVMMemoryPressure maximum is >= 80% for 1
minute, 3 consecutive times |
|
KMSKeyError is >= 1 for 1 minute, 1 consecutive
time |
The AWS KMS encryption key that is used to encrypt data at rest in your domain is disabled. Re-enable it to restore normal operations. For more information, see Encryption of data at rest for Amazon OpenSearch Service. |
KMSKeyInaccessible is >= 1 for 1 minute, 1 consecutive
time |
The AWS KMS encryption key that is used to encrypt data at rest in your domain has been deleted or has revoked its grants to OpenSearch Service. You can't recover domains that are in this state. However, if you have a manual snapshot, you can use it to migrate to a new domain. To learn more, see Encryption of data at rest for Amazon OpenSearch Service. |
shards.active is >= 30000 for 1 minute, 1 consecutive
time |
The total number of active primary and replica shards is greater than 30,000. You might be rotating your indexes too frequently. Consider using ISM to remove indexes once they reach a specific age. |
5xx alarms >= 10% of
OpenSearchRequests |
One or more data nodes might be overloaded, or requests are failing to complete within the idle timeout period. Consider switching to larger instance types or adding more nodes to the cluster. Confirm that you're following best practices for shard and cluster architecture. |
ManagerReachableFromNode maximum is < 1 for 5
minutes, 1 consecutive time |
This alarm indicates that the manager node stopped or is unreachable. These failures are usually the result of a network connectivity issue or an AWS dependency problem. |
ThreadpoolWriteQueue average is >= 100 for 1 minute, 1
consecutive time |
The cluster is experiencing high indexing concurrency. Review and control indexing requests, or increase cluster resources. |
ThreadpoolSearchQueue average is >= 500 for 1 minute, 1
consecutive time |
The cluster is experiencing high search concurrency. Consider scaling your cluster. You can also increase the search queue size, but increasing it excessively can cause out of memory errors. |
ThreadpoolSearchQueue maximum is >= 5000 for 1
minute, 1 consecutive time
|
|
Increase in ThreadpoolSearchRejected SUM is >=1{ math
expression DIFF ( )} for 1 minute, 1 consecutive time |
These alarms notify you of domain issues that might impact performance and stability. |
Increase in ThreadpoolWriteRejected SUM is >=1{ math
expression DIFF ( )} for 1 minute, 1 consecutive time |
Note
If you just want to view metrics, see Monitoring OpenSearch cluster metrics with Amazon CloudWatch.
Other alarms you might consider
Consider configuring the following alarms depending on which OpenSearch Service features you regularly use.
Alarm | Issue |
---|---|
WarmFreeStorageSpace is >= 10% |
You have reached 10% of your total free warm storage.
WarmFreeStorageSpace measures the sum of your free
warm storage space in MiB. UltraWarm uses Amazon S3 rather than attached
disks. |
HotToWarmMigrationQueueSize is >= 20 for 1 minute, 3
consecutive times |
A high number of indexes are concurrently moving from hot to UltraWarm storage. Consider scaling your cluster. |
HotToWarmMigrationSuccessLatency is >= 1 day, 1
consecutive time |
Configure this alarm so that you're notified if the
|
WarmJVMMemoryPressure maximum is >= 95% for 1
minute, 3 consecutive times |
The cluster could encounter out of memory errors if usage increases. Consider scaling vertically. OpenSearch Service uses half of an instance's RAM for the Java heap, up to a heap size of 32 GiB. You can scale instances vertically up to 64 GiB of RAM, at which point you can scale horizontally by adding instances. |
WarmOldGenJVMMemoryPressure maximum is >= 80% for 1
minute, 3 consecutive times |
|
WarmToColdMigrationQueueSize is >= 20 for 1 minute,
3 consecutive times |
A high number of indexes are concurrently moving from UltraWarm to cold storage. Consider scaling your cluster. |
HotToWarmMigrationFailureCount is >= 1 for 1 minute,
1 consecutive time |
Migrations might fail during snapshots, shard relocations, or force merges. Failures during snapshots or shard relocation are typically due to node failures or S3 connectivity issues. Lack of disk space is usually the underlying cause of force merge failures. |
WarmToColdMigrationFailureCount is >= 1 for 1
minute, 1 consecutive time |
Migrations usually fail when attempts to migrate index metadata to cold storage fail. Failures can also happen when the warm index cluster state is being removed. |
WarmToColdMigrationLatency is >= 1 day, 1
consecutive time |
Configure this alarm so that you're notified if the
|
AlertingDegraded is >= 1 for 1 minute, 1 consecutive
time |
Either the alerting index is red, or one or more nodes is not on schedule. |
ADPluginUnhealthy is >= 1 for 1 minute, 1
consecutive time |
The anomaly detection plugin isn't functioning properly, either because of high failure rates or because one of the indexes being used is red. |
AsynchronousSearchFailureRate is >= 1 for 1 minute,
1 consecutive time |
At least one asynchronous search failed in the last minute, which likely means the coordinator node failed. The lifecycle of an asynchronous search request is managed solely on the coordinator node, so if the coordinator goes down, the request fails. |
AsynchronousSearchStoreHealth is >= 1 for 1 minute,
1 consecutive time |
The health of the asynchronous search response store in the persisted index is red. You might be storing large asynchronous responses, which can destabilize a cluster. Try to limit your asynchronous search responses to 10 MB or less. |
SQLUnhealthy is >= 1 for 1 minute, 3 consecutive
times |
The SQL plugin is returning 5xx response codes or passing invalid query DSL to OpenSearch. Troubleshoot the requests that your clients are making to the plugin. |
LTRStatus.red is >= 1 for 1 minute, 1 consecutive
time |
At least one of the indexes needed to run the Learning to Rank plugin has missing primary shards and isn't functional. |