Best practices for monitoring event delivery in Amazon EventBridge - Amazon EventBridge

Best practices for monitoring event delivery in Amazon EventBridge

To ensure that the business logic in your event-driven applications executes reliably, it is essential to monitor your event delivery behavior. EventBridge provides metrics that enable you to monitor, detect, and mitigate issues early to ensure reliable event delivery. These metrics include:

  • Counter-based metrics, such as InvocationAttempts, SuccessfulInvocationAttempts, RetryInvocationAttempts and FailedInvocations, to enable you to observe target throttling, and calculate error rates.

  • Latency-based metrics, such as IngestionToInvocationSuccessLatency, to provide insights into event delivery and delays.

These metrics allow you to monitor the health of your event-driven architectures, and to understand and mitigate event delivery issues caused by underperforming, undersized, or unresponsive targets. For example, a permanently under-scaled or throttled target can lead to excessive retries, delays in event delivery, and permanent delivery failures.

We recommend you combine multiple metrics to get a holistic overview, and closely monitor them. Setting up appropriate alarms and dashboards enables you to address persistent issues early.

For information on specific metrics, see EventBridge metrics.

Detecting event delivery failures

EventBridge includes metrics you can configure to report target invocations--that is, event delivery attempts--per rule.

We recommend you monitor the following metrics at the rule level:

  • InvocationAttempts to observe the total number of times EventBridge attempts to invoke the target, including event delivery retries.

  • SuccessfulInvocationAttempts for the number of invocation attempts where EventBridge successfully delivered the event to the target.

  • RetryInvocationAttempts for the number attempts that represent event delivery retries.

    An increase in RetryInvocationAttempts may be an early indication of an undersized target.

In addition, since increased retry attempts can be a first sign of delivery issues, we also recommend creating a single metric that tracks the percentage of successful target invocations to all target invocations. For example, in CloudWatch you can use metric math to create such a metric, called SuccessfulInvocationRate, using the following formula:

SuccessfulInvocationRate = SuccessfulInvocationAttempts / InvocationAttempts

Then, depending on your requirements, you can configure CloudWatch Alarms to create notifications when a certain threshold is hit.

Although an occasional decrease of SuccessfulInvocationRate due to temporary traffic spikes or invocation errors can be considered normal, a constant mismatch is an indication of a misconfigured target and needs to be addressed as part of the shared responsibility model.

For more information on metric math, see Using math expressions with CloudWatch metrics in the Amazon CloudWatch User Guide.

By default, EventBridge retries delivering an event for 24 hours and up to 185 times. After EventBridge exhausts these retry attempts, EventBridge either drops the event, or sends it to a dead-letter queue if one has been specified. For more information, see Retrying event delivery. To avoid losing events if they fail to be delivered, we recommend you configure a dead-letter queue for each rule target. For more information, For more information, see Using dead-letter queues.

Events that EventBridge fails to deliver to the specified target are reported in the FailedInvocations metric and the InvocationsSentToDlq metric if you have configured a dead-letter queue for the target. If your application is experiencing a large number of FailedInvocations or InvocationsSentToDlq reports, we recommend you investigate if the target is properly scaled and able to receive the given traffic.

Detecting event delivery delays

EventBridge also provides a metric that lets you observe the end-to-end latency--the time it takes from event ingestion to successful delivery to the target. This can be achieved with the IngestionToInvocationSuccessLatency metric. This metric surfaces effects from retries and delayed delivery, for example due to timeouts and slow responses from targets. IngestionToInvocationSuccessLatency includes the time the target takes to successfully respond to event delivery. This allows you to monitor the end-to-end latency between EventBridge and your target, and detect performance variations and degradations of targets, even when there is no target throttling or errors.