Circuit breaker pattern - AWS Prescriptive Guidance

Circuit breaker pattern

Intent

The circuit breaker pattern can prevent a caller service from retrying a call to another service (callee) when the call has previously caused repeated timeouts or failures. The pattern is also used to detect when the callee service is functional again.

Motivation

When multiple microservices collaborate to handle requests, one or more services might become unavailable or exhibit high latency. When complex applications use microservices, an outage in one microservice can lead to application failure. Microservices communicate through remote procedure calls, and transient errors could occur in network connectivity, causing failures. (The transient errors can be handled by using the retry with backoff pattern.) During synchronous execution, the cascading of timeouts or failures can cause a poor user experience.

However, in some situations, the failures could take longer to resolve―for example, when the callee service is down or a database contention results in timeouts. In such cases, if the calling service retries the calls repeatedly, these retries might result in network contention and database thread pool consumption. Additionally, if multiple users are retrying the application repeatedly, this will make the problem worse and can cause performance degradation in the entire application.

The circuit breaker pattern was popularized by Michael Nygard in his book, Release It (Nygard 2018). This design pattern can prevent a caller service from retrying a service call that has previously caused repeated timeouts or failures. It can also detect when the callee service is functional again.

Circuit breaker objects work like electrical circuit breakers that automatically interrupt the current when there is an abnormality in the circuit. Electrical circuit breakers shut off, or trip, the flow of the current when there is a fault. Similarly, the circuit breaker object is situated between the caller and the callee service, and trips if the callee is unavailable.

The fallacies of distributed computing are a set of assertions made by Peter Deutsch and others at Sun Microsystems. They say that programmers who are new to distributed applications invariably make false assumptions. The network reliability, zero-latency expectations, and bandwidth limitations result in software applications written with minimal error handling for network errors.

During a network outage, applications might indefinitely wait for a reply and continually consume application resources. Failure to retry the operations when the network becomes available can also lead to application degradation. If API calls to a database or an external service time out because of network issues, repeated calls with no circuit breaker can affect cost and performance.

Applicability

Use this pattern when:

  • The caller service makes a call that is most likely going to fail.

  • A high latency exhibited by the callee service (for example, when database connections are slow) causes timeouts to the callee service.

  • The caller service makes a synchronous call, but the callee service isn't available or exhibits high latency.

Issues and considerations

  • Service agnostic implementation: To prevent code bloat, we recommend that you implement the circuit breaker object in a microservice-agnostic and API-driven way.

  • Circuit closure by callee: When the callee recovers from the performance issue or failure, they can update the circuit status to CLOSED. This is an extension of the circuit breaker pattern and can be implemented if your recovery time objective (RTO) requires it.

  • Multithreaded calls: The expiration timeout value is defined as the period of time the circuit remains tripped before calls are routed again to check for service availability. When the callee service is called in multiple threads, the first call that failed defines the expiration timeout value. Your implementation should ensure that subsequent calls do not move the expiration timeout endlessly.

  • Force open or close the circuit: System administrators should have the ability to open or close a circuit. This can be done by updating the expiration timeout value in the database table.

  • Observability: The application should have logging set up to identify the calls that fail when the circuit breaker is open.

Implementation

High-level architecture

In the following example, the caller is the order service and the callee is the payment service.

When there are no failures, the order service routes all calls to the payment service by the circuit breaker, as the following diagram shows.

Circuit breaker pattern with no failures.

If the payment service times out, the circuit breaker can detect the timeout and track the failure.

Circuit breaker with payment service failure.

If the timeouts exceed a specified threshold, the application opens the circuit. When the circuit is open, the circuit breaker object doesn't route the calls to the payment service. It returns an immediate failure when the order service calls the payment service.

Circuit breaker stops routing to payment service.

The circuit breaker object periodically tries to see if the calls to the payment service are successful.

Circuit breaker periodically retries payment service.

When the call to payment service succeeds, the circuit is closed, and all further calls are routed to the payment service again.

Circuit breaker with working payment service.

Implementation using AWS services

The sample solution uses express workflows in AWS Step Functions to implement the circuit breaker pattern. The Step Functions state machine lets you configure the retry capabilities and decision-based control flow required for the pattern implementation.

The solution also uses an Amazon DynamoDB table as the data store to track the circuit status. This can be replaced with an in-memory datastore such as Amazon ElastiCache (Redis OSS) for better performance.

When a service wants to call another service, it starts the workflow with the name of the callee service. The workflow gets the circuit breaker status from the DynamoDB CircuitStatus table, which stores the currently degraded services. If CircuitStatus contains an unexpired record for the callee, the circuit is open. The Step Functions workflow returns an immediate failure and exits with a FAIL state.

If the CircuitStatus table doesn't contain a record for the callee or contains an expired record, the service is operational. The ExecuteLambda step in the state machine definition calls the Lambda function that's sent through a parameter value. If the call succeeds, the Step Functions workflow exits with a SUCCESS state.

Circuit breaker implementation with AWS Step Functions and DynamoDB.

If the service call fails or a timeout occurs, the application retries with exponential backoff for a defined number of times. If the service call fails after the retries, the workflow inserts a record in the CircuitStatus table for the service with the an ExpiryTimeStamp, and the workflow exits with a FAIL state. Subsequent calls to the same service return an immediate failure as long as the circuit breaker is open. The Get Circuit Status step in the state machine definition checks the service availability based on the ExpiryTimeStamp value. Expired items are deleted from the CircuitStatus table by using the DynamoDB time to live (TTL) feature.

Sample code

The following code uses the GetCircuitStatus Lambda function to check the circuit breaker status.

var serviceDetails = _dbContext.QueryAsync<CircuitBreaker>(serviceName, QueryOperator.GreaterThan, new List<object> {currentTimeStamp}).GetRemainingAsync(); if (serviceDetails.Result.Count > 0) { functionData.CircuitStatus = serviceDetails.Result[0].CircuitStatus; } else { functionData.CircuitStatus = ""; }

The following code shows the Amazon States Language statements in the Step Functions workflow.

"Is Circuit Closed": { "Type": "Choice", "Choices": [ { "Variable": "$.CircuitStatus", "StringEquals": "OPEN", "Next": "Circuit Open" }, { "Variable": "$.CircuitStatus", "StringEquals": "", "Next": "Execute Lambda" } ] }, "Circuit Open": { "Type": "Fail" }

GitHub repository

For a complete implementation of the sample architecture for this pattern, see the GitHub repository at https://github.com/aws-samples/circuit-breaker-netcore-blog.

Blog references

Related content