Circuit breaker pattern
Intent
The circuit breaker pattern can prevent a caller service from retrying a call to another service (callee) when the call has previously caused repeated timeouts or failures. The pattern is also used to detect when the callee service is functional again.
Motivation
When multiple microservices collaborate to handle requests, one or more services might become unavailable or exhibit high latency. When complex applications use microservices, an outage in one microservice can lead to application failure. Microservices communicate through remote procedure calls, and transient errors could occur in network connectivity, causing failures. (The transient errors can be handled by using the retry with backoff pattern.) During synchronous execution, the cascading of timeouts or failures can cause a poor user experience.
However, in some situations, the failures could take longer to resolve―for example, when the callee service is down or a database contention results in timeouts. In such cases, if the calling service retries the calls repeatedly, these retries might result in network contention and database thread pool consumption. Additionally, if multiple users are retrying the application repeatedly, this will make the problem worse and can cause performance degradation in the entire application.
The circuit breaker pattern was popularized by Michael Nygard in his book, Release It (Nygard 2018). This design pattern can prevent a caller service from retrying a service call that has previously caused repeated timeouts or failures. It can also detect when the callee service is functional again.
Circuit breaker objects work like electrical circuit breakers that automatically interrupt the current when there is an abnormality in the circuit. Electrical circuit breakers shut off, or trip, the flow of the current when there is a fault. Similarly, the circuit breaker object is situated between the caller and the callee service, and trips if the callee is unavailable.
The fallacies of distributed computing
During a network outage, applications might indefinitely wait for a reply and continually consume application resources. Failure to retry the operations when the network becomes available can also lead to application degradation. If API calls to a database or an external service time out because of network issues, repeated calls with no circuit breaker can affect cost and performance.
Applicability
Use this pattern when:
-
The caller service makes a call that is most likely going to fail.
-
A high latency exhibited by the callee service (for example, when database connections are slow) causes timeouts to the callee service.
-
The caller service makes a synchronous call, but the callee service isn't available or exhibits high latency.
Issues and considerations
-
Service agnostic implementation: To prevent code bloat, we recommend that you implement the circuit breaker object in a microservice-agnostic and API-driven way.
-
Circuit closure by callee: When the callee recovers from the performance issue or failure, they can update the circuit status to
CLOSED
. This is an extension of the circuit breaker pattern and can be implemented if your recovery time objective (RTO) requires it. -
Multithreaded calls: The expiration timeout value is defined as the period of time the circuit remains tripped before calls are routed again to check for service availability. When the callee service is called in multiple threads, the first call that failed defines the expiration timeout value. Your implementation should ensure that subsequent calls do not move the expiration timeout endlessly.
-
Force open or close the circuit: System administrators should have the ability to open or close a circuit. This can be done by updating the expiration timeout value in the database table.
-
Observability: The application should have logging set up to identify the calls that fail when the circuit breaker is open.
Implementation
High-level architecture
In the following example, the caller is the order service and the callee is the payment service.
When there are no failures, the order service routes all calls to the payment service by the circuit breaker, as the following diagram shows.
If the payment service times out, the circuit breaker can detect the timeout and track the failure.
If the timeouts exceed a specified threshold, the application opens the circuit. When the circuit is open, the circuit breaker object doesn't route the calls to the payment service. It returns an immediate failure when the order service calls the payment service.
The circuit breaker object periodically tries to see if the calls to the payment service are successful.
When the call to payment service succeeds, the circuit is closed, and all further calls are routed to the payment service again.
Implementation using AWS services
The sample solution uses express workflows in AWS Step Functions
The solution also uses an Amazon DynamoDB
When a service wants to call another service, it starts the workflow with the name of
the callee service. The workflow gets the circuit breaker status from the DynamoDB
CircuitStatus
table, which stores the currently degraded services. If
CircuitStatus
contains an unexpired record for the callee, the circuit is
open. The Step Functions workflow returns an immediate failure and exits with a FAIL
state.
If the CircuitStatus
table doesn't contain a record for the callee or
contains an expired record, the service is operational. The ExecuteLambda
step
in the state machine definition calls the Lambda function that's sent through a parameter
value. If the call succeeds, the Step Functions workflow exits with a SUCCESS
state.
If the service call fails or a timeout occurs, the application retries with exponential
backoff for a defined number of times. If the service call fails after the retries, the
workflow inserts a record in the CircuitStatus
table for the service with the
an ExpiryTimeStamp
, and the workflow exits with a FAIL
state.
Subsequent calls to the same service return an immediate failure as long as the circuit
breaker is open. The Get Circuit Status
step in the state machine definition
checks the service availability based on the ExpiryTimeStamp
value. Expired
items are deleted from the CircuitStatus
table by using the DynamoDB time to live
(TTL) feature.
Sample code
The following code uses the GetCircuitStatus
Lambda function to check the
circuit breaker status.
var serviceDetails = _dbContext.QueryAsync<CircuitBreaker>(serviceName, QueryOperator.GreaterThan, new List<object> {currentTimeStamp}).GetRemainingAsync(); if (serviceDetails.Result.Count > 0) { functionData.CircuitStatus = serviceDetails.Result[0].CircuitStatus; } else { functionData.CircuitStatus = ""; }
The following code shows the Amazon States Language statements in the Step Functions workflow.
"Is Circuit Closed": { "Type": "Choice", "Choices": [ { "Variable": "$.CircuitStatus", "StringEquals": "OPEN", "Next": "Circuit Open" }, { "Variable": "$.CircuitStatus", "StringEquals": "", "Next": "Execute Lambda" } ] }, "Circuit Open": { "Type": "Fail" }
GitHub repository
For a complete implementation of the sample architecture for this pattern, see the
GitHub repository at https://github.com/aws-samples/circuit-breaker-netcore-blog