Circuit Breaker Stability Pattern
[Update 2014-02-02: I’ve posted a video showing a visualization of this pattern at http://www.youtube.com/watch?v=WEXh1WoppMU]
In a loosely-coupled system connected with asynchronous messages, a flood of client requests can exceed the capacity of a service. When this happens, we would prefer that the service respond with an indication of the overloaded condition, rather than making us wait for service we may never receive. The Twitter “fail whale” is an example. The Circuit Breaker pattern [1] helps maintain the stability of a system in an overload situation.
Timeout Circuit Breaker
A circuit breaker can be either closed or open. When it is closed, communication flows normally. When it is open, the “circuit” is broken, and clients receive a fail-fast rejection notice. Our implementation of the pattern uses a timeout policy to detect an overloaded condition. If the service does not reply within a specified amount of time, the request fails and the client receives a timeout notice. After some number of failures, the circuit is broken (becomes open).
Closed State
The Circuit Breaker is a Proxy for the service. Clients send requests to the circuit breaker, which acts as the public entry-point for the service, concealing the actual service. The circuit breaker begins in closed state. A group of configuration parameters define things like the actual underlying service, the timeout period, the number of failed attempts needed to trip the breaker and the backoff period that allows the service time to recover. In addition, we specify an administration actor, used to validate and perform administrative operations.
LET cb_closed_beh(n_failed, config) = \msg.[ LET $config = (admin, svc, timeout, n_retries, backoff) CASE msg OF ... ($admin, _) : [] (cust, req) : [ CREATE k_timer WITH \_.[] CREATE k_reply WITH cb_reply_beh( SELF, admin, k_timer, cust, req) SEND (k_reply, req) TO svc AFTER timeout SEND k_timer TO k_reply ] END ]
When an actor with cb_closed_beh receives a message msg, it first checks to see if it is an administrative message tagged with the admin identity. For now we will ignore administrative messages.
In most cases the message consists of a customer cust and a request req for service. A timer actor k_timer is created to represent the unique (potential) occurrence of a timeout for this request. A reply customer k_reply is created to either catch the reply from the service, or the occurrence of the timeout. This customer is sent, along with the original request req, to the underlying service svc. Finally, a delayed message is created that will send the timer k_timer to the reply customer k_reply after the timeout period has elapsed.
Reply Proxy
The reply customer is a proxy for the actual customer, created to handle the reply to a specific request. If the timer message arrives first, then the request timed out and we want to indicate failure. If the reply from the service arrives first, then the request succeeded and we want to pass on the reply. In either case, we want to ignore subsequent messages.
LET cb_reply_beh(cb, admin, timer, cust, req) = \msg.[ CASE msg OF $timer : [ SEND (admin, #fail) TO cb SEND (#fail, req) TO cust ] reply : [ SEND (admin, #ok) TO cb SEND reply TO cust ] END BECOME \_.[] ]
When an actor with cb_reply_beh receives a message msg, it is either a timeout event tagged with the timer, or a reply from the underlying service. If it is a timeout, we notify the circuit breaker cb with an administrative message indicating failure, and send a #fail
response to the original customer cust. Otherwise we send an administrative message indicating success to the circuit breaker cb, and forward the reply to the original customer cust. In either case, we change the behavior of the reply proxy to ignore all subsequent messages. Note how the identity of the timer, created specifically for this request, represents a message that could not possibly be generated as a reply from the service. Also notice that the reply proxy responds directly to the original customer, while concurrently sending a success/failure indication to the circuit breaker.
Administrative Messages
As shown above, the reply proxy sends administrative messages to the circuit breaker indicating the success or failure of each request. In the closed state, the circuit breaker keeps a count of consecutive failures. If the number of failures exceeds the configured retry limit, the circuit breaker will become open and begin rejecting requests.
LET inc(x) = add(x, 1) LET cb_closed_beh(n_failed, config) = \msg.[ LET $config = (admin, svc, timeout, n_retries, backoff) CASE msg OF ($admin, #ok) : [ BECOME cb_closed_beh(0, config) ] ($admin, #fail) : [ CASE less_equal(n_failed, n_retries) OF TRUE : [ BECOME cb_closed_beh(inc(n_failed), config) ] FALSE : [ BECOME cb_open_beh(config) SEND (#open, SELF, config) TO admin AFTER backoff SEND (admin, #close) TO SELF ] END ] ($admin, _) : [] (cust, req) : [ CREATE k_timer WITH \_.[] CREATE k_reply WITH cb_reply_beh( SELF, admin, k_timer, cust, req) SEND (k_reply, req) TO svc AFTER timeout SEND k_timer TO k_reply ] END ]
When an actor with cb_closed_beh receives an #ok
administrative message, it resets its failure counter to zero and remains closed.
When a #fail
administrative message is received, the failure count n_failed is compared against the retry limit n_retries. If the count is less than or equal to the limit, the failure counter is incremented and the circuit breaker remains closed. If the count exceeds the limit, the circuit breaker becomes open, the admin actor is notified of the state change, and a delayed messages is sent to close the circuit breaker after the backoff period has elapsed.
Open State
A circuit breaker in the open state rejects all requests until it receives an administrative message to switch back to closed. Extraneous administrative messages, such as #ok
and #fail
notifications for outstanding requests, are ignored.
LET cb_open_beh(config) = \msg.[ LET $config = (admin, svc, timeout, n_retries, backoff) CASE msg OF ($admin, #close) : [ BECOME cb_closed_beh(n_retries, config) SEND (#closed, SELF, config) TO admin ] ($admin, _) : [] (cust, req) : [ SEND (#reject, req) TO cust ] END ]
When an actor with cb_open_beh receives a #close
administrative message, the circuit breaker becomes closed and the admin actor is notified of the state change. The failure counter is set to equal the retry limit, rather than zero, so that the circuit breaker will trip quickly if the underlying service has not yet recovered. This is what Nygard called the half-open state.
If a normal customer request is received, the customer is sent a #reject
response. This corresponds to the “fail whale”. The customer is notified immediately that their request will not be processed, and no further load is placed on the service.
Operational Control
Using circuit breakers can improve the stability of your system by disconnecting clients from overloaded services, allowing them time to recover from an overload condition. Operations staff will want ways to monitor and control these important stability components. State changes (open/closed) are already reported to the administrative actor, where they can be logged and/or trigger alerts. However, we would also like to probe a circuit breaker on demand, to determine its current status.
LET cb_open_beh(config) = \msg.[ LET $config = (admin, svc, timeout, n_retries, backoff) CASE msg OF ($admin, #close) : [ BECOME cb_closed_beh(n_retries, config) SEND (#closed, SELF, config) TO admin ] ($admin, #status) : [ SEND (#open, SELF, config) TO admin ] ($admin, _) : [] (cust, req) : [ SEND (#reject, req) TO cust ] END ]
When an actor with cb_open_beh receives a #status
administrative message, it sends its current status of open, just as it would on a state transition.
We would also like to be able to force state transitions to occur. We may want to test the effect of tripping the breaker, or reset it before the backoff period has expired. In the open state, this is already supported by the #close
message, used by the backoff timer. We need an equivalent #open
message in the closed state.
LET cb_closed_beh(n_failed, config) = \msg.[ LET $config = (admin, svc, timeout, n_retries, backoff) CASE msg OF ($admin, #ok) : [ BECOME cb_closed_beh(0, config) ] ($admin, #fail) : [ CASE less_equal(n_failed, n_retries) OF TRUE : [ BECOME cb_closed_beh(inc(n_failed), config) ] FALSE : [ BECOME cb_open_beh(config) SEND (#open, SELF, config) TO admin AFTER backoff SEND (admin, #close) TO SELF ] END ] ($admin, #open) : [ BECOME cb_open_beh(config) SEND (#open, SELF, config) TO admin ] ($admin, #status) : [ SEND (#closed, SELF, config) TO admin ] ($admin, _) : [] (cust, req) : [ CREATE k_timer WITH \_.[] CREATE k_reply WITH cb_reply_beh( SELF, admin, k_timer, cust, req) SEND (k_reply, req) TO svc AFTER timeout SEND k_timer TO k_reply ] END ]
When an actor with cb_closed_beh receives an #open
administrative message, the circuit breaker becomes open and the admin actor is notified of the state change.
When it receives a #status
administrative message, it sends its current status of closed, just as it would on a state transition.
Policy Parameters
Various aspects of the circuit breaker behavior are hard-coded in the preceding implementations. Let’s abstract out some of those policy decisions. By adding a few elements to the configuration parameters, we can provide policy functions parameterizing the behavior. The aspects we will abstract are, updating the failure count on success, forming a timeout response, and forming a reject response. In addition, we will provide an administrative message for re-configuring the circuit breaker, updating its policy parameters without interrupting its operation.
LET cb_closed_beh(n_failed, config) = \msg.[ LET $config = (admin, svc, timeout, n_retries, backoff, on_ok, on_fail, on_reject) CASE msg OF ($admin, #ok) : [ BECOME cb_closed_beh(on_ok(n_failed), config) ] ($admin, #fail) : [ CASE less_equal(n_failed, n_retries) OF TRUE : [ BECOME cb_closed_beh(inc(n_failed), config) ] FALSE : [ BECOME cb_open_beh(config) SEND (#open, SELF, config) TO admin AFTER backoff SEND (admin, #close) TO SELF ] END ] ($admin, #open) : [ BECOME cb_open_beh(config) SEND (#open, SELF, config) TO admin ] ($admin, #status) : [ SEND (#closed, SELF, config) TO admin ] ($admin, #config, config') : [ BECOME cb_closed_beh(n_failed, config') ] ($admin, _) : [] (cust, req) : [ CREATE k_timer WITH \_.[] CREATE k_reply WITH cb_reply_beh( SELF, admin, k_timer, on_fail, cust, req) SEND (k_reply, req) TO svc AFTER timeout SEND k_timer TO k_reply ] END ] LET cb_reply_beh(cb, admin, timer, on_fail, cust, req) = \msg.[ CASE msg OF $timer : [ SEND (admin, #fail) TO cb SEND on_fail(cust, req) TO cust ] reply : [ SEND (admin, #ok) TO cb SEND reply TO cust ] END BECOME \_.[] ] LET cb_open_beh(config) = \msg.[ LET $config = (admin, svc, timeout, n_retries, backoff, on_ok, on_fail, on_reject) CASE msg OF ($admin, #close) : [ BECOME cb_closed_beh(n_retries, config) SEND (#closed, SELF, config) TO admin ] ($admin, #status) : [ SEND (#open, SELF, config) TO admin ] ($admin, #config, config') : [ BECOME cb_open_beh(config') ] ($admin, _) : [] (cust, req) : [ SEND on_reject(cust, req) TO cust ] END ]
The #config
administrative message provides a new set of configuration parameters for the circuit breaker. The new configuration takes effect on the next message processed. This is a powerful mechanism for on-the-fly re-configuration of a system in operation. It allows system operators to alter selected aspects of the component’s behavior without taking the component out-of-service. From here it is a small step to allow a complete hot-swap of the component’s behavior, if we wanted to support that.
The on_ok function calculates a new value for n_failed given the current value when a request is successful. The on_fail function creates a timeout-failure response given a customer and a request. The on_reject function creates a reject-failure response given a customer and a request. The on_fail and on_reject functions have the same signature, so the same implementation could be used for both, in case we don’t need to distinguish between the two causes of failure.
LET on_success(n_failed) = (div(n_failed, 2)) LET on_failure(cust, req) = (#timeout, req) LET on_rejection(cust, req) = (#reject, req)
This set of policy functions is an example of how we might choose to parameterize a circuit breaker. To reproduce our hard-coded behavior, the on_success function would have to return a constant 0
. Instead, we have chosen to implement exponential decay, dividing the number of failures by two on each successful request. The on_failure and on_rejection functions are used to form a fail/reject response that will be understood by the customer. The implementations here shows that we are not required to use all of the available data to form a fail/reject response. The requesting customer is made available in case we want to use it to distinguish between otherwise identical requests.
Summary
The Circuit Breaker pattern allows separation of clients from services in an overloaded condition. This enhances the stability of the system in two ways. By removing load from the service, allowing it time to recover from the overload. And by quickly returning a reject response to clients, so they aren’t waiting for a response that may never come. The actor-based implementation of proxies, for both clients and services, makes it easy to compose implementations of this pattern with existing client and service implementations. Parameterizing policies like the structure of a fail/reject response support adaptation to pre-existing protocols for reporting service failure.
References
- [1]
- M. Nygard. Release It!: Design and Deploy Production-Ready Software. The Pragmatic Bookshelf, 2007.
Tags: actor, circuit-breaker, distribution, patterns, stability, state-machine