Home

Awesome

Fuse — A Circuit Breaker implementation for Erlang

This application implements a so-called circuit-breaker for Erlang.

Build Status

NOTE: If you need to access FUSE (Filesystem in Userspace) then this is not the project you want. An Erlang implementation can be found in the fuserl project, fuserl on Google Code or fuserl on Github gives the pointers.

Fuse has seen use in various production systems, and the code should be quite stable. In particular, the extensive QuickCheck testing should make the code trustable to a larger extent than its production use.

Alternative implementations

I know of a couple of alternative circuit breaker implementations for Erlang:

Changelog

We use semantic versioning. In release X.Y.Z we bump

2.5.0

2.4.5

2.4.4

Maintenance. No functional changes to the code, but tests were updated.

2.4.3

Another maintenance release, but with one feature

2.4.2

Just a simply maintenance release for Hex

2.4.1

Maintenance release. Several grave errors were removed due to the extension of the QuickCheck model to also include timing:

2.4.0

Add support for monitoring fuses through prometheus.io. Contribution by Ilya Khaprov.

2.3.0

Support the fault_injection style fuses. These are fuses that fails automatically at a certain rate, say 1/500 requests, to test systems for robustness against faulty data.

2.2.0

Add fuse:circuit_disable/1 and fuse:circuit_enable/1.

2.1.0

Add the ability to remove a fuse. Work by Zeeshan Lakhani / Basho.

2.0.0

This major release breaks backwards compatibility in statistics. Cian Synott wrote code which generalizes stats collecting through plugins, with exometer and folsom being the major plugins. Read about how to configure the system for exometer or folsom use in this README file.

No other changes in this version.

1.1.0

Rename fuse_evtfuse_event. While this is not strictly a valid thing, since we break backwards compatibility, I hope no-one have begun using Fuse yet. As such, I decided to make this a minor bump instead.

1.0.0

Initial Release.

Background

When we build large systems, one of the problems we face is what happens when we have long dependency chains of applications. We might have a case where applications call like this:

app_A → app_B → app_C

Now, if we begin having errors in application B, the problem is that application A needs to handle this by waiting for a timeout of Application B all the time. This incurs latency in the code base. A Circuit Breaker detects the error in the underlying system and then avoids making further queries. This allows you to handle the breakage systematically in the system. For long cascades, layering of circuit breakers allow one to detect exactly which application is responsible for the breakage.

A broken circuit introduces some good characteristics in the system:

The broken circuit will be retried once in a while. The system will then auto-heal if connectivity comes back for the underlying systems.

Thanks

Several companies should be thanked:

Contributors

List of people who have made contributions to the project of substantial size:

Documentation

Read the tutorial in the next section. For a command reference, there is full EDoc documentation via make docs. Note that great care has been taken to produce precise documentation of the stable API fragment of the tool. If you find anything to be undocumented, please open an Issue—or better: a pull request with a patch!

Tutorial

To use fuse, you must first start the fuse application:

application:start(fuse).

but note that in real systems it is better to have other applications depend on fuse and then start it as part of a release boot script. Next, you must add a fuse into the system by installing a fuse description. This is usually done as part of the application:start/1 callback:

Name = database_fuse
Strategy = {standard, MaxR, MaxT}, %% See below for types
Refresh = {reset, 60000},
Opts = {Strategy, Refresh},
fuse:install(Name, Opts).

This sets up a fuse with a given Name and a given set of options. Options are given as a tuple with two values. The strategy of the fuse and the refresh of the fuse.

Fuses are name-created idempotently, so your application can recreate a fuse if it wants. Note however, that fuse recreation has two major rules:

Once you have installed a fuse, you can ask about its state:

Context = sync,
case fuse:ask(database_fuse, Context) of
  ok -> …;
  blown -> …
end,

This queries the fuse for its state and lets you handle the case where it is currently blown. The Context specifies the context under which the fuses is running (like in mnesia). There are currently two available contexts:

Now suppose you have a working fuse, but you suddenly realize you get errors of the type {error, timeout}. Since you think this is a problem, you can tell the system that the fuse is under strain. You do this by melting the fuse:

case emysql:execute(Stmt) of
    {error, connection_lock_timeout} ->
      ok = fuse:melt(database_fuse),
      …
    …
end,

The fuse has a policy, so once it has been melted too many times, it will blow for a while until it has cooled down. Then it will heal back to the initial state. If the underlying system is still broken, the fuse will quickly break again. While this reset-methodology is not optimal, it is easy to create a Quickcheck model showing the behaviour is correct. Note melt is synchronous. It blocks until the fuse can handle the melt. There are two reasons for this:

Another way to run the fuse is to use a wrapper function. Suppose you have a function with the following spec:

-spec exec() -> {ok, Result} | {melt, Result}
  when Result :: term().

%% To use this function:
Context = sync,
case fuse:run(Name, fun exec/0, Context) of
  {ok, Result} -> …;
  blown -> …
end,

this function will do the asking and melting itself based on the output of the underlying function. The run/3 invocation is often easier to handle in programs. As with ask/1, you must supply your desired context.

Fuse types

There are a couple of different fuse types in the system:

Administrative commands

An administrator can manually disable/reenable fuses through the following commands:

ok = fuse:circuit_disable(Name),
…
ok = fuse:circuit_enable(Name),

When you disable a circuit, you blow the fuse until you enable the circuit again.

The interaction rules for disables/enables is that they dominate every other command except the call to remove/1. That is, even reinstalling an already installed fuse will not reenable it. The only way is to either call fuse:circuit_enable/1 or by first fuse:remove/1'ing the fuse and then executing an install/1 command.

Monitoring fuse state

Fuses installed into the system are automatically instrumented in two ways: stats plugins and the alarm_handler.

Stats plugins

Fuse includes a simple behaviour, fuse_stats_plugin, for integration with statistics and monitoring systems.

Independent of which stats plugin you use, the ok and blown metrics are increased on every ask/2 call to the fuse. The melt metric is increased whenever we see a melt happen.

Note: The metrics are subject to change. Especially if someone can come up with better metrics to instrument for in the system.

Using a plugin

By default, fuse uses the fuse_stats_ets plugin. To use another, set it up in the environment with e.g.

application:set_env(fuse, stats_plugin, fuse_stats_folsom).

or in a .config as

{fuse, [ {stats_plugin, fuse_stats_folsom} ] }

Note that it's up to you to arrange your application's dependencies such that plugin applications like folsom or exometer are available and started. Fuse has no direct dependency on either folsom or exometer.

Writing a plugin

See the source of fuse_stats_plugin and the plugins above for documentation and examples.

Integration with the alarm handler

Furthermore, fuses raises alarms when they are blown. They raise an alarm under the same name as the fuse itself. To clear the alarm, the system uses hysteresis. It has to see 3 consecutive ok states on a fuse before clearing the alarm. This is to avoid alarm states from flapping excessively.

Fuse events

The fuse system contains an event handler, fuse_event which can be used to listen on events and react when events trigger in the fuse subsystem. It will send events which are given by the following dialyzer specification {atom(), blown | ok}. Where the atom() is the name of the installed fuse.

The intended use is to evict waiters from queues in a system. Suppose you are queueing workers up for answers, blocking the workers in the process. When the workers were queued, the fuse was not blown, but now it suddenly broke. You can then install an event handler which pushes a message to the queueing process and tells it the fuse is broken. It can then react by evicting all the entries in queue as if the fuse was broken.

Speed

Standard fuses

On a Lenovo Thinkpad running Linux 3.14.4 with a processor Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz (An Ivy Bridge) Erlang Release 17.0.1, we get a throughput of 2.1 million fuse queries per second by running the stress test in stress/stress.erl. This test also has linear speedup over all the cores. Lookup times are sub-microsecond, usually around the 0.5 ballpark.

Running on a Q4 2013 Macbook Pro, OSX 10.9.2, 2 Ghz Intel Core i7 (Haswell) yields roughly the same speed.

In practice, your system will be doing other things as well, but do note that the overhead of enabling a fuse is expected to be around 0.5μs in overhead.

Fault injecting fuses

A fuse running with fault injection has the added caveat that it also makes a call to rand:uniform() which in turn will slow down the request rate. It is not expected to be a lot of slowdown, but it is mentioned here for the sake of transparency.

Tests

Fuse is written with two kinds of tests. First of all, it uses a set of Common Test test cases which runs the basic functionality of the system. Furthermore, fuse is written with Erlang QuickCheck test cases. EQC tests are written before the corresponding code is written, and as such, this is "Property Driven Development".

To run the standard tests, execute:

rebar3 ct

To run the EQC tests, you need a working EQC installation in your erlang path. Then I tend to do:

rebar3 shell
> cd("eqc_test").
> make:all().
> eqc_cluster:t(10).

I am deliberately keeping them out of the CI chain due to the necessity of Erlang Quickcheck in order to be able to run tests. There are a set of models, each testing one aspect of the fuse system. Taken together, they provide excellent coverage of the fuse system as a whole.

Great care has been taken in order to make sure fuse can be part of the error kernel of a system. The main fuse server is not supposed to crash under any circumstance. The monitoring application may crash since it is only part of the reporting. While important, it is not essential to correct operation.

Requirements

QuickCheck allows us to test for requirements of a system. This essentially tests for any interleaving of calls toward the Fuse subsystem, so we weed out any error. We test for the following requirements as part of the test suite. We have verified that all of these requirements are being hit by typical EQC runs:

Group heal:
R01 - Heal non-installed fuse (must never be triggered)
R02 - Heal installed fuse (only if blown already)

Group install:
R03 - Installation of a fuse with invalid configuation
R04 - Installation of a fuse with valid configuration

Group Reset:
R05 - Reset of an uninstalled fuse
R06 - Reset of an installed fuse (blown and nonblown)

Group Melt:
R11 - Melting of an installed fuse
R12 - Melting of an uninstalled fuse

Group run/2:
R07 - Use of run/2 on an ok fuse
R08 - Use of run/2 on a melted fuse
R09 - Use of run/2 on an ok fuse which is melted in the process
R10 - Use of run/2 on an uninstalled fus

Group blow:
R13 - Blowing a fuse
R14 - Removing melts from the window by expir

Group ask/1:
R15 - Ask on an installed fuse
R16 - Ask on an uninstalled fus

Group circuitry:
R17 - Disable an installed fuse
R18 - Disable an uninstalled fuse
R19 - Reenable an installed fuse
R20 - Reenable an uninstalled fuse
R21 - Melting a disabled fus

Group reset commands:
R22 - Heal command execution
R23 - Delay command execution

EQC Test harness features

Furthermore:

Subtle Errors found by EQC

Software construction is a subtle and elusive business. Most errors in software are weeded out early in the development cycle, the errors that remain are really rare and hard to find. Static type systems will only raise the bar and make the remaining errors even more slippery from your grasp. Thus, the only way to remove these errors is to use a tool which is good at construction elusive counter-examples. Erlang QuickCheck is such a tool.

Development guided by properties leads to a code base which is considerably smaller. In the course of building fuse we iteratively removed functionality from the code base which proved to be impossible to implement correctly. Rather than ending up with a lot of special cases, development by property simply suggested the removal of certain nasty configurations of the software, which does nothing but makes it more bloated. Worse—one could argue some of these configuration would never be used in practice, leading to written code which would never be traversed.

General

Subtleties

The monitor model found the following: