Home

Awesome

fastfilter: Binary fuse & xor filters for Zig <a href="https://hexops.com"><img align="right" alt="Hexops logo" src="https://raw.githubusercontent.com/hexops/media/main/readme.svg"></img></a>

CI

<a href="https://raw.githubusercontent.com/FastFilter/xor_singleheader/master/figures/comparison.png"><img align="right" src="https://raw.githubusercontent.com/FastFilter/xor_singleheader/master/figures/comparison.png" alt="comparison" width="400px"></img></a>

Binary fuse filters & xor filters are probabilistic data structures which allow for quickly checking whether an element is part of a set.

Both are faster and more concise than Bloom filters, and smaller than Cuckoo filters. Binary fuse filters are a bleeding-edge development and are competitive with Facebook's ribbon filters:

Benefits of Zig implementation

This is a Zig implementation, which provides many practical benefits:

  1. Iterator-based: you can populate xor or binary fuse filters using an iterator, without keeping your entire key set in-memory and without it being a contiguous array of keys. This can reduce memory usage when populating filters substantially.
  2. Distinct allocators: you can provide separate Zig std.mem.Allocator implementations for the filter itself and population, enabling interesting opportunities like mmap-backed population of filters with low physical memory usage.
  3. Generic implementation: use Xor(u8), Xor(u16), BinaryFuse(u8), BinaryFuse(u16), or experiment with more exotic variants like Xor(u4) thanks to Zig's bit-width integers and generic type system.

Zig's safety-checking and checked overflows has also enabled us to improve the upstream C/Go implementations where overflow and undefined behavior went unnoticed.[1]

Usage

Decide if xor or binary fuse filters fit your use case better: should I use binary fuse filters or xor filters?

Get your keys into u64 values. If you have strings, structs, etc. then use something like Zig's std.hash_map.getAutoHashFn to convert your keys to u64 first. ("It is not important to have a good hash function, but collisions should be unlikely (~1/2^64).")

Create a build.zig.zon file in your project (replace $LATEST_COMMIT with the latest commit hash):

.{
    .name = "mypkg",
    .version = "0.1.0",
    .dependencies = .{
        .fastfilter = .{
            .url = "https://github.com/hexops/fastfilter/archive/$LATEST_COMMIT.tar.gz",
        },
    },
}

Run zig build in your project, and the compiler instruct you to add a .hash = "..." field next to .url.

Then use the dependency in your build.zig:

pub fn build(b: *std.Build) void {
    ...
    exe.addModule("fastfilter", b.dependency("fastfilter", .{
        .target = target,
        .optimize = optimize,
    }).module("fastfilter"));
}

In your main.zig, make use of the library:

const std = @import("std");
const testing = std.testing;
const fastfilter = @import("fastfilter");

test "mytest" {
    const allocator = std.heap.page_allocator;

    // Initialize the binary fuse filter with room for 1 million keys.
    const size = 1_000_000;
    var filter = try fastfilter.BinaryFuse8.init(allocator, size);
    defer filter.deinit(allocator);

    // Generate some consecutive keys.
    var keys = try allocator.alloc(u64, size);
    defer allocator.free(keys);
    for (keys, 0..) |key, i| {
        _ = key;
        keys[i] = i;
    }

    // Populate the filter with our keys. You can't update a xor / binary fuse filter after the
    // fact, instead you should build a new one.
    try filter.populate(allocator, keys[0..]);

    // Now we can quickly test for containment. So fast!
    try testing.expect(filter.contain(1) == true);
}

(you can just add this project as a Git submodule in yours for now, as Zig's official package manager is still under way.)

Binary fuse filters automatically deduplicate any keys during population. If you are using a different filter type (you probably shouldn't be!) then keys must be unique or else filter population will fail. You can use the fastfilter.AutoUnique(u64)(keys) helper to deduplicate (in typically O(N) time complexity), see the tests in src/unique.zig for usage examples.

Serialization

To serialize the filters, you only need to encode these struct fields:

pub fn BinaryFuse(comptime T: type) type {
    return struct {
        ...
        seed: u64,
        segment_length: u32,
        segment_length_mask: u32,
        segment_count: u32,
        segment_count_length: u32,
        fingerprints: []T,
        ...

T will be the chosen fingerprint size, e.g. u8 for BinaryFuse8 or Xor8.

Look at std.io.Writer and std.io.BitWriter for ideas on actual serialization.

Similarly, for xor filters you only need these struct fields:

pub fn Xor(comptime T: type) type {
    return struct {
        seed: u64,
        blockLength: u64,
        fingerprints: []T,
        ...

Should I use binary fuse filters or xor filters?

If you're not sure, start with BinaryFuse8 filters. They're fast, and have a false-positive probability rate of 1/256 (or 0.4%).

There are many tradeoffs, primarily between:

See the benchmarks section for a comparison of the tradeoffs between binary fuse filters and xor filters, as well as how larger bit sizes (e.g. BinaryFuse(u16)) consume more memory in exchange for a lower false-positive probability rate.

Note that fuse filters are not to be confused with binary fuse filters, the former have issues with construction, often failing unless you have a large number of unique keys. Binary fuse filters do not suffer from this and are generally better than traditional ones in several ways. For this reason, we consider traditional fuse filters deprecated.

Note about extremely large datasets

This implementation supports key iterators, so you do not need to have all of your keys in-memory, see BinaryFuse8.populateIter and Xor8.populateIter.

If you intend to use a xor filter with datasets of 100m+ keys, there is a possible faster implementation for construction found in the C implementation xor8_buffered_populate which is not implemented here.

Changelog

The API is generally finalized, but we may make some adjustments as Zig changes or we learn of more idiomatic ways to express things. We will release v1.0 once Zig v1.0 is released.

v0.11.0

v0.10.3

v0.10.2

v0.10.1

v0.10.0

v0.9.3

v0.9.2

v0.9.1

v0.9.0

v0.8.0

initial release with support for Xor and traditional Fuse filters of varying bit sizes, key iterators, serialization, and a slice de-duplication helper.

Benchmarks

Benchmarks were ran on both a 2019 Macbook Pro and Windows 10 desktop machine using e.g.:

zig run -O ReleaseFast src/benchmark.zig -- --xor 8 --num-keys 1000000
<details> <summary><strong>Benchmarks:</strong> 2019 Macbook Pro, Intel i9 (1M - 100M keys)</summary>
Algorithm# of keyspopulatecontains(k)false+ prob.bits per entrypeak populatefilter total
binaryfuse8100000037.5ms24.0ns0.003911159.0422 MiB1 MiB
binaryfuse16100000045.5ms24.0ns0.0000152418.0924 MiB2 MiB
binaryfuse32100000056.0ms24.0ns036.1828 MiB4 MiB
xor21000000108.0ms25.0ns0.25004799.8452 MiB1 MiB
xor4100000099.0ms25.0ns0.062538659.8452 MiB1 MiB
xor81000000103.4ms25.0ns0.00390559.8452 MiB1 MiB
xor161000000104.7ms26.0ns0.0000150919.6852 MiB2 MiB
xor321000000102.2ms25.0ns039.3652 MiB4 MiB
binaryfuse810000000621.2ms36.0ns0.00391699.02225 MiB10 MiB
binaryfuse1610000000666.6ms102.0ns0.000014718.04245 MiB21 MiB
binaryfuse3210000000769.0ms135.0ns036.07286 MiB43 MiB
xor2100000001.9s43.0ns0.25007039.84527 MiB11 MiB
xor4100000002.0s41.0ns0.06261379.84527 MiB11 MiB
xor8100000001.9s42.0ns0.00393699.84527 MiB11 MiB
xor16100000002.2s106.0ns0.000017319.68527 MiB23 MiB
xor32100000002.2s140.0ns039.36527 MiB46 MiB
binaryfuse81000000007.4s145.0ns0.0039899.012 GiB107 MiB
binaryfuse161000000008.4s169.0ns0.00001618.012 GiB214 MiB
binaryfuse3210000000010.2s173.0ns036.032 GiB429 MiB
xor210000000028.5s144.0ns0.2498439.845 GiB117 MiB
xor410000000027.4s154.0ns0.0623389.845 GiB117 MiB
xor810000000028.0s153.0ns0.0040169.845 GiB117 MiB
xor1610000000029.5s161.0ns0.00001219.685 GiB234 MiB
xor3210000000029.4s157.0ns039.365 GiB469 MiB

Legend:

</details> <details> <summary><strong>Benchmarks:</strong> Windows 10, AMD Ryzen 9 3900X (1M - 100M keys)</summary>
Algorithm# of keyspopulatecontains(k)false+ prob.bits per entrypeak populatefilter total
binaryfuse8100000044.6ms24.0ns0.003907969.0422 MiB1 MiB
binaryfuse16100000048.9ms25.0ns0.0000155318.0924 MiB2 MiB
binaryfuse32100000049.9ms25.0ns0.0000000136.1828 MiB4 MiB
xor2100000077.3ms25.0ns0.250001639.8452 MiB1 MiB
xor4100000080.0ms25.0ns0.062504279.8452 MiB1 MiB
xor8100000076.0ms25.0ns0.003916629.8452 MiB1 MiB
xor16100000083.7ms26.0ns0.0000153619.6852 MiB2 MiB
xor32100000079.1ms27.0ns039.3652 MiB4 MiB
fuse8100000069.4ms25.0ns0.003906639.1049 MiB1 MiB
fuse16100000071.5ms27.0ns0.0000151618.2049 MiB2 MiB
fuse32100000071.1ms27.0ns036.4049 MiB4 MiB
binaryfuse810000000572.3ms33.0ns0.00388679.02225 MiB10 MiB
binaryfuse1610000000610.6ms108.0ns0.000012718.04245 MiB21 MiB
binaryfuse3210000000658.2ms144.0ns036.07286 MiB43 MiB
xor2100000001.2s39.0ns0.2498769.84527 MiB11 MiB
xor4100000001.2s39.0ns0.06250269.84527 MiB11 MiB
xor8100000001.2s41.0ns0.00388819.84527 MiB11 MiB
xor16100000001.3s117.0ns0.000013419.68527 MiB23 MiB
xor32100000001.3s147.0ns039.36527 MiB46 MiB
fuse8100000001.1s36.0ns0.00390899.10499 MiB10 MiB
fuse16100000001.1s112.0ns0.000017218.20499 MiB21 MiB
fuse32100000001.1s145.0ns036.40499 MiB43 MiB
binaryfuse81000000006.9s167.0ns0.003819.012 GiB107 MiB
binaryfuse161000000007.2s171.0ns0.00000918.012 GiB214 MiB
binaryfuse321000000008.5s174.0ns036.032 GiB429 MiB
xor210000000016.8s166.0ns0.2498689.845 GiB117 MiB
xor410000000018.9s183.0ns0.0624179.845 GiB117 MiB
xor810000000019.1s168.0ns0.0038739.845 GiB117 MiB
xor1610000000016.9s171.0ns0.00002119.685 GiB234 MiB
xor3210000000019.4s189.0ns039.365 GiB469 MiB
fuse810000000019.6s167.0ns0.0037979.104 GiB108 MiB
fuse1610000000020.8s171.0ns0.00001518.204 GiB216 MiB
fuse3210000000021.5s176.0ns036.404 GiB433 MiB

Legend:

</details>

Related readings

Special thanks

If it was not for the above people, I (@slimsag) would not have been able to write this implementation and learn from the excellent C implementation. Please credit the above people if you use this library.