Awesome
binfuse
C++ Library for Binary Fuse Filters
Binary fuse filters are a recent (2022) development in the group of Approximate Membership Query filters
Approximate membership query filters (hereafter, AMQ filters) comprise a group of space-efficient probabilistic data structures that support approximate membership queries. An approximate membership query answers whether an element is in a set or not with a false positive rate of ϵ.
Binary fuse filters are a further development of XOR filters. Both are more space efficient, and faster to build/query than traditional options like Bloom and Cuckoo filters.
This binfuse
C++ library
builds on the
C-libary by the
authors of the relevant research
paper.
As well as adding a convenient C++ interface, binfuse::filter
also
facilitates (de-)serializing the populated filter to/from disk as well
as querying it directly from disk via an mmap
, with cross platform
support from mio. Both in memory and
"off disk" operation is supported.
One of the challenges with binary fuse filters, is that they are
immutable once populated, so data cannot be added incrementally, and
they consume a significant amount of memory during the populate
process - 64GB of memory is recommended for populating with 500
million uint64_t
keys/hashes. This has, until now, placed an upward
bound on the practical application of these filters to very large
datasets.
binfuse::sharded_filter
allows convenient slicing of the dataset
into an arbitrary number of shards which are written to
disk and indexed by the N
most significant bits of the uint64_t
keys/hashes. Sharding is transparent to the user during queries is and
still very fast with just 3 mmap
accesses per query.
binfuse::sharded_filter
easily controls RAM requirements during the
"populate filter" process and enables datasets of 10s of billions of
records with common hardware. Query speeds depend on disk hardware and
cache conditions, but can be in the 50ns range.
Applications
The movitating use case was this pwned pasword db project, but there are many common applications:
Typical applications of AMQ filters are distributed systems and database systems. The AMQ filter functions as a proxy to the set of keys of a database or remote memory. Before a presumably slow query to the database or to remote memory is performed, the AMQ filter is used to give an approximate answer as to whether the key is in the database or in remote memory. The slow query is only performed when the AMQ filter returns true. Only in the case of a (hopefully rare) false positive is an unnecessary I/O or remote access performed. The applications are numerous and include package and resource routing, P2P networks, and distributed caching.
Usage examples from tests
Singular binuse::filter
for in memory use:
binfuse::filter8 filter(std::vector<std::uint64_t>{
0x0000000000000000,
0x0000000000000001, // order is not important
0x0000000000000002,
});
EXPECT_TRUE(filter.is_populated());
EXPECT_TRUE(filter.contains(0x0000000000000000));
EXPECT_TRUE(filter.contains(0x0000000000000001));
EXPECT_TRUE(filter.contains(0x0000000000000002));
Singular binuse::filter
save and load:
binfuse::filter8_sink filter_sink(std::vector<std::uint64_t>{
0x0000000000000000,
0x0000000000000001, // order is not important
0x0000000000000002,
});
filter_sink.save("tmp/filter8.bin");
binfuse::filter8_source filter_source;
filter_source.load("tmp/filter8.bin");
EXPECT_TRUE(filter_source.contains(0x0000000000000000));
EXPECT_TRUE(filter_source.contains(0x0000000000000001));
EXPECT_TRUE(filter_source.contains(0x0000000000000002));
Sharded filter, bulding one shard at the time:
binfuse::filter8 tiny_low(std::vector<std::uint64_t>{
// note the MSB is clear on all below
0x0000000000000000,
0x0000000000000001, // order is not important
0x0000000000000002,
});
binfuse::filter8 tiny_high(std::vector<std::uint64_t>{
// note the MSB is set on all below
0x8000000000000000,
0x8000000000000001, // order is not important
0x8000000000000002,
});
binfuse::sharded_filter8_sink sink("tmp/sharded_filter8_tiny.bin",
1); // one bit sharding, ie 2 shards
sink.add_shard(tiny_low, 0); // specify the prefix for each shard
sink.add_shard(tiny_high, 1); // order of adding is not important
EXPECT_EQ(sink.shards(), 2);
// now reopen the filter as a "source"
binfuse::sharded_filter8_source source("tmp/sharded_filter8_tiny.bin", 1);
// verify all entries
EXPECT_TRUE(source.contains(0x0000000000000000));
EXPECT_TRUE(source.contains(0x0000000000000001));
EXPECT_TRUE(source.contains(0x0000000000000002));
EXPECT_TRUE(source.contains(0x8000000000000000));
EXPECT_TRUE(source.contains(0x8000000000000001));
EXPECT_TRUE(source.contains(0x8000000000000002));
EXPECT_EQ(source.shards(), 2);
or via "streaming" API:
binfuse::sharded_filter<binary_fuse8_t, mio::access_mode::write> sink(
"tmp/sharded_filter8_tiny.bin", 1);
// alternative "streaming" API for bullding the filter
// the entries below must be strictly in order
sink.stream_prepare();
sink.stream_add(0x0000000000000000);
sink.stream_add(0x0000000000000001);
sink.stream_add(0x0000000000000002);
sink.stream_add(0x8000000000000000);
sink.stream_add(0x8000000000000001);
sink.stream_add(0x8000000000000002);
sink.stream_finalize();
binfuse::sharded_filter<binary_fuse8_t, mio::access_mode::read> source(
"tmp/sharded_filter8_tiny.bin", 1);
EXPECT_TRUE(source.contains(0x0000000000000000));
EXPECT_TRUE(source.contains(0x0000000000000001));
EXPECT_TRUE(source.contains(0x0000000000000002));
EXPECT_TRUE(source.contains(0x8000000000000000));
EXPECT_TRUE(source.contains(0x8000000000000001));
EXPECT_TRUE(source.contains(0x8000000000000002));
The main classes are templated as follows to select underlying 8 or 16
bit filters, giving a 1/256 and 1/65536 chance of a false positive
respectively. The persistent versions are also templated by the
mio::mmap::access_mode
to select read or write (AKA source or sink):
namespace binfuse {
template <typename T>
concept filter_type = std::same_as<T, binary_fuse8_t> || std::same_as<T, binary_fuse16_t>;
template <filter_type FilterType>
class filter;
template <filter_type FilterType, mio::access_mode AccessMode>
class persistent_filter;
template <filter_type FilterType, mio::access_mode AccessMode>
class sharded_filter;
} // namespace binfuse
the following convenience aliases are provided:
namespace binfuse {
// binfuse::filter
using filter8 = filter<binary_fuse8_t>;
using filter16 = filter<binary_fuse16_t>;
using filter8_sink = persistent_filter<binary_fuse8_t, mio::access_mode::write>;
using filter8_source = persistent_filter<binary_fuse8_t, mio::access_mode::read>;
using filter16_sink = persistent_filter<binary_fuse16_t, mio::access_mode::write>;
using filter16_source = persistent_filter<binary_fuse16_t, mio::access_mode::read>;
// binfuse::sharded_filter
using sharded_filter8_sink = sharded_filter<binary_fuse8_t, mio::access_mode::write>;
using sharded_filter8_source = sharded_filter<binary_fuse8_t, mio::access_mode::read>;
using sharded_filter16_sink = sharded_filter<binary_fuse16_t, mio::access_mode::write>;
using sharded_filter16_source = sharded_filter<binary_fuse16_t, mio::access_mode::read>;
} // namespace binfuse
Requirements and building
POSIX systems and Windows(mingw) are supported, with a recent C++20 compiler (eg gcc 13.2, clang 18.1), but it will also build with gcc-10.2 from Debian 11.
git clone https://github.com/oschonrock/binfuse.git
git submodule update --init --recursive
cmake -S . -B build -DCMAKE_BUILD_TYPE=release
cmake --build build
Tests are run automatically at the end of build. They can be disabled with -DBINFUSE_TEST=OFF
.
Including in your project
The library, and its included upstream C-library are header only, but
there is a cmake
interface target available:
add_subdirectory(ext/binfuse)
add_executable(my_exe main.cpp)
target_link_libraries(my_exe PRIVATE binfuse)
// main.cpp
#include "binfuse/sharded_filter.hpp"
#include "binfuse/filter.hpp"
...
A note on file formats and tags
Two different binary formats are used for filter
and
sharded_filter
(details). They
each have different build parameters, which further affect the
structure. These are fingerprint
size (8 or 16bit) and, in the case
of the sharded filter, the number of shards
.
The file format parameters are recorded in the first 16 bytes of each
file, so that files are not accidentally opened with the wrong
parameters, giving bogus results. These tags are human readable and
can be inspected with hexdump
.
$ hd filter8.bin | head -n1
00000000 73 62 69 6e 66 75 73 65 30 38 2d 30 30 36 34 00 |sbinfuse08-0064.|
$ hd filter16.bin | head -n1
00000000 73 62 69 6e 66 75 73 65 31 36 2d 30 30 36 34 00 |sbinfuse16-0064.|
binfuse
will throw exceptions, if files are opened with the wrong type/params.
Benchmarks
There is a benchmark for the sharded_filter
, which includes figures
for the filter
internally. This can be built with
-DBINFUSE_BENCH=ON
. Some results below for 100m keys.
It is difficult to run the binfuse_bench_large
for 1billion or more
keys, due to memory pressure, unless the first runs, with low shard
count, are pruned out. This illustrates the precise motivation for the
sharded_filter
.
Edit for your needs.
$ ./build/binfuse_bench_large
Shard Size: 50000000 Shards: 2 Keys: 100000000
gen populate verify add query f+ve
f8 16.3ns 87.2ns 33.9ns 5.3ns 48.3ns 0.390792%
f16 16.3ns 93.5ns 39.3ns 10.5ns 50.8ns 0.001542%
Shard Size: 25000000 Shards: 4 Keys: 100000000
gen populate verify add query f+ve
f8 12.4ns 87.1ns 30.7ns 5.4ns 47.7ns 0.390143%
f16 12.6ns 87.6ns 33.6ns 10.6ns 50.1ns 0.001487%
Shard Size: 12500000 Shards: 8 Keys: 100000000
gen populate verify add query f+ve
f8 9.8ns 70.6ns 28.1ns 5.6ns 48.2ns 0.390789%
f16 11.0ns 73.2ns 31.3ns 11.0ns 51.9ns 0.001483%
Shard Size: 6250000 Shards: 16 Keys: 100000000
gen populate verify add query f+ve
f8 5.6ns 69.6ns 23.7ns 6.0ns 47.6ns 0.390566%
f16 5.7ns 70.6ns 28.4ns 11.6ns 50.6ns 0.001511%
Shard Size: 3125000 Shards: 32 Keys: 100000000
gen populate verify add query f+ve
f8 5.1ns 60.4ns 11.7ns 6.8ns 46.7ns 0.390106%
f16 5.0ns 64.8ns 23.3ns 12.1ns 50.6ns 0.001576%
Shard Size: 1562500 Shards: 64 Keys: 100000000
gen populate verify add query f+ve
f8 4.7ns 59.2ns 6.2ns 8.2ns 46.1ns 0.390199%
f16 4.7ns 59.8ns 11.9ns 14.0ns 52.6ns 0.001496%
Shard Size: 781250 Shards: 128 Keys: 100000000
gen populate verify add query f+ve
f8 4.7ns 56.6ns 6.2ns 10.2ns 46.4ns 0.390504%
f16 4.7ns 54.6ns 6.4ns 17.1ns 48.4ns 0.001524%
Shard Size: 390625 Shards: 256 Keys: 100000000
gen populate verify add query f+ve
f8 4.6ns 48.7ns 5.7ns 15.1ns 47.5ns 0.390732%
f16 4.6ns 48.9ns 6.3ns 22.2ns 46.7ns 0.001484%
A note on memory consumption
The first few runs of the above benchmark, with low shard count, will
consume large, transient amounts of memory (1-2GB range) during
filter::populate
. The latter runs show the benefit of the
sharded_filter
, and the max memory consumption will settle at the
size of the filter file (108MB/215MB for 8/16bit).
Note that even this latter memory consumption is almost all in the
mmap
. The OS is likely deciding to just let your process cache the
mmap fully in memory. This obviously benefits performance, but it is
flexible disk cache. If the machine comes under memory pressure, these
pages will be evicted.
A huge dataset of 10 billion entries builds and queries just fine. The 8bit filter is 11GB and the 16bit filter is 22GB on disk. The stats below show how the 22GB, 16bit filter did not fit in the physical memory of the 16GB machine so the query time slowed to 0.7ms, but otherwise the system is quite stable.
Shard Size: 9765625 Shards: 1024 Keys: 10000000000
gen populate verify add query f+ve
f8 7.5ns 174.4ns 26.2ns 6.7ns 150.7ns 0.390653%
f16 8.4ns 74.7ns 30.2ns 12.6ns 682612.8ns 0.001000%