Home

Awesome

Zini

Zini (Zig + Mini) is a Zig library providing some succinct data structures:

Overview

PTHash, minimal perfect hash function

zini.pthash contains an implementation of PTHash, a minimal perfect hash function construction algorithm. Given a set of n elements, with the only requirement being that you can hash them, it generates a hash function which maps each element to a distinct number between 0 and n - 1. The generated hash function is extremely small, typically consuming less than 4 bits per element, regardless of the size of the input type. The algorithm provides multiple parameters to tune making it possible to optimize for (small) size, (short) construction time, or (short) lookup time.

To give a practical example: In ~0.6 seconds Zini was able to create a hash function for /usr/share/dict/words containing 235886 words. The resulting hash function required in total 865682 bits in memory. This corresponds to 108.2 kB in total or 3.67 bits per word. In comparison, the original file was 2.49 MB and compressing it with gzip -9 only gets it down to 754 kB (which you can't use directly in memory without decompressing it). It should of course be noted that they don't store the equivalent data as you can't use the generated hash function to determine if a word is present or not in the list. The comparison is mainly useful to get a feeling of the magnitudes.

Bumped Ribbon Retrieval, a retrieval data structure

zini.ribbon contains an implementation of Bumped Ribbon Retrieval (BuRR), a retrieval data structure. Given n keys (with the only requirement being that you can hash them) which each have an r-bit value, we'll build a data structure which will return the value for all of the n keys. However, the keys are actually not stored (we're only using the hash) so if you ask for the value for an unknown key you will get a seemingly random answer; there's no way of knowing whether the key was present in the original dataset or not.

The theoretically minimal amount of space needed to store the values is n * r (we have n r-bit values after all). We use the term "overhead" to refer to how much extra amount of data we need. The Bumped Ribbon Retrieval will often have less than 1% overhead.

Usage

Zini is intended to be used as a library, but also ships the command-line tools zini-pthash and zini-ribbon. As the documentation is a bit lacking it might be useful to look through tools/zini-{pthash,ribbon}/main.zig to understand how it's used.

USAGE
  ./zig-out/bin/zini-pthash [build | lookup] <options>

COMMAND: build
  Builds hash function for plain text file.

  -i, --input <file>
  -o, --output <file>
  -c <int>
  -a, --alpha <float>
  -s, --seed <int>

COMMAND: lookup

  -i, --input <file>
  -k, --key <key>
  -b, --benchmark

And here's an example run of using zini-pthash.

# Build zini-pthash:
$ zig build -Drelease-safe

# Build a hash function:
$ ./zig-out/bin/zini-pthash build -i /usr/share/dict/words -o words.pth
Reading /usr/share/dict/words...

Building hash function...

Successfully built hash function:
  seed: 12323441790160983030
  bits: 865554
  bits/n: 3.6693741892269993

Writing to words.pth

# Look up an index in the hash function:
$ ./zig-out/bin/zini-pthash lookup -i words.pth --key hello
Reading words.pth...

Successfully loaded hash function:
  seed: 12323441790160983030
  bits: 865554
  bits/n: 3.6693741892269993

Looking up key=hello:
112576

Acknowledgments

Zini is merely an implementation of existing algorithms and techniques already described in the literature:

License

Zini is licensed under the 0BSD license.