Home

Awesome

emphf

emphf is a minimal perfect hashing library for large-scale key sets focused on speed and low memory usage. A minimal perfect hash function (MPHF) is a data structure that maps injectively a static set of strings of size n to the integer set [0, n-1]. The overall space usage of the MPHFs generated with emphf is 2.61 n bits plus a small constant factor.

The algorithms used in this library are described in the paper Cache-Oblivious Peeling of Random Hypergraphs by Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna.

All the algorithms implemented here construct a MPHF for a given key set using a standard scheme based on the Majewski–Wormald–Havas–Czech (MWHC) construction, that is based on finding a peeling order of a random 3-hypergraph.

compute_mphf_seq and compute_mphf_scan_mmap construct the same data structure, but the second should be used when the space needed to construct the MPHF does not fit in main memory. The obtained MPHF data structures can perform lookups significantly faster than other implementations using the same algorithm (see the paper for details).

The script test_all.py executes all the algorithms on a given file (or the standard UNIX dictionary if none is provided) and checks that the generated hash function is indeed minimal and perfect.

The project uses CMake. To build it on Unix systems it should be sufficient to do the following:

$ cmake .
$ make