Home

Awesome

A Benchmark of Minimal Perfect Hash Function Algorithms

This project aims to assess the performance of various minimal perfect hash function algorithms.

Definition: Given a set S of n distinct keys, a function f that bijectively maps the keys of S into the range {0,...,n−1} is called a minimal perfect hash function (MPHF) for S. Algorithms that find such functions when n is large and retain constant evaluation time are of practical interest. For instance, search engines and databases typically use minimal perfect hash functions to quickly assign identifiers to static sets of variable-length keys such as strings. The challenge is to design an algorithm which is efficient in three different aspects: time to find f (construction time), time to evaluate f on a key of S (lookup time), and space of representation for f.

This benchmark reports the three main efficiency aspects of each algorithm: construction time, lookup time, and space of representation. The minimal perfect hash functions are built using as keys either random 64-bits integers or line-delimited strings, which are read from the standard input.

NOTE: This repository has been created during the development of PTHash: Revisiting FCH Minimal Perfect Hashing [1], a C++ library implementing fast and compact minimal perfect hash functions.

Tested Algorithms

Currently implemented/included algorithms are:

Compiling the code

The code depends on several git submodules. If you have cloned the repository without --recursive, you will need to perform the following command before building:

git submodule update --init --recursive

To build the code:

bash configure.sh
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j

Usage

Execute ./mphf_benchmark -h to print an help message describing all command line parameters, whose output is as follows

Usage: ./mphf_benchmark [-h,--help] algorithm [--variant variant] [-n num_keys] [--num_construction_runs num_construction_runs] [--num_lookup_runs num_lookup_runs] [--verbose] [--seed seed] [--threads threads] [--gen generator]

 algorithm
	The name of the algorithm to run. One among `fch`, `chd`, `bbhash`, `emphf`, `recsplit`, `pthash`, `ppthash`.

 [--variant variant]
	Variant of the selected algorithm to test. (default: 0 = all variants)

 [-n num_keys]
	The number of random keys to use for the test. If it is not provided, then keys are read from the input (one per line).

 [--num_construction_runs num_construction_runs]
	Number of times to perform the construction. (default: 1)

 [--num_lookup_runs num_lookup_runs]
	Number of times to perform the lookup test. (default: 1)

 [--verbose]
	Verbose output during construction. (default: false)

 [--seed seed]
	Seed used for construction. (default: 0)

 [--threads threads]
	Number of threads used in multi-threaded calculations. (default: 0 = auto)

 [--gen generator]
	The method of generating keys, one of: `64 (default if -n is given)`, `xs32` (xor-shift 32), `xs64` (xor-shift 64), `stdin` (strings from stdin; default if -n is not given)

 [-h,--help]
	Print this help text and silently exits.

To actually use the benchmark, and reproduce Table 5 of PTHash: Revisiting FCH Minimal Perfect Hashing [1] (except GOV, which is Java-based), execute the following commands

./mphf_benchmark all -n 100000000 --num_construction_runs 5 --num_lookup_runs 5 --seed 1234567890
./mphf_benchmark all -n 1000000000 --num_construction_runs 5 --num_lookup_runs 5 --seed 1234567890

The following table summarizes the output of the second command, which employs 1 billion 64-bit random keys, on a server machine equipped with Intel i9-9900K cores (@3.60 GHz), 64 GB of RAM DDR3 (@2.66 GHz), and running Linux 5 (64 bits)

Methodconstruction<br>(secs)space<br>(bits/key)lookup<br>(ns/key)
FCH, c=4159044.0035
FCH, c=529375.0035
FCH, c=621335.0035
FCH, c=712217.0035
CHD, 𝜆=419722.17419
CHD, 𝜆=559642.07417
CHD, 𝜆=6237462.01416
EMPHF3742.61199
GOV8752.23175
BBHASH, 𝛾=12533.06170
BBHASH, 𝛾=21523.71143
BBHASH, 𝛾=51006.87113
RecSplit, l=5, 𝑏=52332.95220
RecSplit, l=8, 𝑏=1009361.80204
RecSplit, l=12, 𝑏=957002.23197
PTHash, C-C, 𝛼=0.99, 𝑐=710423.2337
PTHash, D-D, 𝛼=0.88, 𝑐=73083.9464
PTHash, EF, 𝛼=0.99, 𝑐=617992.17101
PTHash, D-D, 𝛼=0.94, 𝑐=76892.9955

Authors

References