Home

Awesome

Benchmarking nearest neighbors

Build Status

Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem, but so far there has not been a lot of empirical attempts at comparing approaches in an objective way.

This project contains some tools to benchmark various implementations of approximate nearest neighbor (ANN) search for different metrics. We have pregenerated datasets (in HDF5) formats and we also have Docker containers for each algorithm. There's a test suite that makes sure every algorithm works.

Evaluated

Data sets

We have a number of precomputed data sets for this. All data sets are pre-split into train/test and come with ground truth data in the form of the top 100 neighbors. We store them in a HDF5 format:

DatasetDimensionsTrain sizeTest sizeNeighborsDistanceDownload
DEEP1B969,990,00010,000100AngularHDF5 (3.6GB)
Fashion-MNIST78460,00010,000100EuclideanHDF5 (217MB)
GIST9601,000,0001,000100EuclideanHDF5 (3.6GB)
GloVe251,183,51410,000100AngularHDF5 (121MB)
GloVe501,183,51410,000100AngularHDF5 (235MB)
GloVe1001,183,51410,000100AngularHDF5 (463MB)
GloVe2001,183,51410,000100AngularHDF5 (918MB)
Kosarak2798374,962500100JaccardHDF5 (2.0GB)
MNIST78460,00010,000100EuclideanHDF5 (217MB)
NYTimes256290,00010,000100AngularHDF5 (301MB)
SIFT1281,000,00010,000100EuclideanHDF5 (501MB)
Last.fm65292,38550,000100AngularHDF5 (135MB)

Results

These are all as of 2020-07-12, running all benchmarks on a c5.4xlarge machine on AWS with --parallelism set to 3:

glove-100-angular

glove-100-angular

sift-128-euclidean

glove-100-angular

fashion-mnist-784-euclidean

fashion-mnist-784-euclidean

lastfm-64-dot

lastfm-64-dot

nytimes-256-angular

nytimes-256-angular

glove-25-angular

glove-25-angular

Install

The only prerequisite is Python (tested with 3.6) and Docker.

  1. Clone the repo.
  2. Run pip install -r requirements.txt.
  3. Run python install.py to build all the libraries inside Docker containers (this can take a while, like 10-30 minutes).

Running

  1. Run python run.py (this can take an extremely long time, potentially days)
  2. Run python plot.py or python create_website.py to plot results.

You can customize the algorithms and datasets if you want to:

Including your algorithm

  1. Add your algorithm into ann_benchmarks/algorithms by providing a small Python wrapper.
  2. Add a Dockerfile in install/ for it
  3. Add it to algos.yaml
  4. Add it to .travis.yml

Principles

Authors

Built by Erik Bernhardsson with significant contributions from Martin Aumüller and Alexander Faithfull.

Related Publication

The following publication details design principles behind the benchmarking framework: