Awesome

Non-metric Similarity Graphs for Maximum Inner Product Search

This is the implementation for ip-nsw from the following paper:

S. Morozov, A. Babenko. Non-metric Similarity Graphs for Maximum Inner Product Search. Advances in Neural Information Processing Systems 32 (NIPS 2018).

and it is substantially based on https://github.com/yurymalkov/hnsw.

Requirements

g++ with c++11 support
OpenMP installed

Compilation

To download and compile the code type:

$ git clone https://github.com/stanis-morozov/ip-nsw.git
$ cd ip-nsw
$ cmake CMakeLists.txt
$ make

Dataset

We introduce to the community the new dataset Music100. The database of Music100 consists of 1,000,000 vectors of dimensionality 100 and there is a random sample of 10,000 queries of the same dimensionality.

Database Music100

Query Music100

Groundtruth: Top 1, Top 10, Top 100.

Both files are written in binary format. Vectors are stored consecutively, and numbers are represented in 4-bytes floating poing format (float in C/C++).

Example of reading database_music100.bin in C++:

size_t databaseSize = 1000 * 1000;
size_t dimension = 100;
std::vector <std::vector <float> > data(databaseSize, std::vector <float> (dimension));
std::ifstream input("database_music100.bin", std::ios::binary);
for (size_t i = 0; i < databaseSize; i++) {
    input.read((char*)data[i].data(), dimension * sizeof(float));
}
input.close();

Example of reading database_music100.bin in Python:

databaseSize = 10**6
dimension = 100
data = np.fromfile('database_music100.bin', dtype = np.float32).reshape(databaseSize, dimension)

Run experiments

Usage: main [OPTIONS]
This tool supports two modes: to construct the graph index from the database and to retrieve top K maximum inner product vectors using the constructed index. Each mode has its own set of parameters.

  --mode            "database" or "query". Use "database" for
                    constructing index and "query" for top K
                    maximum inner product retrieval

Database mode supports the following options:
  --database        Database filename. Database should be stored in binary format.
                    Vectors are written consecutive and numbers are
                    represented in 4-bytes floating poing format (float in C/C++)
  --databaseSize    Number of vectors in the database
  --dimension       Dimension of vectors
  --outputGraph     Filename for the output index graph
  --efConstruction  efConstruction parameter. Default: 1024
  --M               M parameter. Default: 32

Query mode supports the following options:
  --query           Query filename. Queries should be stored in binary format.
                    Vectors are written consecutive and numbers are
                    represented in 4-bytes floating poing format (float in C/C++)
  --querySize       Number of queries
  --dimension       Dimension of vectors
  --inputGraph      Filename for the input index graph
  --efSearch        efSearch parameter. Default: 128
  --topK            Top size for retrieval. Default: 1
  --output          Filename to print the result. Default: stdout

Our best performed index graph from the NIPS paper is (we assume that our Music100 dataset is located in the data folder):

./main --mode database --database data/database_music100.bin --databaseSize 1000000 --dimension 100 --outputGraph out_graph.hnsw --efConstruction 1024 --M 32
./main --mode query --query data/query_music100.bin --querySize 10000 --dimension 100 --inputGraph out_graph.hnsw --topK 10 --efSearch 128 --output result.txt

The above commands print the output to result.txt in the following format:

109133 121486 19537 659869 834299 626720 867593 793041 985613 780019 
981712 601975 355926 202115 294738 198925 679900 609952 19993 615818 
77873 764344 292930 982610 65402 80310 638008 446408 209176 974376 
615370 981712 768351 601975 19993 679900 521439 198925 867593 739598 
619843 567912 732469 560702 649392 34342 827097 18669 385074 690897 
473444 752040 342582 154589 881600 292375 171237 223102 370021 219806 
313256 36821 267804 507875 949307 270882 471300 70277 651590 111192 
19993 93160 361868 896032 202115 739598 58964 967143 205517 71338 
894914 943250 85319 431888 615156 428084 909129 687106 884369 942893 
480939 228514 630512 651287 197739 509842 745237 566816 881843 166116
...

where each line corresponds to the answer for the query and the numbers in the lines are zero-based IDs of vectors in the database. The Recall for the above parameters is 0.99458.

Related publication

@inproceedings{ip-nsw18,
    title={Non-metric Similarity Graphs for Maximum Inner Product Search}
    author={Stanislav Morozov and Artem Babenko},
    booktitle={Advances in Neural Information Processing Systems},
    year={2018}
}