Home

Awesome

SRS - Fast Approximate Nearest Neighbor Search in High Dimensional Euclidean Space With a Tiny Index

SRS-Mem is a C++ program for performing approximate nearest neighbor search in high dimension Euclidean space in the main memory. The current implementation is adapted from our VLDB'15 research paper. The main modification is to use an in-memory multidimensional index (rather than an R-tree as in the paper), as it is often the case that our index is so small and can be accommodated in the main memory. Currently, the index is a modified version of the Cover Tree due to its strong theoretical guarantees; nevertheless, any multidimensional index that supports incremental exact kNN search can be used!

Features

How it works

In a nutshell, SRS reduces the problem of "approximate NN search in high dimensional space" to a "exact T-NN search in a low dimensional space".

In the indexing phase, SRS projects data points from the original high-dimensional space into an appropriately chosen m-dimensional space via 2-stable random projections, and then uses a cover-tree to index these projected points.

The key observation is that the inter-point distance in the projected space (called Projected Distance) over that in the original high-dimensional space follows a scaled chi-squared distribution, which has a sharp concentration bound. Therefore, given any threshold on the projected distance, and for any point o, we can compute exactly the probability that o's projected distance is within the threshold.

In the querying phase, SRS performs an incrementally k-NN search on the projected space, until when it has found a satisfactory point (i.e., early-termination condition), or it has examined t * n points (i.e., normal-termination condition).

Before start

How to use SRS

  1. Compile the program:
% make all
  1. Use cal_param to calculate a feasible setting of parameters based on the given constraints. The users can manually set either m or the success probability. A feasible setting of parameters will be printed on the screen and it can be used later in the query processing phase. The implementation is based on Algorithm 6 in the paper. For example (using the toy dataset with 3000 points):
% ./cal_param -n 3000 -m 7 -c 4

The following message will be printed out:

A feasible setting is:
m = 7
prob_thres(-r) = 0.299203
T_max(-t) = 2
t = 0.000544
The output indicates that, in the query processing phase, the users shall
use the arguments `-c 4`, `-m 7`, `-r 0.299203` and `-t 2`.

As a rule of thumb, we recommend setting m between 6 and 10, to begin with.

  1. Use gen_gt to generate the ground truth of given dataset and query workload. The ground truth file will be used when processing the query workload. For example (using the toy dataset):
% ./gen_gt -d 192 -n 3000 -k 10 -q data/toy.q -s data/toy.ds -g data/toy.gt -y i

-y i indicates that each coordinates is an integer. The other option is -y f, indicating that each coordinates is a floating point number.

  1. Use srs with the -I option to index the data. Users need to specify m and index path in this step. For example (using the toy dataset):
% mkdir index
% ./srs -I -d 192 -i index/ -m 7 -n 3000 -s data/toy.ds -y i
  1. Use srs with the -Q option to process the query workload. The implementation is based on Algorithm 1 in the paper and its variants. The top-k approximate nearest neighbors for each query in the query workload will be returned, together with the average ratio and time over all queries.

    Users can use the parameter setting given by cal_param. For example (using the toy dataset):

% ./srs -Q -c 4 -g data/toy.gt -i index/ -k 10 -q data/toy.q -t 2 -r 0.299203
Alternatively, users can also specify the parameters by themselves to achieve another
space-time trade-off.

6. The users can change the -t parameter to n (i.e., the cardinality of the dataset) to force the algorithm rely on the early-termination condition to stop. This will increase the quality and slightly increase the time cost. This is the SRS-2 algorithm in the paper.

  1. The users can change the -r parameter to any number larger than 1, to force the algorithm stop only on the normal-termination condition (i.e., examining tn data points). This will substantially increase the quality and time cost. This is the SRS-1 algorithm in the paper.

  2. The users can change the -c parameter to a smaller value to achieve better quality without affecting the worst case time cost. This is the SRS-12+ algorithm in the paper.

Data Format

  1. Data file should contain n lines, where n is the cardinality of the dataset. The file should be formatted as:
e_1_1 e_1_2 ... e_1_d
...
e_n_1 e_n_2 ... e_n_d

where e_i_js are integers, and are separated by whitespace.

  1. Query file should contain N+1 lines, where N is the number of queries in the query workload. The file should be formatted as:
N d
ID_1 e_1_1 e_1_2 ... e_1_d
...
ID_N e_N_1 e_N_2 ... e_N_d

where d is the dimensionality, e_i_j is an integer, and separated by whitespace.

Hard Dataset

Condition of use

Future work

Contact

Please report bugs, feature requests, comments, or suggestions to Yifang Sun (yifangs AT cse.unsw.edu.au) or Wei Wang (weiw AT cse.unsw.edu.au).