Home

Awesome

Nearest Neighbor Descent (nndescent)

Nearest Neighbor Descent (nndescent) is a C++ implementation of the nearest neighbor descent algorithm, designed for efficient and accurate approximate nearest neighbor search. With seamless integration into Python, it offers a powerful solution for constructing k-nearest neighbor graphs. This algorithm is based on the pynndescent library.

Features

Please note that not all distances have undergone thorough testing. Therefore, it is advised to use them with caution and at your own discretion.

Installation

From PyPI

You can install nndescent directly from PyPI using pip:

pip install nndescent

If you want to run the examples in tests, additional packages are needed. You can install them manually or install nndescent with the full option:

pip install nndescent[full]

From Source

  1. Clone the repository:
git clone https://github.com/brj0/nndescent.git
cd nndescent
  1. Build and install the package:
pip install .

If you want to run the examples in tests, additional packages are needed. You can install them manually or install nndescent with the full option:

pip install .[full]
  1. To run the examples in tests you must first download the datasets:
python tests/make_test_data.py

Usage

In Python you can utilize the nndescent library in the following way:

import numpy as np
import nndescent

# Data must be a 2D numpy array of dtype 'float32'.
data = np.random.randint(50, size=(20,3)).astype(np.float32)

# Run NND algorithm
nnd = nndescent.NNDescent(data, n_neighbors=4)

# Get result
nn_indices, nn_distances = nnd.neighbor_graph

# Query data must be a 2D numpy array of dtype 'float32'.
query_data = np.random.randint(50, size=(5,3)).astype(np.float32)

# Calculate nearest neighbors for each query point
nn_query_indices, nn_query_distances = nnd.query(query_data, k=6)

To compile and run the C++ examples use the following commands within the project folder:

mkdir build
cd build
cmake ..
make
./simple

For detailed usage in C++ and for further Python/C++ examples please refer to the examples provided in the tests directory of the repository and the code documentation.

Performance

On my computer, the training phase of nndescent is approximately 15% faster than pynndescent for dense matrices, and 75% faster for sparse matrices. Furthermore, the search query phase shows a significant improvement, with >70% faster execution time. Below is the output obtained from running tests/benchmark.py, an ad hoc benchmark test. In this test, both nndescent and pynndescent were executed with the same parameters using either 'euclidean' or 'dot' as metric:

Benchmark test pynndescent (py) vs nndescent (c)

Data setpy train [ms]c train [ms]ratiopy vs c matchpy test [ms]c test [ms]ratiopy accuracyc accuracy
faces149.8145.90.9741.0001663.718.40.0111.0000.999
fmnist11959.210768.70.9000.9975754.81334.10.2320.9780.978
glove25149754.2101864.00.6800.96498740.69907.40.1000.7960.808
glove50192965.8137171.80.7110.88599750.810647.70.1070.7050.743
glove100218202.9180088.40.8250.81598770.212080.40.1220.6510.731
glove200287206.6243466.60.8480.772101639.417615.60.1730.6220.773
mnist11319.710188.10.9000.9975725.91273.80.2220.9690.968
nytimes63323.855638.10.8790.81423632.17108.90.3010.6140.811
sift131711.4105826.00.8030.97482503.77957.90.0960.8380.839
20newsgroups107339.028339.70.2640.92267518.622513.10.3330.8580.929

The compilation time and the lengthy numba loading time during runtime and import for 'pynndescent' are not considered in this ad hoc benchmark test. An Ann-Benchmarks wrapper is planned for the future.

Background

The theoretical background of NND is based on the following paper:

In addition, the algorithm utilizes random projection trees for initializing the nearest neighbor graph. The nndescent algorithm constructs a tree by randomly selecting two points and splitting the data along a hyperplane passing through their midpoint. For a more theoretical background, please refer to:

Contributing

Contributions are welcome! If you have any bug reports, feature requests, or suggestions, please open an issue or submit a pull request.

License

This project is licensed under the BSD-2-Clause license.

Acknowledgements

This implementation is based on the original pynndescent library by Leland McInnes. I would like to acknowledge and appreciate his work as a source of inspiration for this project.

For more information, visit the pynndescent GitHub repository.