Home

Awesome

CUHNSW

License Build Status contributions welcome

Efficient CUDA implementation of Hierarchical Navigable Small World (HNSW) graph algorithm for Approximate Nearest Neighbor (ANN)

Introduction

This project is to speed up HNSW algorithm by CUDA. I expect that anyone who will be interested in this project might be already familiar with the following paper and the open source project. If not, I strongly recommend that you check them first.

I also adapted some ideas from the following project.

By brief survey, I found there are several papers and projects to suggest to speed up ANN algorithms by GPU.

I started this project because I was originally interested in both CUDA programming and ANN algorithms. I release this project because it achieved meaningful performance and hope to develop further by community participation.

Literally, this package is implemented to build HNSW graphs using GPU, and to approximate nearest neighbor search through the built graphs, and the format of the model file is compatible with hnswlib. In other words, you can build a HNSW graph from this package, then save it and load it from hnswlib for search, and vice versa.

How to install

  1. pip install
pip install cuhnsw
  1. build from source
# clone repo and submodules
git clone git@github.com:js1010/cuhnsw.git && cd cuhnsw && git submodule update --init

# install requirements
pip install -r requirements.txt

# generate proto
python -m grpc_tools.protoc --python_out cuhnsw/ --proto_path cuhnsw/proto/ config.proto

# install
python setup.py install

How to use

import h5py
from cuhnsw import CuHNSW


h5f = h5py.File("glove-50-angular.hdf5", "r")
data = h5f["train"][:, :].astype(np.float32)
h5f.close()
ch0 = CuHNSW(opt={})
ch0.set_data(data)
ch0.build()
ch0.save_index("cuhnsw.index")
import h5py
from cuhnsw import CuHNSW

h5f = h5py.File("glove-50-angular.hdf5", "r")
data = h5f["test"][:, :].astype(np.float32)
h5f.close()
ch0 = CuHNSW(opt={})
ch0.load_index("cuhnsw.index")
nns, distances, found_cnt = ch0.search_knn(data, topk=10, ef_search=300)

Performance

attr1 vcpu2 vcpu4 vcpu8 vcpugpu
build time343.909179.83689.793670.54768.2847
build quality0.8631930.8633010.8632380.8631650.865471
attr1 vcpu2 vcpu4 vcpu8 vcpugpu
search time556.605287.967146.331115.43129.7008

Thoughts on Future Task

  1. implement parallel compilation using bazel or cmake (easy-medium): bazel is more preferable. compilation time is a little bit painful so far.
  2. achieve significant speed-up by using half-precision operation (medium): I experimented it, but only got around 10 % improvement. I am not sure if I have used the half-precision feature appropriately.
  3. support multi-device (very hard): it only supports single-device (gpu) yet since the graph should be shared across all the building threads.