Home

Awesome

RcppHNSW

AppVeyor Build Status R-CMD-check Coverage Status CRAN Status Badge Dependencies CRAN Monthly Downloads CRAN Downloads Last Commit

Rcpp bindings for hnswlib.

Status

February 4 2024 RcppHNSW 0.6.0 is released to CRAN, supporting hnswlib version 0.8.0.

September 19 2023 RcppHNSW 0.5.0 is released to CRAN, supporting hnswlib version 0.7.0, a getItems method for returning the items used to build the index and some performance improvements if your data is already column-stored. Also, a small roxygen problem with the package documentation was fixed.

July 18 2022 RcppHNSW 0.4.1 is released. Unfortunately, there are valgrind problems with the version of hnswlib used in RcppHNSW 0.4.0, so that has been rolled back.

July 16 2022 RcppHNSW 0.4.0 is released. This release matches hnswlib version 0.6.2, but otherwise adds no new features. Some minor CRAN check NOTEs are fixed and there is also a minor license change: previously the license was GPLv3. From this version, it now supports GPLv3 or later.

September 6 2020 RcppHNSW 0.3.0 is now available on CRAN, with multi-threading support.

August 30 2020. Although not yet on CRAN, support for building and searching an index in parallel (via the n_threads function argument and setNumThreads object method) has been added to the current development version (available via devtools::install_github). Thanks to Dmitriy Selivanov for a lot of the work on this.

September 20 2019. RcppHNSW 0.2.0 is now available on CRAN, up to date with hnswlib at https://github.com/nmslib/hnswlib/commit/c5c38f0, with new methods: size, resizeIndex and markDeleted. Also, a bug that prevented searching with datasets smaller than k has been fixed. Thanks to Yuxing Liao for spotting that.

January 21 2019. RcppHNSW is now available on CRAN.

October 20 2018. By inserting some preprocessor symbols into hnswlib, these bindings no longer require a non-portable compiler flag and hence will pass R CMD CHECK without any warnings: previously you would be warned about -march=native. The price paid is not using specialized functions for the distance calculations that are architecture-specific. I have not checked how bad the performance hit is. The old settings remain in src/Makevars and src/Makevars.win (commented out), if you want to build the project from source directly. Otherwise, Release 0.0.0.9000 is the last version with the old behavior, which can be installed with something like:

devtools::install_github("jlmelville/rcpphnsw@v0.0.0.9000")

hnswlib

hnswlib is a header-only C++ library for finding approximate nearest neighbors (ANN) via Hierarchical Navigable Small Worlds (Yashunin and Malkov, 2016). It is part of the nmslib project.

The RcppHNSW Package

An R package that interfaces with hnswlib, taking enormous amounts of inspiration from Dirk Eddelbuettel's RcppAnnoy package which did the same for the Annoy ANN C++ library.

One difference is that I use roxygen2 to generate the man pages. The NAMESPACE is still built manually, however (I don't believe you can export the classes currently).

Installing

From CRAN:

install.packages("RcppHNSW")

Development versions from github:

devtools::install_github("jlmelville/RcppHNSW")

Function example

irism <- as.matrix(iris[, -5])

# function interface returns results for all rows in nr x k matrices
all_knn <- RcppHNSW::hnsw_knn(irism, k = 4, distance = "l2")
# other distance options: "euclidean", "cosine" and "ip" (inner product distance)

# for high-dimensional data you may see a speed-up if you store the data
# where each *column* is an item to be indexed and searched. Set byrow = TRUE
# for this.
# Admittedly, the iris dataset is *not* high-dimensional
iris_by_col <- t(irism)
all_knn <- RcppHNSW::hnsw_knn(iris_by_col, k = 4, distance = "l2", byrow = FALSE)

# process can be split into two steps, so you can build with one set of data
# and search with another
ann <- hnsw_build(irism[1:100, ])
iris_nn <- hnsw_search(irism[101:150, ], ann, k = 5)

Class Example

As noted in the "Do not use named parameters" section below, you should avoid using named parameters when using class methods. But I do use them in a few places below to document the name of the parameters the positional arguments refer to.

library(RcppHNSW)
data <- as.matrix(iris[, -5])

# Create a new index using the L2 (squared Euclidean) distance
# nr and nc are the number of rows and columns of the data to be added, respectively
# ef and M determines speed vs accuracy trade off
# You must specify the maximum number of items to add to the index when it
# is created. But you can increase this number: see the next example
M <- 16
ef <- 200
dim <- ncol(data)
nitems <- nrow(data)
ann <- new(HnswL2, dim, nitems, M, ef)

# Add items to index
for (i in 1:nitems) {
  ann$addItem(data[i, ])
}

# Find 4 nearest neighbors of row 1
# indexes are in res$item, distances in res$distance
# set include_distances = TRUE to get distances as well as index
res <- ann$getNNsList(data[1, ], k = 4, include_distances = TRUE)

# It's more efficient to use the batch methods if you have all the data you
# need at once
ann2 <- new(HnswL2, dim, nitems, M, ef)
ann2$addItems(data)
# Retrieve the 4 nearest neighbors for every item in data
res2 <- ann2$getAllNNsList(data, 4, TRUE)
# labels of the data are in res$item, distances in res$distance

# If you are able to store your data column-wise, then the overhead of copying
# the data into a form usable by hnsw can be noticeably reduced
data_by_col <- t(data)
ann3 <- new(HnswL2, dim, nitems, M, ef)
ann3$addItemsCol(data_by_col)
# Retrieve the 4 nearest neighbors for every item in data_by_col
res3 <- ann3$getAllNNsListCol(data_by_col, 4, TRUE)
# The returned neared neighbor data matrices are also returned column-wise
all(res2$item == t(res3$item) & res2$distance == t(res3$distance))

# Save the index
ann$save("iris.hnsw")

# load it back in: you do need to know the dimension of the original data
ann4 <- new(HnswL2, dim, "iris.hnsw")
# new index should behave like the original
all(ann$getNNs(data[1, ], 4) == ann4$getNNs(data[1, ], 4))

# other distance classes:
# Cosine: HnswCosine
# Inner Product: HnswIP
# Euclidean: HnswEuclidean

Here's a rough equivalent of the serialization/deserialization example from the hnswlib README, but using the recently-added resizeIndex method to increase the size of the index after its initial specification, avoiding having to read from or write to disk:

library("RcppHNSW")
set.seed(12345)

dim <- 16
num_elements <- 100000

# Generate sample data
data <- matrix(stats::runif(num_elements * dim), nrow = num_elements)

# Split data into two batches
data1 <- data[1:(num_elements / 2), ]
data2 <- data[(num_elements / 2 + 1):num_elements, ]

# Create index
M <- 16
ef <- 10
# Set the initial index size to the size of the first batch
p <- new(HnswL2, dim, num_elements / 2, M, ef)

message("Adding first batch of ", nrow(data1), " elements")
p$addItems(data1)

# Query the elements for themselves and measure recall:
idx <- p$getAllNNs(data1, k = 1)
message("Recall for the first batch: ", formatC(mean(idx == 1:nrow(data1))))

# Increase the total capacity, so that it will handle the new data
p$resizeIndex(num_elements)

message("Adding the second batch of ", nrow(data2), " elements")
p$addItems(data2)

# Query the elements for themselves and measure recall:
idx <- p$getAllNNs(data, k = 1)
# You can get distances with:
# res <- p$getAllNNsList(data, k = 1, include_distances = TRUE)
# res$dist contains the distance matrix, res$item stores the indexes

message("Recall for two batches: ", formatC(mean(idx == 1:num_elements)))

Although there's no longer any need for this, for completeness, here's how you would use save and new to achieve the same effect without resizeIndex:

filename <- "first_half.bin"
# Serialize index
p$save(filename)

# Reinitialize and load the index
rm(p)
message("Loading index from ", filename)
# Increase the total capacity, so that it will handle the new data
p <- new(HnswL2, dim, filename, num_elements)
unlink(filename)

API

DO NOT USE NAMED PARAMETERS

Because these are wrappers around C++ code, you cannot use named parameters in the calling R code. Arguments are parsed by position. This is most annoying in constructors, which take multiple integer arguments, e.g.

### DO THIS ###
dim <- 10
num_elements <- 100
M <- 200
ef_construction <- 16
index <- new(HnswL2, dim, num_elements, M, ef_construction)

### DON'T DO THIS ###
index <- new(HnswL2, dim, ef_construction = 16, M = 200, num_elements = 100)
# treated as if you wrote:
index <- new(HnswL2, dim, 16, 200, 100)

OK onto the API

Differences from Python Bindings

License

GPL-3 or later.