Home

Awesome

Introduction to approximate nearest neighbor search

Exact nearest neighbor search (k-nearest-neighbor or KNN) is prohibitively expensive at higher dimensions, because approaches to segment the search space that work in 2D or 3D like quadtree or k-d tree devolve to linear scans at higher dimensions. This is one aspect of what is called “the curse of dimensionality.”

With larger datasets, it is almost always more useful to get an approximate answer in logarithmic time, than the exact answer in linear time. This is abbreviated as ANN (approximate nearest neighbor) search.

There are two broad categories of ANN index:

Graph-based indexes tend to be simpler to implement and faster, but more importantly they can be constructed and updated incrementally. This makes them a much better fit for a general-purpose index than partitioning approaches that only work on static datasets that are completely specified up front. That is why all the major commercial vector indexes use graph approaches.

JVector is a graph index in the DiskANN family tree.

JVector Architecture

JVector is a graph-based index that builds on the DiskANN design with composeable extensions.

JVector implements a single-layer graph with nonblocking concurrency control, allowing construction to scale linearly with the number of cores: JVector scales linearly as thread count increases

The graph is represented by an on-disk adjacency list per node, with additional data stored inline to support two-pass searches, with the first pass powered by lossily compressed representations of the vectors kept in memory, and the second by a more accurate representation read from disk. The first pass can be performed with

The second pass can be performed with

This two-pass design reduces memory usage and reduces latency while preserving accuracy.

Additionally, JVector is unique in offering the ability to construct the index itself using two-pass searches, allowing larger-than-memory indexes to be built: Much larger indexes

This is important because it allows you to take advantage of logarithmic search within a single index, instead of spilling over to linear-time merging of results from multiple indexes.

JVector step-by-step

All code samples are from SiftSmall in the JVector source repo, which includes the siftsmall dataset as well. Just import the project in your IDE and click Run to try it out!

Step 1: Build and query an index in memory

First the code:

    public static void siftInMemory(ArrayList<VectorFloat<?>> baseVectors) throws IOException {
        // infer the dimensionality from the first vector
        int originalDimension = baseVectors.get(0).length();
        // wrap the raw vectors in a RandomAccessVectorValues
        RandomAccessVectorValues ravv = new ListRandomAccessVectorValues(baseVectors, originalDimension);

        // score provider using the raw, in-memory vectors
        BuildScoreProvider bsp = BuildScoreProvider.randomAccessScoreProvider(ravv, VectorSimilarityFunction.EUCLIDEAN);
        try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp,
                                                               ravv.dimension(),
                                                               16, // graph degree
                                                               100, // construction search depth
                                                               1.2f, // allow degree overflow during construction by this factor
                                                               1.2f)) // relax neighbor diversity requirement by this factor
        {
            // build the index (in memory)
            OnHeapGraphIndex index = builder.build(ravv);

            // search for a random vector
            VectorFloat<?> q = randomVector(originalDimension);
            SearchResult sr = GraphSearcher.search(q,
                                                   10, // number of results
                                                   ravv, // vectors we're searching, used for scoring
                                                   VectorSimilarityFunction.EUCLIDEAN, // how to score
                                                   index,
                                                   Bits.ALL); // valid ordinals to consider
            for (SearchResult.NodeScore ns : sr.getNodes()) {
                System.out.println(ns);
            }
        }
    }

Commentary:

Step 2: more control over GraphSearcher

Keeping the Builder the same, the updated search code looks like this:

            // search for a random vector using a GraphSearcher and SearchScoreProvider
            VectorFloat<?> q = randomVector(originalDimension);
            try (GraphSearcher searcher = new GraphSearcher(index)) {
                SearchScoreProvider ssp = SearchScoreProvider.exact(q, VectorSimilarityFunction.EUCLIDEAN, ravv);
                SearchResult sr = searcher.search(ssp, 10, Bits.ALL);
                for (SearchResult.NodeScore ns : sr.getNodes()) {
                    System.out.println(ns);
                }
            }

Commentary:

Step 3: Measuring recall

A blisteringly-fast vector index isn’t very useful if it doesn’t return accurate results. As a sanity check, SiftSmall includes a helper method testRecall. Wiring up that to our code mostly involves turning the SearchScoreProvider into a factory lambda:

            Function<VectorFloat<?>, SearchScoreProvider> sspFactory = q -> SearchScoreProvider.exact(q, VectorSimilarityFunction.EUCLIDEAN, ravv);
            testRecall(index, queryVectors, groundTruth, sspFactory);

If you run the code, you will see slight differences in recall every time (printed by testRecall):

(OnHeapGraphIndex) Recall: 0.9898
...
(OnHeapGraphIndex) Recall: 0.9890

This is expected given the approximate nature of the index being created and the perturbations introduced by the multithreaded concurrency of the build call.

Step 4: write and load index to and from disk

The code:

        Path indexPath = Files.createTempFile("siftsmall", ".inline");
        try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp, ravv.dimension(), 16, 100, 1.2f, 1.2f)) {
            // build the index (in memory)
            OnHeapGraphIndex index = builder.build(ravv);
            // write the index to disk with default options
            OnDiskGraphIndex.write(index, ravv, indexPath);
        }

        // on-disk indexes require a ReaderSupplier (not just a Reader) because we will want it to
        // open additional readers for searching
        ReaderSupplier rs = new SimpleMappedReaderSupplier(indexPath);
        OnDiskGraphIndex index = OnDiskGraphIndex.load(rs);
        // measure our recall against the (exactly computed) ground truth
        Function<VectorFloat<?>, SearchScoreProvider> sspFactory = q -> SearchScoreProvider.exact(q, VectorSimilarityFunction.EUCLIDEAN, ravv);
        testRecall(index, queryVectors, groundTruth, sspFactory);

Commentary:

Step 5: use compressed vectors in the search

Compressing the vectors with product quantization is done as follows:

        // compute and write compressed vectors to disk
        Path pqPath = Files.createTempFile("siftsmall", ".pq");
        try (DataOutputStream out = new DataOutputStream(new BufferedOutputStream(Files.newOutputStream(pqPath)))) {
            // Compress the original vectors using PQ. this represents a compression ratio of 128 * 4 / 16 = 32x
            ProductQuantization pq = ProductQuantization.compute(ravv,
                                                                 16, // number of subspaces
                                                                 256, // number of centroids per subspace
                                                                 true); // center the dataset
            // Note: before jvector 3.1.0, encodeAll returned an array of ByteSequence.
            PQVectors pqv = pq.encodeAll(ravv);
            // write the compressed vectors to disk
            pqv.write(out);
        }

Then we can wire up the compressed vectors to a two-phase search by getting the fast ApproximateScoreFunction from PQVectors, and the Reranker from the index View:

        ReaderSupplier rs = new MMapReaderSupplier(indexPath);
        OnDiskGraphIndex index = OnDiskGraphIndex.load(rs);
        // load the PQVectors that we just wrote to disk
        try (RandomAccessReader in = new SimpleMappedReader(pqPath)) {
            PQVectors pqv = PQVectors.load(in);
            // SearchScoreProvider that does a first pass with the loaded-in-memory PQVectors,
            // then reranks with the exact vectors that are stored on disk in the index
            Function<VectorFloat<?>, SearchScoreProvider> sspFactory = q -> {
                ApproximateScoreFunction asf = pqv.precomputedScoreFunctionFor(q, VectorSimilarityFunction.EUCLIDEAN);
                Reranker reranker = index.getView().rerankerFor(q, VectorSimilarityFunction.EUCLIDEAN);
                return new SearchScoreProvider(asf, reranker);
            };
            // measure our recall against the (exactly computed) ground truth
            testRecall(index, queryVectors, groundTruth, sspFactory);
        }

This set of functionality is the classic DiskANN design.

Step 6: building a larger-than-memory index

JVector can also apply PQ compression to allow indexing a larger than memory dataset: only the compressed vectors are kept in memory.

First we need to set up a PQVectors instance that we can add new vectors to, and a BuildScoreProvider using it:

        // compute the codebook, but don't encode any vectors yet
        ProductQuantization pq = ProductQuantization.compute(ravv, 16, 256, true);

        // as we build the index we'll compress the new vectors and add them to this List backing a PQVectors;
        // this is used to score the construction searches
        List<ByteSequence<?>> incrementallyCompressedVectors = new ArrayList<>();
        PQVectors pqv = new PQVectors(pq, incrementallyCompressedVectors);
        BuildScoreProvider bsp = BuildScoreProvider.pqBuildScoreProvider(VectorSimilarityFunction.EUCLIDEAN, pqv);

Then we need to set up an OnDiskGraphIndexWriter with full control over the construction process.

        Path indexPath = Files.createTempFile("siftsmall", ".inline");
        Path pqPath = Files.createTempFile("siftsmall", ".pq");
        // Builder creation looks mostly the same
        try (GraphIndexBuilder builder = new GraphIndexBuilder(bsp, ravv.dimension(), 16, 100, 1.2f, 1.2f);
             // explicit Writer for the first time, this is what's behind OnDiskGraphIndex.write
             OnDiskGraphIndexWriter writer = new OnDiskGraphIndexWriter.Builder(builder.getGraph(), indexPath)
                     .with(new InlineVectors(ravv.dimension()))
                     .withMapper(new OnDiskGraphIndexWriter.IdentityMapper())
                     .build();
             // output for the compressed vectors
             DataOutputStream pqOut = new DataOutputStream(new BufferedOutputStream(Files.newOutputStream(pqPath))))

Once that’s done, we can index vectors one at a time:

            // build the index vector-at-a-time (on disk)
            for (VectorFloat<?> v : baseVectors) {
                // compress the new vector and add it to the PQVectors (via incrementallyCompressedVectors)
                int ordinal = incrementallyCompressedVectors.size();
                incrementallyCompressedVectors.add(pq.encode(v));
                // write the full vector to disk
                writer.writeInline(ordinal, Feature.singleState(FeatureId.INLINE_VECTORS, new InlineVectors.State(v)));
                // now add it to the graph -- the previous steps must be completed first since the PQVectors
                // and InlineVectorValues are both used during the search that runs as part of addGraphNode construction
                builder.addGraphNode(ordinal, v);
            }

Finally, we need to run cleanup() and write the index and the PQVectors to disk:

            // cleanup does a final enforcement of maxDegree and handles other scenarios like deleted nodes
            // that we don't need to worry about here
            builder.cleanup();

            // finish writing the index (by filling in the edge lists) and write our completed PQVectors
            writer.write(Map.of());
            pqv.write(pqOut);

Commentary:

Less-obvious points

Advanced features

The research behind the algorithms

Developing and Testing

This project is organized as a multimodule Maven build. The intent is to produce a multirelease jar suitable for use as a dependency from any Java 11 code. When run on a Java 20+ JVM with the Vector module enabled, optimized vector providers will be used. In general, the project is structured to be built with JDK 20+, but when JAVA_HOME is set to Java 11 -> Java 19, certain build features will still be available.

Base code is in jvector-base and will be built for Java 11 releases, restricting language features and APIs appropriately. Code in jvector-twenty will be compiled for Java 20 language features/APIs and included in the final multirelease jar targeting supported JVMs. jvector-multirelease packages jvector-base and jvector-twenty as a multirelease jar for release. jvector-examples is an additional sibling module that uses the reactor-representation of jvector-base/jvector-twenty to run example code. jvector-tests contains tests for the project, capable of running against both Java 11 and Java 20+ JVMs.

To run tests, use mvn test. To run tests against Java 20+, use mvn test. To run tests against Java 11, use mvn -Pjdk11 test. To run a single test class, use the Maven Surefire test filtering capability, e.g., mvn -Dtest=TestNeighborArray test. You may also use method-level filtering and patterns, e.g., mvn -Dtest=TestNeighborArray#testRetain* test.

You can run SiftSmall and Bench directly to get an idea of what all is going on here. Bench will automatically download required datasets to the fvec and hdf5 directories. The files used by SiftSmall can be found in the siftsmall directory in the project root.

To run either class, you can use the Maven exec-plugin via the following incantations:

mvn compile exec:exec@bench

or for Sift:

mvn compile exec:exec@sift

Bench takes an optional benchArgs argument that can be set to a list of whitespace-separated regexes. If any of the provided regexes match within a dataset name, that dataset will be included in the benchmark. For example, to run only the glove and nytimes datasets, you could use:

mvn compile exec:exec@bench -DbenchArgs="glove nytimes"

To run Sift/Bench without the JVM vector module available, you can use the following invocations:

mvn -Pjdk11 compile exec:exec@bench

mvn -Pjdk11 compile exec:exec@sift

The ... -Pjdk11 invocations will also work with JAVA_HOME pointing at a Java 11 installation.

To release, configure ~/.m2/settings.xml to point to OSSRH and run mvn -Prelease clean deploy.