Home

Awesome

SimSearch

Overview

SimSearch is an open-source software for top-k similarity search over multi-attribute entity profiles possibly residing in different, remote, and heterogeneous data sources.

SimSearch is developed in Java and provides support for combined similarity search against multi-attribute entities, i.e., datasets with different types of properties (textual/categorical, numerical, spatial, temporal, etc.). The queries enable multi-attribute similarity search for data exploration and may involve different similarity measures per attribute (Jaccard, Euclidean, Manhattan, etc.):

Attribute data values may come from diverse data sources, and each one can be either ingested or queried in-situ:

With SimSearch, users can request the top-k results across all examined attributes and get each result ranked by an aggregate similarity score. This library builds specialized indices for each specific attribute type. It supports two alternative methods for similarity search:

SimSearch can be deployed either as a standalone Java application or as a RESTful web service.

Documentation

Javadoc is available here.

Usage

Step 1. Download or clone the project:

$ git clone https://github.com/smartdatalake/simsearch.git

Step 2. Open terminal inside root folder and compile by running:

$ mvn clean package spring-boot:repackage

Step 3. Edit the parameters for the various data sources and their queryable attributes in the sources.json file.

Step 3. Edit the parameters in the search.json file for the various attributes involved in the top-k similarity search query.

Standalone execution

To invoke SimSearch locally in standalone mode as a Java application, run the executable:

$ java -jar target/simsearch-0.5.1-SNAPSHOT.jar

Next, choose a number corresponding to a functionality you want to apply:

(1): MOUNT SOURCES -> Specifies the queryable attributes and (if necessary) builds suitable in-memory indices on their values. This mount operation must be applied before any queries are submitted. The user must provide the path to a JSON file containing the specification of data sources and their attributes to be made queryable in SimSearch. Example configuration for enabling SimSearch using rank aggregation is available in sources.json.example file or data/gdelt/standalone/sources.json, which must specify a suitable metric distance per attribute. Example configuration for pivot-based SimSearch is available in data/gdelt/standalone/sources_pivot.json. Note that if set-valued textual attributes (e.g., containing sets of keywords) are involved in SimSearch, they must be transformed into word embeddings using a dictionary of terms that must be also specified in the configuration. For instance, this dictionary has been constructed with Latent Dirichlet Allocation (LDA) over the terms used in the organizations attribute of the sample dataset.

(2): DELETE SOURCES -> Disables attributes from querying; attributes may be enabled again using functionality (1). This operation is removes the specified attribute(s) when rank aggregation is used. CAUTION! For pivot-based SimSearch, this operation drops the centralized index built on all attributes; in this case, the index must be rebuilt from scratch using functionality (1) and involving the desired attributes.

(3): CATALOG -> Returns a list of the currently queryable attributes and the operation (categorical, numerical, spatial, temporal, textual, or pivot-based) supported for each one.

(4): SEARCH -> Allows specification of a top-k similarity search query. The user must also specify the path to a JSON file containing the query specification (as in search.json.example file or data/gdelt/standalone/search.json for a search request using rank aggregation). Configuration for search requests using pivot-based SimSearch must specifically define pivot_based as the algorithm to be used, as in this example of a pivot-based search request. In all cases, once evaluation is complete, results will be available in JSON format (as in data/gdelt/standalone/search_results.json).

(5): SQL TERMINAL -> This terminal-based front-end enables users to type conjunctive SQL-like queries, issue them against the locally running SimSearch instance, and readily inspect the query results. Please consult details on this SQL syntax specifically customized for top-k similarity search.

Launching SimSearch as web service

SimSearch also integrates a REST API and can be deployed as a web service application at a specific port (e.g., 8090) as follows:

$ java -Dserver.port=8090 -jar target/simsearch-0.5.1-SNAPSHOT.jar --service

Option --service signifies that a web application will be deployed using Spring Boot. Once the user wishes to make some data source(s) available for similarity search, a new instance of the service is created, which is associated with an auto-generated API key that is returned back to the user. All subsequent requests against this instance of the SimSearch service should specify this API key. Multiple SimSearch instances may be active in parallel but running in isolation, each one responding to requests that specify its own unique API key.

Once an instance of the SimSearch service is deployed as above, requests can be formulated according to the API documentation (typically accessible at http://localhost:8090/swagger-ui.html#).

Thus, users are able to issue requests to an instance of the SimSearch service via a client application (e.g., Python scripts), such as:

In case all data is available in ElasticSearch, these example scripts demonstrate how to specify a SimSearch instance against various types of ES-indexed atributes and interact with it with top-k similarity search queries.

SimSearch REST API support for OpenAPI 3.1

Starting from version 0.5.1, SimSearch's REST API also supports OpenAPI 3.1 specification. In order to support for OpenAPI 3.1 specification, deploy SimSearch as a web service application at a specific port (e.g., 8090) as follows:

$ java -Dserver.port=8090 -Dspringdoc.api-docs.version=openapi_3_1 -jar target/simsearch-0.5.1-SNAPSHOT.jar --service

and the description of SimSearch REST API will be available at http://localhost:8090/v3/api-docs. Note that directive -Dspringdoc.api-docs.version=openapi_3_1 dictates that the API will support OpenAPI 3.1 specification; if omitted, the REST API will support OpenAPI 3.0.1 specification instead. In both cases, the functionality of SimSearch service requests is as described above.

Value specification in search requests

SimSearch supports several options in specifying query values in search requests. The following examples indicate these alternative specifications for the various types of attributes involved in SimSearch queries:

SQL syntax

When running SimSearch as a standalone application, its terminal-based interface allows users to submit conjunctive SQL-like queries and interactively browse the results. In particular, users can write SELECT statements of the following syntax (optional clauses are enclosed in brackets):

SELECT *, [ extra_attrX [, ...] ]
    [ FROM running_instance ]
      WHERE attr_name1 ~= 'attr_value1' [ AND ...]
    [ WEIGHTS weight_value1 [, ...] ]
    [ ALGORITHM { threshold | partial_random_access | no_random_access | pivot_based } ]
    [ LIMIT count ] ;

More specifically:

These example SQL statements demonstrate how to specify such queries through the terminal over a locally running instance of SimSearch. In the listing of returned results, the applied weights are shown in brackets next to the names of attributes involved in the similarity criteria.

Finally, users may specify the internal parameter setting query_timeout regarding the maximum response time allowed per query (default value: 10000 milliseconds). If this deadline is reached during evaluation of a query, the best (i.e., approximately scored) results found so far will be fetched. This timeout value is specified in milliseconds as in this example:

SET query_timeout 20000;

Then, execution of any newly submitted queries will timeout after 20 seconds at the latest, issuing all currently collected results, but with possibly approximate scores and ranks.

Interactive Data Exploration with the SimSearch REST API and Jupyter notebooks

This Jupyter notebook demonstrates how to interact with a deployed SimSearch service and specify requests.

It also demonstrates how results of multi-attribute SimSearch queries can be visualized in various plots (maps, keyword clouds, histograms) for interactive data exploration.

Creating and launching a Docker image

We provide an indicative Dockerfile that may be used to create a Docker image (sdl/simsearch-docker) from the executable:

$ docker build -t sdl/simsearch-docker .

This docker image can then be used to launch a web service application at a specific port (e.g., 8090) as follows:

$ docker run -p 8090:8080 sdl/simsearch-docker:latest

Once the service is launched, requests can be sent as mentioned above in order to create, manage, and query instances of SimSearch against data source(s).

Demonstration

We have made available two videos demonstrating the current functionality provided by the SimSearch software:

Note on R-tree implementation

SimSearch modifies and extends an R-tree implementation originally published here.

The original code provides an in-memory immutable R-tree implementation for a spatial index in n dimensions.

Basic extensions made for SimSearch include:

License

The contents of this project are licensed under the Apache License 2.0.

Acknowledgement

This software is being developed in the context of the SmartDataLake project. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825041.