Home

Awesome

libsearch 🔎

npm libsearch TypeScript types Build Status

Simple, index-free text search for JavaScript, used across my personal projects like YC Vibe Check, linus.zone/entr, and my personal productivity software. Read the annotated source to understand how it works under the hood.

The API

Let's begin with some quick examples:

import { search } from 'libsearch'; // on Node.js
const { search } = window.libsearch; // in the browser

const articles = [
    { title: 'Weather in Berkeley, California' },
    { title: 'University report: UC Berkeley' },
    { title: 'Berkeley students rise in solidarity...' },
    { title: 'Californian wildlife returning home' },
];

// basic usage
search(articles, 'berkeley cali', a => a.title);
// => [{ title: 'Weather in Berkeley, California' }]
search(articles, 'california', a => a.title);
// => [
//   { title: 'Weather in Berkeley, California' },
//   { title: 'Californian wildlife returning home' },
// ]

// mode: 'word' only returns whole-word matches
search(articles, 'california', a => a.title, { mode: 'word' });
// => [{ title: 'Weather in Berkeley, California' }]

// case sensitivity
search(articles, 'W', a => a.title, { caseSensitive: true });
// => [{ title: 'Weather in Berkeley, California' }]

// empty query returns the full list, unmodified
search(articles, '', a => a.title);
// => [{...}, {...}, {...}, {...}]

More formally, libsearch exposes a single API, the search function. This function takes two required arguments and two optional arguments:

function search<T>(
    items: T[],
    query: string,
    by?: (it: T) => string,
    options?: {
        caseSensitive: boolean,
        mode: 'word' | 'prefix' | 'autocomplete',
    },
): T[]

You can find more examples of how these options combine together in the unit tests.

Installation and usage

On the web, with <script>

Drop this into your HTML:

<script src="https://unpkg.com/libsearch/dist/browser.js"></script>

This will expose the search function as window.libsearch.search.

Via NPM

npm install libsearch
# or
yarn add libsearch

And use in your code:

import { search } from 'libsearch';

// search(...);

Using TypeScript types

libsearch ships with TypeScript type definitions generated from the source file. Using libsearch from NPM should get them picked up by the TypeScript compiler.

How it works

libsearch lets you perform basic full-text search across a list of JavaScript objects quickly, without requiring a pre-built search index, while offering reasonably good TF-IDF ranking of results. It doesn't deliver the wide array of features that come with libraries like FlexSearch and lunr.js, but is a big step above text.indexOf(query) > -1, and is fast enough to be usable for searching thousands of documents on every keystroke in my experience.

There are two key ideas in how libsearch delivers this:

1. Transforming queries into regular expressions

Modern JavaScript engines ship with highly optimized regular expression engines, and libsearch takes advantage of this for fast, index-free text search by transforming query strings into regular expression filters at search time.

Most full-text search libraries work by first requiring the developer to build up an "index" data structure mapping search terms to documents in which they appear. This is usually a good tradeoff, because it moves some of the computational work of "searching" to be done ahead of time, so search itself can remain fast and accurate. It also allows for fancy transformations and data cleanup like lemmatization on the indexed data without destroying search speed. But when building prototypes and simple web apps, I often didn't want to incur the complexity of having a separate "indexing" step to get a "good enough" search solution. An index needs to be stored somewhere and maintained constantly as the underlying dataset changes and grows.

The main task of a search index is mapping "tokens" or keywords that appear in the dataset to the documents in which they appear, so that the question "which documents contain the word X?" is fast (O(1)) to answer at search time. Without an index, this turns into an O(n) question, as every document needs to be scanned for the keyword. But often, on modern hardware, for small-enough datasets (of a few MBs) typical in a client-side web app, the n is pretty small, small enough that O(n) on every keystroke isn't noticeable.

libsearch transforms a query like "Uni of California" into a list of regular expression filters, (^|\W)Uni($|\W), (^|\W)of($|\W), (^|\W)California. It then "searches" without needing an index by filtering the corpus through each of those regular expressions.

2. "Good enough" TF-IDF ranking based on RegExp matches and document length

The conventional TF-IDF metric is computed for each word as:

(# matches) / (# words in the doc) * log(# total docs / # docs that matched)

Getting the number of words in a doc requires tokenizing the document, or at least splitting the document by whitespaces, which is computationally expensive. So libsearch approximates this by using the length of the document (number of characters) instead.

Using the regular expression queries described above, libsearch's TF-IDF formula is:

(# RegExp matches) / (doc.length) * log(# docs / # docs that matched RegExp)

which is computed for each word as the search is performed, and then aggregated at the end for sorting.

Development

libsearch's source code is written in TypeScript. To allow the library to be used across TypeScript, vanilla Node.js and the web, we compile two builds:

The ES module build is produced with tsc, the TypeScript compiler, and the minified browser build is further produced with Webpack.

NPM/Yarn commands:

Before pushing to main or publishing, I usually run

yarn fmt && yarn build:all && yarn test && yarn docs

to make sure I haven't forgotten anything.