Home

Awesome

The Open Guide to Search Engineering

Table of Contents

Introduction

What every software engineer should know about search

[This guide is based on this popular blog post, also discussed on Hacker News.]

Ask a software engineer: “How would you add search functionality to your product?” or “How do I build a search engine?” You’ll probably immediately hear something like: “Oh, we’d just launch an ElasticSearch cluster. Search is easy these days.”

But is it? Numerous current products still have suboptimal search experiences. Any true search expert will tell you that few engineers have a very deep understanding of how search engines work, knowledge that’s often needed to improve search quality.

Even though many open source software packages exist, and the research is vast, the knowledge around building solid search experiences is limited to a select few. Ironically, searching online for search-related expertise doesn’t yield any recent, thoughtful overviews.

Legend

Why read this?

Think of this post as a collection of insights and resources that could help you to build search experiences. It can’t be a complete reference, of course, but hopefully we can improve it based on feedback (please comment or reach out!).

I’ll point at some of the most popular approaches, algorithms, techniques, and tools, based on my work on general purpose and niche search experiences of varying sizes at Google, Airbnb and several startups.

❗️ Not appreciating or understanding the scope and complexity of search problems can lead to bad user experiences, wasted engineering effort, and product failure.

If you’re impatient or already know a lot of this, you might find it useful to jump ahead to the services and tools section.

Principles

Most of what we cover here has four underlying principles:

  1. 🔹 Search is an inherently messy problem:

    • Queries are highly variable. The search problems are highly variable based on product needs.
    • Think about how different Facebook search (searching a graph of people).
    • YouTube search (searching individual videos).
    • Or how different both of those are are from Kayak (air travel planning is a really hairy problem).
    • Google Maps (making sense of geo-spacial data).
    • Pinterest (pictures of a brunch you might cook one day).
  2. Quality, metrics, and processes matter a lot:

    • There is no magic bullet (like PageRank) nor a magic ranking formula that makes for a good approach. Processes are always evolving collection of techniques and processes that solve aspects of the problem and improve overall experience, usually gradually and continuously.
    • ❗️In other words, search is not just just about building software that does ranking or retrieval (which we will discuss below) for a specific domain. Search systems are usually an evolving pipeline of components that are tuned and evolve over time and that build up to a cohesive experience.
    • In particular, the key to success in search is building processes for evaluation and tuning into the product and development cycles. A search system architect should think about processes and metrics, not just technologies.
  3. Use existing technologies first:

    • As in most engineering problems, don’t reinvent the wheel yourself. When possible, use existing services or open source tools. If an existing SaaS (such as Algolia or managed Elasticsearch) fits your constraints and you can afford to pay for it, use it. This solution will likely will be the best choice for your product at first, even if down the road you need to customize, enhance, or replace it.
  4. Even if you buy, know the details:

    • ❗️ Even if you are using an existing open source or commercial solution, you should have some sense of the complexity of the search problem and where there are likely to be pitfalls.

Theory: The search problem

Search is different for every product, and choices depend on many technical details of the requirements. It helps to identify the key parameters of your search problem:

Thinking through these points up front can help you make significant choices designing and building individual search system components.

Theory: The search pipeline

Now let’s go through a list of search sub-problems. These are usually solved by separate subsystems that form a pipeline. What that means is that a given subsystem consumes the output of previous subsystems, and produces input for the following subsystems.

This leads to an important property of the ecosystem: once you change how an upstream subsystem works, you need to evaluate the effect of the change and possibly change the behavior downstream.

Here are the most important problems you need to solve:

Index selection

Given a set of documents (e.g. the entirety of the Internet, all the Twitter posts, all the pictures on Instagram), select a potentially smaller subset of documents that may be worthy for consideration as search results and only include those in the index, discarding the rest. This is done to keep your indexes compact, and is almost orthogonal to selecting the documents to show to the user. Examples of particular classes of documents that don’t make the cut may include:

Spam

Oh, all the different shapes and sizes of search spam! A giant topic in itself, worthy of a separate guide. In the meantime, consult this web spam taxonomy overview.

Undesirable documents

Domain constraints might require filtering: porn, illegal content, etc. The techniques are similar to spam filtering, probably with extra heuristics.

Duplicates

Or near-duplicates and redundant documents. Can be done with Locality-sensitive hashing, similarity measures, clustering techniques or even clickthrough data. A good overview of techniques.

Low-utility documents

The definition of utility depends highly on the problem domain, so it’s hard to recommend the approaches here. Some ideas are: it might be possible to build a utility function for your documents; heuristics might work, or example an image that contains only black pixels is not a useful document; utility might be learned from user behavior.

Index construction

For most search systems, document retrieval is performed using an inverted index — often just called the index.

Query analysis and document retrieval

Most popular search systems allow non-structured queries. That means the system has to extract structure out of the query itself. In the case of an inverted index, you need to extract search terms using NLP techniques.

The extracted terms can be used to retrieve relevant documents. Unfortunately, most queries are not very well formulated, so it pays to do additional query expansion and rewriting, like:

Ranking

Given a list of documents (retrieved in the previous step), their signals, and a processed query, create an optimal ordering (ranking) for those documents.

Originally, most ranking models in use were hand-tuned weighted combinations of all the document signals. Signal sets might include PageRank, clickthrough data, topicality information and others.

To further complicate things, many of those signals, such as PageRank, or ones generated by statistical language models contain parameters that greatly affect the performance of a signal. Those have to be hand-tuned too.

Lately, 🔹 learning to rank, signal-based discriminative supervised approaches are becoming more and more popular. Some popular examples of LtR are McRank and LambdaRank from Microsoft, and MatrixNet from Yandex.

A new, vector space based approach for semantic retrieval and ranking is gaining popularity lately. The idea is to learn individual low-dimensional vector document representations, then build a model which maps queries into the same vector space.

Then, retrieval is just finding several documents that are closest by some metric (e.g. Euclidean distance) to the query vector. Ranking is the distance itself. If the mapping of both the documents and queries is built well, the documents are chosen not by a fact of presence of some simple pattern (like a word), but how close the documents are to the query by meaning.

Indexing pipeline operation

Usually, each of the above pieces of the pipeline must be operated on a regular basis to keep the search index and search experience current.

❗️Operating a search pipeline can be complex and involve a lot of moving pieces. Not only is the data moving through the pipeline, but the code for each module and the formats and assumptions embedded in the data will change over time.

A pipeline can be run in “batch” or based on a regular or occasional basis (if indexing speed does not need to be real time) or in a streamed way (if real-time indexing is needed) or based on certain triggers.

Some complex search engines (like Google) have several layers of pipelines operating on different time scales — for example, a page that changes often (like cnn.com) is indexed with a higher frequency than a static page that hasn’t changed in years.

Serving systems

Ultimately, the goal of a search system is to accept queries, and use the index to return appropriately ranked results. While this subject can be incredibly complex and technical, we mention a few of the key aspects to this part of the system.

Quality, evaluation, and improvement

So you’ve launched your indexing pipeline and search servers, and it’s all running nicely. Unfortunately the road to a solid search experience only begins with running infrastructure.

Next, you’ll need to build a set of processes around continuous search quality evaluation and improvement. In fact, this is actually most of the work and the hardest problem you’ll have to solve.

🔹What is quality? First, you’ll need to determine (and get your boss or the product lead to agree), what quality means in your case:

Metrics: Some of these concepts can be quite hard to quantify. On the other hand, it’s incredibly useful to be able to express how well a search engine is performing in a single number, a quality metric.

Continuously computing such a metric for your (and your competitors’) system you can both track your progress and explain how well you are doing to your boss. Here are some classical ways to quantify quality, that can help you construct your magic quality metric formula:

🔹Human evaluations: Quality metrics might seem like statistical calculations, but they can’t all be done by automated calculations. Ultimately, metrics need to represent subjective human evaluation, and this is where a “human in the loop” comes into play.

❗️Skipping human evaluation is probably the most widespread reason of sub-par search experiences.

Usually, at early stages the developers themselves evaluate the results manually. At later point human raters (or assessors) may get involved. Raters typically use custom tools to look at returned search results and provide feedback on the quality of the results.

Subsequently, you can use the feedback signals to guide development, help make launch decisions or even feed them back into the index selection, retrieval or ranking systems.

Here is the list of some other types of human-driven evaluation, that can be done on a search system:

Evaluation datasets

One should start thinking about the datasets used for evaluation (like “golden sets” mentioned above) early in the search experience design process. How you collect and update them? How you push them to the production eval pipeline? Is there a built-in bias?

Live experiments: After your search engine catches on and gains enough users, you might want to start conducting live search experiments on a portion of your traffic. The basic idea is to turn some optimization on for a group of people, and then compare the outcome with that of a “control” group — a similar sample of your users that did not have the experiment feature on for them. How you would measure the outcome is, once again, very product specific: it could be clicks on results, clicks on ads etc.

Evaluation cycle time: How fast you improve your search quality is directly related to how fast you can complete the above cycle of measurement and improvement. It is essential from the beginning to ask yourself, “how fast can we measure and improve our performance?”

Will it take days, hours, minutes or seconds to make changes and see if they improve quality? ❗️Running evaluation should also be as easy as possible for the engineers and should not take too much hands-on time.

Practical advice

This guide is not meant as a tutorial, but here is a rough outline of a recommended approach to building a search experience right now:

Services and tools

☁️ SaaS

Tools and libraries

Datasets

A few fun or useful data sets to try building a search engine or evaluating search engine quality:

References

This concludes my humble attempt to make a somewhat-useful “map” for an aspiring search engine engineer. Did I miss something important? I’m pretty sure I did — you know, the margin is too narrow to contain this enormous topic. Let me know if you think that something should be here and is not — you can reach me at forwidur@gmail.com or at @forwidur.

P.S. — This post is part of a open, collaborative effort to build an online reference, the Open Guide to Practical AI, which we’ll release in draft form soon. See this popular guide for an example of what’s coming. If you’d like to get updates on or help with with this effort, sign up here.

Credits

Original author: Max Grigorev

Contribution and review: Leo Polovets, Abhishek Das

Editor: Joshua Levy

License

Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.