Awesome

GutenSearch

A search engine for Project Gutenberg books built with PostgreSQL and Dash. Find it running on GutenSearch.com.

Summary

Source data

Project Gutenberg offers mirroring via rsync. However, in June 2016, Allison Parrish released a corpus of all text files and metadata up to that point in time, which was used here instead of the raw data.

Process

set up the instance, firewall, etc.
create a new Postgres database
stream the JSON metadata into a table
stream the raw text data
transform the data
start the app

Installation

Choosing your hardware

The below worked for a dedicated server with an Intel Atom 2.40GHz CPU, 16GB RAM and 250GB SSD. The queries are mostly CPU-bound, particularly for common phrases. The deployed app uses 128GB of its 217GB partition.

Setting up the instance

I've only tested this on a clean install of Ubuntu 20.04.1 LTS.

You'll need the following to get started:

sudo apt update
sudo apt install screen
sudo apt install unzip
sudo apt install vim # not strictly necessary
sudo apt install postgresql postgresql-contrib

Setting up Postgres

You can use this guide. You may want to increase resources as follows:

vim /etc/postgresql/12/main/postgresql.conf

Changing the following (here shown for a server with 16GB RAM):

shared_buffers = 8GB # (25% of server RAM)
work_mem = 40MB # (RAM * 0.25 / 100)
maintenance_work_mem = 800MB # (RAM * 0.05)
effective_cache_size = 8GB # (RAM * 0.5)

Setting up Python

As usual the app relies on an alphabet soup of libraries:

sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.8
sudo apt-get install python3-venv
sudo apt install python3-pip
sudo apt install libpq-dev

Getting the data

Create project folder

mkdir gutensearch

Download raw data

You will want this one in a screen as it might take a while - a few minutes in a decent data centre at 30MB/s, or a night and morning from a home connection.

screen -S download_data
wget -c http://static.decontextualize.com/gutenberg-dammit-files-v002.zip
mv gutenberg-dammit-files-v002.zip gutensearch/gutenberg-dammit-files-v002.zip
cd gutensearch
unzip gutenberg-dammit-files-v002.zip -d gutenberg-dammit-files-v002
exit

Insert the metadata

I recommend doing this one by hand line by line, instead of passing the file to psql. Open a screen, then line-by-line server-process-part1.sql.

screen -S process_data
sudo -u postgres psql # run through server-process-part1.sql
exit

SQL part 1 streams the metadata JSON into a table.

Insert the text

This part streams the text files into an SQL file that can be run later. It may take a while so best have it in a screen.

screen -S app_venv
cd ~
python3 -m venv .venvs/dash
pip3 install --upgrade pip
python3 -m pip install psycopg2
python3 server-import.py # Change the path to yours first!
exit

Transform the data

This part will take the longest as 6GB zipped is expanded into more than 60GB of tables and indices. \timing for each part is included as comments in the code; on the instance mentioned earlier, you're looking at the better part of a day.

screen -r process_data
sudo -u postgres psql # now run through server-process-2.sql
exit

Setting up the app

Libraries

You'll need the following:

pip install --upgrade pip
python3 -m pip install dash
python3 -m pip install dash_auth
python3 -m pip install pandas
python3 -m pip install sqlalchemy
python3 -m pip install networkx
python3 -m pip install gunicorn

Set up HTTPS

Follow instructions here.

Don't forget to backup the certs:

scp -r user@host:/etc/letsencrypt /path/to/backup/location

Set up firewall and reverse proxy

Find the relevant instructions for your provider. Mine are here.

You'll need to set up the firewall, instructions here.

Relevant files can be found here:

cd /etc/nginx/sites-available
sudo vim reverse-proxy.conf # add server_name and change the port
sudo ln -s /etc/nginx/sites-available/reverse-proxy.conf /etc/nginx/sites-enabled/reverse-proxy.conf

You can now serve the app

screen -S app_server
gunicorn app:server -b :port --workers=17 --log-level=debug --timeout=700

License

In accordance with GutenTag's and Gutenberg Dammit's license:

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.