Home

Awesome

This code represents the build process for wordfreq, among other things. I've made it public because it's good to know where the data in wordfreq comes from. However, I make no promises that you'll be able to run it if you don't work at Luminoso.

Dependencies

Exquisite Corpus makes use of various libraries and command-line tools to process data correctly and efficiently. As something that is run on a development machine, it uses the best, fastest libraries it can, though this leads to somewhat complex system requirements.

You will need these programming environments installed:

You also need certain tools to be available:

Installation

Some steps here probably need to be filled in better.

apt install python3-dev haskell-stack libmecab-dev libicu-dev jq xml2 curl
git clone https://github.com/LuminosoInsight/wikiparsec
cd wikiparsec
stack install
pip install -e .

Getting data

Most of the data in Exquisite Corpus will be downloaded from places where it can be found on the Web. However, one input must be downloaded separately: Twitter data cannot be distributed due to the Twitter API's terms of use.

If you have a collection of tweets, put their text in data/raw/twitter-2015.txt, one tweet per line. Or just put an empty file there.

Building

Make sure you have lots of disk space available in the data directory, which may have to be a symbolic link to an external hard disk.

Run:

snakemake -j 8

...and wait a day or two for results, or a crash that may tell you what you need to fix.

To build parallel corpus, run ./build.sh parallel. If you want alignment files for already built parallel corpus or want to build parallel corpus and alignment together, run ./build.sh alignment.