Awesome
This code represents the build process for wordfreq, among other things. I've made it public because it's good to know where the data in wordfreq comes from. However, I make no promises that you'll be able to run it if you don't work at Luminoso.
Dependencies
Exquisite Corpus makes use of various libraries and command-line tools to process data correctly and efficiently. As something that is run on a development machine, it uses the best, fastest libraries it can, though this leads to somewhat complex system requirements.
You will need these programming environments installed:
- Python 3.4 or later
- Haskell, installed with
haskell-stack
, used to compile and runwikiparsec
You also need certain tools to be available:
- The C library for
mecab
(apt install libmecab-dev) - The ICU Unicode libraries (apt install libicu-dev)
- The JSON processor
jq
(apt install jq) - The XML processor
xml2
(apt install xml2) - The HTTP downloader
curl
(apt install curl) - wikiparsec (https://github.com/LuminosoInsight/wikiparsec)
Installation
Some steps here probably need to be filled in better.
- Install system-level dependencies:
apt install python3-dev haskell-stack libmecab-dev libicu-dev jq xml2 curl
- Clone, build, and install
wikiparsec
:
git clone https://github.com/LuminosoInsight/wikiparsec
cd wikiparsec
stack install
-
If building alignment files to get alignments for parallel corpus:
- Compile
fast_align
by following the instructions at https://github.com/clab/fast_align - Create a symbolic link to executable
fast_align
inside this directory (executablefast_align
is found in the directory wherefast_align
was compiled)
- Compile
-
Finally, return to this directory and install
exquisite-corpus
itself, along with the Python dependencies it manages:
pip install -e .
Getting data
Most of the data in Exquisite Corpus will be downloaded from places where it can be found on the Web. However, one input must be downloaded separately: Twitter data cannot be distributed due to the Twitter API's terms of use.
If you have a collection of tweets, put their text in
data/raw/twitter-2015.txt
, one tweet per line. Or just put an empty file
there.
Building
Make sure you have lots of disk space available in the data
directory, which
may have to be a symbolic link to an external hard disk.
Run:
snakemake -j 8
...and wait a day or two for results, or a crash that may tell you what you need to fix.
To build parallel corpus, run ./build.sh parallel
. If you want alignment files for
already built parallel corpus or want to build parallel corpus and alignment together, run
./build.sh alignment
.