Home

Awesome

Elephant Wrapper version 0.2.3

Overview

This software is a wrapper for the Elephant tokenizer (github repo here). Besides providing a few pre-trained tokenizers, Elephant can train a new tokenizer using Conditional Random Fields (Wapiti implementation) and optionally Recurrent Neural Networks (Grzegorz Chrupala's 'Elman' implementation, based on Tomas Mikolov's Recurrent Neural Networks Language Modeling Toolkit.

This wrapper aims at improving the usability of the original Elephant system. In particular, scripts are provided to facilitate the task of training a new model:

More details can be found in this paper; you can cite it btw ;)

Installation

Obtaining this repository together with its dependencies

This git repository includes submodules. This is why it is recommended to clone it with:

git clone --recursive git@github.com:erwanm/elephant-wrapper.git

Alternatively, the dependencies can be downloaded separately. In this case the executables should be accessible in the PATH environment variable.

Compiling third-party components

Recommended way to compile and setup the environment:

make
make install
export PATH=$PATH:$(pwd)/bin

By default the executables are copied in the local bin directory, but this can be changed by assigning another path to PREFIX, e.g. make install PREFIX=/usr/local/bin/.

Usage

Applying a tokenizer

Examples

echo "Hello my old friend, why you didn't call?" | tokenize.sh en
tokenize.sh fr <my-french-text.txt

Print a list of available language codes

tokenize.sh -P

Other options

tokenize.sh -h

Training a tokenizer

Train an Elman language model and then train a Wapiti model

With corpus.conllu the input data (conllu format as in Universal Dependencies 2 data):

train-lm-from-UD-corpus.sh corpus.conllu elman.lm
train-tokenizer-from-UD-corpus.sh -e corpus.conllu patterns/code7.txt my-output-dir

Other options

For more details, most scripts in the bin directory display a usage message when executed with option -h.

Experiments

Download the UD 2.x corpus

Download the data from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2515 (UD version 2.1).

Generating a tokenizer for every corpus in the UD 2.x data

The following command can be used to generate the tokenizers provided in models. For every dataset, 96 patterns are tested using 5-fold cross-validation, then the optimal pattern (according to maximum accuracy) is used to train the final tokenizer.

With directory ud-treebanks containing the datasets in the Universal Dependencies 2.x data:

advanced-training-UD.sh -s 0.8 -l -e -m 0 -g 3,8,1,2,2,1 ud-treebanks tokenizers

Runnning this process will easily take several days on a modern machine. A basic way to process datasets in parallel consists in using option -d, which only prints the individual command needed for each dataset. The output can be redirected to a file, then the file can be split into the required number of batches. For instance, the following shows how to split the UD 2.1 data, which contains 102 datasets, into 17 batches of 6 datasets:

advanced-training-UD.sh -d -s 0.8 -l -e -m 0 -g 3,8,1,2,2,1 ud-treebanks-v2.1/ tokenizers >all.tasks
split -d -l 6 all.tasks batch.
for f in batch.??; do (bash $f &); done

Remark: since the datasets have different sizes, some batches will probably take more time than others.

Reproducing the experiments described in the LREC 18 paper

The directory experiments contains 4 directories, each containing the scripts which were used to perform one of the experiments described in the LREC 18 paper. Of course, these scripts can also be used as examples in order to make your own experiments.

Intra-language experiment

Run this experiment with:

experiments/01-training-same-language/apply-to-all-datasets-groups.sh <output dir>

This will generate the results of the experiment for the predefined groups (files experiments/01-training-same-language/*.datasets). After the experiment, the final tables can be found in <output dir>/<language>/perf.out.

Training size experiment

experiments/02-training-size/training-size-larger-datasets.sh -l experiments/02-training-size/regular-datasets.list <UD2.1 path> 20 10 <output dir>

This will perform the training stage followed by testing on the test set for 10 different sizes of training data (proportional increment), and for the 20 largest datasets found in regular-datasets.list.

Option -c can be used to specify a custom list of sizes (warning: you must make sure that the datasets are large enough).

If using option -d the commands will be printed. This is convenient to run the processes in parallel (see example for advanced-training.sh above).

VMWE17 experiment

This shows how to tokenize a third-party resource (here Europarl) following a model trained on some input data. The input data must be provided in a format which gives both the tokens and the original text (e.g. .conll). Only the parts of the experiment about training tokenizers from the VMWE17 data and applying these tokenizers to Europarl data are covered here.

experiments/03-train-from-VMWE17-shared-task/train-tokenizers-from-vmwe17.sh <VMWE17 path> <output dir>
split -d -l 1 <output dir>/tasks batch.
for f in batch.??; do (bash $f &); done # Caution: very long!

Finally the models can be applied to Europarl with:

experiments/03-train-from-VMWE17-shared-task/tokenize-europarl.sh <VMWE-trained models dir> <Europarl data dir> <output dir>

This script will print the commands to run as individual files in <output dir>/tasks.

Changelog

0.2.3

0.2.2

0.2.1

0.2.0

License

Please see file LICENSE.txt in this repository for details.