Awesome
Elephant Wrapper version 0.2.3
Overview
This software is a wrapper for the Elephant tokenizer (github repo here). Besides providing a few pre-trained tokenizers, Elephant can train a new tokenizer using Conditional Random Fields (Wapiti implementation) and optionally Recurrent Neural Networks (Grzegorz Chrupala's 'Elman' implementation, based on Tomas Mikolov's Recurrent Neural Networks Language Modeling Toolkit.
This wrapper aims at improving the usability of the original Elephant system. In particular, scripts are provided to facilitate the task of training a new model:
untokenize.pl
converts files in .conllu format from the Universal Dependencies 2.x corpora to the IOB format (required for training a tokenizer);generate-patterns.pl
generates Wapiti patterns automatically;cv-tokenizers-from-UD-corpus.sh
runs cross-validation experiments with multiple pattern files, in order to select the optimal pattern for a dataset;train-multiple-tokenizers.sh
streamlines the process of training tokenizers for multiple datasets provided in the same format, like the Universal Dependencies 2.x corpora.- The models trained from the Universal Dependencies 2.x datasets; this means that this software include tokenizers for 50 languages (see usage below).
More details can be found in this paper; you can cite it btw ;)
Installation
Obtaining this repository together with its dependencies
This git repository includes submodules. This is why it is recommended to clone it with:
git clone --recursive git@github.com:erwanm/elephant-wrapper.git
Alternatively, the dependencies can be downloaded separately. In this case the executables should be accessible in the PATH
environment variable.
Compiling third-party components
Recommended way to compile and setup the environment:
make
make install
export PATH=$PATH:$(pwd)/bin
By default the executables are copied in the local bin
directory,
but this can be changed by assigning another path to PREFIX
,
e.g. make install PREFIX=/usr/local/bin/
.
Usage
Applying a tokenizer
Examples
echo "Hello my old friend, why you didn't call?" | tokenize.sh en
tokenize.sh fr <my-french-text.txt
Print a list of available language codes
tokenize.sh -P
Other options
tokenize.sh -h
Training a tokenizer
Train an Elman language model and then train a Wapiti model
With corpus.conllu
the input data (conllu
format as in Universal Dependencies 2 data):
train-lm-from-UD-corpus.sh corpus.conllu elman.lm
train-tokenizer-from-UD-corpus.sh -e corpus.conllu patterns/code7.txt my-output-dir
Other options
For more details, most scripts in the bin
directory display a usage message when executed with option -h
.
Experiments
Download the UD 2.x corpus
Download the data from https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2515 (UD version 2.1).
Generating a tokenizer for every corpus in the UD 2.x data
The following command can be used to generate the tokenizers provided in models
. For every dataset, 96 patterns are tested using 5-fold cross-validation, then the optimal pattern (according to maximum accuracy) is used to train the final tokenizer.
With directory ud-treebanks
containing the datasets in the Universal Dependencies 2.x data:
advanced-training-UD.sh -s 0.8 -l -e -m 0 -g 3,8,1,2,2,1 ud-treebanks tokenizers
Runnning this process will easily take several days on a modern machine. A basic way to process datasets in parallel consists in using option -d
, which only prints the individual command needed for each dataset. The output can be redirected to a file, then the file can be split into the required number of batches. For instance, the following shows how to split the UD 2.1 data, which contains 102 datasets, into 17 batches of 6 datasets:
advanced-training-UD.sh -d -s 0.8 -l -e -m 0 -g 3,8,1,2,2,1 ud-treebanks-v2.1/ tokenizers >all.tasks
split -d -l 6 all.tasks batch.
for f in batch.??; do (bash $f &); done
Remark: since the datasets have different sizes, some batches will probably take more time than others.
Reproducing the experiments described in the LREC 18 paper
The directory experiments
contains 4 directories, each containing the scripts which were used to perform one of the experiments described in the LREC 18 paper. Of course, these scripts can also be used as examples in order to make your own experiments.
Intra-language experiment
Run this experiment with:
experiments/01-training-same-language/apply-to-all-datasets-groups.sh <output dir>
This will generate the results of the experiment for the predefined groups (files experiments/01-training-same-language/*.datasets
). After the experiment, the final tables can be found in <output dir>/<language>/perf.out
.
Training size experiment
experiments/02-training-size/training-size-larger-datasets.sh -l experiments/02-training-size/regular-datasets.list <UD2.1 path> 20 10 <output dir>
This will perform the training stage followed by testing on the test set for 10 different sizes of training data (proportional increment), and for the 20 largest datasets found in regular-datasets.list
.
Option -c
can be used to specify a custom list of sizes (warning: you must make sure that the datasets are large enough).
If using option -d
the commands will be printed. This is convenient to run the processes in parallel (see example for advanced-training.sh
above).
VMWE17 experiment
This shows how to tokenize a third-party resource (here Europarl) following a model trained on some input data. The input data must be provided in a format which gives both the tokens and the original text (e.g. .conll
). Only the parts of the experiment about training tokenizers from the VMWE17 data and applying these tokenizers to Europarl data are covered here.
experiments/03-train-from-VMWE17-shared-task/train-tokenizers-from-vmwe17.sh <VMWE17 path> <output dir>
split -d -l 1 <output dir>/tasks batch.
for f in batch.??; do (bash $f &); done # Caution: very long!
- Requires the VMWE17 data available in <VMWE17 path>.
train-tokenizers-from-vmwe17.sh
prints the commands to execute for every dataset.- Remark: if you want to skip the training stage, the models can be found in
experiments/03-train-from-VMWE17-shared-task/vmwe-trained-models.tar.bz2
.
Finally the models can be applied to Europarl with:
experiments/03-train-from-VMWE17-shared-task/tokenize-europarl.sh <VMWE-trained models dir> <Europarl data dir> <output dir>
This script will print the commands to run as individual files in <output dir>/tasks
.
- Option
-n
can be used to split Europarl into chunks which can then be processed in parallel. - Warning: the script
tokenize-line-by-line.sh
used here is very unefficient (sorry!).
Changelog
0.2.3
- [fixed] couple bugs, in particular https://github.com/erwanm/elephant-wrapper/issues/39
- [added]
experiments/00-general-perf
: script and results to reproduce the training of all the tokenizers from UD2.1 data (paper LREC 18). - [added]
experiments/01-training-same-language
: scripts and results to reproduce the intra-language experiments (paper LREC 18). - [added]
experiments/02-training-size
: scripts and results to reproduce the training size experiments (paper LREC 18). - [added]
experiments/03-train-from-VMWE17-shared-task
: scripts, models and results to reproduce the part about Europarl tokenization of the VMWE17 Shared Task experiment (paper LREC 18). - [added] Token-based precision and recall available as evaluation measures.
- [added] Script
tokenize-line-by-line.sh
(warning: very slow).
0.2.2
- [added] script advanced-training.sh to process an individual dataset with cross-validation to select the best model.
- [changed] updated pred-tokenized models: (1) better performance using optimal pattern for each dataset; (2) updating corpus to UD 2.1, with more languages/datasets
- [changed] updated version of UD data from 2.0 to 2.x in scripts and examples.
- [added] option
-d
to generate individual commands by dataset intrain-multiple-tokenizers.sh
, so that tasks can be run in parallel. - [changed] updated README documentation.
0.2.1
0.2.0
- [added] pattern files generation.
- [added] cross-validation for multiple patterns.
- [added] word-based evaluation.
License
Please see file LICENSE.txt in this repository for details.
- Elephant and Wapiti are both published under the BSD 2 Clauses license.
-
- Grzegorz Chrupala's 'Elman' is (c) Grzegorz Chrupala (no licensing information found). 'Elman' is a variant of Tomas Mikolov's Recurrent Neural Networks Language Modeling Toolkit, which is published under the BSD 3 Clauses license.
- elephant-wrapper (this repository) is published under the GPLv3 license.