Home

Awesome

efmaral

Efficient Markov Chain word alignment

efmaral is a tool for performing word alignment using Gibbs sampling with a Bayesian extension of the IBM alignment models. It is both fast and accurate. In addition, it can function as a plug-in replacement for fast_align, or be used as a library from a Python program.

Update: one major issue with efmaral is that is uses considerable amounts of memory when aligning large corpora. If this is a problem for you, consider trying eflomal instead.

References

A thorough description can be found in Östling and Tiedemann (2016) (BibTeX).

Additional SMT results can be found in Jörg Tiedemann, Fabienne Cap, Jenna Kanerva, Filip Ginter, Sara Stymne, Robert Östling, and Marion Weller-Di Marco. (BibTeX)

Installing

efmaral is implemented in Python/C and requires the following software to be installed:

These can be installed on Debian 8 (jessie) using the following command as root:

apt-get install python3-dev python3-setuptools cython3 gcc python3-numpy

Then, clone this repository and run setup.py:

git clone https://github.com/robertostling/efmaral.git
cd efmaral
python3 setup.py install

The functions you are probably most interested in are efmaral.cyalign.align and efmaral.cyalign.align_numeric (see docstrings in cyalign.pyx for usage).

If everything works, you can try directly to evaluate with a small part of the English-Hindi data set from WPT-05:

python3 scripts/evaluate.py efmaral \
    3rdparty/data/test.eng.hin.wa \
    3rdparty/data/test.eng 3rdparty/data/test.hin \
    3rdparty/data/trial.eng 3rdparty/data/trial.hin

This uses the small trial set as training data, so the actual figures are poor.

Usage

For the lazy, there is a convenience script to align in both directions and perform symmetrization. You can use it with the example data in the repo:

scripts/align_symmetrize.sh 3rdparty/data/test.eng 3rdparty/data/test.hin test.moses grow-diag-final-and

The default values of efmaral should give acceptable results, but for a full list of options, run:

./align.py --help

Given a parallel text in fast_align format, efmaral can be used in the same way as fast_align. First, we can convert some of the WPT-05 data above to the fast_align format:

python3 scripts/wpt2fastalign.py \
    3rdparty/data/test.eng 3rdparty/data/test.hin >test.fa

Then, we can run efmaral on this file:

./align.py -i test.fa >test-fwd.moses 

If we want symmetrized alignments, also run in the reverse direction, then run atools (which comes with fast_align) to symmetrize:

./align.py -r -i test.fa >test-back.moses
atools -i test-fwd.moses -j test-back.moses -c grow-diag-final-and >test.moses

efmaral also supports reading Europarl-style inputs directly, such as the WPT data, by providing two filename arguments to the -i option:

./align.py -i 3rdparty/data/test.eng 3rdparty/data/test.hin >test-fwd.moses

Performance

Update: additional evaluation results are available in Östling and Tiedemann (2016)

This is a comparison between efmaral and fast_align. Tests were run on a two-CPU Xeon E5645 2.4 GHz machine, each CPU contains 6 cores with hyperthreading. In all, 24 virtual cores are available.

The times given are the total runtime for the evaluate.py script, and includes the (rather insignificant) time used by evaluate.py itself and the atools symmetrization program.

Alignment Error Rate (AER) is used to indicate alignment accuracy (lower is better). Both programs are run with the recommended parameters.

Note: the timings should be taken with a grain of salt for both systems, since they depend critically on the number of iterations used, which could easily be adjusted depending on the performance/accuracy ratio desired. Also note that benchmarking on a 24-core system gives fast_align an advantage.

efmaral

LanguagesSentencesAERCPU time (s)Real time (s)
English-Swedish1,862,4260.1331,719620
English-French1,130,5510.085763279
English-Inkutitut340,6010.23512246
Romanian-English48,6810.28716146
English-Hindi3,5300.4839810

fast_align

LanguagesSentencesAERCPU time (s)Real time (s)
English-Swedish1,862,4260.20511,090672
English-French1,130,5510.1533,840241
English-Inuktitut340,6010.28747747
Romanian-English48,6810.32520817
English-Hindi3,5300.672242

One advantage of the EM algorithm used by fast_align is that it is easy to parallelize, unlike the collapsed Gibbs sampler efmaral uses. Therefore, fast_align is sometimes faster when aligning a single text on a system with many cores (such as this one). The CPU time column gives the total amount of time used by all CPU cores, and should be relatively constant regardless of the number of cores. In contrast, the Real time column gives the actual running time on this 24-core system.

Note that fast_align uses a fixed number of iterations, whereas efmaral tries to increase the number of sampling iterations when aligning small corpora. Currently the number of iterations is proportional to the inverse of the square root of the corpus length, but there is no deep analysis behind this decision (nor a proper evaluation, so this may change in the future).

Both tools have an unusually high CPU time/real time ratio with the very short English-Hindi corpus, this might be due to OpenMP overhead becoming noticeable.

Tips and tricks

The three most important options are probably:

Aligning very large corpora

There are two things to consider: memory usage, and the index table pointer size.

The amount of memory used is proportional to the sum of the products of the lengths of each sentence pair, i.e. \sum_{e,f} (|e| \times |f|). In practice, this amounts to about 20 GB for aligning a large Europarl bitext (such as the 1.86 million sentence English-Swedish corpus mentioned above). If you align both directions in parallel, you need 40 GB.

If the parameter vector can not fit in an unsigned 32-bit integer, it is necessary to change INDEX_t in gibbs.c as well as in cyalign.pyx. This will however further increase memory usage, so the code uses 32-bit integers by default. An error will be printed if the parameter vector becomes too large.

Implementation notes

The advantage of using Gibbs sampling is that, unlike Expectation-Maximization (EM, used in tools such as GIZA++), adding word order and fertility models do not affect the computational complexity of inference. fast_align circumvents this by using a variant of the much simpler Model 2, but this reduces accuracy.

A naively implemented Gibbs sampler can however add a large constant factor, which brings down performance. The method used here to obtain an efficient sampler is to create a pre-computed table indexing the (fixed) sequence of parameters accessed during each sampling iteration, which reduces the inner loop of the sampler to a simple dot product:

    for (size_t i=0; i<ee_len; i++) {
        ps_sum += counts[counts_idx[i]] * counts_sum[ee[i]];
        ps[i] = ps_sum;
    }

The above is the actual code used for the IBM1 Gibbs sampler. Adding the HMM word order model and a fertility model simply consists of adding two more factors to the product (wihch are faster than the above, since they don't need indirect lookups and random access in large arrays).