Awesome

efmaral

Efficient Markov Chain word alignment

efmaral is a tool for performing word alignment using Gibbs sampling with a Bayesian extension of the IBM alignment models. It is both fast and accurate. In addition, it can function as a plug-in replacement for fast_align, or be used as a library from a Python program.

Update: one major issue with efmaral is that is uses considerable amounts of memory when aligning large corpora. If this is a problem for you, consider trying eflomal instead.

References

A thorough description can be found in Östling and Tiedemann (2016) (BibTeX).

Additional SMT results can be found in Jörg Tiedemann, Fabienne Cap, Jenna Kanerva, Filip Ginter, Sara Stymne, Robert Östling, and Marion Weller-Di Marco. (BibTeX)

Installing

efmaral is implemented in Python/C and requires the following software to be installed:

Python 3 (tested with version 3.4) and setuptools
gcc (tested with version 4.9)
Cython (tested with version 0.21 and 0.20)
NumPY (tested with version 1.8.2)
fast_align (needed if you want to use the atools utility for symmetrization)

These can be installed on Debian 8 (jessie) using the following command as root:

apt-get install python3-dev python3-setuptools cython3 gcc python3-numpy

Then, clone this repository and run setup.py:

git clone https://github.com/robertostling/efmaral.git
cd efmaral
python3 setup.py install

The functions you are probably most interested in are efmaral.cyalign.align and efmaral.cyalign.align_numeric (see docstrings in cyalign.pyx for usage).

If everything works, you can try directly to evaluate with a small part of the English-Hindi data set from WPT-05:

python3 scripts/evaluate.py efmaral \
    3rdparty/data/test.eng.hin.wa \
    3rdparty/data/test.eng 3rdparty/data/test.hin \
    3rdparty/data/trial.eng 3rdparty/data/trial.hin

This uses the small trial set as training data, so the actual figures are poor.

Usage

For the lazy, there is a convenience script to align in both directions and perform symmetrization. You can use it with the example data in the repo:

scripts/align_symmetrize.sh 3rdparty/data/test.eng 3rdparty/data/test.hin test.moses grow-diag-final-and

The default values of efmaral should give acceptable results, but for a full list of options, run:

./align.py --help

Given a parallel text in fast_align format, efmaral can be used in the same way as fast_align. First, we can convert some of the WPT-05 data above to the fast_align format:

python3 scripts/wpt2fastalign.py \
    3rdparty/data/test.eng 3rdparty/data/test.hin >test.fa

Then, we can run efmaral on this file:

./align.py -i test.fa >test-fwd.moses

If we want symmetrized alignments, also run in the reverse direction, then run atools (which comes with fast_align) to symmetrize:

./align.py -r -i test.fa >test-back.moses
atools -i test-fwd.moses -j test-back.moses -c grow-diag-final-and >test.moses

efmaral also supports reading Europarl-style inputs directly, such as the WPT data, by providing two filename arguments to the -i option:

./align.py -i 3rdparty/data/test.eng 3rdparty/data/test.hin >test-fwd.moses

Performance

Update: additional evaluation results are available in Östling and Tiedemann (2016)

This is a comparison between efmaral and fast_align. Tests were run on a two-CPU Xeon E5645 2.4 GHz machine, each CPU contains 6 cores with hyperthreading. In all, 24 virtual cores are available.

The times given are the total runtime for the evaluate.py script, and includes the (rather insignificant) time used by evaluate.py itself and the atools symmetrization program.

Alignment Error Rate (AER) is used to indicate alignment accuracy (lower is better). Both programs are run with the recommended parameters.

Note: the timings should be taken with a grain of salt for both systems, since they depend critically on the number of iterations used, which could easily be adjusted depending on the performance/accuracy ratio desired. Also note that benchmarking on a 24-core system gives fast_align an advantage.

efmaral

Languages	Sentences	AER	CPU time (s)	Real time (s)
English-Swedish	1,862,426	0.133	1,719	620
English-French	1,130,551	0.085	763	279
English-Inkutitut	340,601	0.235	122	46
Romanian-English	48,681	0.287	161	46
English-Hindi	3,530	0.483	98	10

fast_align

Languages	Sentences	AER	CPU time (s)	Real time (s)
English-Swedish	1,862,426	0.205	11,090	672
English-French	1,130,551	0.153	3,840	241
English-Inuktitut	340,601	0.287	477	47
Romanian-English	48,681	0.325	208	17
English-Hindi	3,530	0.672	24	2

One advantage of the EM algorithm used by fast_align is that it is easy to parallelize, unlike the collapsed Gibbs sampler efmaral uses. Therefore, fast_align is sometimes faster when aligning a single text on a system with many cores (such as this one). The CPU time column gives the total amount of time used by all CPU cores, and should be relatively constant regardless of the number of cores. In contrast, the Real time column gives the actual running time on this 24-core system.

Note that fast_align uses a fixed number of iterations, whereas efmaral tries to increase the number of sampling iterations when aligning small corpora. Currently the number of iterations is proportional to the inverse of the square root of the corpus length, but there is no deep analysis behind this decision (nor a proper evaluation, so this may change in the future).

Both tools have an unusually high CPU time/real time ratio with the very short English-Hindi corpus, this might be due to OpenMP overhead becoming noticeable.

Tips and tricks

The three most important options are probably:

--length: number of iterations relative to the number determined automatically (based on the number of sentences). The default value of 1.0 means no change, whereas e.g. 0.2 would result in a 5x speedup (unless one hits the bottom number of 4 or top value of 250 iterations) -- probably at the cost of some accuracy.
--null-prior: prior probability of NULL alignment, this can to some extent be used to trade recall for precision, by setting a higher value than the default 0.2 (or vice versa, with a lower value).
--samplers: number of independent samplers to use. This reduces initialization bias, and is the only way for efmaral to utilize multiple CPU cores since each is run in a separate thread. The default is 2, but for maximum speed use a value of 1. A value much higher than 4 is probably not very useful.

Aligning very large corpora

There are two things to consider: memory usage, and the index table pointer size.

The amount of memory used is proportional to the sum of the products of the lengths of each sentence pair, i.e. \sum_{e,f} (|e| \times |f|). In practice, this amounts to about 20 GB for aligning a large Europarl bitext (such as the 1.86 million sentence English-Swedish corpus mentioned above). If you align both directions in parallel, you need 40 GB.

If the parameter vector can not fit in an unsigned 32-bit integer, it is necessary to change INDEX_t in gibbs.c as well as in cyalign.pyx. This will however further increase memory usage, so the code uses 32-bit integers by default. An error will be printed if the parameter vector becomes too large.

Implementation notes

The advantage of using Gibbs sampling is that, unlike Expectation-Maximization (EM, used in tools such as GIZA++), adding word order and fertility models do not affect the computational complexity of inference. fast_align circumvents this by using a variant of the much simpler Model 2, but this reduces accuracy.

A naively implemented Gibbs sampler can however add a large constant factor, which brings down performance. The method used here to obtain an efficient sampler is to create a pre-computed table indexing the (fixed) sequence of parameters accessed during each sampling iteration, which reduces the inner loop of the sampler to a simple dot product:

    for (size_t i=0; i<ee_len; i++) {
        ps_sum += counts[counts_idx[i]] * counts_sum[ee[i]];
        ps[i] = ps_sum;
    }

The above is the actual code used for the IBM1 Gibbs sampler. Adding the HMM word order model and a fertility model simply consists of adding two more factors to the product (wihch are faster than the above, since they don't need indirect lookups and random access in large arrays).