Home

Awesome

eflomal

Efficient Low-Memory Aligner

This is a word alignment tool based on efmaral, with the following main differences:

Technical details relevant to both efmaral and eflomal can be found in the following article:

Installing

To install the complete Python package, run:

python -m pip install .

If you want to compile and install only the C binary, run:

make -C src
sudo make -C src install

Change the INSTALLDIR parameter in the install step if you want to install somewhere other than the default /usr/local/bin (e.g. make -C src -e INSTALLDIR=~/bin install).

Using

There are three main ways of using eflomal:

  1. Directly call the eflomal binary. Note that this requires some preprocessing.
  2. Use the eflomal-align command-line interface, which is partly compatible with that of efmaral. Run eflomal-align --help for instructions.
  3. Use the Cython module to call the eflomal binary, this takes care of the preprocessing and file conversions necessary. See the docstrings in eflomal.pyx for documentation.

In addition, there are convenience scripts for aligning and symmetrizing (with the atools program from fast_align) as well as evaluating with data from the WPT shared task datasets. These work the same way as in efmaral, please see its README for details.

Input data format

When used with the -s and -t options for separate source/target files, the eflomal-align interface expects one sentence per line with space-separated tokens, similar to most word alignment software.

The -i option assumes a fast_text style joint source/target file of the format

source sentence ||| target sentence
another source sentence ||| another target sentence
...

The --priors option expects a file generated by eflomal-makepriors (see below). This file contains user-specified lexical, HMM and/or fertility distribution priors. Since the algorithm is asymmetric, HMM and fertility priors can be stored for both the forward and reverse directions. eflomal-makepriors handles this automatically, see examples below.

Note that the default value of the Dirichlet priors (defined in eflomal.c as LEX_ALPHA, JUMP_ALPHA and FERT_ALPHA) will be added to whatever is specified in the priors file. This means that integer counts for whatever word forms you have data on are fine in the priors file.

It s possible to use the special <NULL> token in the priors file, in case you want to encourage certain word forms to remain unaligned. Currently the eflomal-makepriors script does not generate these, and this feature has not been tested yet.

Generating priors

If you have a large file that you want to use as "training data", en-sv, and a small file that you later want to align quickly, en-sv.small, start by aligning the large file as usual, e.g.:

eflomal-align -i en-sv --model 3 -f en-sv.fwd -r en-sv.rev

Now you can generate priors based on this large aligned file, stored in en-sv.priors:

eflomal-makepriors -i en-sv -f en-sv.fwd -r en-sv.rev --priors en-sv.priors

Alternatively, you can symmetrize en-sv.fwd and en-sv.rev into en-sv.sym and pass the same file to both -f and -r:

atools -c grow-diag-final-and -i en-sv.fwd -j en-sv.rev >en-sv.sym
eflomal-makepriors -i en-sv -f en-sv.sym -r en-sv.sym --priors en-sv.priors

Now, if you have another file to align, en-sv.small, simply use e.g.:

eflomal-align -i en-sv.small --priors en-sv.priors --model 3 \
    -f en-sv.small.fwd -r en-sv.small.rev

This will be much faster than merging en-sv and en-sv.small and aligning them jointly, while nearly as accurate (assuming en-sv.small is much smaller than en-sv).

Output data format

The alignment output contains the same number of lines as the input files, where each line contains pairs of indexes. For instance, if the source input contains the following:

a black cat

and the target input is the following:

kuro neko

the correct output would be:

1-0 2-1

That is, 1-0 indicates token 1 of the source (black) is aligned to token 0 of the target (kuro), and 2-1 that token 2 of the source (cat) is aligned to token 1 of the target (neko). NULL alignments are not present in the output.

Note that the forward and reverse alignments both use source-target order, so the output can be fed directly to atools (see scripts/align_symmetrize.sh for an example).

In case you made a mistake with the direction, you can fix it afterwards with scripts/reverse_moses.py.

Python interface

The Python package provides an interface for aligning and estimating priors. Here is a simple example using the files in testdata:

import eflomal

aligner = eflomal.Aligner()

with open('test1.sv', 'r', encoding='utf-8') as src_data, \
     open('test1.en', 'r', encoding='utf-8') as trg_data, \
     open('test1.priors', 'r', encoding='utf-8') as priors_data:
    # Align with priors
    aligner.align(
        src_data, trg_data,
        links_filename_fwd='sv-en.fwd', links_filename_rev='sv-en.rev',
        priors_input=priors_data)

with open('test1.sv', 'r', encoding='utf-8') as src_data, \
     open('test1.en', 'r', encoding='utf-8') as trg_data, \
     open('sv-en.fwd', 'r', encoding='utf-8') as fwd_links, \
     open('sv-en.rev', 'r', encoding='utf-8') as rev_links, \
     open('sv-en.priors', 'w', encoding='utf-8') as priors_f:
    # Estimate priors
    priors_tuple = eflomal.calculate_priors(
        src_data, trg_data, fwd_links, rev_links)
    # Write priors to file
    eflomal.write_priors(priors_f, *priors_tuple)

Note that the output files for Aligner.align() are given as paths, not file objects, as they are written directly by the eflomal binary.

Performance

This is a comparison between eflomal, efmaral and fast_align.

The difference between efmaral and eflomal is in part due to different default parameters, in particular the number of iterations and the number of independent samplers.

Note that all timing figures below include alignments in both directions (run in parallel) and symmetrization.

eflomal

LanguagesSentencesAERCPU time (s)Real time (s)
English-French1,130,5510.0811,232337
English-Inkutitut340,6010.20316144
Romanian-English48,6810.29815933
English-Hindi3,5300.467316

efmaral

LanguagesSentencesAERCPU time (s)Real time (s)
English-Swedish1,862,4260.1331,719620
English-French1,130,5510.085763279
English-Inkutitut340,6010.23512246
Romanian-English48,6810.28716146
English-Hindi3,5300.4839810

fast_align

LanguagesSentencesAERCPU time (s)Real time (s)
English-Swedish1,862,4260.20511,090672
English-French1,130,5510.1533,840241
English-Inuktitut340,6010.28747747
Romanian-English48,6810.32520817
English-Hindi3,5300.672242