Awesome

L2Copy4APE

Introduction
Prerequisites
Datasets
Preprocess
Training
Inference
Hyper-params
Evaluation
Download
Citation
Contact

Introduction

This is the code for the paper "Learning to Copy for Automatic Post-Editing" (EMNLP2019). The implementation is on top of the open-source NMT toolkit THUMT. You might need to glance over the user manual of THUMT for knowing the basic usage of THUMT.

Prerequisites

Python 2.7
Tensorflow 1.11

Datasets

WMT 2016-2018 datasets

Training, development and test data consist English-German triplets (source, target, and post-edit) belonging to the IT domain, and are already tokenized.

Preprocess

We apply Truecasing (train-truecaser.perl, truecase.perl) and BPE (learn_joint_bpe_and_vocab.py, apply_bpe.py) to these datasets.

learn truecase model perl train-truecaser.perl --model $true_model --corpus $corpus
apply truecase perl truecase.perl -model $true_model < ${corpus} > ${corpus}.tc
learn bpe python learn_joint_bpe_and_vocab.py --input $src $trg -s 32000 -o bpe32k --write-vocabulary vocab.$L1 vocab.$L2
apply bpe python apply_bpe.py --vocabulary vocab.$L1 --vocabulary-threshold 50 -c bpe32k < ${corpus} > ${corpus}.bpe

For the PBSMT dataset:

learn truecase model and BPE from WMT16train + WMT17train + artificial-small (549k)
apply truecasing and BPE to the 549k dataset
5M dataset = (WMT16train + WMT17train)*20 + artificial-small + artificial-big
12M dataset = 5M dataset + eSCAPE-PBSMT
apply truecasing and BPE to these datasets
generate copying labels file using LCS algorithm

Note that the eSCAPE-PBSMT dataset need to be tokenized by tokenizer.perl tokenizer.perl -no-escape -l $lang -threads 10 -lines 10000 -time < ${corpus} > ${corpus}.tok.

For the NMT dataset:

learn truecase model and BPE from WMT18train + artificial-small (539k)
apply truecasing and BPE to the 539k dataset
5M dataset = WMT18train*20 + artificial-small + artificial-big
apply truecasing and BPE to the 5M dataset
generate copying labels file using LCS algorithm

Finally, shuffle these datasets and generate joint vocabulary.

Training

We employ two-step training:

Train with interactive+CopyNet:

PYTHONPATH=/path/to/L2Copy4APE/ \
python /path/to/L2Copy4APE/thumt/bin/trainer.py \
    --input ${train_src} ${train_mt} ${train_pe} ${train_label} \
    --vocabulary ${vocab_joint} ${vocab_joint} \
    --model transformer \
    --validation ${dev_src} \
    --references ${dev_mt} ${dev_pe} ${dev_label} \
    --parameters eval_steps=500,constant_batch_size=false,batch_size=6250,train_steps=50000,save_checkpoint_steps=9999999,shared_embedding_and_softmax_weights=true,device_list=[0,1,2,3],decode_alpha=1.0,eval_batch_size=64,update_cycle=1,keep_top_checkpoint_max=10,shared_source_target_embedding=true,keep_checkpoint_max=1,enable_tagger=False
mkdir -p average
python /path/to/L2Copy4APE/thumt/scripts/checkpoint_averaging.py --path train/eval --output average --checkpoints 10

Save the model after model averaging under pretrained/average.

Train with interactive+CopyNet+Predictor+Joint Training:

PYTHONPATH=/path/to/L2Copy4APE/ \
python /path/to/L2Copy4APE/thumt/bin/trainer.py \
    --input ${train_src} ${train_mt} ${train_pe} ${train_label} \
    --vocabulary ${vocab_joint} ${vocab_joint} \
    --model transformer \
    --validation ${dev_src} \
    --references ${dev_mt} ${dev_pe} ${dev_label} \
    --checkpoint /path/to/pretrained/average \
    --parameters eval_steps=500,constant_batch_size=false,batch_size=6250,train_steps=50000,save_checkpoint_steps=9999999,shared_embedding_and_softmax_weights=true,device_list=[0,1,2,3],decode_alpha=1.0,eval_batch_size=64,update_cycle=2,keep_top_checkpoint_max=10,shared_source_target_embedding=true,keep_checkpoint_max=1,multi_task_alpha=0.9,where_to_apply='enc',copy_lambda=1.0,enable_tagger=True
mkdir -p average
python /path/to/L2Copy4APE/thumt/scripts/checkpoint_averaging.py --path train/eval --output average --checkpoints 10

Save the model after model averaging under final/average.

Note that pre-training seems to be useless on 5M NMT dataset, we suggust to directly train the model like step-2 on 5M NMT dataset.

Inference

PYTHONPATH=/path/to/L2Copy4APE/ \
python /path/to/L2Copy4APE/thumt/bin/translator.py \
    --input $src $mt \
    --output $trg \
    --vocabulary ${vocab_joint} ${vocab_joint} \
    --checkpoints /path/to/final/average \
    --model transformer \
    --parameters device_list=[$devs],decode_alpha=1.0,decode_batch_size=64,shared_embedding_and_softmax_weights=true,shared_source_target_embedding=true,where_to_apply='enc',enable_tagger=True
sed -r 's/(@@ )|(@@ ?$)//g' < ${trg} > ${trg}.rbpe
perl ${moses_dir}/scripts/recaser/detruecase.perl < ${trg}.rbpe > ${trg}.rbpe.detc

Hyper-params

enable_tagger: whether to use Predictor+Joint Training.
where_to_apply: where to apply copying scores, you can choose 'enc' for encoder, 'dec' for decoder and CopyNet, and 'both' for all.
multi_task_alpha: \alpha in the paper.
copy_lambda: \lambda in the paper.

Evaluation

BLEU-4 (multi-bleu.perl)
TER (tercom, download)

Download

We provide the pre-trained models, you can skip the training process and do inference directly.

[Google Drive] For 5M training dataset, you should get 72.76/72.55/71.59(DEV16/TEST16/TEST17 BLEU) and 18.61/18.38/18.78(DEV16/TEST16/TEST17 TER).
[Google Drive] For 12M training dataset, you should get 73.85/73.83/72.69(DEV16/TEST16/TEST17 BLEU) and 17.78/17.25/17.77(DEV16/TEST16/TEST17 TER).

Citation

If you use our codes, please cite our paper:

@inproceedings{huang-etal-2019-learning,
    title = "Learning to Copy for Automatic Post-Editing",
    author = "Huang, Xuancheng  and
      Liu, Yang  and
      Luan, Huanbo  and
      Xu, Jingfang  and
      Sun, Maosong",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1634",
    doi = "10.18653/v1/D19-1634",
    pages = "6124--6134",
    abstract = "Automatic post-editing (APE), which aims to correct errors in the output of machine translation systems in a post-processing step, is an important task in natural language processing. While recent work has achieved considerable performance gains by using neural networks, how to model the copying mechanism for APE remains a challenge. In this work, we propose a new method for modeling copying for APE. To better identify translation errors, our method learns the representations of source sentences and system outputs in an interactive way. These representations are used to explicitly indicate which words in the system outputs should be copied. Finally, CopyNet (Gu et al., 2016) can be combined with our method to place the copied words in correct positions in post-edited translations. Experiments on the datasets of the WMT 2016-2017 APE shared tasks show that our approach outperforms all best published results.",
}

Contact

If you have questions, suggestions and bug reports, please email hxc17@mails.tsinghua.edu.cn.