Awesome

TreeSwap

Complimentary code for our paper TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swapping accepted at RANLP 2023.

Building the data augmentation package

The data augmentator uses Poetry for packaging and dependency management.

NOTE: Mac users need to install graphviz before following the installation.
sudo chown -R $(whoami) /usr/local/bin
brew install graphviz
sudo chown -R root /usr/local/bin

Current server setup

To use all the features in the repo

conda create --name my-env python=3.8.5
conda activate my-env

pip install -r requirements.txt
pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
conda install -c conda-forge sentencepiece=0.1.95 sacrebleu=1.5.1 fasttext=0.9.2 yq=2.13.0
conda install libgcc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/miniconda3/lib/

cd src
poetry install

Setup

To install all the necessary dependencies, just run:

cd src/hu_nmt
poetry install
pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
conda install -c conda-forge fasttext=0.9.2 yq=2.13.0

Download model for language detection (used in preprocessing)

wget -O /tmp/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

All installed dependencies are written to a poetry.lock file.

If you already have a Poetry environment and want to resume work:

git pull
poetry update

Poetry update will update the lock file.

You can also launch a shell in your terminal:

poetry shell

To set up PyCharm with this virtual environment, just configure it as the project interpreter.

You can obtain the path for the virtualenv by:

poetry env info --path

Running augmentation

The augment.sh uses the following parameters from config.yaml:

data.original
augmentation hyperparameters:
- augmentation_type: ged/edge_mapper/base
- similarity_threshold
- augmentation_ratio

Example augmentation config

# create directory for new experiment
cd opennmt/experiments/runs/simple_aug_example

# set the data path in the config file
vim config.yaml

../../../bash_scripts/augment.sh

Training models

Setup

Create a new conda environment:

conda create --name my-env python=3.8.5
conda activate my-env

Install the required packages:

pip install -r requirements.txt
pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
conda install -c conda-forge sentencepiece=0.1.95 sacrebleu=1.5.1 fasttext=0.9.2 yq=2.13.0

If you get the following error during vocabulary building:

ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (...)

run the following lines one by one in the given order:

conda install libgcc #1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/miniconda3/lib/ #2

Run

To train a model you need to specify a config file like this one where you specify all the model parameters and data paths based on the OpenNMT documentation (build vocab, train, translate), and also specify additional parameters for our scripts.

After you have set up your config.yaml file you should build your vocabularies (you only have to do this once). After the vocabs have been created you can call the full_train.sh script which will train your model based on your config, translate your validation set and evaluate BLEU. It will also track your execution based on the next section.

# create directory for new experiment
cd opennmt/experiments/runs
mkdir new_experiment
cd new_experiment

# create config file
vim config.yaml

# build vocabulary
../../../bash_scripts/1_build_vocab.sh

# run training with evaluation and experiment tracking
../../../bash_scripts/full_train.sh

You can also run the model training and evaluation steps separately with the scripts found in the opennmt/bash_scripts directory.

Experiment tracking

When you run a full training or just the 8_save_history.sh script your experiment will be tracked.

It saves the following files to the history directory in folder specified by the datetime you have ran your experiment:

config file
final result
final translation of the validation set
tensorboard logs
translation pairs
best model

It saves the following in the history.tsv file in the history directory:

all the parameters specified in the config file (if there are nested fields they are represented as a.b)
date - when the experiment was ran
history_path - corresponding history directory
bleu_score - overall BLEU score
bleu_score_n - ngram BLEU score
git_hash - hash of the git commit that was used

If there is a new parameter added to the config the previous runs will have None as a value for that parameter.

Datasets

The preprocessed datasets and the train/dev/test splits used in the experiments for our paper: TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swapping can be found here.

Trained models

Our trained models from the paper Syntax-based data augmentation for Hungarian-English machine translation for hu-en and en-hu specifically, are available on the HuggingFace Model Hub with usage steps:

HU-EN
EN-HU

Citation

If you use our method please cite the following papers:

@inproceedings{nagy-etal-2023-treeswap,
    title = "{T}ree{S}wap: Data Augmentation for Machine Translation via Dependency Subtree Swapping",
    author = "Nagy, Attila  and
      Lakatos, Dorina  and
      Barta, Botond  and
      {\'A}cs, Judit",
    editor = "Mitkov, Ruslan  and
      Angelova, Galia",
    booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
    month = sep,
    year = "2023",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2023.ranlp-1.82",
    pages = "759--768",
    abstract = "Data augmentation methods for neural machine translation are particularly useful when limited amount of training data is available, which is often the case when dealing with low-resource languages. We introduce a novel augmentation method, which generates new sentences by swapping objects and subjects across bisentences. This is performed simultaneously based on the dependency parse trees of the source and target sentences. We name this method TreeSwap. Our results show that TreeSwap achieves consistent improvements over baseline models in 4 language pairs in both directions on resource-constrained datasets. We also explore domain-specific corpora, but find that our method does not make significant improvements on law, medical and IT data. We report the scores of similar augmentation methods and find that TreeSwap performs comparably. We also analyze the generated sentences qualitatively and find that the augmentation produces a correct translation in most cases. Our code is available on Github.",
}

@inproceedings {nagy2023syntax,
    title = {{Data Augmentation for Machine Translation via Dependency Subtree Swapping}},
    author = {Nagy, Attila and Lakatos, Dorina and Barta, Botond and Nanys, Patrick and {\'{A}}cs, Judit},
    booktitle = {XIX. Conference on Hungarian Computational Linguistics.},
    year = {2023},
}