Awesome
TreeSwap
Complimentary code for our paper TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swapping accepted at RANLP 2023.
Building the data augmentation package
The data augmentator uses Poetry for packaging and dependency management.
NOTE: Mac users need to install graphviz before following the installation.
sudo chown -R $(whoami) /usr/local/bin brew install graphviz sudo chown -R root /usr/local/bin
Current server setup
To use all the features in the repo
conda create --name my-env python=3.8.5
conda activate my-env
pip install -r requirements.txt
pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
conda install -c conda-forge sentencepiece=0.1.95 sacrebleu=1.5.1 fasttext=0.9.2 yq=2.13.0
conda install libgcc
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/miniconda3/lib/
cd src
poetry install
Setup
To install all the necessary dependencies, just run:
cd src/hu_nmt
poetry install
pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
conda install -c conda-forge fasttext=0.9.2 yq=2.13.0
Download model for language detection (used in preprocessing)
wget -O /tmp/lid.176.bin https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
All installed dependencies are written to a poetry.lock file.
If you already have a Poetry environment and want to resume work:
git pull
poetry update
Poetry update will update the lock file.
You can also launch a shell in your terminal:
poetry shell
To set up PyCharm with this virtual environment, just configure it as the project interpreter.
You can obtain the path for the virtualenv by:
poetry env info --path
Running augmentation
The augment.sh
uses the following parameters from config.yaml
:
data.original
- augmentation hyperparameters:
augmentation_type
:ged
/edge_mapper
/base
similarity_threshold
augmentation_ratio
# create directory for new experiment
cd opennmt/experiments/runs/simple_aug_example
# set the data path in the config file
vim config.yaml
../../../bash_scripts/augment.sh
Training models
Setup
Create a new conda environment:
conda create --name my-env python=3.8.5
conda activate my-env
Install the required packages:
pip install -r requirements.txt
pip3 install torch==1.8.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
conda install -c conda-forge sentencepiece=0.1.95 sacrebleu=1.5.1 fasttext=0.9.2 yq=2.13.0
If you get the following error during vocabulary building:
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (...)
run the following lines one by one in the given order:
conda install libgcc #1
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/miniconda3/lib/ #2
Run
To train a model you need to specify a config file like this one where you specify all the model parameters and data paths based on the OpenNMT documentation (build vocab, train, translate), and also specify additional parameters for our scripts.
After you have set up your config.yaml file you should build your vocabularies (you only have to do this once). After the vocabs have been created you can call the full_train.sh script which will train your model based on your config, translate your validation set and evaluate BLEU. It will also track your execution based on the next section.
# create directory for new experiment
cd opennmt/experiments/runs
mkdir new_experiment
cd new_experiment
# create config file
vim config.yaml
# build vocabulary
../../../bash_scripts/1_build_vocab.sh
# run training with evaluation and experiment tracking
../../../bash_scripts/full_train.sh
You can also run the model training and evaluation steps separately with the scripts found in the opennmt/bash_scripts
directory.
Experiment tracking
When you run a full training or just the 8_save_history.sh
script your experiment will be tracked.
It saves the following files to the history directory in folder specified by the datetime you have ran your experiment:
- config file
- final result
- final translation of the validation set
- tensorboard logs
- translation pairs
- best model
It saves the following in the history.tsv
file in the history directory:
- all the parameters specified in the config file (if there are nested fields they are represented as
a.b
) date
- when the experiment was ranhistory_path
- corresponding history directorybleu_score
- overall BLEU scorebleu_score_n
- ngram BLEU scoregit_hash
- hash of the git commit that was used
If there is a new parameter added to the config the previous runs will have None
as a value for that parameter.
Datasets
The preprocessed datasets and the train/dev/test splits used in the experiments for our paper: TreeSwap: Data Augmentation for Machine Translation via Dependency Subtree Swapping
can be found here.
Trained models
Our trained models from the paper Syntax-based data augmentation for Hungarian-English machine translation
for hu-en
and en-hu
specifically, are available on the HuggingFace Model Hub with usage steps:
Citation
If you use our method please cite the following papers:
@inproceedings{nagy-etal-2023-treeswap,
title = "{T}ree{S}wap: Data Augmentation for Machine Translation via Dependency Subtree Swapping",
author = "Nagy, Attila and
Lakatos, Dorina and
Barta, Botond and
{\'A}cs, Judit",
editor = "Mitkov, Ruslan and
Angelova, Galia",
booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
month = sep,
year = "2023",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2023.ranlp-1.82",
pages = "759--768",
abstract = "Data augmentation methods for neural machine translation are particularly useful when limited amount of training data is available, which is often the case when dealing with low-resource languages. We introduce a novel augmentation method, which generates new sentences by swapping objects and subjects across bisentences. This is performed simultaneously based on the dependency parse trees of the source and target sentences. We name this method TreeSwap. Our results show that TreeSwap achieves consistent improvements over baseline models in 4 language pairs in both directions on resource-constrained datasets. We also explore domain-specific corpora, but find that our method does not make significant improvements on law, medical and IT data. We report the scores of similar augmentation methods and find that TreeSwap performs comparably. We also analyze the generated sentences qualitatively and find that the augmentation produces a correct translation in most cases. Our code is available on Github.",
}
@inproceedings {nagy2023syntax,
title = {{Data Augmentation for Machine Translation via Dependency Subtree Swapping}},
author = {Nagy, Attila and Lakatos, Dorina and Barta, Botond and Nanys, Patrick and {\'{A}}cs, Judit},
booktitle = {XIX. Conference on Hungarian Computational Linguistics.},
year = {2023},
}