Home

Awesome

XAlign: Cross-lingual-Fact-to-Text-Alignment-and-Generation-for-Low-Resource-Languages

In this work, we propose the creation of cross-lingual fact-to-text dataset, XAlign accepted at WebConf-2022 poster-demo track. It consist of English WikiData triples/facts mapped to sentences from low resources Wikipedia.

We explored two different unsupervised methods to solve cross-lingual alignment task based on:

This repository consists of steps for executing the cross-lingual alignment approaches and finetuning mT5 for data-to-text generation on XAlign. One can find more details, analyses, and baseline results in our paper.

Installation

Install the required packgaes as follow:

pip install -r requirements.txt

Dataset

Dataset releases

Data Fields

Each record consist of the following entries:

The facts key contains list of facts where each facts is stored as dictionary. A single record within fact list contains following entries:

Data Instances

Example from English

{
  "sentence": "Mark Paul Briers (born 21 April 1968) is a former English cricketer.",
  "facts": [
    {
      "subject": "Mark Briers",
      "predicate": "date of birth",
      "object": "21 April 1968",
      "qualifiers": []
    },
    {
      "subject": "Mark Briers",
      "predicate": "occupation",
      "object": "cricketer",
      "qualifiers": []
    },
    {
      "subject": "Mark Briers",
      "predicate": "country of citizenship",
      "object": "United Kingdom",
      "qualifiers": []
    }
  ],
  "language": "en"
}

Example from one of the low-resource languages (i.e. Hindi)

{
  "sentence": "बोरिस पास्तेरनाक १९५८ में साहित्य के क्षेत्र में नोबेल पुरस्कार विजेता रहे हैं।",
  "facts": [
    {
      "subject": "Boris Pasternak",
      "predicate": "nominated for",
      "object": "Nobel Prize in Literature",
      "qualifiers": [
        {
          "qualifier_predicate": "point in time",
          "qualifier_object": "1958"
        }
      ]
    }
  ],
  "language": "hi"
}

Gold standard Test dataset

We manually annotated the test dataset across 8 languages with the help of crowd-sourced annotators.

Language#Count#Word count (avg/min/max)#Facts/sentence (avg/min/max)
Hindi84211.1/5/242.1/1/5
Marathi73612.7/6/402.1/1/8
Telugu7349.7/5/302.2/1/6
Tamil6569.5/5/241.9/1/8
English47017.5/8/612.7/1/7
Gujarati53012.7/6/312.1/1/6
Bengali7928.7/5/241.6/1/5
Kannada64210.4/6/452.2/1/7
Oriya52913.4/5/452.4/1/7
Assamese63716.22/5/722.2/1/9
Malayalam6159.2/6/241.8/1/5
Punjabi52913.4/5/452.4/1/7

Train and validation dataset (automatically aligned)

We have automatically created a large collection of well aligned sentence-fact pair across languages using the best cross-lingual aligner evaluated on gold standard test datasets.

Language#Count#Word Count (avg/min/max)#Facts/sentence (avg/min/max)
Hindi5658225.3/5/992.0/1/10
Marathi1940820.4/5/942.2/1/10
Telugu2434415.6/5/971.7/1/10
Tamil5670716.7/5/971.8/1/10
English13258420.2/4/862.2/1/10
Gujarati903123.4/5/991.8/1/10
Bengali12121619.3/5/992.0/1/10
Kannada2544119.3/5/991.9/1/10
Oriya1433316.88/5/991.7/1/10
Assamese970719.23/5/991.6/1/10
Malayalam5513515.7/5/981.9/1/10
Punjabi3013632.1/5/992.1/1/10

Cross-lingual Alignment Approaches

1) Transfer learning from NLI

Before executing the code, download the XNLI dataset from here.

To execute the mT5 based approach, follow the steps:

$ cd XNLI-based-models/finetune_mt5

Copy xnli_dataset.zip (downloaded before) to ./datasets and unzip. Finally execute the command:

$ python main.py --epochs 5 --gpus 1 --batch_size 16 --max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-large --fp16 0

To execute the MuRIL or XLM-RoBERTa based approaches, follow the steps:

$ cd XNLI-based-models/finetune_multilingual_encoder_models

Copy xnli_dataset.zip (downloaded before) to ./datasets and unzip.Finally execute the command:

$ python main.py --epochs 5 --gpus 1 --batch_size 32 --max_seq_len 200 --learning_rate <lr> --model_name <model_name> --fp16 1

where,

2) Distant supervision using KELM dataset

Before executing the code, download the multi-lingual KELM dataset from here.

To execute the mT5 based approach, follow the steps:

$ cd distant_supervision/finetune_mt5

Copy multilingual-KELM-dataset.zip (downloaded before) to ./datasets directory and unzip. Finally execute the command:

$ python main.py --epochs 5 --gpus 1 --batch_size 16 --max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-large --fp16 0

To execute the MuRIL or XLM-RoBERTa based approaches, follow the steps:

$ cd distant_supervision/finetune_multilingual_encoder_models

Copy multilingual-KELM-dataset.zip (downloaded before) to ./datasets directory and unzip. Finally execute the command:

$ python main.py --epochs 5 --gpus 1 --batch_size 32 --max_seq_len 200 --learning_rate <lr> --model_name <model_name> --fp16 1

where,

Cross-lingual Alignment Results

Following are the F1-score for cross-lingual alignment over gold standard test datasets.

<!-- || | | | | F1-score | | | | | -->
HindiMarathiTeluguTamilEnglishGujaratiBengaliKannadaAverage
Baselines
KELM-style49.342.636.845.141.037.243.633.841.1
WITA-style50.757.451.745.960.250.053.553.052.8
Stage-1 + TF-IDF75.068.569.371.873.770.178.764.771.5
Distant Supervision based approaches
MuRIL-large76.368.474.075.570.578.562.467.771.7
XLM-Roberta-large78.169.076.573.976.578.566.972.474.0
mT5-large79.071.477.678.676.680.069.870.575.4
Transfer Learning based approaches
MuRIL-large71.671.776.575.173.478.779.571.874.8
XLM-Roberta-large77.276.778.081.279.080.583.172.778.6
mT5-large90.283.184.188.684.585.175.178.583.7

Cross-lingual Data-to-Text Generation

Before procedding, copy XAlign-dataset.zip (available upon request) to data-to-text-generator/mT5-baseline/datasets folder and unzip.

To finetune the best baseline on the XAlign, follow the steps:

$ cd data-to-text-generator/mT5-baseline
$ python main.py --epochs 30 --gpus 1 --batch_size 2 --src_max_seq_len 250 --tgt_max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-small --online_mode 0 --use_pretrained 1 --lang hi,mr,te,ta,en,gu,bn,kn --verbose --enable_script_unification 1 

To evaluate the trained model, follow the steps:

$ cd data-to-text-generator/mT5-baseline
$ python main.py --epochs 30 --gpus 1 --batch_size 4 --src_max_seq_len 250 --tgt_max_seq_len 200 --learning_rate 1e-3 --model_name google/mt5-small --online_mode 0 --use_pretrained 1 --lang hi,mr,te,ta,en,gu,bn,kn enable_script_unification 1 --inference

Cross-lingual Data-to-Text Generation Results

BLEU obtained on the Test dataset on XAlign.

HindiMarathiTeluguTamilEnglishGujaratiBengaliKannadaAverage
Baseline (fact translation)2.712.040.951.681.010.642.730.451.53
GAT-Transformer29.5417.944.917.1940.3311.3430.155.0818.31
Vanilla Transformer35.4217.316.948.8238.8713.2135.613.1619.92
mT5-small40.6120.2311.3913.6143.6516.6145.288.7725.02

Contributors

Citation

One can cite our paper as follows:

@article{abhishek2022xalign,
  title={XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages},
  author={Abhishek, Tushar and Sagare, Shivprasad and Singh, Bhavyajeet and Sharma, Anubhav and Gupta, Manish and Varma, Vasudeva},
  journal={arXiv preprint arXiv:2202.00291},
  year={2022}
}