Home

Awesome

XSemPLR

This repository maintains datasets and models for ACL 2023 paper XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations.

Navigation: Overview, Datasets, Models and Experiments, Repository Structure, Citation, License

Overview

we present XSEMPLR, a unified benchmark for cross-lingual semantic parsing featured with 22 natural languages and 8 meaning representations by examining and selecting 9 existing datasets to cover 5 tasks and 164 domains. The figure shows the overview of Cross-Lingual Semantic Parsing over various natural languages and meaning representations. example

Datasets

Statistics

Datasets in XSemPLR. We assemble 9 datasets in various domains for 5 semantic parsing tasks. It covers 8 meaning representations. The questions cover 22 languages in 15 language families. Train/Dev/Test columns indicate the number of MRs each paired with multiple NLs. dataset

Json Files

One can access the train/valid/test set of XSemPLR via dataset folder which includes 9 folders for each dataset. The format of one instance in json data is as follows:

{
  "db_id": "concert_singer",               // database id, only avaliable for MSpider dataset.
  "question": {                            // queries in natural language.
    "en": "How many singers do we have?",  // English query.
    "zh": "我们有多少歌手?",                // Chinese query.
    "vi": "Có tất cả bao nhiêu ca sĩ ?"    // Vietnamese query.
  },
  "mr": {                                  // meaning representations.
    "sql": "SELECT count(*) FROM singer"   // sql query.
  }                                        // *note that some datasets may contain multiple languages and meaning representations.
}

This shows an example of /data/mspider/dev.json, note that some datasets may contain multiple languages (mschema2qa, mtop) and meaning representations (mgeoquery).

Dataset Creation

All datasets can be produced (from raw dataset) by

python read_dataset.py

Models and Experiments

Settings

We consider the following 6 settings for training and testing.

Models

We consider 6 models in total in three different groups.

Multilingual Pretrained Encoders with Pointer-based Decoders (Enc-PTR)

The first group is multilingual pretrained encoders with decoders augmented with pointers. Both encoders and decoders use Transformers. The decoder uses pointers to copy entities from natural language inputs to generate meaning representations. We use two types of multilingual pretrained encoders, mBERT and XLM-R, and both are trained on web data covering over 100 languages.

Multilingual Pretrained Encoder-Decoder Models (Enc-Dec).

The second group uses pretrained encoder-decoder models, including mBART and mT5 which uses text-to-text denoising objective for pretraining over multilingual corpora.

Multilingual Large Language Models (LLMs)

The third group is multilingual large language models based on GPT including Codex and BLOOM. Codex is fine-tuned on publicly available code from GitHub. We mainly use these models to evaluate the ability of few-shot learning using in-context learning without any further finetuning. Specifically, we append 8 samples and the test query to predict the MR. For Monolingual Few-shot, samples and the query are in the same NL, while for Cross-lingual Zero-shot Transfer, samples are in English and the query is in the target NL.

Run the model

Please check the readme of each model and follow the instruction to set up the environments and run the models. To run Enc-PTR, one can cd \model\seq2seqPTR. To run Enc-Dec, one can cd \model\UniPSP. To run LLMs, one can cd \model\Codex for running Codex and cd \model\BLOOM for running BLOOM.

Experiment Results

Results on XSemPLR. We consider 6 settings including 2 Monolingual, 1 Multilingual, and 2 Cross-lingual settings, and one Translate-Test setting. Each number is averaged across different languages in that dataset. * Codex/BLOOM are evaluated on only two settings as we apply 8-shot in-context learning without finetuning the model parameters. Two settings are not applicable to MCoNaLa because it has no training set on NLs other than English. Translate-Test performances on MSchem2QA and MTOP are especially low because the MR of these data also contains tokens in target languages. results

Repository Structure

This figure shows the overview of repository sturcture

.
├── assets             // Figures of README.md.                       
├── dataset            // XSemPLR dataset, one dataset per sub-folder.                                
├── model              // Models of XSemPLR, including BLOOM, Codex, mBERT+PTR, XLM-R+PTR, mBART, mT5 on 6 settings.
│   ├── BLOOM          // BLOOM model. See readme in this folder for running.
│   ├── Codex          // Codex model. See readme in this folder for running.  
│   ├── seq2seqPTR     // 2 Enc-PTR models, including mBERT+PTR, XLM-R+PTR. See readme in this folder for running.                 
│   └── UniPSP         // 2 Enc-Dec models, including mBART and mT5. See readme in this folder for running.  
├── utils              // The code to create multilingual, fewshot dataset, and to combine translations.                         
└── read_dataset.py    // Preprocess raw data to create XSemPLR.

Citation

@inproceedings{zhang2023xsemplr,
  title={XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages and Meaning Representations},
  author={Zhang, Yusen and Wang, Jun and Wang, Zhiguo and Zhang, Rui},
  booktitle={ACL},
  year={2023}
}

License

Dataset License

We adopt different licenses used by different dataset sources.

Code License

Our code is distributed under MIT License. Our implementation is based on the following libraries/repos