Home

Awesome

LREBench: A low-resource relation extraction benchmark.

This repo is official implementation for the EMNLP2022 (Findings) paper LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [poster].

This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data; $(ii)$ diverse balancing methods to address the long-tailed distribution issue; $(iii)$ data augmentation technologies and self-training to generate more labeled in-domain data.

<div align=center> <img src="figs/intro.png" alt="intro" width=70% height=70% /> </div>

Contents

Environment

To install requirements:

>> conda create -n LREBench python=3.9
>> conda activate LREBench
>> pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113

Datasets

We provide 8 benchmark datasets and prompts used in our experiments.

All processed full-shot datasets can be downloaded and need to be placed in the dataset folder. The expected files of one dataset contains rel2id.json, train.json and test.json.

Normal Prompt-based Tuning

<div align=center> <img src="figs/prompt.png" alt="prompt" width=70% height=70% /> </div>

1 Initialize Answer Words

Use the command below to get answer words first.

>> python get_label_word.py --modelpath roberta-large --dataset semeval

The {modelpath}_{dataset}.pt will be saved in the dataset folder, and you need to assign the modelpath and dataset with names of the pre-trained language model and the dataset to be used before.

2 Split Datasets

We provide the sampling code for obtaining 8-shot (sample_8shot.py) , 10% (sample_10.py) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and new_test.json and new_rel2id.json are obtained.

>> python sample_8shot.py -h
    usage: sample_8shot.py [-h] --input_dir INPUT_DIR --output_dir OUTPUT_DIR

    optional arguments:
      -h, --help            show this help message and exit
      --input_dir INPUT_DIR, -i INPUT_DIR
                            The directory of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.
>> python sample_10.py -h
    usage: sample_10.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR

    optional arguments:
      -h, --help            show this help message and exit
      --input_file INPUT_FILE, -i INPUT_FILE
                            The directory of the training file.
      --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                            The directory of the sampled files.

For example:

>> python sample_8.py -i dataset/semeval -o dataset/semeval/8-shot
>> cd dataset/semeval
>> mkdir 8-1
>> cp 8-shot/new_rel2id.json 8-1/rel2id.json
>> cp 8-shot/new_test.json 8-1/test.json
>> cp 8-shot/train_8_1.json 8-1/train.json
>> cp 8-shot/unlabel_8_1.json 8-1/label.json

3 Prompt-based Tuning

All running scripts for each dataset are in the scripts folder. For example, train KonwPrompt on SemEval, CMeIE and ChemProt with the following commands:

>> bash scripts/semeval.sh  # RoBERTa-large
>> bash scripts/CMeIE.sh    # Chinese RoBERTa-large
>> bash scripts/ChemProt.sh # BioBERT-large

4 Different prompts

<div align=center> <img src="figs/prompts.png" alt="prompts" width=70% height=70% /> </div>

Simply add parameters to the scripts.

Template Prompt: --use_template_words 0

Schema Prompt: --use_template_words 0 --use_schema_prompt True

PTR: refer to PTR

Balancing

<div align=center> <img src="figs/balance.png" alt="balance" width=40% height=40% /> </div>

1 Re-sampling

2 Re-weighting Loss

Simply add the useloss parameter to script for choosing various re-weighting loss.

For exampe: --useloss MultiFocalLoss. (chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)

Data Augmentation

<div align=center> <img src="figs/DA.png" alt="DA" width=70% height=70% /> </div>

1 Prepare the environment

>> pip install nlpaug nlpcda

Please follow the instructions from nlpaug and nlpcda for more information (Thanks a lot!).

2 Try different DA methods

We provide many data augmentation methods

Self-training for Semi-supervised learning

<div align=center> <img src="figs/self-training.png" alt="st" width=70% height=70% /> </div>

Standard Fine-tuning Baseline

<div align=center> <img src="figs/fine-tuning.png" alt="ft" width=70% height=70% /> </div>

Fine-tuning

Citation

If you use the code, please cite the following paper:

@inproceedings{xu-etal-2022-towards-realistic,
    title = "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study",
    author = "Xu, Xin  and
      Chen, Xiang  and
      Zhang, Ningyu  and
      Xie, Xin  and
      Chen, Xi  and
      Chen, Huajun",
    editor = "Goldberg, Yoav  and
      Kozareva, Zornitsa  and
      Zhang, Yue",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.29",
    doi = "10.18653/v1/2022.findings-emnlp.29",
    pages = "413--427"
}