Awesome
LREBench: A low-resource relation extraction benchmark.
This repo is official implementation for the EMNLP2022 (Findings) paper LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [poster].
This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data; $(ii)$ diverse balancing methods to address the long-tailed distribution issue; $(iii)$ data augmentation technologies and self-training to generate more labeled in-domain data.
<div align=center> <img src="figs/intro.png" alt="intro" width=70% height=70% /> </div>Contents
- LREBench
Environment
To install requirements:
>> conda create -n LREBench python=3.9
>> conda activate LREBench
>> pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113
Datasets
We provide 8 benchmark datasets and prompts used in our experiments.
All processed full-shot datasets can be downloaded and need to be placed in the dataset folder. The expected files of one dataset contains rel2id.json, train.json and test.json.
Normal Prompt-based Tuning
<div align=center> <img src="figs/prompt.png" alt="prompt" width=70% height=70% /> </div>1 Initialize Answer Words
Use the command below to get answer words first.
>> python get_label_word.py --modelpath roberta-large --dataset semeval
The {modelpath}_{dataset}.pt
will be saved in the dataset folder, and you need to assign the modelpath
and dataset
with names of the pre-trained language model and the dataset to be used before.
2 Split Datasets
We provide the sampling code for obtaining 8-shot (sample_8shot.py) , 10% (sample_10.py) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and new_test.json and new_rel2id.json are obtained.
>> python sample_8shot.py -h
usage: sample_8shot.py [-h] --input_dir INPUT_DIR --output_dir OUTPUT_DIR
optional arguments:
-h, --help show this help message and exit
--input_dir INPUT_DIR, -i INPUT_DIR
The directory of the training file.
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
The directory of the sampled files.
>> python sample_10.py -h
usage: sample_10.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR
optional arguments:
-h, --help show this help message and exit
--input_file INPUT_FILE, -i INPUT_FILE
The directory of the training file.
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
The directory of the sampled files.
For example:
>> python sample_8.py -i dataset/semeval -o dataset/semeval/8-shot
>> cd dataset/semeval
>> mkdir 8-1
>> cp 8-shot/new_rel2id.json 8-1/rel2id.json
>> cp 8-shot/new_test.json 8-1/test.json
>> cp 8-shot/train_8_1.json 8-1/train.json
>> cp 8-shot/unlabel_8_1.json 8-1/label.json
3 Prompt-based Tuning
All running scripts for each dataset are in the scripts folder. For example, train KonwPrompt on SemEval, CMeIE and ChemProt with the following commands:
>> bash scripts/semeval.sh # RoBERTa-large
>> bash scripts/CMeIE.sh # Chinese RoBERTa-large
>> bash scripts/ChemProt.sh # BioBERT-large
4 Different prompts
<div align=center> <img src="figs/prompts.png" alt="prompts" width=70% height=70% /> </div>Simply add parameters to the scripts.
Template Prompt: --use_template_words 0
Schema Prompt: --use_template_words 0 --use_schema_prompt True
PTR: refer to PTR
Balancing
<div align=center> <img src="figs/balance.png" alt="balance" width=40% height=40% /> </div>1 Re-sampling
-
Create the re-sampled training file based on the 10% training set by resample.py.
>> python resample.py -h usage: resample.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --rel_file REL_FILE optional arguments: -h, --help show this help message and exit --input_file INPUT_FILE, -i INPUT_FILE The path of the training file. --output_dir OUTPUT_DIR, -o OUTPUT_DIR The directory of the sampled files. --rel_file REL_FILE, -r REL_FILE the path of the relation file
For example,
>> mkdir dataset/semeval/10sa-1 >> python resample.py -i dataset/semeval/10/train10per_1.json -r dataset/semeval/rel2id.json -o dataset/semeval/sa >> cd dataset/semeval >> cp rel2id.json test.json 10sa-1/ >> cp sa/sa_1.json 10sa-1/train.json
2 Re-weighting Loss
Simply add the useloss parameter to script for choosing various re-weighting loss.
For exampe: --useloss MultiFocalLoss
.
(chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)
Data Augmentation
<div align=center> <img src="figs/DA.png" alt="DA" width=70% height=70% /> </div>1 Prepare the environment
>> pip install nlpaug nlpcda
Please follow the instructions from nlpaug and nlpcda for more information (Thanks a lot!).
2 Try different DA methods
We provide many data augmentation methods
-
English (nlpaug): TF-IDF, contextual word embedding (BERT and RoBERTa), and WordNet' Synonym (-lan==en, -d).
-
Chinese (nlpcda): Synonym (-lan==cn)
-
All DA methods can be implemented on contexts, entities and both of them (--locations).
-
Generate augmented data
>> python DA.py -h usage: DA.py [-h] --input_file INPUT_FILE --output_dir OUTPUT_DIR --language {en,cn} [--locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]] [--DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}] [--model_dir MODEL_DIR] [--model_name MODEL_NAME] [--create_num CREATE_NUM] [--change_rate CHANGE_RATE] optional arguments: -h, --help show this help message and exit --input_file INPUT_FILE, -i INPUT_FILE the training set file --output_dir OUTPUT_DIR, -o OUTPUT_DIR The directory of the sampled files. --language {en,cn}, -lan {en,cn} DA for English or Chinese --locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...], -l {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...] List of positions that you want to manipulate --DAmethod {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym}, -d {word2vec,TF-IDF,word_embedding_bert,word_embedding_roberta,random_swap,synonym} Data augmentation method --model_dir MODEL_DIR, -m MODEL_DIR the path of pretrained models used in DA methods --model_name MODEL_NAME, -mn MODEL_NAME model from huggingface --create_num CREATE_NUM, -cn CREATE_NUM The number of samples augmented from one instance. --change_rate CHANGE_RATE, -cr CHANGE_RATE the changing rate of text
Take context-level DA based on contextual word embedding on ChemProt for example:
python DA.py \ -i dataset/ChemProt/10/train10per_1.json \ -o dataset/ChemProt/aug \ -d word_embedding_bert \ -mn dmis-lab/biobert-large-cased-v1.1 \ -l sent1 sent2 sent3
-
Delete repeated instances and get the final augmented data
>> python merge_dataset.py -h usage: merge_dataset.py [-h] [--input_files INPUT_FILES [INPUT_FILES ...]] [--output_file OUTPUT_FILE] optional arguments: -h, --help show this help message and exit --input_files INPUT_FILES [INPUT_FILES ...], -i INPUT_FILES [INPUT_FILES ...] List of input files containing datasets to merge --output_file OUTPUT_FILE, -o OUTPUT_FILE Output file containing merged dataset
For example:
python merge_dataset.py \ -i dataset/ChemProt/train10per_1.json dataset/ChemProt/aug/aug.json \ -o dataset/ChemProt/aug/merge.json
Self-training for Semi-supervised learning
<div align=center> <img src="figs/self-training.png" alt="st" width=70% height=70% /> </div>- Train a teacher model on a few labeled data (8-shot or 10%)
- Place the unlabeled data label.json in the corresponding dataset folder.
- Assigning pseudo labels using the trained teacher model: add
--labeling True
to the script and obtain the pseudo-labeled dataset label2.json. - Put the gold-labeled data and pseudo-labeled data together. For example:
>> python self-train_combine.py -g dataset/semeval/10-1/train.json -p dataset/semeval/10-1/label2.json -la dataset/semeval/10la-1 >> cd dataset/semeval >> cp rel2id.json test.json 10la-1/
- Train the final student model: add
--stutrain True
to the script
Standard Fine-tuning Baseline
<div align=center> <img src="figs/fine-tuning.png" alt="ft" width=70% height=70% /> </div>Citation
If you use the code, please cite the following paper:
@inproceedings{xu-etal-2022-towards-realistic,
title = "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study",
author = "Xu, Xin and
Chen, Xiang and
Zhang, Ningyu and
Xie, Xin and
Chen, Xi and
Chen, Huajun",
editor = "Goldberg, Yoav and
Kozareva, Zornitsa and
Zhang, Yue",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.29",
doi = "10.18653/v1/2022.findings-emnlp.29",
pages = "413--427"
}