Awesome
REPLM: Document-Level In-Context Few-Shot Relation Extraction via Pre-Trained Language Models
The original implementation of the paper. You can cite the paper as below.
@article{ozyurt2023context,
title={Document-Level In-Context Few-Shot Relation Extraction via Pre-Trained Language Models},
author={Ozyurt, Yilmazcan and Feuerriegel, Stefan and Zhang, Ce},
journal={arXiv preprint arXiv:2310.11085},
year={2023}
}
We used Python 3.8.5 in our experiments.
You can install the requirement libraries via pip install -r requirements.txt
into your new virtual Python environment.
Data pre-processing
First step is to download the DocRED dataset, following the instructions from the original repository. As a result, you should have a new folder ./DocRED
.
Then you can run the pre-processing pipeline DocRED_preprocess/main.sh.
Running our REPLM framework
Run the inference for L different sets of in-context few-shot examples (by changing <seed_no>):
python extract_relations.py --relation <rel_id> --seed <seed_no> --experiments_main_folder experiment_<rel_id> --experiment_folder seed<seed_no>
After running for different seeds, aggregate their results as follows:
python aggregate_extractions.py --temperature <temperature> --threshold <threshold> --experiments_main_folder experiment_<rel_id>
This should yield experiment_<rel_id>/aggretaged_predictions.csv
Running our REPLM framework on OpenAI's GPT models
As our REPLM framework is transferable to other LM backbones, you can easily replace the default LM with your favourite one, such as one of the GPT models from OpenAI. Specifically, if you want to experiment with gpt-3.5-turbo
, you can run the inference via the following:
python extract_relations_openai.py --model_name gpt-3.5-turbo --relation <rel_id> --seed <seed_no> --experiments_main_folder experiment_<rel_id> --experiment_folder seed<seed_no>
Important Note: Don't forget to set OPENAI_API_KEY
environment variable before running the experiments.
Evaluation via external knowledge base
Extracted relations from DocRED can be further evaluated/compared from WikiData.
To achieve this, first you need to clone simple-wikidata-db and follow the steps there to have a local copy of WikiData. (Of note: One can ideally use the online query service of WikiData instead of a local copy, however, it is too slow to run queries for thousands of extracted relations.)
For convenience we repeat the important steps here:
-
Open the
./simple-wikidata-db
directory, which you recently cloned. -
Fetch the most recent dump of WikiData
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
- Run the process (Note: it's a CPU-heavy process)
python simple_wikidata_db/preprocess_dump.py --input_file latest-all.json.gz --out_dir PROCESSED_DATA/ --num_lines_in_dump -1 --processes 200 --batch_size 250000
- The local copy of WikiData should be ready at
./simple-wikidata-db/PROCESSED_DATA
The final step is to compare the extracted relations against WikiData entries
python pred_to_wikidata.py --pred_folder experiment_<rel_id> --pred_file aggregated_predictions.csv --rel_id <rel_id> -p_alias "simple-wikidata-db/PROCESSED_DATA/aliases" -p_rels "simple-wikidata-db/PROCESSED_DATA/entity_rels"
This should yield experiment_<rel_id>/preds_in_wikidata.csv
, which lists the extracted relations from the documents that also appear in the knowledge base WikiData.