Home

Awesome

Welcome to LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction

Introduction

In this repository, you will find the data published in the paper LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction (EACL 2021), along with the training procedures necessary to train your open information extractor from these data, and finally, you will find evaluation techniques for evaluating your open information extractor.

Dataset Creation

In the folder titled dataset_creation, you will find code to transform QA-SRL 2.0 data into open information extraction data. In dataset_creation\dataset_creation\lsoie_data.zip you will find the full LSOIE dataset, containing the following:

Model Training

This repository contains code to train your open information extractor using the allennlp research framework. You will find the models explored in this work specified in the large_scale_oie package. This package is broken into a the following components:

In the models folder there are a few options available to use:

Prepare Data for Training

To get started training, first convert your chosen LSOIE data into allennlp connl format with the following command in ./large_scale_oie/dataset_readers:

python convert_conll_to_onenotes.py --inp extraction_conlls/train.oie.conll --domain train --out onto_notes_conlls/train.gold_conll

Specify and Run Experiment

Now that you have your data ready, specify an experiment json in ./experiments. There you will see an opportunity to specify your train and validation paths.

Then, you write a training file like run_training.sh which, when run ./run_training.sh will kick off the allennlp framework. Model weights and tensorboard logs are saved in `./results'.

Making Predictions

Once you have trained your open information extractor, you can use it to make some predictions!

The predictor is built into the large_scale_oie/evaluation folder and it is pre configured with the lsoie test set, but you can change that the a txt file of any sentences you would like. To predict run:

python large_scale_oie/evaluation/predict_conll.py --in results/srl_bert_no_science/ --out large_scale_oie/evaluation/srl_bert_noscience.conll

where results are the checkpoints of your trained model. It will by default take best.th.

Performing Evaluation

The evaluation script here corrects the original Supervised Open Information Extraction in that it does not invert confidence estimates and looks for the syntactic head match from the gold extraction argument, rather than lexical overlap, which has many downfalls.

To use the evaluation script after prediction, you must first convert to tabbed extractions using the following command:

python large_scale_oie/evaluation/trained_oie_extractor.py --in large_scale_oie/evaluation/ls_long_sort_orig_qdist_old_model_crf_on_science.conll --out large_scale_oie/evaluation/ls_long_sort_orig_qdist_old_model_crf_on_science.txt

This will take the predictions made by the extractor and transform them into confidences that can be thresholded during eval.

Now you move to the folder, oie-evaluation and you have a couple of options for a test set:

Running the evaluation command

python3 benchmark.py --gold=./oie_corpus/eval.oie --out=eval/srl_bert_oie2016.dat --tabbed=./systems_output/ls_long_sort_orig_qdist_old_model_new_eval.txt

And example evaluation series to create eval plots is in the file

oie_eval/paper_eval.sh

Please Stay in Touch

If you have encountered this work and you are interested in building upon it, please reach out! Many pieces shifted during the research process and I can certainly be a resource for you as you discover the contents herein.

Citation

If you find the LSOIE paper or dataset useful, please cite:

@article{lsoie-2021,
  title={{LSOIE}: A Large-Scale Dataset for Supervised Open Information Extraction},
  author={{Solawetz}, Jacob and {Larson}, Stefan},
  journal={arXiv preprint arXiv:2101.11177},
  year={2019},
  url="https://arxiv.org/pdf/2101.11177.pdf"
}