Awesome
Semi-Open Relation Extraction
The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.
The FOBIE dataset has been used to explore Semi-Open Relation Extraction (SORE). The code for this and instructions can be found inside the SORE folder Readme.md, or in the ReadTheDocs documentations.
Format
The train/test/dev data files are provided in two formats. A verbose json format inspired on the Semeval2018 task 7 dataset:
{"[document_ID]":
{"[relation_ID_within_document]":
{"annotations":
{"modifiers":
{"[within_sentence_modifier_ID]":
{"Arg0": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"},
"Arg1": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"}
}
},
"tradeoffs":
{"[within_sentence_tradeoff_ID]":
{"Arg0": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"},
"Arg1": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"},
"TO_indicator": {"span_start": "[token_index]",
"span_end": "[token_index]",
"span_id": "[brat_ID]",
"text": "[string]"},
"labels": {"Confidence": "High"}
}
}
},
"sentence": "[string]"
}
},
And the Sci-ERC dataset format, which is used to train the SciIE system:
{ "clusters": [],
"sentences": [["List", "of", "some", "tokens", "."]],
"ner": [[[4, 4, "Generic"]]],
"relations": [[[4, 4, 6, 17, "Tradeoff"]]],
"doc_key": "XXX"}
We also provide a script to convert data from the verbose format to SciIE format, as well as a script to convert BRAT annotations to the verbose format.
Statistics
Also see dataset_statistics.py under the scripts folder.
Train | Dev | Test | Total | |
---|---|---|---|---|
<sub># Unique documents </sub> | <sub>1010</sub> | <sub>138</sub> | <sub>144</sub> | <sub>1292</sub> |
<sub># Sentences</sub> | <sub>1248</sub> | <sub>150</sub> | <sub>150</sub> | <sub>1548</sub> |
<sub>Avg. sent. length</sub> | <sub>37.42</sub> | <sub>38.91</sub> | <sub>40.02</sub> | <sub>37.81</sub> |
<sub>% of sents ≥ 25 tokens</sub> | <sub>82.21 %</sub> | <sub>85.33 %</sub> | <sub>83.33 %</sub> | <sub>82.62%</sub> |
<sub>Relations:</sub> | ||||
<sub> - Trade-Off</sub> | <sub>639</sub> | <sub>54</sub> | <sub>72</sub> | <sub>765</sub> |
<sub> - Not-a-Trade-Off</sub> | <sub>2004</sub> | <sub>258</sub> | <sub>240</sub> | <sub>2502</sub> |
<sub> - Arg-Modifier</sub> | <sub>1247</sub> | <sub>142</sub> | <sub>132</sub> | <sub>1521</sub> |
<sub>Triggers</sub> | <sub>1292</sub> | <sub>155</sub> | <sub>153</sub> | <sub>1600</sub> |
<sub>Keyphrases</sub> | <sub>3436</sub> | <sub>401</sub> | <sub>398</sub> | <sub>4235</sub> |
<sub>Keyphrases w/ multiple relations</sub> | <sub>1600</sub> | <sub>188</sub> | <sub>163</sub> | <sub>1951</sub> |
<sub>Spans</sub> | <sub>4728</sub> | <sub>556</sub> | <sub>551</sub> | <sub>5835</sub> |
<sub>Max relations/sent</sub> | <sub>9 </sub> | <sub>8 </sub> | <sub>8 </sub> | |
<sub>Max spans/sent</sub> | <sub>9</sub> | <sub>8 </sub> | <sub>8 </sub> | |
<sub>Max triggers/sent</sub> | <sub>2 </sub> | <sub>2 </sub> | <sub>2 </sub> | |
<sub>Max args/trigger</sub> | <sub>5 </sub> | <sub>4 </sub> | <sub>4 </sub> | |
<sub>Unique spans</sub> | <sub>3643</sub> | |||
<sub>Unique triggers</sub> | <sub>41 </sub> | |||
<sub># single-word keyphrases</sub> | <sub>864 (20.4%) </sub> | |||
<sub>Avg. tokens per keyphrase</sub> | <sub>3.46 </sub> |
If you use the FOBIE dataset or SORE code in your research, please consider citing the following papers:
@inproceedings{Kruiper2020_SORE,
author = "Kruiper, Ruben
and Vincent, Julian F V
and Chen-Burger, Jessica
and Desmulliez, Marc P Y
and Konstas, Ioannis",
title = "In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts"
year = "2020",
url = "https://arxiv.org/pdf/2005.07751.pdf",
arxivId = "2005.07751"
}
@inproceedings{Kruiper2020_FOBIE,
author = "Kruiper, Ruben
and Vincent, Julian F V
and Chen-Burger, Jessica
and Desmulliez, Marc P Y
and Konstas, Ioannis",
title = "A Scientific Information Extraction Dataset for Nature Inspired Engineering"
booktitle = "Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)",
year = "2020",
keywords = "Biomimetics,Relation Extraction,Scientific Information Extraction,Trade-Offs",
pages = "2078--2085",
url = "http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.255.pdf",
arxivId = "2005.07753"
}
The FOBIE dataset along with SORE code in this repository are licensed under a Creative Commons Attribution 4.0 License. <img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-sa.png" width="134" height="47">