Home

Awesome

Datasets of Annotated Semantic Relationships

This repository contains annotated datasets which can be used to train supervised models for the task of semantic relationship extraction. If you know any more datasets, and want to contribute, please, notify me or submit a PR.

It's divided in 3 groups:

Traditional Information Extraction: relationships are manually annotated, and belongs to pre-determined type, i.e. a closed number of classes.

Open Information Extraction: relationships are manually annotated, but don't have any specific type.

Distantly Supervised: relationships are annotated by appying some Distant Supervision technique and are pre-determined.

<br><br>

DatasetNr. ClassesLanguageYearCite
aimed.tar.gz2English2005Subsequence Kernels for Relation Extraction
wikipedia_datav1.0.tar.gz53English2006Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text
SemEval2007-Task4.tar.gz7English2007SemEval-2007 Task 04: Classification of Semantic Relations between Nominals
hlt-naacl08-data.txt2English2007Learning to Extract Relations from the Web using Minimal Supervision
ReRelEM.tar.gz4Portuguese2009Relation detection between named entities: report of a shared task
SemEval2010_task8_all_data.tar.gz10 / 19 (directional)English2010SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals
BioNLP.tar.gz2English2011Overview of BioNLP Shared Task 2011
DDICorpus2013.zip4English2012The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions
ADE-Corpus-V2.zip2English2013Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports
DBpediaRelations-PT-0.2.txt.bz210Portuguese2013Exploring DBpedia and Wikipedia for Portuguese Semantic Relationship Extraction
kbp37-master.zip37 directionalEnglish2015Relation Classification via Recurrent Neural Network

<br><br>

DatasetNr. ClassesLanguageYearCite
DataSet-IJCNLP2011.tar.gzOpenEnglish2011Extracting Relation descriptors with Conditional Random Fields
reverb_emnlp2011_data.tar.gzOpenEnglish2011Identifying Relations for Open Information Extraction
ClausIE-datasets.tar.gzOpenEnglish2013ClausIE: Clause-Based Open Information Extraction
emnlp13_ualberta_experiments_v2.zipOpenEnglish2013Effectiveness and Efficiency of Open Relation Extraction

<br><br>

DatasetNr. ClassesLanguageYearCite
http://iesl.cs.umass.edu/riedel/ecml/DistantEnglish2010Modeling Relations and Their Mentions without Labeled Text
https://github.com/google-research-datasets/relation-extraction-corpusDistantEnglish2013https://research.googleblog.com/2013/04/50000-lessons-on-how-to-read-relation.html
PGR.zipDistantEnglish2019A Silver Standard Corpus of Human Phenotype-Gene Relations
PGR-crowd.zipDistant + CrowdsourcedEnglish2020A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing

<br><br>

<a name="tie"></a>

Traditional Information Extraction

DBpediaRelations-PT

Dateset: DBpediaRelations-PT-0.2.txt.bz2

Cite: Exploring DBpedia and Wikipedia for Portuguese Semantic Relationship Extraction

Description: A collections of sentences in Portuguese that express semantic relationships between pairs of entities extracted from DBPedia. The sentences were collected by distant supervision, and were than manuall revised.


AImed

Dateset: aimed.tar.gz

Cite: Subsequence Kernels for Relation Extraction

Description: It consists of 225 Medline abstracts, of which 200 are known to describe interactions between human proteins, while the other 25 do not refer to any interaction. There are 4084 protein references and around 1000 tagged interactions in this dataset.


SemEval 2007

Dateset: SemEval2007-Task4.tar.gz

Cite: SemEval-2007 Task 04: Classification of Semantic Relations between Nominals

Description: Small data set, containing 7 relationship types and a total of 1,529 annotated examples.


SemEval 2010

Dateset: SemEval2010_task8_all_data.tar.gz

Cite: SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals

Description: SemEval-2010 Task 8 as a multi-way classification task in which the label for each example must be chosen from the complete set of ten relations and the mapping from nouns to argument slots is not provided in advance. We also provide more data: 10,717 annotated examples, compared to 1,529 in SemEval-1 Task 4.


ReRelEM

Dateset: ReRelEM.tar.gz

Cite: Relation detection between named entities: report of a shared task

Description: First evaluation contest (track) for Portuguese whose goal was to detect and classify relations betweennamed entities in running text, called ReRelEM. Given a collection annotated with named entities belonging to ten different semantic categories, we marked all relationships between them within each document. We used the following fourfold relationship classification: identity, included-in, located-in, and other (which was later on explicitly detailed into twenty different relations).


Wikipedia

Dateset: wikipedia_datav1.0.tar.gz

Cite: Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text

Description: We sampled 1127 paragraphs from 271 articles from the online encyclopedia Wikipedia and labeled a total of 4701 relation instances. In addition to a large set of person-to-person relations, we also included links between people and organizations, as well as biographical facts such as birthday and jobTitle. In all, there are 53 labels in the training data.


Web

Dateset: hlt-naacl08-data.txt

Cite: Learning to Extract Relations from the Web using Minimal Supervision

Description: Corporate Acquisition Pairs and Person-Birthplace Pairs taken from the web. The corporate acquisition test set has a total of 995 instances, out of which 156 are positive. The person-birthplace test set has a total of 601 instances, and only 45 of them are positive.


BioNLP Shared Task

Dateset: BioNLP.tar.gz

Cite: Overview of BioNLP Shared Task 2011

Description: The task involves the recognition of two binary part-of relations between entities: PROTEIN-COMPONENT and SUBUNITCOMPLEX. The task is motivated by specific challenges: the identification of the components of proteins in text is relevant e.g. to the recognition of Site arguments (cf. GE, EPI and ID tasks), and relations between proteins and their complexes relevant to any task involving them. REL setup is informed by recent semantic relation tasks (Hendrickx et al., 2010). The task data, consisting of new annotations for GE data, extends a previously introduced resource (Pyysalo et al., 2009; Ohta et al., 2010a).


The DDI corpus

Dateset: DDICorpus2013.zip

Cite: The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions

Description: The DDI corpus contains MedLine abstracts on drug-drug interactions as well as documents describing drug-drug interactions from the DrugBank database. This task is designed to address the extraction of drug-drug interactions as a whole, but divided into two subtasks to allow separate evaluation of the performance for different aspects of the problem. The task includes two subtasks:

Four types of DDIs are proposed:


ADE-V2

Dateset: ADE-Corpus-V2.zip

Cite: Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports

Description: The work presented here aims at generating a systematically annotated corpus that can support the development and validation of methods for the automatic extraction of drug-related adverse effects from medical case reports. The documents are systematically double annotated in various rounds to ensure consistent annotations. The annotated documents are finally harmonized to generate representative consensus annotations. In order to demonstrate an example use case scenario, the corpus was employed to train and validate models for the classification of informative against the non-informative sentences. A Maximum Entropy classifier trained with simple features and evaluated by 10-fold cross-validation resulted in the F1 score of 0.70 indicating a potential useful application of the corpus.


KBP-37

Dateset: kbp37-master.zip.zip

Cite: Relation Classification via Recurrent Neural Network

Description: This dataset is a revision of MIML-RE annotation dataset, provided by Gabor Angeli et al. (2014). They use both the 2010 and 2013 KBP official document collections, as well as a July 2013 dump of Wikipedia as the text corpus for annotation, 33811 sentences been annotated. To make the dataset more suitable for our task, we made several refinement:

  1. First, we add direction to the relation names, such that ‘per:employee of’ is splited into two relations ‘per:employee of(e1,e2)’ and ‘per:employee of(e2,e1)’ except for ‘no relation’. According to description of KBP task,3 we replace ‘org:parents’ with ‘org:subsidiaries’ and replace ‘org:member of’ with ‘org:member’ (by their reverse directions). This leads to 76 relations in the dataset.

  2. Then, we statistic the frequency of each relation with two directions separately. And relations with low frequency are discarded so that both directions of each relation occur more than 100 times in the dataset. To better balance the dataset, 80% ‘no relation’ sentences are also randomly discarded.

  3. After that, dataset are randomly shuffled and then sentences under each relation are all split into three groups, 70% for training, 10% for development, 20% for test. Finally, we remove those sentences in the development and test set whose entity pairs and relation are appeared in a training sentence simultaneously.

<br><br>

<a name="oie"></a>

Open Information Extraction

ReVerb

Dateset: reverb_emnlp2011_data.tar.gz

Cite: Identifying Relations for Open Information Extraction

Description: 500 sentences sampled from the Web, using Yahoo’s random link service.


ClausIE

Dateset: ClausIE-datasets.tar.gz

Cite: ClausIE: Clause-Based Open Information Extraction

Description:

Three different datasets. First, the Reverb dataset consists of 500 sentences with manually labeled extractions. The sentences have been obtained via the random-link service of Yahoo and are generally very noisy. Second, 200 random sentences from Wikipedia pages. These sentences are shorter, simpler, and less noisy than those of the Reverb dataset. Since some Wikipedia articles are written by non-native speakers, however, the Wikipedia sentences do contain some incorrect grammatical constructions. Third, 200 random sentences from the New York Times collection these sentences are generally very clean but tend to be long and complex.


Effectiveness and Efficiency of Open Relation Extraction

Dateset: emnlp13_ualberta_experiments_v2.zip

Cite: Effectiveness and Efficiency of Open Relation Extraction

Description: WEB-500 is a commonly used dataset, developed for the TextRunner experiments (Banko and Etzioni, 2008). These sentences are often incomplete and grammatically unsound, representing the challenges of dealing with web text. NYT-500 represents the other end of the spectrum with formal, well written new stories from the New York Times Corpus (Sandhaus, 2008). PENN-100 contains sentences from the Penn Treebank recently used in an evaluation of the TreeKernel method (Xu et al., 2013). We manually annotated the relations for WEB-500 and NYT-500 and use the PENN-100 annotations provided by TreeKernel’s authors (Xu et al., 2013).


Extracting Relation descriptors with Conditional Random Fields

Dateset: DataSet-IJCNLP2011.tar.gz

Cite: Extracting Relation descriptors with Conditional Random Fields

Description: New York Times data set contains 150 business articles from New York Times. The articles were crawled from the NYT website between November 2009 and January 2010. After sentence splitting and tokenization, we used the Stanford NER tagger (URL: http://nlp.stanford.edu/ner/index.shtml) to identify PER and ORG named entities from each sentence. For named entities that contain multiple tokens we concatenated them into a single token. We then took each pair of (PER, ORG) entities that occur in the same sentence as a single candidate relation instance, where the PER entity is treated as ARG-1 and the ORG entity is treated as ARG-2.

Wikipedia data was previously created by Aron Culotta et al.. Since the original data set did not contain the annotation information we need, we re-annotated it. Similarly, we performed sentence splitting, tokenization and NER tagging, and took pairs of (PER, PER) entities occurring in the same sentence as a candidate relation instance. We always treat the first PER entity as ARG-1 and the second PER entity as ARG-2.

<br><br>

<a name="ds"></a>

Distant Supervision for Relation Extraction

NYT dataset

Dateset: http://iesl.cs.umass.edu/riedel/ecml/

Cite: Modeling Relations and Their Mentions without Labeled Text

Description: The NYT dataset is a widely used dataset on distantly supervisied relation extraction task. This dataset was generated by aligning freebase relations with the New York Times (NYT) corpus, with sentences from the years 2005-2006 used as the training corpus and sentences from 2007 used as the testing corpus.


Google's relation-extraction-corpus

Dateset: https://github.com/google-research-datasets/relation-extraction-corpus

Cite: https://research.googleblog.com/2013/04/50000-lessons-on-how-to-read-relation.html

Description: https://research.googleblog.com/2013/04/50000-lessons-on-how-to-read-relation.html


PGR Corpus

Dataset: PGR.zip

Cite: A Silver Standard Corpus of Human Phenotype-Gene Relations

Description: Human phenotype-gene relations are fundamental to fully understand the origin of some phenotypic abnormalities and their associated diseases. Biomedical literature is the most comprehensive source of these relations, however, we need Relation Extraction tools to automatically recognize them. Most of these tools require an annotated corpus and to the best of our knowledge, there is no corpus available annotated with human phenotype-gene relations. This paper presents the Phenotype-Gene Relations (PGR) corpus, a silver standard corpus of human phenotype and gene annotations and their relations. The corpus consists of 1712 abstracts, 5676 human phenotype annotations, 13835 gene annotations, and 4283 relations. We generated this corpus using Named-Entity Recognition tools, whose results were partially evaluated by eight curators, obtaining a precision of 87.01%. By using the corpus we were able to obtain promising results with two state-of-the-art deep learning tools, namely 78.05% of precision. The PGR corpus was made publicly available to the research community.


PGR-crowd Corpus

Dataset: PGR-crowd.zip

Cite: A hybrid approach toward biomedical relation extraction training corpora: combining distant supervision with crowdsourcing

Description: Biomedical relation extraction (RE) datasets are vital in the construction of knowledge bases and to potentiate the discovery of new interactions. There are several ways to create biomedical RE datasets, some more reliable than others, such as resorting to domain expert annotations. However, the emerging use of crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), can potentially reduce the cost of RE dataset construction, even if the same level of quality cannot be guaranteed. There is a lack of power of the researcher to control who, how and in what context workers engage in crowdsourcing platforms. Hence, allying distant supervision with crowdsourcing can be a more reliable alternative. The crowdsourcing workers would be asked only to rectify or discard already existing annotations, which would make the process less dependent on their ability to interpret complex biomedical sentences. In this work, we use a previously created distantly supervised human phenotype–gene relations (PGR) dataset to perform crowdsourcing validation. We divided the original dataset into two annotation tasks: Task 1, 70% of the dataset annotated by one worker, and Task 2, 30% of the dataset annotated by seven workers. Also, for Task 2, we added an extra rater on-site and a domain expert to further assess the crowdsourcing validation quality. Here, we describe a detailed pipeline for RE crowdsourcing validation, creating a new release of the PGR dataset with partial domain expert revision, and assess the quality of the MTurk platform. We applied the new dataset to two state-of-the-art deep learning systems (BiOnt and BioBERT) and compared its performance with the original PGR dataset, as well as combinations between the two, achieving a 0.3494 increase in average F-measure. The code supporting our work and the new release of the PGR dataset is available at https://github.com/lasigeBioTM/PGR-crowd.