Home

Awesome

OPIEC: An Open Information Extraction Corpus

<img src="img/opiec-logo.png" align="right" width=200>

Introduction

OPIEC is an Open Information Extraction (OIE) corpus, consisted of more than 341M triples extracted from the entire English Wikipedia. Each triple from the corpus is consisted of rich meta-data: each token from the subj/obj/rel along with NLP annotations (POS tag, NER tag, ...), provenance sentence along with the dependency parse, original (golden) links from Wikipedia, sentence order, space/time, etc (for more detailed explanation of the meta-data, see here).

There are two major corpora released with OPIEC:

  1. OPIEC: an OIE corpus containing hundreds of millions of triples.
  2. WikipediaNLP: the entire English Wikipedia with NLP annotations.

For more details concerning the construction, analysis and statistics of the corpus, read the AKBC paper "OPIEC: An Open Information Extraction Corpus". To download the data and get additional resources, please visit the project page. To use the code for the construction of the pipeline, please visit the GitHub repository OPIEC-pipeline.

Reading the data

The data is stored in avro format. For details about the metadata, see here. To read the data, you need the avroschema file found in the directory avroschema; more specifically TripleLinked.avsc and WikiArticleLinkedNLP.avsc for OPIEC and WikipediaNLP respectively.

Python

There are two corpora that you can read: OPIEC and WikipediaNLP. For reading OPIEC, see src/main/py3/read_articles_from_avro_demo.py:

from avro.datafile import DataFileReader
from avro.io import DatumReader
import pdb 

AVRO_SCHEMA_FILE = "../../../avro/TripleLinked.avsc"
AVRO_FILE = "../../../data/triples.avro"
reader = DataFileReader(open(AVRO_FILE, "rb"), DatumReader())
for triple in reader:
    print(triple)
    # use triple.keys() to see every field in the schema (it's a dictionary)
    pdb.set_trace()
reader.close()

Similarly, for reading WikipediaNLP, see src/main/py3/read_articles_from_avro_demo.py:

from avro.datafile import DataFileReader
from avro.io import DatumReader
import pdb 

AVRO_SCHEMA_FILE = "../../../avroschema/WikiArticleLinkedNLP.avsc"
AVRO_FILE = "../../../data/articles.avro" # edit this line
reader = DataFileReader(open(AVRO_FILE, "rb"), DatumReader())
for article in reader:
    print(article['title'])
    # use article.keys() to see every field in the schema (it's a dictionary)
    pdb.set_trace()

reader.close()

Java

There are two corpora that you can read: OPIEC and WikipediaNLP. For reading OPIEC, see src/main/java/de/uni_mannheim/ReadTriplesAvro.java:

package de.uni_mannheim;

import avroschema.linked.TripleLinked;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.io.DatumReader;
import org.apache.avro.specific.SpecificDatumReader;

import java.io.File;
import java.io.IOException;

public class ReadTriplesAvro {
    public static void main(String args[]) throws IOException {
        File f = new File("data/triples.avro");
        DatumReader<TripleLinked> userDatumReader = new SpecificDatumReader<>(TripleLinked.class);
        DataFileReader<TripleLinked> dataFileReader = new DataFileReader<>(f, userDatumReader);

        while (dataFileReader.hasNext()) {
            // Debugging variables
            TripleLinked triple = dataFileReader.next();
            System.out.print("Processing triple: " + triple.getTripleId());
        }
    }
}

Similarly, for reading WikipediaNLP, see src/main/java/de/uni_mannheim/ReadArticlesAvro.java:

package de.uni_mannheim;

import avroschema.linked.WikiArticleLinkedNLP;

import java.io.File;
import java.io.IOException;

import org.apache.avro.file.DataFileReader;
import org.apache.avro.io.DatumReader;
import org.apache.avro.specific.SpecificDatumReader;

public class ReadArticlesAvro {
    public static void main(String args[]) throws IOException {
        File f = new File("data/articles.avro");
        DatumReader<WikiArticleLinkedNLP> userDatumReader = new SpecificDatumReader<>(WikiArticleLinkedNLP.class);
        DataFileReader<WikiArticleLinkedNLP> dataFileReader = new DataFileReader<>(f, userDatumReader);

        while (dataFileReader.hasNext()) {
            WikiArticleLinkedNLP article = dataFileReader.next();
            System.out.println("Processing article: " + article.getTitle());
        }
    }
}

Metadata

There are two corpora that we are releasing: OPIEC and WikipediaNLP. In this section, the metadata for the two corpora are described.

WikipediaNLP

WikipediaNLP is the NLP annotation corpus for the English Wikipedia. Each object is a Wikipedia article containing:

OPIEC

Each OIE triple in OPIEC contains the following metadata:

Citation

If you use any of these corpora, or use the findings from the paper, please cite:

@inproceedings{gashteovski2019opiec,
  title={OPIEC: An Open Information Extraction Corpus},
  author={Gashteovski, Kiril and Wanner, Sebastian and Hertling, Sven and Broscheit, Samuel and Gemulla, Rainer},
  booktitle={Proceedings of the Conference on Automatic Knowledge Base Construction (AKBC)},
  year={2019}
}

OPIEC usage