Awesome
USC Distantly-supervised Relation Extraction System
This repository puts together recent models and data sets for sentence-level relation extraction using knowledge bases (i.e., distant supervision). In particular, it contains the source code for WWW'17 paper CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases.
Please also check out our new repository on handling shifted label distribution in distant supervision
Task: Given a text corpus with entity mentions detected and heuristically labeled using distant supervision, the task aims to identify relation types/labels between a pair of entity mentions based on the sentence context where they co-occur.
Quick Start
Blog Posts
- [08/2017] Indirect Supervision for Relation Extraction Using Question-Answer Pairs
- [08/2016] Heterogeneous Supervision for Relation Extraction
Data
For evaluating on sentence-level extraction, we processed (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, mapped entity names to Freebase entities using DBpediaSpotlight, aligned Freebase facts to sentences, and assign entity types of Freebase entities to their mapped names in sentences:
-
PubMed-BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology). (Download)
-
NYT-manual: 1.18M sentences sampled from 294K New York Times news articles which were then aligned with Freebase facts by (Riedel et al., ECML'10) (link to Riedel's data). For test set, 395 sentences are manually annotated with 24 relation types and 47 entity types (Hoffmann et al., ACL'11) (link to Hoffmann's data). (Download)
-
Wiki-KBP: the training corpus contains 1.5M sentences sampled from 780k Wikipedia articles (Ling & Weld, 2012) plus ~7,000 sentences from 2013 KBP corpus. Test data consists of 14k system-labeled sentences from 2013 KBP slot filling assessment results. It has 7 relation types and 126 entity types after filtering of numeric value relations. (Download)
Please put the data files in corresponding subdirectories under data/source
Benchmark
Performance comparison with several relation extraction systems over KBP 2013 dataset (sentence-level extraction).
Method | Precision | Recall | F1 |
---|---|---|---|
Mintz (our implementation, Mintz et al., 2009) | 0.296 | 0.387 | 0.335 |
LINE + Dist Sup (Tang et al., 2015) | 0.360 | 0.257 | 0.299 |
MultiR (Hoffmann et al., 2011) | 0.325 | 0.278 | 0.301 |
FCM + Dist Sup (Gormley et al., 2015) | 0.151 | 0.498 | 0.300 |
HypeNet (our implementation, Shwartz et al., 2016) | 0.210 | 0.315 | 0.252 |
CNN (our implementation, Zeng et at., 2014) | 0.198 | 0.334 | 0.242 |
PCNN (our implementation, Zeng et at., 2015) | 0.220 | 0.452 | 0.295 |
LSTM (our implementation) | 0.274 | 0.500 | 0.350 |
Bi-GRU (our implementation) | 0.301 | 0.465 | 0.362 |
SDP-LSTM (our implementation, Xu et at., 2015) | 0.300 | 0.436 | 0.356 |
Position-Aware LSTM (Zhang et al., 2017) | 0.265 | 0.598 | 0.367 |
CoType-RM (Ren et al., 2017) | 0.303 | 0.407 | 0.347 |
CoType (Ren et al., 2017) | 0.348 | 0.406 | 0.369 |
Note: for models that trained on sentences annotated with a single label (HypeNet, CNN/PCNN, LSTM, SDP/PA-LSTMs, Bi-GRU), we form one training instance for each sentence-label pair based on their DS-annotated data.
Usage
Dependencies
We will take Ubuntu for example.
- python 2.7
- Python library dependencies
$ pip install pexpect ujson tqdm
- stanford coreNLP 3.7.0 and its python wrapper. Please put the library under `code/DataProcessor/'.
$ cd code/DataProcessor/
$ git clone git@github.com:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip
- eigen 3.2.5 (already included).
We have included compilied binaries. If you need to re-compile retype.cpp
under your own g++ environment
$ cd code/Model/retype; make
Default Run
As an example, we show how to run CoType on the Wiki-KBP dataset
Start the Stanford corenlp server for the python wrapper.
$ java -mx4g -cp "code/DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
Feature extraction, embedding learning on training data, and evaluation on test data.
$ ./run.sh
For relation classification, the "none"-labeled instances need to be first removed from train/test JSON files. The hyperparamters for embedding learning are included in the run.sh script.
Parameters
Dataset to run on.
Data="KBP"
- Hyperparameters for relation extraction:
- KBP: -negative 3 -iters 400 -lr 0.02 -transWeight 1.0
- NYT: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
- BioInfer: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
Hyperparameters for relation classification are included in the run.sh script.
Evaluation
Evaluates relation extraction performance (precision, recall, F1): produce predictions along with their confidence score; filter the predicted instances by tuning the thresholds.
$ python code/Evaluation/emb_test.py extract KBP retype cosine 0.0
$ python code/Evaluation/tune_threshold.py extract KBP emb retype cosine
In-text Prediction
The last command in run.sh generates json file for predicted results, in the same format as test.json in data/source/$DATANAME, except that we only output the predicted relation mention labels. Replace the second parameter with whatever threshold you would like.
$ python code/Evaluation/convertPredictionToJson.py $Data 0.0
Customized Run
Code for producing the JSON files from a raw corpus for running CoType and baseline models is here.
Baselines
You can find our implementation of some recent relation extraction models under the Code/Model/ directory.
References
- Xiang Ren, Zeqiu Wu, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, Tarek F. Abdelzaher, Jiawei Han. "CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases", WWW 2017.
- Meng Qu, Xiang Ren, Yu Zhang, Jiawei Han. “Weakly-supervised Relation Extraction by Pattern-enhanced Embedding Learning”, WWW 2018.
- Liyuan Liu*, Xiang Ren*, Qi Zhu, Shi Zhi, Huan Gui, Heng Ji, Jiawei Han. "Heterogeneous Supervision for Relation Extraction: A Representation Learning Approach", EMNLP 2017.
- Ellen Wu, Xiang Ren, Frank Xu, Ji Li, Jiawei Han. "Indirect Supervision for Relation Extraction using Question-Answer Pairs", WSDM 2018.
Contributors
- Ellen Wu
- Meng Qu
- Frank Xu
- Wenqi He
- Maosen Zhang
- Qinyuan Ye
- Xiang Ren