Home

Awesome

HacRED

Dataset for HacRED: A Large-Scale Relation Extraction Dataset Toward Hard Cases in Practical Applications

We first analyze the performance gap between popular datasets and practical applications, the underlying reason of which is that practical applications intrinsically have more hard cases.

To make RE models more robust on such practical hard cases, we propose the Hard Case Relation Extraction Dataset (HacRED).

The HacRED consists of 65,225 relational facts annotated from 9,231 wiki documents with sufficient and diverse hard cases.

Notably, HacRED is one of the largest Chinese document-level RE datasets and achieves a high 96% F1 score on data quality and a more reasonable data distribution.

Statistic

Common StatisticsHacRED
# Text9,231
# Relation26
# Triple67,047
# Fact65,225
Avg. Sentences5.0
Avg. Words126.6
Avg. Chars204.2
Avg. Entities10.8
Avg. Triples7.4
Data DistributionHacRED
Ratio of Duplicated Triples2.72%
Ratio of Biased Relations0.00%
Ratio of Top-20% Relational Triples49.96%

File List

main files:

meta-data files:

hard cases for detailed analysis:

Download

  1. Download from google drive:

https://drive.google.com/drive/folders/1T6QUfDV_ILAr6UJ_fROYQd4-NaFxIzqN?usp=sharing

  1. 复旦大学知识工场实验室主页下载:

http://kw.fudan.edu.cn/

Data Format

The format of data files is JSONL, including train.json, dev.json, test.json.

Each line contains relevant annotations of a document. The overall format is similar to the format of DocRED. Alternatively, we provided two kinds of data considering the characteristic of the Chinese.

If using the data at character-level in Chinese, sents_char, vertex_char, labels_char in HacRED are corresponding to sents, vertexSet, labels in DocRED, respectively.

If using the data at word-level in Chinese, sents_word, vertex_word, labels_word in HacRED are corresponding to sents, vertexSet, labels in DocRED, respectively.

{
    "id": 6020,   # unique key can be used to identify a certain document
    "text": "......"
    "sents_char": [
        ["token_1", "token_2", "token_3",..., "token_m"],   # tokens in sentence at char-level vocabulary
        [],	  # tokens in sentence 1
        [],
        ...
        []
    ],
    "vertex_char": [
        [
            {
                "name":"<a BOOK_NAME mention>",   # entity mention
                "sent_id":0,       # mention in which sentences, index start from 0
                "type":"WORK",     # ner type of entity
                "pos":[8,13]       # entity boundaries in certain sentence, for example sents_char[sent_id][pos[0]: pos[1]] would get the mention
            }, 
            {}, 
            {}   # one entity with multiple mentions in a document, name, sent_id, pos may be different
        ],
        [{}, {}],	# another entity
        ...,
        [{}] 
    ],
    "labels_char":[
        {
            "r":"<a RELATION label>",	# description of a relation label, not relation label_id
            "h":0,	# head entity (subject, e1) index in vertex_char, index start from 0
            "t":1	# tail entity (object, e2) index in vertex_char
        },
        {},
        ...
    ],
    "sents_word":[
        [],
        [],
        ...,
        []
    ], # same as sents_char format, but tokens in sentences at word-level vocabulary
    "vertex_word":[
        [{}, {}],
        [{}]
        ...
    ], # same as vertex_char format, but sentence index and boundaries are computed based on sents_word
    "labels_word":[
        {}, 
        {},
        ...
    ]	# same as labels_char format, but subject/object index are computed based on vertex_word
}

Cite

@inproceedings{cheng-etal-2021-hacred,
    title = "{H}ac{RED}: A Large-Scale Relation Extraction Dataset Toward Hard Cases in Practical Applications",
    author = "Cheng, Qiao  and
      Liu, Juntao  and
      Qu, Xiaoye  and
      Zhao, Jin  and
      Liang, Jiaqing  and
      Wang, Zhefeng  and
      Huai, Baoxing  and
      Yuan, Nicholas Jing  and
      Xiao, Yanghua",
    booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
    year = "2021",
    pages = "2819--2831",
}