Home

Awesome

DocEE Dataset

DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction

Introduction

DocEE, a new document-level event extraction dataset including 27,000+ events, 180,000+ arguments. DocEE has three features: large-scale manual annotations, fine-grained argument schema and application-oriented settings.DocEE focuses on the extraction of the main event, that is one-event-per-document.

Our academic paper can be found here: https://tongmeihan1995.github.io/meihan.github.io/research/NAACL2022.pdf.

Download DocEE

DocEE is now available at https://drive.google.com/drive/folders/1_cRnc2leAmOKT9Ma8koz6X8Ivl-_lapp?usp=sharing, which including three files:

Download DocEE-zh

DocEE-zh is now available, the dataset and its ontology can be downloaded from https://drive.google.com/drive/folders/15YDTsiTvt7qMC9itKoK5IyUAdcD8ezXB?usp=share_link. DocEE-zh contains 36,729 annotation data.

Baseline

Baseline and evaluation indicators can refer to the project https://github.com/tongmeihan1995/DocEE-Application

Event Schema of DocEE

To construct event schema, we gain insight from journalism, which divides events into hard news and soft news (Reinemann et al., 2012; Tuchman, 1973).

Hard news is a social emergency that must be reported immediately, such as earthquakes, road accidents and armed conflicts.

Soft news refers to interesting incidents related to human life, such as celebrity deeds, sports events and other entertainment-centric reports.

Based on the hard/soft news theory and the category framework in (Lehman-Wilzig and Seletzky, 2010), we define a total of 59 event types, with 31 hard news event types and 28 soft news event types. We provides full event ontology in Event Schema.md.

Example of DocEE

DocEE aims at Event Classification and Event Arguments Extraction. Here is an example of DocEE. image

For each event argument, we annotate four keys:

{'start': 82, 'end': 96, 'type': 'Date', 'text': 'Friday evening'}

Statistics of DocEE

We are now the largest dataset for documnet-level event extraction.

Datasets#isDocEvent#EvTyp.#ArgTyp.#Doc.#ArgInst.#ArgScat.
ACE200533355999,5901.0
KBP201618201697,9191.0
KBP2017182016710,9291.0
MUC-4451,7002,6414.0
WikiEvents50592465,5362.2
RAMS139659,12421,2374.8
DocEE(ours)5935627,485180,52810.2

How do I cite DocEE?

For now, cite the NAACL paper:

@article{tongdocee,
  title={DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction},
  author={Tong, Meihan and Xu, Bin and Wang, Shuai and Han, Meihuan and Cao, Yixin and Zhu, Jiangqi and Chen, Siyu and Hou, Lei and Li, Juanzi}
  journal={2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
  year={2022}
}