Home

Awesome

GENTLE

Repository for the Genre Tests for Linguistic Evaluation (GENTLE) Corpus

This repository contains release versions of the Genre Tests for Linguistic Evaluation (GENTLE) corpus, an English out-of-domain test set following the same multilayer annotations found in the GUM corpus. The texts are of the following 8 genres:

Splits - test only

The entire corpus is designed to be a test set of challenging genres for NLP systems to be evaluated on. Although one can train a model on this corpus, or concatenate it to another training set, we present this entire corpus as a test set, and do not provide any official train / dev data.

Citing

To cite this corpus, please refer to the following article:

@inproceedings{aoyama-etal-2023-gentle,
    title = "{GENTLE}: A Genre-Diverse Multilayer Challenge Set for {E}nglish {NLP} and Linguistic Evaluation",
    author = "Aoyama, Tatsuya  and
      Behzad, Shabnam  and
      Gessler, Luke  and
      Levine, Lauren  and
      Lin, Jessica  and
      Liu, Yang Janet  and
      Peng, Siyao  and
      Zhu, Yilun  and
      Zeldes, Amir",
    booktitle = "Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII)",
    year = "2023",
    address = "Toronto, Canada",
    url = "https://aclanthology.org/2023.law-1.17",
    doi = "10.18653/v1/2023.law-1.17",
    pages = "166--178",
}

Directories

The corpus is downloadable in multiple formats. Not all formats contain all annotations: The most accessible format is probably CoNLL-U dependencies (in dep/), but the most complete XML representation is in PAULA XML, and the easiest way to search in the corpus is using ANNIS. Here is an example query for phrases headed by 'one' bridging back to a different, previously mentioned entity. Other formats may be useful for other purposes. See website for more details.