Home

Awesome

News

<div align="center">

logo

Paper Hugging Face Datasets License: CC BY-NC 4.0

</div>

Data and evaluation code for the paper MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation).

@inproceedings{tedeschi-navigli-2022-multinerd,
    title = "{M}ulti{NERD}: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguation)",
    author = "Tedeschi, Simone  and
      Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-naacl.60",
    doi = "10.18653/v1/2022.findings-naacl.60",
    pages = "801--812",
    abstract = "Named Entity Recognition (NER) is the task of identifying named entities in texts and classifying them through specific semantic categories, a process which is crucial for a wide range of NLP applications. Current datasets for NER focus mainly on coarse-grained entity types, tend to consider a single textual genre and to cover a narrow set of languages, thus limiting the general applicability of NER systems.In this work, we design a new methodology for automatically producing NER annotations, and address the aforementioned limitations by introducing a novel dataset that covers 10 languages, 15 NER categories and 2 textual genres.We also introduce a manually-annotated test set, and extensively evaluate the quality of our novel dataset on both this new test set and standard benchmarks for NER.In addition, in our dataset, we include: i) disambiguation information to enable the development of multilingual entity linking systems, and ii) image URLs to encourage the creation of multimodal systems.We release our dataset at https://github.com/Babelscape/multinerd.",
}

Please consider citing our work if you use data and/or code from this repository.

In a nutshell, MultiNERD is the first language-agnostic methodology for automatically creating multilingual, multi-genre and fine-grained annotations for Named Entity Recognition and Entity Disambiguation. Specifically, it can be seen an extension of the combination of two prior works from our research group that are WikiNEuRal, from which we took inspiration for the state-of-the-art silver-data creation methodology, and NER4EL, from which we took the fine-grained classes and inspiration for the entity linking part.

The produced dataset covers:

Additionally, we included image URLs to encourage the creation of multimodal systems.

Finally, MultiNERD shows consistent improvements of up to against state-of-the-art alternative data production methods on common benchmarks for NER while covering a broader set of NER categories (15 vs. 4):

comparison

<br>

Data

Dataset VersionSentencesTokensPERORGLOCANIMBIOCELDISEVEFOODINSTMEDIAMYTHPLANTTIMEVEHIOTHER
MultiNERD EN164.1K3.6M75.8K33.7K78.5K15.5K0.2K2.8K11.2K3.2K11.0K0.4K7.5K0.7K9.5K3.2K0.5K3.1M
MultiNERD ES173.2K4.3M70.9K20.6K90.2K10.5K0.3K2.4K8.6K6.8K7.8K0.6K8.0K1.6K7.6K45.3K0.3K3.8M
MultiNERD NL171.7K3.0M56.9K21.4K78.7K34.4K0.1K2.1K6.1K4.7K5.6K0.2K3.8K1.3K6.3K31.0K0.4K2.7M
MultiNERD DE156.8K2.7M79.2K31.2K72.8K11.5K0.1K1.4K5.2K4.0K3.6K0.1K2.8K0.8K7.8K3.3K0.5K2.4M
MultiNERD RU129.0K2.3M43.4K21.5K75.2K7.3K0.1K1.2K1.9K2.8K3.2K1.1K11.3K0.6K4.8K22.8K0.5K2.0M
MultiNERD IT181.9K4.7M75.3K19.3K98.5K8.8K0.1K5.2K6.5K5.8K5.8K0.8K8.6K1.8K5.1K71.2K0.6K4.2M
MultiNERD FR176.2K4.3M89.6K28.2K90.9K11.4K0.1K2.3K3.1K7.4K3.2K0.7K8.0K2.0K4.4K27.4K0.6K3.8M
MultiNERD PL195.0K3.0M66.5K29.2K100.0K19.7K0.1K3.3K6.5K6.7K3.3K0.6K4.9K1.3K6.6K44.1K0.7K2.5M
MultiNERD PT177.6K3.9M54.0K13.2K124.8K14.7K0.1K4.2K6.8K5.9K5.4K0.6K9.1K1.6K9.2K48.6K0.3K3.4M
MultiNERD ZH195.3K5.8M68.3K20.8K49.6K26.1K0.4K0.8K0.1K5.1K1.9K1.1K55.9K1.8K6.1K0.4K0.3K3.4M

We remark that the datasets are automatically created, and, therefore, they may contain errors. Specifically, the highest-quality classes (in terms of both precision and number of the annotations, according to Table 3 and Figure 1 in the paper are PER, ORG, LOC, CEL, DIS, EVE and MEDIA, while others can be often very noisy due to the ambiguity of their instances.

<br>

License

MultiNERD is licensed under the CC BY-SA-NC 4.0 license. The text of the license can be found here.

We underline that the source from which the raw sentences have been extracted are Wikipedia (wikipedia.org) and Wikinews wikinews.org and the NER annotations have been produced by Babelscape.

<br>

Acknowledgments

We gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon2020 research and innovation programme (http://mousse-project.org/).