Home

Awesome

NYTK-NerKor

The home repository of the NYTK-NerKor corpus, a Hungarian gold standard named entity annotated corpus containing 1 million tokens.

:construction: We are currently checking the morphological annotation layers related to Universal Dependencies. An update is expected soon, see all the details here. :construction:

License and usage

The corpus creation was funded by the Hungarian Research Centre for Linguistics (Nyelvtudományi Kutatóközpont, NYTK). The project leaders were Eszter Simon and Noémi Vadász.

The corpus is available under the license CC-BY-SA 4.0. If you use this corpus, please cite our paper (see below).

Data

Corpus files are under the data folder. The train, devel and test subfolders contain the data files grouped by genre: fiction, legal, news, web, wikipedia.

The corpus contains gold standard morphological annotation besides NE labels.

The proportion of train, devel and test sets is around 80%-10%-10%. All sets provide a balanced selection from all genres and sources. For exact numbers, see the train-devel-test table below.

The fiction subcorpus contains i) novels from MEK (Hungarian Electronic Library) and Project Gutenberg; and ii) subtitles from OpenSubtitles.

The legal texts come from EU sources: it is a selection from the EU Constitution, documents from the European Economic and Social Committee, DGT-Acquis and JRC-Acquis.

The sources of the news subcorpus are: Press Release Database of European Commission, Global Voices and NewsCrawl Corpus.

Web texts contain a selection from the Hungarian Webcorpus 2.0.

Wikipedia texts are from the Hungarian Wikipedia. :)

Token numbers

genrefilesentencetoken
fiction12224690203014
legal397272191984
news829767213157
web39810886187853
wikipedia15714702221332
altogether798673171017340

NE labels and density

genrePERLOCORGMISCNENE density
fiction5206101021228167090.03304698198
legal24912476536179898300.05120218352
news4588230953253681159030.07460697983
web282613431789243483920.04467322854
wikipedia8897915653864403278420.1257929265
altogether21766150651924812597686760.0675054554

Train-devel-test sets

genretraindeveltest
fiction1613182090320793
legal1519102045419620
news1707472067321737
web1507251840118727
wikipedia1765152266722150
altogether811215103098103027

Data format

The format of data files are CoNLL-U Plus with the standard .conllup file extension. The first line in each file is: # global.columns = FORM LEMMA UPOS XPOS FEATS CONLL:NER, where:

FORM: the token itself;

LEMMA: the lemma of the token (according to the UD guidelines);

UPOS: UD POS tags;

XPOS: full morphological annotation (POS + morphosyntactic features) provided by emMorph;

FEATS: UD morphosyntactic features;

CONLL:NER: NE annotation;

EMMORPH:LEMMA: the lemma of the token (dictionary form without derivation);

For details on UD part-of-speech tags and morphosyntactic features, see ud_pos_feats.md.

The NE annotation follows the CoNLL2002 labelling standard. The four NE categories are: PER, LOC, MISC, ORG. The tags are in the IOB2 format: a B- prefix denotes the first item of a NE phrase and an I- prefix any non-initial word. Non-names are marked by an O label.

Guidelines

Annotation guidelines, WebAnno guidelines and Annotation scheme are available in the Guidelines folder. (Only in Hungarian.)

Citation

If you use this resource or any part of its documentation, please refer to:

Simon, Eszter; Vadász, Noémi. (2021) Introducing NYTK-NerKor, A Gold Standard Hungarian Named Entity Annotated Corpus. In: Ekštein K., Pártl F., Konopík M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science, vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_19

@inproceedings{DBLP:conf/tsd/SimonV21,
  author    = {Eszter Simon and
               No{\'{e}}mi Vad{\'{a}}sz},
  editor    = {Kamil Ekstein and
               Frantisek P{\'{a}}rtl and
               Miloslav Konop{\'{\i}}k},
  title     = {Introducing NYTK-NerKor, {A} Gold Standard Hungarian Named Entity
               Annotated Corpus},
  booktitle = {Text, Speech, and Dialogue - 24th International Conference, {TSD}
               2021, Olomouc, Czech Republic, September 6-9, 2021, Proceedings},
  series    = {Lecture Notes in Computer Science},
  volume    = {12848},
  pages     = {222--234},
  publisher = {Springer},
  year      = {2021},
  doi       = {10.1007/978-3-030-83527-9\_19},
}