Awesome

NYTK-NerKor

The home repository of the NYTK-NerKor corpus, a Hungarian gold standard named entity annotated corpus containing 1 million tokens.

:construction: We are currently checking the morphological annotation layers related to Universal Dependencies. An update is expected soon, see all the details here. :construction:

License and usage

The corpus creation was funded by the Hungarian Research Centre for Linguistics (Nyelvtudományi Kutatóközpont, NYTK). The project leaders were Eszter Simon and Noémi Vadász.

The corpus is available under the license CC-BY-SA 4.0. If you use this corpus, please cite our paper (see below).

Data

Corpus files are under the data folder. The train, devel and test subfolders contain the data files grouped by genre: fiction, legal, news, web, wikipedia.

The corpus contains gold standard morphological annotation besides NE labels.

The proportion of train, devel and test sets is around 80%-10%-10%. All sets provide a balanced selection from all genres and sources. For exact numbers, see the train-devel-test table below.

The fiction subcorpus contains i) novels from MEK (Hungarian Electronic Library) and Project Gutenberg; and ii) subtitles from OpenSubtitles.

The legal texts come from EU sources: it is a selection from the EU Constitution, documents from the European Economic and Social Committee, DGT-Acquis and JRC-Acquis.

The sources of the news subcorpus are: Press Release Database of European Commission, Global Voices and NewsCrawl Corpus.

Web texts contain a selection from the Hungarian Webcorpus 2.0.

Wikipedia texts are from the Hungarian Wikipedia. :)

Token numbers

genre	file	sentence	token
fiction	122	24690	203014
legal	39	7272	191984
news	82	9767	213157
web	398	10886	187853
wikipedia	157	14702	221332
altogether	798	67317	1017340

NE labels and density

genre	PER	LOC	ORG	MISC	NE	NE density
fiction	5206	1010	212	281	6709	0.03304698198
legal	249	1247	6536	1798	9830	0.05120218352
news	4588	2309	5325	3681	15903	0.07460697983
web	2826	1343	1789	2434	8392	0.04467322854
wikipedia	8897	9156	5386	4403	27842	0.1257929265
altogether	21766	15065	19248	12597	68676	0.0675054554

Train-devel-test sets

genre	train	devel	test
fiction	161318	20903	20793
legal	151910	20454	19620
news	170747	20673	21737
web	150725	18401	18727
wikipedia	176515	22667	22150
altogether	811215	103098	103027

Data format

The format of data files are CoNLL-U Plus with the standard .conllup file extension. The first line in each file is: # global.columns = FORM LEMMA UPOS XPOS FEATS CONLL:NER, where:

FORM: the token itself;

LEMMA: the lemma of the token (according to the UD guidelines);

UPOS: UD POS tags;

XPOS: full morphological annotation (POS + morphosyntactic features) provided by emMorph;

FEATS: UD morphosyntactic features;

CONLL:NER: NE annotation;

EMMORPH:LEMMA: the lemma of the token (dictionary form without derivation);

For details on UD part-of-speech tags and morphosyntactic features, see ud_pos_feats.md.

The NE annotation follows the CoNLL2002 labelling standard. The four NE categories are: PER, LOC, MISC, ORG. The tags are in the IOB2 format: a B- prefix denotes the first item of a NE phrase and an I- prefix any non-initial word. Non-names are marked by an O label.

Guidelines

Annotation guidelines, WebAnno guidelines and Annotation scheme are available in the Guidelines folder. (Only in Hungarian.)

Citation

If you use this resource or any part of its documentation, please refer to:

Simon, Eszter; Vadász, Noémi. (2021) Introducing NYTK-NerKor, A Gold Standard Hungarian Named Entity Annotated Corpus. In: Ekštein K., Pártl F., Konopík M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science, vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_19

@inproceedings{DBLP:conf/tsd/SimonV21,
  author    = {Eszter Simon and
               No{\'{e}}mi Vad{\'{a}}sz},
  editor    = {Kamil Ekstein and
               Frantisek P{\'{a}}rtl and
               Miloslav Konop{\'{\i}}k},
  title     = {Introducing NYTK-NerKor, {A} Gold Standard Hungarian Named Entity
               Annotated Corpus},
  booktitle = {Text, Speech, and Dialogue - 24th International Conference, {TSD}
               2021, Olomouc, Czech Republic, September 6-9, 2021, Proceedings},
  series    = {Lecture Notes in Computer Science},
  volume    = {12848},
  pages     = {222--234},
  publisher = {Springer},
  year      = {2021},
  doi       = {10.1007/978-3-030-83527-9\_19},
}