Home

Awesome

NorNE: Norwegian Named Entities

This dataset is described in the paper NorNE: Annotating Named Entities for Norwegian by Fredrik Jørgensen, Tobias Aasmoe, Anne-Stine Ruud Husevåg, Lilja Øvrelid, and Erik Velldal, accepted for LREC 2020 and available as pre-print here: https://arxiv.org/abs/1911.12146

NorNE ads named entity annotations on top of the Norwegian Dependency Treebank and was created as a collaboration between Schibsted Media Group, Språkbanken at the National Library of Norway and the Language Technology Group at the University of Oslo.

The NorNE corpus is published under the same license as the Norwegian Dependency Treebank

About the Norwegian Dependency Treebank (NDT)

The texts in the Norwegian Dependency Treebank (NDT) are manually annotated with morphological features, syntactic functions and hierarchical structure. The formalism used for the syntactic annotation is dependency grammar.

The treebanks consists of two parts, one part in Norwegian Bokmål (nob) and one part in Norwegian Nynorsk (nno). Both parts contain around 300.000 tokens, and are a mix of different non-fictional genres.

See the NDT webpage for more details.

About the Named Entity annotations

NDT has been extended with NER annotations. The texts, tokenization and syntactic annotations from the original NDT has not been changed in any way.

The annotated files are distributed in two different collections:

  1. ndt/, the same files as in the NDT resource. Extended with entity annotations.
  2. ud/, the files in ndt/ in a train/dev/test split, as distributed in the Universal Dependencies project.

Extended with entity annotations. More details on the splits can be found in the documentation of the Norwegian Bokmål UD project.

Each subdirectory contains a folder for the two variants of, Norwegian Bokmål (nob) and Norwegian Nynorsk (nno), respectively.

Entity types

The following types of entities are annotated:

Furthermore, all GPE entities are additionally sub-categorized as being either ORG or LOC, with the two annotation levels separated by an underscore:

The two special types GPE_LOC and GPE_ORG can easily be altered depending on the task, choosing either the more general GPE tag or the more specific LOC/ORG tags, conflating them with the other annotations of the same type. This means that the following sets of entity types can be derived:

The class distribution is as follows, broken down across the data splits of the UD version of NDT, and sorted by total counts (i.e. the number of examples, not tokens within the spans of the annotatons):

TypeTrainDevTestTotal
PER40336075605200
ORG28284002833511
GPE_LOC21322582572647
PROD67116271904
LOC613109103825
GPE_ORG3885550493
DRV5197748644
EVT13195145
MISC8000

Entity definitions

Annotation principles

  1. A name in this context is close to Saul Kripke's definition of a name, in that a name has a unique reference and its meaning is constant (there are exceptions in the annotations, e.g. "Regjeringen" (en. "Government")).
  2. It is the usage of a name that determines the entity type, not the default/literal sense of the name,
  3. If there is an ambiguity in the type/sense of a name, then the the default/literal sense of the name is chosen (following Markert and Nissim, 2002).

For more details, see the "Annotation Guidelines.pdf" distributed with the corpus.

Annotation scheme

The entities are annotated using the IOB2 format:

Example:

    1  John      ...   name=B-PER
    2  Towner    ...   name=I-PER
    3  Williams  ...   name=I-PER
    4  is        ...   name=O
    ...

File format

The texts are on the CONLL-U format, a tab-separated columnar format, as described below.

Named entity annotations are found in the MISC field (10th column), on the format name=<TYPE>

From Universal dependency's description of the CONLL-U format:

Annotations are encoded in plain text files (UTF-8, using only the LF character as line break, including an LF character at the end of file) with three types of lines:

  1. Word lines containing the annotation of a word/token in 10 fields separated by single tab characters; see below.
  2. Blank lines marking sentence boundaries.
  3. Comment lines starting with hash (#).

Sentences consist of one or more word lines, and word lines contain the following fields:

  1. ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
  2. FORM: Word form or punctuation symbol.
  3. LEMMA: Lemma or stem of word form.
  4. UPOS: Universal part-of-speech tag.
  5. XPOS: Language-specific part-of-speech tag; underscore if not available.
  6. FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
  7. HEAD: Head of the current word, which is either a value of ID or zero (0).
  8. DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  9. DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
  10. MISC: Any other annotation.

The fields DEPS and MISC replace the obsolete fields PHEAD and PDEPREL of the CoNLL-X format. In addition, we have modified the usage of the ID, FORM, LEMMA, XPOS, FEATS and HEAD fields as explained below.

The fields must additionally meet the following constraints: