Home

Awesome

Summary

The Swedish-Talbanken treebank is based on Talbanken, a treebank developed at Lund University in the 1970s.

Introduction

The Swedish-Talbanken treebank is a conversion of the Prose section of Talbanken (Einarsson, 1976), originally annotated by a team led by Ulf Teleman at Lund University according to the MAMBA annotation scheme (Teleman, 1974). It consists of roughly 6,000 sentences and 95,000 tokens taken from a variety of informative text genres, including textbooks, information brochures, and newspaper articles. The syntactic annotation is converted directly from the original MAMBA annotation, while the morphological annotation is based on the reannotation performed when incorporating Talbanken into the Swedish Treebank (Nivre and Megyesi, 2007). Tokenization mostly follows the standard of the Stockholm-Umeå Corpus, Version 2.0 (2006), and lemmatization is based on Saldo (Borin et al., 2008).

Acknowledgments

The new conversion has been performed by Joakim Nivre and Aaron Smith at Uppsala University. We thank everyone who has been involved in previous conversion efforts at Växjö University and Uppsala University, including Bengt Dahlqvist, Sofia Gustafson-Capkova, Johan Hall, Anna Sågvall Hein, Beáta Megyesi, Jens Nilsson, and Filip Salomonsson. Special thanks also to Lars Borin and Markus Forsberg at Språkbanken for help with the lemmatization. Finally, we owe a huge debt to the team who produced the original treebank in the 1970s.

References

Data Splits

The test set (sv-ud-test.conllu) is the standard test set from the Swedish Treebank, which is a balanced sample of complete documents from different parts of the treebank.

The rest of the treebank has been split by taking the first 90% as the training set (sv-ud-train.conllu) and the last 10% as the development set (sv-ud-dev.conllu).

Document and paragraph boundaries are explicitly represented by comment lines (# newdoc id = DOC_ID, # newpar id = PAR_ID), but genre classification is not available for documents.

Tokenization

The tokenization in the Swedish-Talbanken treebank follows the principles of the Stockholm-Umeå Corpus, Version 2.0 (SUC, 2006), which has become the de facto standard for Swedish tokenization and part-of-speech tagging. This is a straightforward segmentation based on whitespace and punctuation, but the following special cases deserve to be mentioned:

The Swedish-Talbanken treebank contains the following tokens with spaces (all abbreviations):

Bl a bl a d v s e d f n fr o m Fr o m m fl m m o s v s k t ex t o m t v

The Swedish-Talbanken treebank does not contain multiword tokens.

Morphology

The morphological annotation in the Swedish-Talbanken treebank follows the general guidelines and does not add any language-specific features. The language-specific tags (including features) follow the guidelines of the Stockholm-Umeå Corpus.

The mapping from language-specific tags and features to universal tags and features was done automatically. We are not aware of any remaining errors or inconsistences but the mapping has not been validated manually.

Lemmas were assigned using SALDO (Borin et al., 2008) in combination with the language-specific SUC tags. Cases of remaining ambiguity were resolved heuristically, which may have introduced errors. For words and symbols not covered by SALDO, lemmas were added manually.

Syntax

The syntactic annotation in the Swedish-Talbanken treebank follows the general guidelines but adds four language-specific relations:

The syntactic annotation has been automatically converted from the original MAMBA annotation scheme in Talbanken. The following phenomena are known to deviate from the general guidelines and will be fixed in future versions:

Changelog

From v1 to v1.1, an extensive (but not complete) manual validation was carried out, resulting in a large number of conversion errors being corrected. Specifically, all non-projective trees were validated.

From v1.1 to v1.2, complex names and multiword expressions have been manually validated. As a result, the annotation of complex names now conforms to the universal guidelines.

From v1.2 to v1.3, we fixed the following annotation bugs/inconsistencies:

From v1.3 to v1.4, only the documentation has been updated to reflect the fact that there are two treebanks for Swedish.

From v1.4 to v2.0, we have implemented the following changes to conform to v2 of the guidelines:

From v2.0 to v2.1, no changes have been made.

From v2.1 to v2.2:

From v2.2 to v2.3:

From v2.6 to v2.7:

From v2.10 to v2,11

From v2.13 to v2.14:

From v2.14 to v2.15

=== Machine readable metadata ============== Data available since: UD v1.0 License: CC BY-SA 4.0 Includes text: yes Genre: news nonfiction Lemmas: automatic with corrections UPOS: converted with corrections XPOS: manual native Features: converted with corrections Relations: converted with corrections Contributors: Nivre, Joakim; Smith, Aaron; Norrman, Victor Contributing: elsewhere Contact: joakim.nivre@lingfil.uu.se