Awesome
Awesome Danish
A curated list of awesome resources for Danish language technology
Data
Corpora
- Danish Gigaword - Collection of 10^12 words of Danish text. Described in The Danish Gigaword Corpus (Scholia)
- Danish review dataset - Trustpilot-crawled dataset by Alessandro Gianfelici with 44,085 reviews .
- OSCAR - Danish corpus derived from the Common Crawl corpus. Described in Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures (Scholia)
- CLARIN-DK-UCPH
- The Danish Parliament Corpus 2009 - 2017, v1. The license is Creative Commons - Attribution 4.0 International
- Grundtvig's Works Corpus. Not for commercial use as the license is Creative Commons - Attribution-NonCommercial 4.0 International.
- DK-CLARIN Reference Corpus of General Danish Only for academic use.
- DanFEVER - Danish text corpus with over 6'400 claims and support. Described in DanFEVER: claim verification dataset for Danish (Scholia)
- DanNet - wordnet with usage examples. The usage examples have been used for word sense disambiguation, see XL-WSD: An Extra-Large and Cross-Lingual Evaluation Frameworkfor Word Sense Disambiguation
- SemDaX - POS-tagged (only adjectives, nouns and verbs), super sense tagged and BIO-tagged sentences. For educational, teaching or research purposes only.
- NOMCO - "an annotated multimodal collection of conversational Danish". Apparently not directly available for download. [ Scholia ]
- Danish Propbank - commercial resource with 87,000 tokens annotated with morphosyntactic, VerbNet classes and semantic roles.
- Danish Dependency Treebank v. 1.0 - Matthias Trautner Kromann et al.'s dependency annotation of some texts from PAROLE.
- Mr. Bean corpus - Small Danish-Italian corpus with written and spoken retelling (of Mr Bean episodes) and argumentative text (about smoking). Possibly described in Tekststrukturering pa italiensk og dansk
- Køge Corpus - Danish-Turkish transcribed corpus by Jens Normann Jørgensen.
- Danske taler - Collection of Danish speeches. API available at https://dansketaler.dk/wp-json/wp/v2/tale
- DKhate - corpus of 3600 hate speech from Twitter and Reddits as well as news comments. Described in Offensive Language and Hate Speech Detection for Danish (Scholia)
- DaNewsroom - Danish summarization dataset. Probably to appear in 2020. Described in DaNewsroom: A Large-scale Danish Summarisation Dataset (Scholia)
- Wikipedia
- wiki40b/da - Clean-up text from Danish Wikipedia. Described in Wiki-40B: Multilingual Language Model Dataset. (Scholia)
- XED - emotion annotated movie subtitles. Described in XED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection (Scholia).
- DaN+ - annotated for nested named entities on top of the entire Danish Universal Dependencies (UD_Danish-DDT) and 3 new web domains and includes lexical normalization. Described in DaN+: Danish Nested Named Entities and Lexical Normalization
- WikiANN - Named entity annotated corpus. Described in Cross-lingual Name Tagging and Linking for 282 Languages (Scholia)
- Corona Dataset - Question dataset from Certainly annotated for domain and intent.
Parallel corpora
- Europarl - parallel sentences between Danish and English from the European Parlament.
- ITU Faroese Pairs Dataset - Faroese-Danish parallel text. Described in The ITU Faroese Pairs Dataset (Scholia)
- JW300 - "a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average"
- OpenSubtitles2018 - Parallel corpus from movie and tv subtitles. Described in OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles.
- Tatoeba - Sentences
- WikiMatrix, parallel sentences from Wikipedias. 1620 language pairs, including Danish
Spoken language corpora
- CoRal - Danish Conversational and Read-aloud Dataset
- DanPASS - Described in DanPASS - A Danish Phonetically Annotated Spontaneous Speech corpus (Scholia)
- LANCHART - Centre for Language Change In Real Time. Various audio recordings. Whether the data is available is not immediately apparent. Described in, e.g., The data and design of the LANCHART study (Scholia).
- Common Voice - Crowdsourced multilingual annotated speech dataset. As of March 2023, 11 hours of validated speech are distributed. Sentences can be entered collaboratively at https://commonvoice.mozilla.org/sentence-collector. Common Voice is described in Common Voice: A Massively-Multilingual Speech Corpus (Scholia).
- FT Speech - Described in FT SPEECH : Danish Parliament Speech Corpus (Scholia).
- NST
- NST-speech-22khz - A 22kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is dictation.
- NST-speech-16kHz - A 16kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. The speech genre is read-aloud and the text is phonetically balanced. Designed for ASR training and testing.
- NST-speech-44kHz - A 44kHz speech corpus compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service. Designed for speech synthesis.
- VoxLingua107 - 28 hours audio with unannotated Danish speech sampled from YouTube videos. Described in VoxLingua107: a Dataset for Spoken Language Recognition (Scholia)
- VoxPopuli - Speech from the European Parliament including 13'600 hours of unannotated Danish. Described in VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation (Scholia)
- Wikimedia Commons Audio files of Danish language - Recordings of readings of articles from the Danish Wikipedia, Danish words and a few Danish literary works.
Dictionaries and ontologies
- Det Centrale Ordregister - identifier for words and their inflections with 516,017 forms (COR).
- The Danish Sentiment Lexicon - Det Danske Sentimentleksikon (DDS) 13,859 headwords assigned with polarity values.
- NST-lexical-database A pronunciation dictionary compiled by Nordisk Språkteknologi and made available by the Norwegian Library Service.
- DanNet DanNet, Danish Wordnet (v 2.2) - owl format - Danish wordnet with three-clause BSD-like license.
- Retskrivningsordbogen. The official Danish spelling dictionary digitally available under its own special license.
- Opslagsord og ordklasser in CSV format.
- Lexemes, word classes and inflections. Excerpt in the CSF format available. Full list presumably available upon request.
- Lexemes, word classes, inflections, grammatical information, hyphenation and usage examples in XML. Full list presumably available upon request.
- Stavekontrolden - word list with 160,132 Danish words. Used, e.g., for spelling suggestion in LibreOffice. Licensed under GPL, LPGL, and MPL.
- The Concise Danish Dictionary/The Comprehensive Danish Dictionary/Den Store Danske Ordliste (DSDO), word list created by Skåne Sjælland Linux User Group and distributed under a GPL license
- In Debian-based distributions the word list may be installed with
sudo aptitude install aspell-da
and extracted withspell -d da dump master
.
- In Debian-based distributions the word list may be installed with
- Interactive Terminology for Europe (IATE) - European Union terminology database. October 2020 version contains over 500,000 Danish terms.
- The Danish FrameNet Lexicon, 40,267 lines resource containing 5,300 verbs and 6,490 verbal nouns
- Wikidata lexemes - structured database with metadata about lexemes, their forms and their sense. Over 1,290,000 lexemes including over 81,000 Danish lexemes in April 2024.
- Overview over Danish lexemes in Ordia - webapp with overview of content of Wikidata lexemes based on SPARQL queries.
- Wikidata lexemes latest lexemes dump in ttl - official dump of lexeme-only part of Wikidata.
- NST-ngrams - A N-gram frequency list compiled by Nordisk Språkteknologi from newspaper text and made available by the Norwegian Library Service. Can be compiled to an n-gram LM with SRILM.
- AFINN - Danish lexicons annotated for sentiment.
- concreteness-estimates-da - Bill D. Thompson's concreteness estimates for Danish words, as detailed in Automatic Estimation of Lexical Concreteness in 77 Languages (Scholia).
- SAM lexicon - sentiment analysis word list extended from AFINN to 4275 lines. Described in Sentiment Analysis Multitool, SAM.
- Danish Swadesh List - List of Danish words of basic concepts from The Rosetta Project.
- Sketch Engine - cloud service with wordlists, thesearus, collocations, n-grams etc. Free for academic use in the European Union and paid service for commercial use.
Word sets
- Danish-Similarity-Dataset - Similarity scores for 99 Danish word pairs by Nina Schneidermann and Bolette Sandford Pedersen. Also available in danlp.
- Wordsim353-da - Danish translation by Finn Årup Nielsen of the English Wordsim353 English word pair set. Also available in danlp.
- Clinical similarity dataset - 289 word pairs score for similarity.
- Four words - 100 odd-one-out sets of 4 words or phrases.
Embeddings
- cc.da.300 (bin file GB large) - fastText-trained embedding on Danish part of Common Crawl and Danish Wikipedia. Read more about the method in Learning Word Vectors for 157 Languages (Scholia).
- wiki.da (bin+text file) - fastText-trained embedding on Danish Wikipedia. Read more about the method in Enriching Word Vectors with Subword Information (Scholia).
- Byte-Pair Encoding embedding - Gensim-based subword embedding. A large number of Danish embeddings are available. They differ in the size of the vocabulary (from 1000 to 200000) and subspace dimensions (from 25 to 300).
- NLPL word embeddings repository - NLPL word embeddings repository by Language Technology Group at the University of Oslo. Two Danish embedding models as of November 2020.
- Danish NLPL word embedding - 100-dimensional word2vec skipgram model trained by Andrey Kutuzov based on the Danish CoNLL17 corpus.
- Danish DSL and Reddit word2vec word embeddings - 300-dimensional CBOW word2vec word embedding by Emil Middelboe and Anders Lillie trained on Danish DSL corpus and Reddit.
Neural text models
- A-ttack - Ælæctra-based model for detection of "textual attacks" developed by Analyse & Tal. Related to the Ha-te model.
- Danish BERT - Certainly's (Botxo/Møllerhøj) Weights for a BERT trained on a large Danish corpora.
- Danish ELECTRA - Philip Tamimi-Sarnikowski's Danish ELECTRA model. Available in the transformer library.
- daT5-summariser - Danish abstractive summarisation of news articles based on mT5-base.
- ConvBERT - Philip Tamimi-Sarnikowski's model
- Danish ELMo on OSCAR - (Link does not work as of December 2020)
- Ha-te - Hate speech detection based on Ælæctra developed by Analyse & Tal. Related to the A-ttack model.
- mfaq - Multilingual FAQ retrieval model. Described in MFAQ: a Multilingual FAQ Dataset (Scholia)
- Ælæctra - Malte Højmark-Bertelsen's Danish Gigaword-trained Electra-based model
- Multilingual sentence transformers - Pre-trained multilingual sentence transformers,
- wiki40b-lm-da - language model trained on Danish from Wiki40B dataset
- WikiBERT - BERT model for many languages, including Danish. Described in WikiBERT models: deep transfer learning for many languages (Scholia)
Neural speech models
- Hugging Face - List of models for Danish automatic speech recognition.
- Alvenir Wav2vec2 - Pretrained Danish neural model.
- Whisper - Multilingual neural model from OpenAI.
- xls-r-300m-danish-nst-cv9 - Pretrained Danish neural model.
Tools
Lemmatization
- Lemmy - Lemmatizer for Danish in Python.
- cstlemma - lemmatiser.
- spaCy - Python-based package with lemmatization.
Punctuation
- punctfix - "Adds punctuation and capitalization for a given text."
Named entity recognition
- ScandiNER - Scandinavian named entity recognition, achieving state-of-the-art performance in Danish, Norwegian (both Bokmål and Nynorsk), Swedish, Icelandic and Faroese.
- DaLUKE - Danish named entity recognition based on LUKE. Described in DaLUKE: The Entity-aware,Danish Language Model.
- spaCy - Python-based named entity extraction
- daner - Named entity extraction from ITU NLP. Described in DKIE: Open Source Information Extraction for Danish (Scholia).
- flair+danlp ner-tagger - Flair NER tagger trained by the Alexandra Institute.
- Polyglot named entity extraction -
Entity linking
- Babelfy - Web app and service for linking words and entities.
- DBpedia Spotlight - DBpedia-based entity linker. Described in Improving Efficiency and Accuracy in Multilingual Entity Extraction (Scholia)
Sentiment analysis
- afinn - Python package with AFINN Danish lexicon annotated for sentiment, also installable with
pip install afinn
. - Hisia - Python package with pre-trained machine-learning based Danish sentiment analysis by Prayson Wilfred Daniel.
- senda - Python package with transformer-based sentiment analysis from Ekstra Bladet Analyse with as of 2021 state-of-the-art performance on one dataset.
- Sentida - R package With Danish sentiment lexicon and handling of, e.g., negation. Detailed in SENTIDA: A New Tool for Sentiment Analysis in Danish (Scholia).
Automatic Speech Recognition
- danspeech - DeepSpeech2-based Danish speech recognition in Python
- kaldi-sprakbanken - A recipe for training state-of-the-art(2017) speech recogniser for Danish based on the 16kHz NST database.
Speech Synthesis (text-to-speech)
- espeak - An open-source speech synthesis program for ~56 languages including Danish. eSpeak can also be used as a grapheme-to-phoneme converter and was used to create the Danish Kaldi recipe.
- ResponsiveVoice - Commercial Web-based (Javascript-based) text-to-speech synthesis for a number of languages, including Danish. The commercial service is currently free for limited and non-commercial use.
- Google Cloud Text-to-Speech - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish.
- Amazon Polly - Commercial Web-based text-to-speech synthesis for a number of languages, including Danish. Part of Amazon's commercial AWS services. Female and male voices are available as examples. Limited unregistered free service available at TTSMP3.
Fundamental processing
- DaNLP - "a repository for Natural Language Processing resources for the Danish Language."
- dapipe - Danish UD-pipe: tokenisation, lemmatisation, PoS tagging, morphology, dependencies.
- UDPipe - Non-language specific version of dapipe. Newer version of the Danish-DDT model than that which is offered by dapipe is available at https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998
- DKIE - GATE pipeline including wrapped Danish models for Stanford CoreNLP.
- StanfordNLP. Python software package for dependency parsing, including tokenization, lemmatization and part-of-speech tagging. A pre-trained model for Danish is available.
- bornholmsk - Datasets and embeddings for the Bornholmsk dialect.
- spaCy - Python-based natural language processing package
- dacy - Danish spaCy pipeline.
Competitions
- ELEXIS Monolingual Word Sense Alignment Task - Predicting the relationship between two senses in each of several languages, including Danish.
- OffensEval 2020 - Danish - Offensive Language Identification in Social Media competition. Described in Offensive Language and Hate Speech Detection for Danish (Scholia)
Benchmarks
- Danoliterate - Overview of the performance of language models on a range of individual benchmarks.
- ScandEval - Overview of the performance of language models on a range of individual benchmark, Danish as well as other Germanic languages.
Resources about resources
- Danish resources - Finn Årup Nielsen's PDF with pointers to Danish resources.
- Scholia's topic aspect for Danish, works (mostly scientific articles) about "Danish" as listed in Wikidata.
- DaNLP - Alexandra Institute's list of Danish resources
- Language Technology Resources for Danish, list from Det Dansk Sprog- og Litteraturselskab
- European Language Resources Association (ELRA) list for Danish, list of various annotated corpora available for purchase with both commercial and non-commercial licenses.
- sprogteknologi.dk - List of Danish language resources. Compiled by the Agency for Digitisation.