Home

Awesome

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

You can read more about this corpus collection here. If you find this overview useful for your research, please cite:

@inproceedings{blaschke-etal-2023-survey,
  title = {A survey of corpora for {G}ermanic low-resource languages and dialects},
  author = {Blaschke, Verena and Sch{\"u}tze, Hinrich and Plank, Barbara},
  year = {2023},
  month = may,
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
  address = {T{\'o}rshavn, Faroe Islands},
  publisher = {University of Tartu Library},
  url = {https://aclanthology.org/2023.nodalida-1.41},
  pages = {392--414},
}

Language varieties:

Inclusion criteria:

We focus on manual or manually corrected annotations rather than fully automatically annotated data. For corpora with an “uncurated” note, we strongly recommend manually checking the data quality, as it might be low or mixed. We've excluded corpora where we were able to determine large-scale data quality issues. Note that the webcrawl-based corpora likely overlap with the contents of some of the other corpora, and for languages with especially few resources, the overlap with Wikipedia tends to be extremely high.

The license names link to where the license is mentioned on the corpus website, unless the license is mentioned on the site linked in the first column, in the article accompanying the dataset, or in the downloaded corpus files. Always refer to the original corpus websites/papers to double-check the license information; we cannot guarantee that the information here is up to date.

Did we forget a corpus for a Germanic low-resource language or dialect that fits these inclusion criteria? Please reach out to us via a GitHub issue or an email to verena DOT blaschke ÄT cis.lmu.de!

General

CorpusNotesSizeRepresentationLicense
Sound Comparisons: Germanic (Paschen ea 2019)word-based, 120 locations/doculects from all Germanic sub-branches106 words × 120 locationsaudio, phono (IPA), English ortho, ortho of relevant std languagesCC BY-NC-ND 4.0

Faroese · fao · fao1244

CorpusNotesSizeRepresentationLicense
UD Faroese OFT (Tyers ea 2018)POS (UPOS, Giellatekno-FAO), dependencies (UD), morpho (UD), lemmas. Contains material from Wikipedia1.2k sentencesFaroese orthoGNU GPL 2.0, GNU LGPL 2.1, Mozilla Public License 1.1
FarPaHC (Ingason ea 2012, Rögnvalsson ea 2012)POS (mod. Penn-historical, phrase structure (mod. Penn-historical)53k tokensFaroese orthoCC BY 4.0
UD Faroese FarPaHC (Ingason ea 2012, Rögnvalsson ea 2012)POS (UPOS), dependencies (UD), morpho (UD)40k tokensFaroese orthoCC BY-SA 4.0
FoNE (Snæbjarnarson ea 2023)named entities (8 classes). The text overlaps with the BLARK background corpus (Sosialurin subcorpus)118k tokensFaroese orthoCC BY 4.0
Fo-STS (Snæbjarnarson ea 2023)semantic text similarity (sentence-level), translated subset of the English STS corpus (Cer ea 2017)729 sentence pairsFaroese orthoCC BY 4.0
BLARK 1.0 (background corpus) (Simonsen ea 2022)25M tokensFaroese orthoCC BY 4.0
Sprotin translationsEnglish–Faroese parallel sentences126k sentence pairsFaroese orthoMIT license
Føroyskur talumálsbanki (Jacobsen 2022)599.9k tokensFaroese ortho(, audio?)CLARIN RES-PLAN-BY-PRIV-NORED
Faroese text collection (FTS)in BLARK 1.0 background corpus1.1M tokensFaroese orthoCC BY 4.0
Korp (Giellatekno)in BLARK 1.0 background corpus (download via BLARK), contains Wikipedia articles?Faroese orthoCC BY 4.0
BLARK 1.0 (audio) (Simonsen ea 2022)locations (Suðuroy, Sandoy, Suðurstreymoy, Norðurstreymoy/​Eysturoy, Vágar, Norðuroyggjar)100 hrsaudio, Faroese ortho, some phonoCC BY 4.0
Faroese Danish Corpus Hamburg (FADAC Hamburg) (subset) (Debess 2019)locations (Tórshavn, Vágar, Suðuroy, Eysturoy/​Norðuroyggjar)31 hrsaudio, Faroese orthoHZSK-RES
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022)parallel with ~200 languages2k sentencesFaroese orthoCC BY-SA 4.0
Tatoeba (fao subset)translations into other languages417 sentencesFaroese orthoCC BY 2.0 FR
ITU Faroese/Danish (Derczynski ea 2022)Danish translations; overlaps with (Danish) Tatoeba4k sentencesCC BY 4.0
Ubuntu via OPUS (Tiedemann 2012)translations into other languages20.2k tokensFaroese ortho?
QED via OPUS (Abdelali ea 2014, Tiedemann 2012)translations into other languages6.4k tokensFaroese ortho?
UDHR-LID (subset) (Karagan ea 2023, Unicode)57 sentencesCC0 1.0
OpenLID (subset) (Burchell ea 2023)combines other corpora40k linesdepend on source datasets
FAO News 2020 (Goldhahn ea 2012)uncurated?33.8k sentences?
FAO Newscrawl 2011 (Goldhahn ea 2012)uncurated?8.8k sentences?
Faroese Mixed Corpus (Goldhahn ea 2012)uncurated?300k sentences?
Faroese Web Corpus (Goldhahn ea 2012)uncurated?1M sentences?
FC3 (Snæbjarnarson ea 2023)Faroese subset of CommonCrawl (uncurated)98k paragraphs / 9M tokensFaroese orthounspecified CC license
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)uncurated102 MBFaroese orthoCC BY-SA 3.0
MADLAD-400 (subset) (Kudugunta ea 2023)uncurated, subset of CommonCrawl1.8M sentencesCC-BY-4.0
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data2.3M sentencesApache 2.0 + licenses of source datasets
Wikipedia (fo subset)uncurated14k articlesFaroese orthotext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

For additional resources/tools, see also the resource list of the Faroese Centre for Language Technology.

↑ top

Norwegian · nor · norw1258

CorpusNotesSizeRepresentationLicense
LIA Treebank (+transcriptions) (Øvrelid ea 2018)POS (mod. NDT), dependencies (mod. NDT), morpho (mod. NDT), lemmas, locations (17 places in Norway). Annotated subset of LIA Norsk7.5k speech segments / 78k tokensNynorsk ortho, phonoCC BY-NC-SA 4.0
UD Norwegian Nynorsk LIA (+transcriptions) (Øvrelid ea 2018)POS (UPOS), dependencies (UD), morpho (UD), lemmas, locations (10 places in Norway). Annotated subset of LIA Norsk5.3k speech segments / 55k tokensNynorsk ortho, phono; aligned Nynorsk+phono here (Blaschke ea 2023)treebank: CC BY-SA 4.0, transcriptions: CC BY-NC-SA 4.0
NDC Treebank (+transcriptions; website) (Kåsen ea 2022, Johannessen ea 2009)POS (mod. NDT), dependencies (mod. NDT), morpho (mod. NDT), lemmas, locations (17 places)4.6k speech segments / 66k tokensBokmål ortho, phonotreebank and transcriptions: CC BY-NC-SA 4.0
NoMusic (Mæhlum & Scherrer 2024) subset of xSIDslot filling, intent detection, translations into Bokmål and 16 other languages; location (8 dialects)8×800 sentencesad-hoc pronunciation spellingCC BY-SA 4.0
NorDial (subset) (Barnes ea 2021)348 tweetsad-hoc spellingCC0 1.0
NorDial (POS-annotated subset) (Mæhlum ea 2022 – contact authors)POS (UPOS)35+ tweetsad-hoc spelling
Nordic Dialect Corpus (subset) (Johannessen ea 2009)locations (>100 places)1.9M tokensBokmål ortho, phono; aligned Bokmål+phono here (Scherrer 2023)CC BY-NC-SA 4.0
LIA Norsk (Øvrelid ea 2018)locations (222 places)3.5M tokensNynorsk ortho, phonoCC BY-NC-SA 4.0
LIA Norsk (downloadable audio subset) (Øvrelid ea 2018)locations (178 places)?audio, Nynorsk ortho, phonoCC BY-NC-SA 4.0
The spoken language investigation in Oslo (TAUS)locations (East vs. West Oslo)387k tokensBokmål ortho, phonoCC BY-NC-SA 4.0
American Nordic Speech Corpus (CANS) (subset) (Johannessen ea 2015)locations (57 places in USA/Canada)773k tokensBokmål ortho, phonoCC BY-NC-SA 4.0
Speech Database for Norwegian (NB Tale)locations (24 areas)365 × 2 mins (spontaneous speech), 7.6k sentences (reading)audio, Bokmål ortho, mod. X-SAMPACC0
Norwegian Parliamentary Speech Corpus (NPSC)locations (5 dialect regions)140 hrs / 65k sentences / 1.2M tokensaudio, Bokmål/​Nynorsk orthoCC0

↑ top

Jutish · juti1236

CorpusNotesSizeRepresentationLicense
Danish Gigaword Corpus (synne subset) (Derczynski ea 2021)South Jutishca. 20k tokensCC BY 4.0

↑ top

East Danish · scan1238

CorpusNotesSizeRepresentationLicense
Danish Gigaword Corpus (botxt subset) (Derczynski ea 2021, Kjeldsen 2019)Bornholmskca. 400k tokensCC BY 4.0

↑ top

Elfdalian/Övdalian · ovd · elfd1234

Glottolog 4.7 categorizes Elfdalian as a dialect of Dalecarlian/dale1238.

CorpusNotesSizeRepresentationLicense
Nordic Dialect Corpus (subset) (Johannessen ea 2009)locations (7 places)15.7k tokensElfdalian ortho (Råðdjärum's orthography), Swedish orthoCC BY-NC-SA 4.0

↑ top

Swedish · swe · swe1254

CorpusNotesSizeRepresentationLicense
Parallel dialectal-standard Swedish data (Hämäläinen ea 2020, Ivars & Södergård 2007)Finland Swedish (with locations)86.5k tokenstranscription, Swedish orthoCC BY-NC-SA 4.0
American Nordic Speech Corpus (CANS) (subset) (Johannessen ea 2015)locations (7 places in the US)46k tokensSwedish ortho, phonoCC BY-NC-SA 4.0

↑ top

Anglo-Frisian

Scots · sco · scot1243

CorpusNotesSizeRepresentationLicense
POS-tagged Scots corpus (Lameris & Stymne 2021)POS (UPOS); overlaps with the SCOTS corpus1k tokenspartially ad hoc (SCOTS), partially with a standardized orthography (Mak Forrit)
Scottish Corpus of Texts & Speech (SCOTS) (subset) (Anderson ea 2007)partially annotated in the POS-tagged Scots corpusunknown (4.6M tokens total)mix of ad-hoc spelling and English orthocustom
UDHR-LID (subset) (Karagan ea 2023, Unicode)58 sentencesCC0 1.0
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)uncurated35 MB?CC BY-SA 3.0
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data410k sentencesApache 2.0 + licenses of source datasets
Wikipedia (sco subset)uncurated, see reports here and here39k articlesScots spelling recommendationstext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

English · eng · stan1293

CorpusNotesSizeRepresentationLicense
TwitterAAE-UD (Blodgett ea 2016)dependencies (UD); AAVE250 tweetsad-hoc spelling
Diachronic Electronic Corpus of Tyneside English (DECTE) (Corrigan ea 2012locations (19 places in NE England). Contains the Newcastle Electronic Corpus of Tyneside English (NECTE) and NECTE2, and NECTE in turn contains the Tyneside Linguistic Survey (TLS) and the Phonological Variation and Change in Contemporary Spoken English (PVC) corpus.72 hrs / 804k tokensaudio, English ortho, partially: phonocustom
Intonational Variation in English (IViE) (Nolan & Post 2013)locations (British Isles: Belfast, Dublin, Newcastle, Leeds, Bradford, Liverpool, Cambridge, Cardiff, London)36 hrsaudio, English orthocustom
Crowdsourced high-quality UK and Ireland English Dialect speech data set (Demirsahin ea 2020)locations (British Isles: Ireland, Midlands, Northern England, Scotland, Southern England, Wales)31 hrsaudio, English orthoCC BY-SA 4.0
Helsinki Corpus of British English Dialectslocations (UK: Cambridgeshire, Devon, Essex/Lancashire, Isle of Ely, Somerset, Suffolk)1M tokensaudio, English ortho
Nationwide Speech Project (NSP) (Clopper & Pisoni 2006)locations (USA: West, Midland, North, South, New England, Mid-Atlantic)60 × 1 hraudio, partially: English ortho
Corpus of Regional African American Language (CORAAL) (Kendall & Farrington 2021)6 locations, AAVE135.6 hrs / 1.5M tokensaudio, English orthoCC BY-NC-SA 4.0
Sound Comparisons: Englishes (Maguire ea 2019)word-based, 51 locations110 words × 51 locationsaudio, phono (IPA), English orthoCC BY-NC-ND 4.0

See also: SPADE: SPeech Across Dialects of English (Stuart-Smith ea 2017–2020) and their corpus collection.

↑ top

West(ern) Frisian · fry · west2354

CorpusNotesSizeRepresentationLicense
UD Frisian/Dutch Fame (Braggar & van der Goot 2021, Yılmaz ea 2016)POS (UPOS), dependencies (UD), code-switching; code-mixed Frisian and Dutch. Annotated subset of FAME.400 sentencesFrisian​(/Dutch) orthoCC BY-SA 4.0
UD Frisian Frysk (Heeringa ea 2021)under development!; POS (UPOS), dependencies (UD), morpho (UD), lemmas2.9k sentencesFrisian orthoCC BY-SA 3.0
Common Voice (subset) (Ardila ea 2020)211 hrsaudio, Frisian orthoCC0
Frisian AudioMining Enterprise (FAME!) (Yılmaz ea 2016)partially: locations18.5 hrsaudio, Frisian ortho
Recordings of Dutch-Frisian council meetings (Bentum ea 2022)26 hrs / 281k tokensaudio, Frisian ortho
Corpus Spoken Frisian / Korpus Sprutsen Frysk (KSF)200 hrs (65 hrs thereof transcribed)audio, partially: Frisian ortho
Boarnsterhim Corpus (BHC) (subset) (Sloos ea 2018)under revision!unknown (250 hrs total, with Dutch)audio
Tatoeba (fry subset)translations into other languages641 sentencesFrisian orthoCC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012)translations into other languages22.4k tokensFrisian ortho
KDE4 via OPUS (Tiedemann 2012)translations into other languagesca. 300k tokensFrisian ortho
GNOME via OPUS (Tiedemann 2012)translations into other languages55.7k tokensFrisian ortho
Mozilla-I10ntranslations into other languagesca. 400k tokensFrisian orthoMozilla Public License 2.0
UDHR-LID (subset) (Karagan ea 2023, Unicode)58 sentencesCC0 1.0
FRY News 2020 (Goldhahn ea 2012)uncurated?107.5k sentences? (written)?
Western Frisian Newscrawl (Goldhahn ea 2012)uncurated?100k sentences
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)uncurated72 MBFrisian orthoCC BY-SA 3.0
CC-100 (subset) (Wenzek ea 2020)uncurated, subset of CommonCrawl174 MBFrisian ortho
OSCAR (subset) (Abadji ea 2022)uncurated, subset of CommonCrawl9.9M tokens / 70.4 MBFrisian orthoMetadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)uncurated, subset of mc4 and OSCAR223k sentencessee mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023)uncurated, subset of CommonCrawl3.7M sentencesCC-BY-4.0
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data927k sentencesApache 2.0 + licenses of source datasets
Wikipedia (fy subset)uncurated50k articlesFrisian orthotext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

North(ern) Frisian · frr · north2626

CorpusNotesSizeRepresentationLicense
Tatoeba (frr subset)translations into other languages2.9k sentences?CC BY 2.0 FR
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data55.3k sentencesApache 2.0 + licenses of source datasets
Wikipedia (frr subset)uncurated, partially tagged with dialect information17k articlesdifferent dialect-based (ad-hoc?) orthographiestext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Saterland Frisian/Saterfrisian · stq · sate1242

CorpusNotesSizeRepresentationLicense
Tatoeba (stq subset)translations into other languages96 sentences?CC BY 2.0 FR
MADLAD-400 (subset) (Kudugunta ea 2023)uncurated, subset of CommonCrawl27.7k sentencesCC-BY-4.0
Wikipedia (stq subset)uncurated4k articlesrevised Kramer orthography for Saterfrisian (unclear if example, recommendation or rule for this wiki)text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Low German

Low Saxon/Low German · nds · lowg1239

(The relationship between the ISO 639-3 code and the Glottocode is complicated.)

CorpusNotesSizeRepresentationLicense
UD Low Saxon LSDC (Siewert & Rueter 2024)POS (UPOS), dependencies (UD), morphological features (UD), glosses (Middle Low Saxon), lemmas, locations (18 dialect areas, see also LSDC note); overlaps with LSDC1000 sentencesad-hoc spelling, Nysassiske SryvwyseCC BY-SA 4.0
TaPaCo (subset) (Scherrer 2020)paraphrases; annotated subset of Tatoeba1107 sentences?CC BY 2.0
Low Saxon Dialect Classification (LSDC) (Siewert ea 2020)locations (15 dialect areas); overlaps with UD Low Saxon LSDC; also contains FRS, WEP, TWD, ACT content88.9k sentences (incl. FRS etc.)ad-hoc spellingCC BY-NC-SA 4.0
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (Schröder 2011, Elmentaler ea 2015) (Low German subset)varieties of Low Saxon (Nordhannoversch, Emsländisch Oldenburgisch), East Frisian Low Saxon and (Northern) Germanunknown (300 hrs total)audioHZSK-RES
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))locations80 min / 10.7k tokensaudio, German orthocustom terms
Tatoeba (nds subset)translations into other languages18.1k sentences?CC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012)translations into other languages35.3k tokens?
KDE4 via OPUS (Tiedemann 2012)translations into other languages1.1M tokens?
GNOME via OPUS (Tiedemann 2012)translations into other languagesca. 700k tokens?
UDHR-LID (subset) (Karagan ea 2023, Unicode)58 sentencesCC0 1.0
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)uncurated24 MB?CC BY-SA 3.0
OSCAR (subset) (Abadji ea 2022)uncurated, subset of CommonCrawl1.6M tokens / 10.7 MB?Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)uncurated, subset of mc4 and OSCAR15.1k sentencessee mc4 & OSCAR
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data934k sentencesApache 2.0 + licenses of source datasets
Wikipedia (nds subset)uncurated, partially tagged with dialect information84k articlesSass'sche Schrievwiestext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0
Wikipedia (nds-nl subset)uncurated, partially tagged with dialect information8k articlesNysassiske Skryvwyse (preferred) and Algemene Nedersaksische Schriefwieze (older articles)text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

East Frisian Low Saxon · frs · east2288

CorpusNotesSizeRepresentationLicense
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (East Frisian Low Saxon subset)varieties of Low Saxon, East Frisian Low Saxon and (Northern) Germanunknown (300 hrs total)audioHZSK-RES
Low Saxon Dialect Classification (LSDC) (OFR subset) (Siewert ea 2020)minor overlaps with UD Low Saxon LSDC240 sentencesad-hoc spellingCC BY-NC-SA 4.0

↑ top

Gronings · gos · gron1242

CorpusNotesSizeRepresentationLicense
TaPaCo (subset) (Scherrer 2020)paraphrases; annotated subset of Tatoeba122 sentences?CC BY 2.0
Automatic speech recognition dataset for Gronings (Bartelds ea 2023)4 hoursaudio, writtenCC BY 4.0
Dataset: Gronings (Bartelds & San 2021, San ea 2021)23 minsaudio, writtenCC BY 4.0
Tatoeba (gos subset)translations into other languages5.7k sentences?CC BY 2.0 FR

↑ top

Twents · twd · twen1241

CorpusNotesSizeRepresentationLicense
Low Saxon Dialect Classification (LSDC) (TWE subset) (Siewert ea 2020)minor overlaps with UD Low Saxon LSDC668 sentencesad-hoc spellingCC BY-NC-SA 4.0

↑ top

Achterhoeks · act · acht1238

CorpusNotesSizeRepresentationLicense
Low Saxon Dialect Classification (LSDC) (ACH subset) (Siewert ea 2020)minor overlaps with UD Low Saxon LSDC988 sentencesad-hoc spellingCC BY-NC-SA 4.0

↑ top

Westphalic/Westphalish/Westphalian · wep · west2356

CorpusNotesSizeRepresentationLicense
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))15 min / 2.4k tokensaudio, German orthocustom terms
Low Saxon Dialect Classification (LSDC) (OWL subset) (Siewert ea 2020)minor overlaps with UD Low Saxon LSDC15k sentencesad-hoc spellingCC BY-NC-SA 4.0

↑ top

Macro-Dutch

Dutch · nld · dutc1256

CorpusNotesSizeRepresentationLicense
Corpus of Southern Dutch Dialects (GCND) (Breitbarth ea 2018)under construction!; might also include West Flemish, Zeelandic, and/or Limburgsaudio, transcriptions
SAND (Barbiers ea 2006)locations?phonocustom
MAND/FAND/GTRP (Goeman ea) (contact institute)locationsphono (K-IPA)

↑ top

West(ern) Flemish · vls · vlaa1240

CorpusNotesSizeRepresentationLicense
Stemmen uit het verleden (annotated subset) (Lybaert ea 2019, Van Keymeulen ea 2019)V2 variation, locations (25 places)1.4k sentencesphonoCC BY-NC 4.0
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data102k sentencesApache 2.0 + licenses of source datasets
VLS Community 2017 (Goldhahn ea 2012)possibly uncurated36.4k sentences? (written)?
Wikipedia (vls subset)uncurated, partially tagged with dialect information8k articlesStandoardvlams (orthography developped by vls.wikipedia.org editors)text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Zeelandic/Zeeuws · zea · zeeu1238

CorpusNotesSizeRepresentationLicense
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data34.4k sentencesApache 2.0 + licenses of source datasets
Wikipedia (zea subset)uncurated6k articles?text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Central German

Upper Saxon · sxu · uppe1400

CorpusNotesSizeRepresentationLicense
SXUCorpus (Herms ea 2016) (contact authors)8 locations500 min / 70 k tokensaudio, German ortho
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))12 min / 1.7k tokensaudio, German orthocustom terms

↑ top

Moselle Franconian · luxe1241

Luxembourgish · ltz · luxe1243

CorpusNotesSizeRepresentationLicense
UD Luxembourgish LuxBank (Plum ea 2024)POS tags (UPOS), dependencies (UD)20 sentencesLuxembourgish ortho
Banking Client Support (BCS) Dataset (Lothritz ea 2021)intent detection, slot filling, parallel with DEU, FRA, ENG1k sentencesLuxembourgish ortho?
Luxembourgish translation of Winograd Natural Language Inference (L-WNLI) (Lothritz ea 2022)NLI, parallel with other languages (Levesque ea 2012)767 samplesLuxembourgish ortho?
Luxembourgish POS and NER (Lothritz ea 2022) (contact authors)POS tags (15 tags), NER (PER, ORG, LOC, GPE, MISC)5.5k sentencesLuxembourgish ortho?
Luxembourgish news classification (Lothritz ea 2022) (contact authors)8 classes10k articlesLuxembourgish ortho?
SA1 (Lothritz ea 2023; contact authors)sentiment1.8k sentences
Luxembourgish sentence negation (Lothritz ea 2023)position of negation particle; overlaps with Leipzig corpora (Newscrawl and/or Web and/or Wikipedia)46k sentences
LuxId (Lavergne ea 2014)code-switching (LTZ, DEU, FRA)924 sentences (most with LTZ content)Luxembourgish​(/German/​French) orthoCC BY-SA 3.0
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022)parallel with ~200 languages2k sentencesLuxembourgish orthoCC BY-SA 4.0
FLEURS (subset) (Conneau ea 2023)parallel with ~100 languages; audio version of FLORES (Goyal ea 2022)1-3 recordings each of 1.9k sentences (3.8k recordings total)audio, Luxembourgish orthoCC BY 4.0
Tatoeba (ltz subset)translations into other languages884 sentencesLuxembourgish orthoCC BY 2.0 FR
Ubuntu via OPUS (Tiedemann 2012)translations into other languages17k tokensLuxembourgish ortho?
KDE4 via OPUS (Tiedemann 2012)translations into other languages28.8k tokensLuxembourgish ortho?
Mozilla-I10ntranslations into other languages6.9k tokensLuxembourgish orthoMozilla Public License 2.0
QED via OPUS (Abdelali ea 2014, Tiedemann 2012)translations into other languages19.2k tokensLuxembourgish ortho?
TED2020 via OPUS (Reimers & Gurevych, Tiedemann 2012)translations into other languages1.7k tokensLuxembourgish orthoCC BY-NC-ND 4.0
UDHR-LID (subset) (Karagan ea 2023, Unicode)59 sentencesCC0 1.0
OpenLID (subset) (Burchell ea 2023)combines other corpora37.7k linesdepend on source datasets
Luxembourgish Newscrawl (Goldhahn ea 2012)uncurated?300k sentences
Luxembourgish Web Corpus (Goldhahn ea 2012)uncurated?1M sentences
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012)uncurated81 MB?CC BY-SA 3.0
OSCAR (subset) (Abadji ea 2022)uncurated, subset of CommonCrawl2.5M tokens / 18.4 MB?Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)uncurated, subset of mc4 and OSCAR166k sentencessee mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023)uncurated, subset of CommonCrawl3.4M sentencesCC-BY-4.0
Wikipedia (lb subset)uncurated61k articlesLuxembourgish orthotext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

For other kinds of resources/tools, see also questoph/NLPforLTZ.

↑ top

Transylvanian Saxon · tran1294

CorpusNotesSizeRepresentationLicense
Audioatlas siebenbürgisch-sächsischer Dialekte (ASD) (University of Munich)360 hrs / 450k tokensaudio, German ortho, partially phonoCLARIN RES

↑ top

Colognian · ksh · kols1241

CorpusNotesSizeRepresentationLicense
Tatoeba (ksh subset)translations into other languages82 sentences?CC BY 2.0 FR
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data33.5k sentencesApache 2.0 + licenses of source datasets
Wikipedia (ksh subset)uncurated, Colognian and other varieties of Ripuarian, partially tagged with dialect and/or orthography information3k articlesad-hoc spelling, some articles according to various Ripuarian orthographiestext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Limburgish/Limburgan · lim · lim1263

CorpusNotesSizeRepresentationLicense
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022)parallel with ~200 languages; Maastrichtian Limburgs2k sentencesCC BY-SA 4.0
Ubuntu via OPUS (Tiedemann 2012)translations into other languages18.4k tokens?
GNOME via OPUS (Tiedemann 2012)translations into other languagesca. 400k tokens?
OpenLID (subset) (Burchell ea 2023)combines other corpora48k linesdepend on source datasets
LIM Community 2017 (Goldhahn ea 2012)possibly uncurated84.4k sentences? (written)?
LIM Web 2010 (Netherlands) (Goldhahn ea 2012)uncurated?35.4k sentences? (written)?
CC-100 (subset) (Wenzek ea 2020)uncurated, subset of CommonCrawl8.3 MB
CulturaX (subset) (Nguyen ea 2023)uncurated, subset of mc4 and OSCAR206 sentencessee mc4 & OSCAR
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data652k sentencesApache 2.0 + licenses of source datasets
Wikipedia (li subset)uncurated, partially tagged with dialect and/or orthography information14k articlesVeldeke-sjpelling, Algemein Gesjreve Limburgstext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Rhine/Rhenish Franconian · rhin1244

Includes Palatin(at)e German · pfl · pala1330.

CorpusNotesSizeRepresentationLicense
Thorsten-Voice Dataset 2023.09 Hessisch (Müller & Kreutz 2024)Hessian2 hrs / 2.1k sentencesaudio, German orthoCC0
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))Hessian8 min / 1.4k tokensaudio, German orthocustom terms
Wikipedia (pfl subset)uncurated, partially tagged with dialect information; contains articles in Palatine German, Lorraine Franconian, Hessian3k articles(implied) ad-hoc spellingtext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Pennsylvania Dutch · pdc · penn1240

CorpusNotesSizeRepresentationLicense
Tatoeba (pdc subset)translations into other languages57 sentences?CC BY 2.0 FR
Wikipedia (pdc subset)uncurated2k articles?text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Yiddish · yid · west2361/east2295

CorpusNotesSizeRepresentationLicense
Penn Parsed Corpus of Historical Yiddish (Santorini 2021)POS (Penn-historical, phrase structure (Penn-historical)200k tokenspartially YIVO transliteration, partially YIVO-inspired ad-hoc transliterationCC BY-NC-SA 4.0
CABank Yiddish Corpus (Newman 2015)New York1 hraudio, transcriptions (partially IPA, partially orthography-based (YIVO-transliteration-based?))CC BY-NC-SA 3.0
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022)parallel with ~200 languages; Eastern Yiddish (Hasidic)2k sentencesCC BY-SA 4.0
UDHR-LID (subset) (Karagan ea 2023, Unicode)Eastern Yiddish59 sentencesCC0 1.0
OpenLID (subset) (Burchell ea 2023)combines other corpora; Eastern Yiddish911 linesdepend on source datasets
YDD Community 2017 (Goldhahn ea 2012)Eastern Yiddish; possibly uncurated21.8k sentences? (written)?
CC-100 (subset) (Wenzek ea 2020)uncurated, subset of CommonCrawl51 MB
OSCAR (subset) (Abadji ea 2022)uncurated, subset of CommonCrawl14.3M tokens / 171.7 MB?Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)uncurated, subset of mc4 and OSCAR141k sentencessee mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023)uncurated, subset of CommonCrawl1.9M sentencesCC-BY-4.0
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data220k sentencesApache 2.0 + licenses of source datasets
Wikipedia (yi subset)uncurated15k articlestext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Upper German

German · deu · stan1295

CorpusNotesSizeRepresentationLicense
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (Schröder 2011, Elmentaler ea 2015) (German subset)varieties of Low Saxon, East Frisian Low Saxon and (Northern) Germanunknown (300 hrs total)audioHZSK-RES
Regional Variants of German 1 (RVG1) (+link2) (Burger & Schiel 1998)unclear whether all of the recordings are in regionally accented (Standard) German or some are in Low Saxon/Bavarian/Colognian/etc. instead500 × 1 min spontaneous speechaudio, phono (SAMPA), German orthoCLARIN ACA
Texas German Sample Corpus (TGSC) (Blevins 2022)13.5 hrs / 75k tokensaudio, German orthoCC0 1.0
Wenkersätze (Wenker 1889–1923: Sprachatlas des Deutschen Reichs. Handdrawn by Emil Maurmann, Georg Wenker and Ferdinand Wrede. Published online as Digitaler Wenker-Atlas, Schmidt ea 2020-)40 German sentences, translated into various lects spoken in the German Reich at the turn of the century40 sentences × 2210 samplesvarious phonetic transcription styles and ad-hoc spellingsCC BY-SA 4.0

For (mostly non-downloadable) resources for studying German dialect variation, see also the updated overview by Fischer & Limper (2019).

↑ top

Upper/High Franconian · uppe1464

Including East Franconian · vmf · main1267.

CorpusNotesSizeRepresentationLicense
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))South Franconian and East FranconianSouth: 10 min / 1.6k tokens; East: between 13 and 26 min / between 1.9k and 2.3k tokensaudio, German orthocustom terms

Bavarian · bar · bava1246

CorpusNotesSizeRepresentationLicense
UD Bavarian MaiBaam (Blaschke ea, 2024)POS (UPOS), dependencies (UD), German lemmas; dialect/location information; overlaps with wiki, xSID, NaLiBaSID1k sentencesad-hoc pronunciation spellingCC BY-SA 4.0
Kontatto (Dal Negro & Ciccolone 2020)POS (unknown), lemmas (German). South Tyrolean147k tokensaudio, phonocustom
BarNER (Peng ea 2024)named entities (based on CoNLL2003); overlaps with wiki11k sentencesad-hoc pronunciation spellingCC-BY 4.0
xSID (van der Goot ea 2021; Aepli ea 2023; Winkler ea 2024) (de-st and de-ba subsets)slot filling, intent detection, translations into 16 languages; South Tyrolean and Central Bavarian2×800 sentencesad-hoc pronunciation spellingCC BY-SA 4.0
NaLiBaSID MAS:de-ba (Winkler ea 2024)slot filling, intent detection; Central Bavarian; translation of MASSIVE hence parallel with 50+ other languages2k sentencesad-hoc pronunciation spelling
NaLiBaSID nat:de-ba (Winkler ea 2024)slot filling, intent detection315 sentencesad-hoc pronunciation spelling
DiDi (Frey ea 2015, 2019) (subset)South Tyrolean9.6k messagesad-hoc pronunciation spellingCLARIN ACA-BY-NC-NORED
Kontatti (Ghilardi 2019) (subset)South Tyroleanunknown (6:48 hrs total)audio, German orthocustom
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))between 21 and 34 min / between 2.7k and 3.2k tokensaudio, German orthocustom terms
AlpiLinK (Rabanus ea 2023) (tir subset)South Tyrolean; location information1908 files (49 sentences, up to 43 speakers)audio, German orthoCC BY-NC-SA 4.0
VinKo (tir subset) (Rabanus ea 2023, Krujt ea 2023)South Tyrolean; location information148 sentences + 71 words (up to 195 speakers per entry)audio, German orthoCC BY-NC-ND 4.0
Tatoeba (bar subset)translations into other languages226 sentencesad-hoc pronunciation spellingCC BY 2.0 FR
Wikipedia (bar subset)uncurated, partially tagged with dialect information27k articlesad-hoc pronunciation spelling with some optional conventionstext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Cimbrian · cim · cimb1238

CorpusNotesSizeRepresentationLicense
Kontatti (Ghilardi 2019) (subset)unknown (6:48 hrs total)audio, German orthocustom
AlpiLinK (Rabanus ea 2023) (cim subset)location information530 files (42 sentences, up to 14 speakers)audio, German orthoCC BY-NC-SA 4.0
VinKo (cim subset) (Rabanus ea 2023, Krujt ea 2023)location information159 sentences + 40 words (up to 14 speakers per entry)audio, German orthoCC BY-NC-ND 4.0

↑ top

Mòcheno · mhn · moch1255

CorpusNotesSizeRepresentationLicense
AlpiLinK (Rabanus ea 2023) (mhn subset)location information42 sentences (1 speaker)audio, German orthoCC BY-NC-SA 4.0
VinKo (mhn subset) (Rabanus ea 2023, Krujt ea 2023)location information159 sentences + 30 words (up to 17 speakers per entry)audio, German orthoCC BY-NC-ND 4.0

↑ top

Swabian · swg · swab1242

CorpusNotesSizeRepresentationLicense
Tatoeba (swg subset)translations into other languages1.9k sentencesad-hoc pronunciation spellingCC BY 2.0 FR
Wikipedia (subset of als subset)uncurated927 (of 27k) articles tagged as Swabianno defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthographytext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Central Alemannic (incl. Swiss German & Alsatian) · gsw · swis1247

CorpusNotesSizeRepresentationLicense
Annotated Corpus for the Alsatian Dialects (Bernhard ea 2018, 2019)POS (UPOS, mod. UPOS), lemmas, glosses (French), NEs (locations); Alsatian; overlap with Wikipedia798 sentencesad-hoc pronunciation spellingCC BY-SA 4.0
BISAME GSW (STIH 2020, Millour & Fort 2018)POS (mod. UPOS); Alsatian382 sentencesad-hoc pronunciation spellingCC BY-NC-SA 3.0 FR
NOAH's corpus (Hollenstein & Aepli 2015)POS (mod. STTS, partially also STTS and UPOS); overlap with UD Swiss German UZH and Wikipedia115k toks(mostly?) ad-hoc pronunciation spellingannotations: CC BY 4.0
UD Swiss German UZH (Aepli & Clematide 2018)POS (UPOS, mod. STTS), dependencies (UD); overlap with NOAH's corpus and Wikipedia100 sentences(mostly?) ad-hoc pronunciation spellingCC BY-SA 4.0
WUS DIALOG GSW (Stark ea 2014-20, Ueberwasser & Stark 2017) (subset)POS (mod. STTS), locations34.7k tokensad-hoc pronunciation spelling, German orthoCC BY-NC-ND
xSID (Aepli ea 2023) (gsw subset)slot filling, intent detection, translations into 16 languages. Bernese800 sentences
SwissDial (Dogan-Schönberger ea 2021)topics (14 classes), translations (across dialects and into German), locations (Aargau, Bern, Basel, Graubünden, Luzern, St. Gallen, Wallis, Zürich); the Wallis data are presumably in Walser (wae)2.5-4.6 hrs × 7-8 dialectsaudio, pronunciation spelling, German orthoCC BY-NC 4.0
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD))10 min / 612 tokensaudio, German orthocustom terms
SpinningBytes Swiss German Corpus (SB-CH) (annotated subset) (Grubenmann ea 2018)sentiment; potential overlap with NOAH's corpus2.8k sentencespronunciation spellingCC BY 4.0
anko Schweizerdeutsch (subset of the Picture postcard corpus) (Sugisaki ea 2023)discourse-related text spans600 postcardspronunciation spelling?
What's up, Switzerland? (subset) (Stark ea 2014-20, Ueberwasser & Stark 2017)locations507k messages / 3.6M tokenspronunciation spellingCC BY-NC-ND
Swatchgroup Geschäftsbericht (subset) via PaCoCo (Graën ea 2019)79.6k tokenspronunciation spellingCC BY-SA
Schweizerdeutsches Mundartkorpus (CHMK; downloadable subcorpus) (Weibel & Peter 2020)locations?CC BY-SA 4.0
Text+Berg via PaCoCo (subset) (Bubenhofer ea 2015, Graën ea 2019)156 sentences / 3.1k tokensCC BY-SA
ArchiMob (Scherrer ea 2019)70 hrsaudio, transcription based on the Dieth orthography for Swiss German, German orthoCC BY-NC-SA 4.0
STT4SG-350 (Plüss ea 2023)locations (7 regions)343 hrsaudio, German orthoMETA-SHARE NonCommercial NoRedistribution
SDS-200 (Plüss ea 2022)200 hrsaudio, German orthoMETA-SHARE NonCommercial NoRedistribution
Swiss Parliaments Corpus (Plüss ea 2021a)293 hrsaudio, German ortho
All Swiss German Dialects Test Set (Plüss ea 2021b)locations (cantons, incl. Wallis)13 hrs / 5.8k utterancesaudio, German orthoMIT
Gemeinderat Zürich Audio Corpus (Plüss ea 2021b)1208 hrsaudioMIT
Ein geparstes und grammatisch annotiertes Korpus schweizerdeutscher Spontansprachdaten (Schönenberger & Haeberli 2019) (contact authors)POS (mod. Penn-historical, phrase structure (Penn-historical). Location: Wil (SG)100k+ tokensDieth orthography
UDHR-LID (subset) (Karagan ea 2023, Unicode)59 sentences?CC0 1.0
Swiss Crawl (Linder ea 2020)uncurated500k+ sentences?CC BY-NC 4.0
SpinningBytes Swiss German Corpus (SB-CH) (Grubenmann ea 2018)uncurated; contains NOAH's corpus116k sentencesCC BY 4.0
SwigSpot (Linder 2018)uncurated8k sentences?Apache 2.0
Tatoeba (gsw subset)translations into other languages474 sentences?CC BY 2.0 FR
Swiss German Web Corpus (Goldhahn ea 2012)uncurated?100+k sentences?
OSCAR (subset) (Abadji ea 2022)uncurated, subset of CommonCrawl34k tokens / 233 KB?Metadata/annotations: CC0 1.0, Common Crawl: custom
CulturaX (subset) (Nguyen ea 2023)uncurated, subset of mc4 and OSCAR6.9k sentencessee mc4 & OSCAR
MADLAD-400 (subset) (Kudugunta ea 2023)uncurated, subset of CommonCrawl. the dataset audit notes issues with the Swiss German subcorpus ⚠1M sentencesCC-BY-4.0
Glot500-c (subset) (Imani ea 2023)partially uncurated, corpus overlap documented in data441k sentencesApache 2.0 + licenses of source datasets
Wikipedia (subset of als subset)uncurated, partially tagged with dialect information27k total (including Swabian and Walser), thereof 2.3k (directly or indirectly) tagged as Alsatian, and 1.7k (directly or indirectly) tagged as Swiss Germanno defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthographytext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top

Walser · wae · wals1238

CorpusNotesSizeRepresentationLicense
ArchiWals / CLiMAlp (Angster ea 2017, Gaeta 2020)locations (Gressoney, Issime, Formazza, Rimella, Alagna)80k+ tokenspronunciation spelling
Walliserdeutsch/RRO (Garner 2014, Garner ea 2014)8.3 hrsaudio, non-standardized transcriptioncustom
SwissDial (subset) (Dogan-Schönberger ea 2021)topics (14 classes), translations (into German and 7 Swiss German dialects)3.3 hrsaudio, pronunciation spelling, German orthoCC BY-NC 4.0
All Swiss German Dialects Test Set (Plüss ea 2021b)locations (cantons, incl. Wallis)unkaudio, German orthoMIT
AlpiLinK (Rabanus ea 2023) (wae subset)location information122 files (42 sentences, up to 3 speakers)audio, German orthoCC BY-NC-SA 4.0
Wikipedia (subset of als subset)uncurated35 (of 27k total) tagged as Wal(li)serno defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthographytext: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0

↑ top