Awesome
A Survey of Corpora for Germanic Low-Resource Languages and Dialects
You can read more about this corpus collection here. If you find this overview useful for your research, please cite:
@inproceedings{blaschke-etal-2023-survey,
title = {A survey of corpora for {G}ermanic low-resource languages and dialects},
author = {Blaschke, Verena and Sch{\"u}tze, Hinrich and Plank, Barbara},
year = {2023},
month = may,
booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
address = {T{\'o}rshavn, Faroe Islands},
publisher = {University of Tartu Library},
url = {https://aclanthology.org/2023.nodalida-1.41},
pages = {392--414},
}
Language varieties:
- General
- North Germanic (Faroese · (non-std.) Norwegian · Jutish · East Danish · Elfdalian · (non-std.) Swedish)
- West Germanic
- North Sea Germanic
- Macro-Dutch ((non-std.) Dutch · West Flemish · Zeelandic)
- High German
- Central German (Upper Saxon · Moselle Franconian incl. Luxembourgish · Colognian · Limburgish · Rhine Franconian incl. Palatine German · Pennsylvania Dutch · Yiddish)
- Upper German ((non-std.) German · Upper Franconian · Bavarian · Cimbrian · Mòcheno · Swabian · Central Alemannic (Swiss German & Alsatian) · Walser)
Inclusion criteria:
- Accessible to researchers
- Can be downloaded (easily)
- No extensive pre-processing required (appropriate file formats; no abundance of OCR errors)
Full sentences/utterances rather than word listsWe have relaxed this criterion and are now also including word-based resources useful for variationist research.- Data are contemporaneous or from the past century
- If only a written version is available, it should be (manually) annotated and/or showcase variation through phone[t/m]ic transcriptions or orthographies used specifically for that language variety
We focus on manual or manually corrected annotations rather than fully automatically annotated data. For corpora with an “uncurated” note, we strongly recommend manually checking the data quality, as it might be low or mixed. We've excluded corpora where we were able to determine large-scale data quality issues. Note that the webcrawl-based corpora likely overlap with the contents of some of the other corpora, and for languages with especially few resources, the overlap with Wikipedia tends to be extremely high.
The license names link to where the license is mentioned on the corpus website, unless the license is mentioned on the site linked in the first column, in the article accompanying the dataset, or in the downloaded corpus files. Always refer to the original corpus websites/papers to double-check the license information; we cannot guarantee that the information here is up to date.
Did we forget a corpus for a Germanic low-resource language or dialect that fits these inclusion criteria? Please reach out to us via a GitHub issue or an email to verena DOT blaschke ÄT cis.lmu.de
!
General
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Sound Comparisons: Germanic (Paschen ea 2019) | word-based, 120 locations/doculects from all Germanic sub-branches | 106 words × 120 locations | audio, phono (IPA), English ortho, ortho of relevant std languages | CC BY-NC-ND 4.0 |
Faroese · fao · fao1244
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
UD Faroese OFT (Tyers ea 2018) | POS (UPOS, Giellatekno-FAO), dependencies (UD), morpho (UD), lemmas. Contains material from Wikipedia | 1.2k sentences | Faroese ortho | GNU GPL 2.0, GNU LGPL 2.1, Mozilla Public License 1.1 |
FarPaHC (Ingason ea 2012, Rögnvalsson ea 2012) | POS (mod. Penn-historical, phrase structure (mod. Penn-historical) | 53k tokens | Faroese ortho | CC BY 4.0 |
UD Faroese FarPaHC (Ingason ea 2012, Rögnvalsson ea 2012) | POS (UPOS), dependencies (UD), morpho (UD) | 40k tokens | Faroese ortho | CC BY-SA 4.0 |
FoNE (Snæbjarnarson ea 2023) | named entities (8 classes). The text overlaps with the BLARK background corpus (Sosialurin subcorpus) | 118k tokens | Faroese ortho | CC BY 4.0 |
Fo-STS (Snæbjarnarson ea 2023) | semantic text similarity (sentence-level), translated subset of the English STS corpus (Cer ea 2017) | 729 sentence pairs | Faroese ortho | CC BY 4.0 |
BLARK 1.0 (background corpus) (Simonsen ea 2022) | 25M tokens | Faroese ortho | CC BY 4.0 | |
Sprotin translations | English–Faroese parallel sentences | 126k sentence pairs | Faroese ortho | MIT license |
Føroyskur talumálsbanki (Jacobsen 2022) | 599.9k tokens | Faroese ortho(, audio?) | CLARIN RES-PLAN-BY-PRIV-NORED | |
Faroese text collection (FTS) | in BLARK 1.0 background corpus | 1.1M tokens | Faroese ortho | CC BY 4.0 |
Korp (Giellatekno) | in BLARK 1.0 background corpus (download via BLARK), contains Wikipedia articles | ? | Faroese ortho | CC BY 4.0 |
BLARK 1.0 (audio) (Simonsen ea 2022) | locations (Suðuroy, Sandoy, Suðurstreymoy, Norðurstreymoy/Eysturoy, Vágar, Norðuroyggjar) | 100 hrs | audio, Faroese ortho, some phono | CC BY 4.0 |
Faroese Danish Corpus Hamburg (FADAC Hamburg) (subset) (Debess 2019) | locations (Tórshavn, Vágar, Suðuroy, Eysturoy/Norðuroyggjar) | 31 hrs | audio, Faroese ortho | HZSK-RES |
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022) | parallel with ~200 languages | 2k sentences | Faroese ortho | CC BY-SA 4.0 |
Tatoeba (fao subset) | translations into other languages | 417 sentences | Faroese ortho | CC BY 2.0 FR |
ITU Faroese/Danish (Derczynski ea 2022) | Danish translations; overlaps with (Danish) Tatoeba | 4k sentences | CC BY 4.0 | |
Ubuntu via OPUS (Tiedemann 2012) | translations into other languages | 20.2k tokens | Faroese ortho | ? |
QED via OPUS (Abdelali ea 2014, Tiedemann 2012) | translations into other languages | 6.4k tokens | Faroese ortho | ? |
UDHR-LID (subset) (Karagan ea 2023, Unicode) | 57 sentences | CC0 1.0 | ||
OpenLID (subset) (Burchell ea 2023) | combines other corpora | 40k lines | depend on source datasets | |
FAO News 2020 (Goldhahn ea 2012) | uncurated? | 33.8k sentences | ? | |
FAO Newscrawl 2011 (Goldhahn ea 2012) | uncurated? | 8.8k sentences | ? | |
Faroese Mixed Corpus (Goldhahn ea 2012) | uncurated? | 300k sentences | ? | |
Faroese Web Corpus (Goldhahn ea 2012) | uncurated? | 1M sentences | ? | |
FC3 (Snæbjarnarson ea 2023) | Faroese subset of CommonCrawl (uncurated) | 98k paragraphs / 9M tokens | Faroese ortho | unspecified CC license |
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) | uncurated | 102 MB | Faroese ortho | CC BY-SA 3.0 |
MADLAD-400 (subset) (Kudugunta ea 2023) | uncurated, subset of CommonCrawl | 1.8M sentences | CC-BY-4.0 | |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 2.3M sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (fo subset) | uncurated | 14k articles | Faroese ortho | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
For additional resources/tools, see also the resource list of the Faroese Centre for Language Technology.
Norwegian · nor · norw1258
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
LIA Treebank (+transcriptions) (Øvrelid ea 2018) | POS (mod. NDT), dependencies (mod. NDT), morpho (mod. NDT), lemmas, locations (17 places in Norway). Annotated subset of LIA Norsk | 7.5k speech segments / 78k tokens | Nynorsk ortho, phono | CC BY-NC-SA 4.0 |
UD Norwegian Nynorsk LIA (+transcriptions) (Øvrelid ea 2018) | POS (UPOS), dependencies (UD), morpho (UD), lemmas, locations (10 places in Norway). Annotated subset of LIA Norsk | 5.3k speech segments / 55k tokens | Nynorsk ortho, phono; aligned Nynorsk+phono here (Blaschke ea 2023) | treebank: CC BY-SA 4.0, transcriptions: CC BY-NC-SA 4.0 |
NDC Treebank (+transcriptions; website) (Kåsen ea 2022, Johannessen ea 2009) | POS (mod. NDT), dependencies (mod. NDT), morpho (mod. NDT), lemmas, locations (17 places) | 4.6k speech segments / 66k tokens | Bokmål ortho, phono | treebank and transcriptions: CC BY-NC-SA 4.0 |
NoMusic (Mæhlum & Scherrer 2024) subset of xSID | slot filling, intent detection, translations into Bokmål and 16 other languages; location (8 dialects) | 8×800 sentences | ad-hoc pronunciation spelling | CC BY-SA 4.0 |
NorDial (subset) (Barnes ea 2021) | 348 tweets | ad-hoc spelling | CC0 1.0 | |
NorDial (POS-annotated subset) (Mæhlum ea 2022 – contact authors) | POS (UPOS) | 35+ tweets | ad-hoc spelling | |
Nordic Dialect Corpus (subset) (Johannessen ea 2009) | locations (>100 places) | 1.9M tokens | Bokmål ortho, phono; aligned Bokmål+phono here (Scherrer 2023) | CC BY-NC-SA 4.0 |
LIA Norsk (Øvrelid ea 2018) | locations (222 places) | 3.5M tokens | Nynorsk ortho, phono | CC BY-NC-SA 4.0 |
LIA Norsk (downloadable audio subset) (Øvrelid ea 2018) | locations (178 places) | ? | audio, Nynorsk ortho, phono | CC BY-NC-SA 4.0 |
The spoken language investigation in Oslo (TAUS) | locations (East vs. West Oslo) | 387k tokens | Bokmål ortho, phono | CC BY-NC-SA 4.0 |
American Nordic Speech Corpus (CANS) (subset) (Johannessen ea 2015) | locations (57 places in USA/Canada) | 773k tokens | Bokmål ortho, phono | CC BY-NC-SA 4.0 |
Speech Database for Norwegian (NB Tale) | locations (24 areas) | 365 × 2 mins (spontaneous speech), 7.6k sentences (reading) | audio, Bokmål ortho, mod. X-SAMPA | CC0 |
Norwegian Parliamentary Speech Corpus (NPSC) | locations (5 dialect regions) | 140 hrs / 65k sentences / 1.2M tokens | audio, Bokmål/Nynorsk ortho | CC0 |
Jutish · juti1236
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Danish Gigaword Corpus (synne subset) (Derczynski ea 2021) | South Jutish | ca. 20k tokens | CC BY 4.0 |
East Danish · scan1238
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Danish Gigaword Corpus (botxt subset) (Derczynski ea 2021, Kjeldsen 2019) | Bornholmsk | ca. 400k tokens | CC BY 4.0 |
Elfdalian/Övdalian · ovd · elfd1234
Glottolog 4.7 categorizes Elfdalian as a dialect of Dalecarlian/dale1238.
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Nordic Dialect Corpus (subset) (Johannessen ea 2009) | locations (7 places) | 15.7k tokens | Elfdalian ortho (Råðdjärum's orthography), Swedish ortho | CC BY-NC-SA 4.0 |
Swedish · swe · swe1254
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Parallel dialectal-standard Swedish data (Hämäläinen ea 2020, Ivars & Södergård 2007) | Finland Swedish (with locations) | 86.5k tokens | transcription, Swedish ortho | CC BY-NC-SA 4.0 |
American Nordic Speech Corpus (CANS) (subset) (Johannessen ea 2015) | locations (7 places in the US) | 46k tokens | Swedish ortho, phono | CC BY-NC-SA 4.0 |
Anglo-Frisian
Scots · sco · scot1243
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
POS-tagged Scots corpus (Lameris & Stymne 2021) | POS (UPOS); overlaps with the SCOTS corpus | 1k tokens | partially ad hoc (SCOTS), partially with a standardized orthography (Mak Forrit) | |
Scottish Corpus of Texts & Speech (SCOTS) (subset) (Anderson ea 2007) | partially annotated in the POS-tagged Scots corpus | unknown (4.6M tokens total) | mix of ad-hoc spelling and English ortho | custom |
UDHR-LID (subset) (Karagan ea 2023, Unicode) | 58 sentences | CC0 1.0 | ||
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) | uncurated | 35 MB | ? | CC BY-SA 3.0 |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 410k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (sco subset) | uncurated, see reports here and here ⚠ | 39k articles | Scots spelling recommendations | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
English · eng · stan1293
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
TwitterAAE-UD (Blodgett ea 2016) | dependencies (UD); AAVE | 250 tweets | ad-hoc spelling | |
Diachronic Electronic Corpus of Tyneside English (DECTE) (Corrigan ea 2012 | locations (19 places in NE England). Contains the Newcastle Electronic Corpus of Tyneside English (NECTE) and NECTE2, and NECTE in turn contains the Tyneside Linguistic Survey (TLS) and the Phonological Variation and Change in Contemporary Spoken English (PVC) corpus. | 72 hrs / 804k tokens | audio, English ortho, partially: phono | custom |
Intonational Variation in English (IViE) (Nolan & Post 2013) | locations (British Isles: Belfast, Dublin, Newcastle, Leeds, Bradford, Liverpool, Cambridge, Cardiff, London) | 36 hrs | audio, English ortho | custom |
Crowdsourced high-quality UK and Ireland English Dialect speech data set (Demirsahin ea 2020) | locations (British Isles: Ireland, Midlands, Northern England, Scotland, Southern England, Wales) | 31 hrs | audio, English ortho | CC BY-SA 4.0 |
Helsinki Corpus of British English Dialects | locations (UK: Cambridgeshire, Devon, Essex/Lancashire, Isle of Ely, Somerset, Suffolk) | 1M tokens | audio, English ortho | |
Nationwide Speech Project (NSP) (Clopper & Pisoni 2006) | locations (USA: West, Midland, North, South, New England, Mid-Atlantic) | 60 × 1 hr | audio, partially: English ortho | |
Corpus of Regional African American Language (CORAAL) (Kendall & Farrington 2021) | 6 locations, AAVE | 135.6 hrs / 1.5M tokens | audio, English ortho | CC BY-NC-SA 4.0 |
Sound Comparisons: Englishes (Maguire ea 2019) | word-based, 51 locations | 110 words × 51 locations | audio, phono (IPA), English ortho | CC BY-NC-ND 4.0 |
See also: SPADE: SPeech Across Dialects of English (Stuart-Smith ea 2017–2020) and their corpus collection.
West(ern) Frisian · fry · west2354
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
UD Frisian/Dutch Fame (Braggar & van der Goot 2021, Yılmaz ea 2016) | POS (UPOS), dependencies (UD), code-switching; code-mixed Frisian and Dutch. Annotated subset of FAME. | 400 sentences | Frisian(/Dutch) ortho | CC BY-SA 4.0 |
UD Frisian Frysk (Heeringa ea 2021) | under development!; POS (UPOS), dependencies (UD), morpho (UD), lemmas | 2.9k sentences | Frisian ortho | CC BY-SA 3.0 |
Common Voice (subset) (Ardila ea 2020) | 211 hrs | audio, Frisian ortho | CC0 | |
Frisian AudioMining Enterprise (FAME!) (Yılmaz ea 2016) | partially: locations | 18.5 hrs | audio, Frisian ortho | |
Recordings of Dutch-Frisian council meetings (Bentum ea 2022) | 26 hrs / 281k tokens | audio, Frisian ortho | ||
Corpus Spoken Frisian / Korpus Sprutsen Frysk (KSF) | 200 hrs (65 hrs thereof transcribed) | audio, partially: Frisian ortho | ||
Boarnsterhim Corpus (BHC) (subset) (Sloos ea 2018) | under revision! | unknown (250 hrs total, with Dutch) | audio | |
Tatoeba (fry subset) | translations into other languages | 641 sentences | Frisian ortho | CC BY 2.0 FR |
Ubuntu via OPUS (Tiedemann 2012) | translations into other languages | 22.4k tokens | Frisian ortho | |
KDE4 via OPUS (Tiedemann 2012) | translations into other languages | ca. 300k tokens | Frisian ortho | |
GNOME via OPUS (Tiedemann 2012) | translations into other languages | 55.7k tokens | Frisian ortho | |
Mozilla-I10n | translations into other languages | ca. 400k tokens | Frisian ortho | Mozilla Public License 2.0 |
UDHR-LID (subset) (Karagan ea 2023, Unicode) | 58 sentences | CC0 1.0 | ||
FRY News 2020 (Goldhahn ea 2012) | uncurated? | 107.5k sentences | ? (written) | ? |
Western Frisian Newscrawl (Goldhahn ea 2012) | uncurated? | 100k sentences | ||
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) | uncurated | 72 MB | Frisian ortho | CC BY-SA 3.0 |
CC-100 (subset) (Wenzek ea 2020) | uncurated, subset of CommonCrawl | 174 MB | Frisian ortho | |
OSCAR (subset) (Abadji ea 2022) | uncurated, subset of CommonCrawl | 9.9M tokens / 70.4 MB | Frisian ortho | Metadata/annotations: CC0 1.0, Common Crawl: custom |
CulturaX (subset) (Nguyen ea 2023) | uncurated, subset of mc4 and OSCAR | 223k sentences | see mc4 & OSCAR | |
MADLAD-400 (subset) (Kudugunta ea 2023) | uncurated, subset of CommonCrawl | 3.7M sentences | CC-BY-4.0 | |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 927k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (fy subset) | uncurated | 50k articles | Frisian ortho | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
North(ern) Frisian · frr · north2626
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Tatoeba (frr subset) | translations into other languages | 2.9k sentences | ? | CC BY 2.0 FR |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 55.3k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (frr subset) | uncurated, partially tagged with dialect information | 17k articles | different dialect-based (ad-hoc?) orthographies | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Saterland Frisian/Saterfrisian · stq · sate1242
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Tatoeba (stq subset) | translations into other languages | 96 sentences | ? | CC BY 2.0 FR |
MADLAD-400 (subset) (Kudugunta ea 2023) | uncurated, subset of CommonCrawl | 27.7k sentences | CC-BY-4.0 | |
Wikipedia (stq subset) | uncurated | 4k articles | revised Kramer orthography for Saterfrisian (unclear if example, recommendation or rule for this wiki) | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Low German
Low Saxon/Low German · nds · lowg1239
(The relationship between the ISO 639-3 code and the Glottocode is complicated.)
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
UD Low Saxon LSDC (Siewert & Rueter 2024) | POS (UPOS), dependencies (UD), morphological features (UD), glosses (Middle Low Saxon), lemmas, locations (18 dialect areas, see also LSDC note); overlaps with LSDC | 1000 sentences | ad-hoc spelling, Nysassiske Sryvwyse | CC BY-SA 4.0 |
TaPaCo (subset) (Scherrer 2020) | paraphrases; annotated subset of Tatoeba | 1107 sentences | ? | CC BY 2.0 |
Low Saxon Dialect Classification (LSDC) (Siewert ea 2020) | locations (15 dialect areas); overlaps with UD Low Saxon LSDC; also contains FRS, WEP, TWD, ACT content | 88.9k sentences (incl. FRS etc.) | ad-hoc spelling | CC BY-NC-SA 4.0 |
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (Schröder 2011, Elmentaler ea 2015) (Low German subset) | varieties of Low Saxon (Nordhannoversch, Emsländisch Oldenburgisch), East Frisian Low Saxon and (Northern) German | unknown (300 hrs total) | audio | HZSK-RES |
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) | locations | 80 min / 10.7k tokens | audio, German ortho | custom terms |
Tatoeba (nds subset) | translations into other languages | 18.1k sentences | ? | CC BY 2.0 FR |
Ubuntu via OPUS (Tiedemann 2012) | translations into other languages | 35.3k tokens | ? | |
KDE4 via OPUS (Tiedemann 2012) | translations into other languages | 1.1M tokens | ? | |
GNOME via OPUS (Tiedemann 2012) | translations into other languages | ca. 700k tokens | ? | |
UDHR-LID (subset) (Karagan ea 2023, Unicode) | 58 sentences | CC0 1.0 | ||
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) | uncurated | 24 MB | ? | CC BY-SA 3.0 |
OSCAR (subset) (Abadji ea 2022) | uncurated, subset of CommonCrawl | 1.6M tokens / 10.7 MB | ? | Metadata/annotations: CC0 1.0, Common Crawl: custom |
CulturaX (subset) (Nguyen ea 2023) | uncurated, subset of mc4 and OSCAR | 15.1k sentences | see mc4 & OSCAR | |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 934k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (nds subset) | uncurated, partially tagged with dialect information | 84k articles | Sass'sche Schrievwies | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Wikipedia (nds-nl subset) | uncurated, partially tagged with dialect information | 8k articles | Nysassiske Skryvwyse (preferred) and Algemene Nedersaksische Schriefwieze (older articles) | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
East Frisian Low Saxon · frs · east2288
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (East Frisian Low Saxon subset) | varieties of Low Saxon, East Frisian Low Saxon and (Northern) German | unknown (300 hrs total) | audio | HZSK-RES |
Low Saxon Dialect Classification (LSDC) (OFR subset) (Siewert ea 2020) | minor overlaps with UD Low Saxon LSDC | 240 sentences | ad-hoc spelling | CC BY-NC-SA 4.0 |
Gronings · gos · gron1242
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
TaPaCo (subset) (Scherrer 2020) | paraphrases; annotated subset of Tatoeba | 122 sentences | ? | CC BY 2.0 |
Automatic speech recognition dataset for Gronings (Bartelds ea 2023) | 4 hours | audio, written | CC BY 4.0 | |
Dataset: Gronings (Bartelds & San 2021, San ea 2021) | 23 mins | audio, written | CC BY 4.0 | |
Tatoeba (gos subset) | translations into other languages | 5.7k sentences | ? | CC BY 2.0 FR |
Twents · twd · twen1241
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Low Saxon Dialect Classification (LSDC) (TWE subset) (Siewert ea 2020) | minor overlaps with UD Low Saxon LSDC | 668 sentences | ad-hoc spelling | CC BY-NC-SA 4.0 |
Achterhoeks · act · acht1238
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Low Saxon Dialect Classification (LSDC) (ACH subset) (Siewert ea 2020) | minor overlaps with UD Low Saxon LSDC | 988 sentences | ad-hoc spelling | CC BY-NC-SA 4.0 |
Westphalic/Westphalish/Westphalian · wep · west2356
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) | 15 min / 2.4k tokens | audio, German ortho | custom terms | |
Low Saxon Dialect Classification (LSDC) (OWL subset) (Siewert ea 2020) | minor overlaps with UD Low Saxon LSDC | 15k sentences | ad-hoc spelling | CC BY-NC-SA 4.0 |
Macro-Dutch
Dutch · nld · dutc1256
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Corpus of Southern Dutch Dialects (GCND) (Breitbarth ea 2018) | under construction!; might also include West Flemish, Zeelandic, and/or Limburgs | audio, transcriptions | ||
SAND (Barbiers ea 2006) | locations | ? | phono | custom |
MAND/FAND/GTRP (Goeman ea) (contact institute) | locations | phono (K-IPA) |
West(ern) Flemish · vls · vlaa1240
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Stemmen uit het verleden (annotated subset) (Lybaert ea 2019, Van Keymeulen ea 2019) | V2 variation, locations (25 places) | 1.4k sentences | phono | CC BY-NC 4.0 |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 102k sentences | Apache 2.0 + licenses of source datasets | |
VLS Community 2017 (Goldhahn ea 2012) | possibly uncurated | 36.4k sentences | ? (written) | ? |
Wikipedia (vls subset) | uncurated, partially tagged with dialect information | 8k articles | Standoardvlams (orthography developped by vls.wikipedia.org editors) | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Zeelandic/Zeeuws · zea · zeeu1238
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 34.4k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (zea subset) | uncurated | 6k articles | ? | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Central German
Upper Saxon · sxu · uppe1400
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
SXUCorpus (Herms ea 2016) (contact authors) | 8 locations | 500 min / 70 k tokens | audio, German ortho | |
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) | 12 min / 1.7k tokens | audio, German ortho | custom terms |
Moselle Franconian · luxe1241
Luxembourgish · ltz · luxe1243
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
UD Luxembourgish LuxBank (Plum ea 2024) | POS tags (UPOS), dependencies (UD) | 20 sentences | Luxembourgish ortho | |
Banking Client Support (BCS) Dataset (Lothritz ea 2021) | intent detection, slot filling, parallel with DEU, FRA, ENG | 1k sentences | Luxembourgish ortho | ? |
Luxembourgish translation of Winograd Natural Language Inference (L-WNLI) (Lothritz ea 2022) | NLI, parallel with other languages (Levesque ea 2012) | 767 samples | Luxembourgish ortho | ? |
Luxembourgish POS and NER (Lothritz ea 2022) (contact authors) | POS tags (15 tags), NER (PER, ORG, LOC, GPE, MISC) | 5.5k sentences | Luxembourgish ortho | ? |
Luxembourgish news classification (Lothritz ea 2022) (contact authors) | 8 classes | 10k articles | Luxembourgish ortho | ? |
SA1 (Lothritz ea 2023; contact authors) | sentiment | 1.8k sentences | ||
Luxembourgish sentence negation (Lothritz ea 2023) | position of negation particle; overlaps with Leipzig corpora (Newscrawl and/or Web and/or Wikipedia) | 46k sentences | ||
LuxId (Lavergne ea 2014) | code-switching (LTZ, DEU, FRA) | 924 sentences (most with LTZ content) | Luxembourgish(/German/French) ortho | CC BY-SA 3.0 |
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022) | parallel with ~200 languages | 2k sentences | Luxembourgish ortho | CC BY-SA 4.0 |
FLEURS (subset) (Conneau ea 2023) | parallel with ~100 languages; audio version of FLORES (Goyal ea 2022) | 1-3 recordings each of 1.9k sentences (3.8k recordings total) | audio, Luxembourgish ortho | CC BY 4.0 |
Tatoeba (ltz subset) | translations into other languages | 884 sentences | Luxembourgish ortho | CC BY 2.0 FR |
Ubuntu via OPUS (Tiedemann 2012) | translations into other languages | 17k tokens | Luxembourgish ortho | ? |
KDE4 via OPUS (Tiedemann 2012) | translations into other languages | 28.8k tokens | Luxembourgish ortho | ? |
Mozilla-I10n | translations into other languages | 6.9k tokens | Luxembourgish ortho | Mozilla Public License 2.0 |
QED via OPUS (Abdelali ea 2014, Tiedemann 2012) | translations into other languages | 19.2k tokens | Luxembourgish ortho | ? |
TED2020 via OPUS (Reimers & Gurevych, Tiedemann 2012) | translations into other languages | 1.7k tokens | Luxembourgish ortho | CC BY-NC-ND 4.0 |
UDHR-LID (subset) (Karagan ea 2023, Unicode) | 59 sentences | CC0 1.0 | ||
OpenLID (subset) (Burchell ea 2023) | combines other corpora | 37.7k lines | depend on source datasets | |
Luxembourgish Newscrawl (Goldhahn ea 2012) | uncurated? | 300k sentences | ||
Luxembourgish Web Corpus (Goldhahn ea 2012) | uncurated? | 1M sentences | ||
Web to Corpus (W2C) (subset) (Majliš 2011, Majliš & Žabokrtský 2012) | uncurated | 81 MB | ? | CC BY-SA 3.0 |
OSCAR (subset) (Abadji ea 2022) | uncurated, subset of CommonCrawl | 2.5M tokens / 18.4 MB | ? | Metadata/annotations: CC0 1.0, Common Crawl: custom |
CulturaX (subset) (Nguyen ea 2023) | uncurated, subset of mc4 and OSCAR | 166k sentences | see mc4 & OSCAR | |
MADLAD-400 (subset) (Kudugunta ea 2023) | uncurated, subset of CommonCrawl | 3.4M sentences | CC-BY-4.0 | |
Wikipedia (lb subset) | uncurated | 61k articles | Luxembourgish ortho | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
For other kinds of resources/tools, see also questoph/NLPforLTZ.
Transylvanian Saxon · tran1294
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Audioatlas siebenbürgisch-sächsischer Dialekte (ASD) (University of Munich) | 360 hrs / 450k tokens | audio, German ortho, partially phono | CLARIN RES |
Colognian · ksh · kols1241
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Tatoeba (ksh subset) | translations into other languages | 82 sentences | ? | CC BY 2.0 FR |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 33.5k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (ksh subset) | uncurated, Colognian and other varieties of Ripuarian, partially tagged with dialect and/or orthography information | 3k articles | ad-hoc spelling, some articles according to various Ripuarian orthographies | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Limburgish/Limburgan · lim · lim1263
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022) | parallel with ~200 languages; Maastrichtian Limburgs | 2k sentences | CC BY-SA 4.0 | |
Ubuntu via OPUS (Tiedemann 2012) | translations into other languages | 18.4k tokens | ? | |
GNOME via OPUS (Tiedemann 2012) | translations into other languages | ca. 400k tokens | ? | |
OpenLID (subset) (Burchell ea 2023) | combines other corpora | 48k lines | depend on source datasets | |
LIM Community 2017 (Goldhahn ea 2012) | possibly uncurated | 84.4k sentences | ? (written) | ? |
LIM Web 2010 (Netherlands) (Goldhahn ea 2012) | uncurated? | 35.4k sentences | ? (written) | ? |
CC-100 (subset) (Wenzek ea 2020) | uncurated, subset of CommonCrawl | 8.3 MB | ||
CulturaX (subset) (Nguyen ea 2023) | uncurated, subset of mc4 and OSCAR | 206 sentences | see mc4 & OSCAR | |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 652k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (li subset) | uncurated, partially tagged with dialect and/or orthography information | 14k articles | Veldeke-sjpelling, Algemein Gesjreve Limburgs | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Rhine/Rhenish Franconian · rhin1244
Includes Palatin(at)e German · pfl · pala1330.
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Thorsten-Voice Dataset 2023.09 Hessisch (Müller & Kreutz 2024) | Hessian | 2 hrs / 2.1k sentences | audio, German ortho | CC0 |
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) | Hessian | 8 min / 1.4k tokens | audio, German ortho | custom terms |
Wikipedia (pfl subset) | uncurated, partially tagged with dialect information; contains articles in Palatine German, Lorraine Franconian, Hessian | 3k articles | (implied) ad-hoc spelling | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Pennsylvania Dutch · pdc · penn1240
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Tatoeba (pdc subset) | translations into other languages | 57 sentences | ? | CC BY 2.0 FR |
Wikipedia (pdc subset) | uncurated | 2k articles | ? | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Yiddish · yid · west2361/east2295
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Penn Parsed Corpus of Historical Yiddish (Santorini 2021) | POS (Penn-historical, phrase structure (Penn-historical) | 200k tokens | partially YIVO transliteration, partially YIVO-inspired ad-hoc transliteration | CC BY-NC-SA 4.0 |
CABank Yiddish Corpus (Newman 2015) | New York | 1 hr | audio, transcriptions (partially IPA, partially orthography-based (YIVO-transliteration-based?)) | CC BY-NC-SA 3.0 |
FLORES-200 (subset) (Goyal ea 2022, NLLB Team 2022) | parallel with ~200 languages; Eastern Yiddish (Hasidic) | 2k sentences | CC BY-SA 4.0 | |
UDHR-LID (subset) (Karagan ea 2023, Unicode) | Eastern Yiddish | 59 sentences | CC0 1.0 | |
OpenLID (subset) (Burchell ea 2023) | combines other corpora; Eastern Yiddish | 911 lines | depend on source datasets | |
YDD Community 2017 (Goldhahn ea 2012) | Eastern Yiddish; possibly uncurated | 21.8k sentences | ? (written) | ? |
CC-100 (subset) (Wenzek ea 2020) | uncurated, subset of CommonCrawl | 51 MB | ||
OSCAR (subset) (Abadji ea 2022) | uncurated, subset of CommonCrawl | 14.3M tokens / 171.7 MB | ? | Metadata/annotations: CC0 1.0, Common Crawl: custom |
CulturaX (subset) (Nguyen ea 2023) | uncurated, subset of mc4 and OSCAR | 141k sentences | see mc4 & OSCAR | |
MADLAD-400 (subset) (Kudugunta ea 2023) | uncurated, subset of CommonCrawl | 1.9M sentences | CC-BY-4.0 | |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 220k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (yi subset) | uncurated | 15k articles | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Upper German
German · deu · stan1295
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Sprachvariation in Norddeutschland (SiN, Hamburg collection) (Schröder 2011, Elmentaler ea 2015) (German subset) | varieties of Low Saxon, East Frisian Low Saxon and (Northern) German | unknown (300 hrs total) | audio | HZSK-RES |
Regional Variants of German 1 (RVG1) (+link2) (Burger & Schiel 1998) | unclear whether all of the recordings are in regionally accented (Standard) German or some are in Low Saxon/Bavarian/Colognian/etc. instead | 500 × 1 min spontaneous speech | audio, phono (SAMPA), German ortho | CLARIN ACA |
Texas German Sample Corpus (TGSC) (Blevins 2022) | 13.5 hrs / 75k tokens | audio, German ortho | CC0 1.0 | |
Wenkersätze (Wenker 1889–1923: Sprachatlas des Deutschen Reichs. Handdrawn by Emil Maurmann, Georg Wenker and Ferdinand Wrede. Published online as Digitaler Wenker-Atlas, Schmidt ea 2020-) | 40 German sentences, translated into various lects spoken in the German Reich at the turn of the century | 40 sentences × 2210 samples | various phonetic transcription styles and ad-hoc spellings | CC BY-SA 4.0 |
For (mostly non-downloadable) resources for studying German dialect variation, see also the updated overview by Fischer & Limper (2019).
Upper/High Franconian · uppe1464
Including East Franconian · vmf · main1267.
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) | South Franconian and East Franconian | South: 10 min / 1.6k tokens; East: between 13 and 26 min / between 1.9k and 2.3k tokens | audio, German ortho | custom terms |
Bavarian · bar · bava1246
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
UD Bavarian MaiBaam (Blaschke ea, 2024) | POS (UPOS), dependencies (UD), German lemmas; dialect/location information; overlaps with wiki, xSID, NaLiBaSID | 1k sentences | ad-hoc pronunciation spelling | CC BY-SA 4.0 |
Kontatto (Dal Negro & Ciccolone 2020) | POS (unknown), lemmas (German). South Tyrolean | 147k tokens | audio, phono | custom |
BarNER (Peng ea 2024) | named entities (based on CoNLL2003); overlaps with wiki | 11k sentences | ad-hoc pronunciation spelling | CC-BY 4.0 |
xSID (van der Goot ea 2021; Aepli ea 2023; Winkler ea 2024) (de-st and de-ba subsets) | slot filling, intent detection, translations into 16 languages; South Tyrolean and Central Bavarian | 2×800 sentences | ad-hoc pronunciation spelling | CC BY-SA 4.0 |
NaLiBaSID MAS:de-ba (Winkler ea 2024) | slot filling, intent detection; Central Bavarian; translation of MASSIVE hence parallel with 50+ other languages | 2k sentences | ad-hoc pronunciation spelling | |
NaLiBaSID nat:de-ba (Winkler ea 2024) | slot filling, intent detection | 315 sentences | ad-hoc pronunciation spelling | |
DiDi (Frey ea 2015, 2019) (subset) | South Tyrolean | 9.6k messages | ad-hoc pronunciation spelling | CLARIN ACA-BY-NC-NORED |
Kontatti (Ghilardi 2019) (subset) | South Tyrolean | unknown (6:48 hrs total) | audio, German ortho | custom |
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) | between 21 and 34 min / between 2.7k and 3.2k tokens | audio, German ortho | custom terms | |
AlpiLinK (Rabanus ea 2023) (tir subset) | South Tyrolean; location information | 1908 files (49 sentences, up to 43 speakers) | audio, German ortho | CC BY-NC-SA 4.0 |
VinKo (tir subset) (Rabanus ea 2023, Krujt ea 2023) | South Tyrolean; location information | 148 sentences + 71 words (up to 195 speakers per entry) | audio, German ortho | CC BY-NC-ND 4.0 |
Tatoeba (bar subset) | translations into other languages | 226 sentences | ad-hoc pronunciation spelling | CC BY 2.0 FR |
Wikipedia (bar subset) | uncurated, partially tagged with dialect information | 27k articles | ad-hoc pronunciation spelling with some optional conventions | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Cimbrian · cim · cimb1238
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Kontatti (Ghilardi 2019) (subset) | unknown (6:48 hrs total) | audio, German ortho | custom | |
AlpiLinK (Rabanus ea 2023) (cim subset) | location information | 530 files (42 sentences, up to 14 speakers) | audio, German ortho | CC BY-NC-SA 4.0 |
VinKo (cim subset) (Rabanus ea 2023, Krujt ea 2023) | location information | 159 sentences + 40 words (up to 14 speakers per entry) | audio, German ortho | CC BY-NC-ND 4.0 |
Mòcheno · mhn · moch1255
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
AlpiLinK (Rabanus ea 2023) (mhn subset) | location information | 42 sentences (1 speaker) | audio, German ortho | CC BY-NC-SA 4.0 |
VinKo (mhn subset) (Rabanus ea 2023, Krujt ea 2023) | location information | 159 sentences + 30 words (up to 17 speakers per entry) | audio, German ortho | CC BY-NC-ND 4.0 |
Swabian · swg · swab1242
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Tatoeba (swg subset) | translations into other languages | 1.9k sentences | ad-hoc pronunciation spelling | CC BY 2.0 FR |
Wikipedia (subset of als subset) | uncurated | 927 (of 27k) articles tagged as Swabian | no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Central Alemannic (incl. Swiss German & Alsatian) · gsw · swis1247
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
Annotated Corpus for the Alsatian Dialects (Bernhard ea 2018, 2019) | POS (UPOS, mod. UPOS), lemmas, glosses (French), NEs (locations); Alsatian; overlap with Wikipedia | 798 sentences | ad-hoc pronunciation spelling | CC BY-SA 4.0 |
BISAME GSW (STIH 2020, Millour & Fort 2018) | POS (mod. UPOS); Alsatian | 382 sentences | ad-hoc pronunciation spelling | CC BY-NC-SA 3.0 FR |
NOAH's corpus (Hollenstein & Aepli 2015) | POS (mod. STTS, partially also STTS and UPOS); overlap with UD Swiss German UZH and Wikipedia | 115k toks | (mostly?) ad-hoc pronunciation spelling | annotations: CC BY 4.0 |
UD Swiss German UZH (Aepli & Clematide 2018) | POS (UPOS, mod. STTS), dependencies (UD); overlap with NOAH's corpus and Wikipedia | 100 sentences | (mostly?) ad-hoc pronunciation spelling | CC BY-SA 4.0 |
WUS DIALOG GSW (Stark ea 2014-20, Ueberwasser & Stark 2017) (subset) | POS (mod. STTS), locations | 34.7k tokens | ad-hoc pronunciation spelling, German ortho | CC BY-NC-ND |
xSID (Aepli ea 2023) (gsw subset) | slot filling, intent detection, translations into 16 languages. Bernese | 800 sentences | ||
SwissDial (Dogan-Schönberger ea 2021) | topics (14 classes), translations (across dialects and into German), locations (Aargau, Bern, Basel, Graubünden, Luzern, St. Gallen, Wallis, Zürich); the Wallis data are presumably in Walser (wae) | 2.5-4.6 hrs × 7-8 dialects | audio, pronunciation spelling, German ortho | CC BY-NC 4.0 |
Zwirner-Korpus (subset of downloadable subcorpus) (Zwirner & Bethge 1958, IDS: Datenbank für gesprochenes Deutsch (DGD)) | 10 min / 612 tokens | audio, German ortho | custom terms | |
SpinningBytes Swiss German Corpus (SB-CH) (annotated subset) (Grubenmann ea 2018) | sentiment; potential overlap with NOAH's corpus | 2.8k sentences | pronunciation spelling | CC BY 4.0 |
anko Schweizerdeutsch (subset of the Picture postcard corpus) (Sugisaki ea 2023) | discourse-related text spans | 600 postcards | pronunciation spelling | ? |
What's up, Switzerland? (subset) (Stark ea 2014-20, Ueberwasser & Stark 2017) | locations | 507k messages / 3.6M tokens | pronunciation spelling | CC BY-NC-ND |
Swatchgroup Geschäftsbericht (subset) via PaCoCo (Graën ea 2019) | 79.6k tokens | pronunciation spelling | CC BY-SA | |
Schweizerdeutsches Mundartkorpus (CHMK; downloadable subcorpus) (Weibel & Peter 2020) | locations | ? | CC BY-SA 4.0 | |
Text+Berg via PaCoCo (subset) (Bubenhofer ea 2015, Graën ea 2019) | 156 sentences / 3.1k tokens | CC BY-SA | ||
ArchiMob (Scherrer ea 2019) | 70 hrs | audio, transcription based on the Dieth orthography for Swiss German, German ortho | CC BY-NC-SA 4.0 | |
STT4SG-350 (Plüss ea 2023) | locations (7 regions) | 343 hrs | audio, German ortho | META-SHARE NonCommercial NoRedistribution |
SDS-200 (Plüss ea 2022) | 200 hrs | audio, German ortho | META-SHARE NonCommercial NoRedistribution | |
Swiss Parliaments Corpus (Plüss ea 2021a) | 293 hrs | audio, German ortho | ||
All Swiss German Dialects Test Set (Plüss ea 2021b) | locations (cantons, incl. Wallis) | 13 hrs / 5.8k utterances | audio, German ortho | MIT |
Gemeinderat Zürich Audio Corpus (Plüss ea 2021b) | 1208 hrs | audio | MIT | |
Ein geparstes und grammatisch annotiertes Korpus schweizerdeutscher Spontansprachdaten (Schönenberger & Haeberli 2019) (contact authors) | POS (mod. Penn-historical, phrase structure (Penn-historical). Location: Wil (SG) | 100k+ tokens | Dieth orthography | |
UDHR-LID (subset) (Karagan ea 2023, Unicode) | 59 sentences | ? | CC0 1.0 | |
Swiss Crawl (Linder ea 2020) | uncurated | 500k+ sentences | ? | CC BY-NC 4.0 |
SpinningBytes Swiss German Corpus (SB-CH) (Grubenmann ea 2018) | uncurated; contains NOAH's corpus | 116k sentences | CC BY 4.0 | |
SwigSpot (Linder 2018) | uncurated | 8k sentences | ? | Apache 2.0 |
Tatoeba (gsw subset) | translations into other languages | 474 sentences | ? | CC BY 2.0 FR |
Swiss German Web Corpus (Goldhahn ea 2012) | uncurated? | 100+k sentences | ? | |
OSCAR (subset) (Abadji ea 2022) | uncurated, subset of CommonCrawl | 34k tokens / 233 KB | ? | Metadata/annotations: CC0 1.0, Common Crawl: custom |
CulturaX (subset) (Nguyen ea 2023) | uncurated, subset of mc4 and OSCAR | 6.9k sentences | see mc4 & OSCAR | |
MADLAD-400 (subset) (Kudugunta ea 2023) | uncurated, subset of CommonCrawl. the dataset audit notes issues with the Swiss German subcorpus ⚠ | 1M sentences | CC-BY-4.0 | |
Glot500-c (subset) (Imani ea 2023) | partially uncurated, corpus overlap documented in data | 441k sentences | Apache 2.0 + licenses of source datasets | |
Wikipedia (subset of als subset) | uncurated, partially tagged with dialect information | 27k total (including Swabian and Walser), thereof 2.3k (directly or indirectly) tagged as Alsatian, and 1.7k (directly or indirectly) tagged as Swiss German | no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |
Walser · wae · wals1238
Corpus | Notes | Size | Representation | License |
---|---|---|---|---|
ArchiWals / CLiMAlp (Angster ea 2017, Gaeta 2020) | locations (Gressoney, Issime, Formazza, Rimella, Alagna) | 80k+ tokens | pronunciation spelling | |
Walliserdeutsch/RRO (Garner 2014, Garner ea 2014) | 8.3 hrs | audio, non-standardized transcription | custom | |
SwissDial (subset) (Dogan-Schönberger ea 2021) | topics (14 classes), translations (into German and 7 Swiss German dialects) | 3.3 hrs | audio, pronunciation spelling, German ortho | CC BY-NC 4.0 |
All Swiss German Dialects Test Set (Plüss ea 2021b) | locations (cantons, incl. Wallis) | unk | audio, German ortho | MIT |
AlpiLinK (Rabanus ea 2023) (wae subset) | location information | 122 files (42 sentences, up to 3 speakers) | audio, German ortho | CC BY-NC-SA 4.0 |
Wikipedia (subset of als subset) | uncurated | 35 (of 27k total) tagged as Wal(li)ser | no defined standard, but a set of recommendations based on published works, the (Swiss German) Dieth orthography and the (Alsatian) Orthal orthography | text: GFDL, CC BY-SA 3.0; images: CC BY-SA 4.0 |