Home

Awesome

💎 Open Speech Corpora

A list of open speech corpora for Speech Technology research and development.

This list has a preference for free (i.e. no $ cost) and truly open corpora (e.g. released under a Creative Commons license or a Community Data License Agreement). Not all these corpora may meet those criteria, but all the following corpora are accessible and usable for research and/or commercial use.

Feel free to propse additions to the list!

There's a long backlog of corpora to be added in the Issues, and Pull Requests are very welcome :)

📜 CC-0

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
Common VoiceMultilingual>15,000 hours (validated); >20,000 hours (total)Multi-speakerhttps://voice.mozilla.org/en/datasetsCC-0
YesnoHebrew6 minsone malehttp://www.openslr.org/1/CC-0
LJ Speech CorpusEnglish~24 hoursone femalehttps://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2CC-0
NST Danish ASR DatabaseDanish229,992 utterances616 speakersoriginal: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-55/CC-0
NST Danish DictationDanish34,955 utterances151 speakershttps://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-20/CC-0
NST Danish Speech SynthesisDanish4,108 utterances1 male speakerhttps://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-21/CC-0
NST Swedish ASR DatabaseSwedish366,000 utterances1,000 speakersoriginal: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-16/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/CC-0
NST Swedish DictationSwedish45,620 utterances195 speakershttps://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-17/CC-0
NST Swedish Speech SynthesisSwedish5,279 utterances1 male speakerhttps://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/CC-0
NST Norwegian ASR DatabaseNorwegian359,760 utterances980 speakersoriginal: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-13/, reorganized: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-54/CC-0
NST Norwegian DictationNorwegian33,360 utterances144 speakershttps://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-14/CC-0
NST Norwegian Speech SynthesisNorwegian5,363 utterances1 male speakerhttps://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-15/CC-0
NB Tale – Speech Database for NorwegianNorwegian7,600 utterances + ~12 hours380 speakershttps://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-31/CC-0
Norwegian Parliamentary Speech Corpus (v0.1)Norwegian~59 hours203 speakershttps://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/CC-0
Wikimedia Commons OdiaOdia~8 hours~20 speakershttps://commons.wikimedia.org/wiki/Category:Odia_pronunciationmostly(?) CC-0
Thorsten-21.02-neutralGerman~24 hours1 male speakerhttps://www.Thorsten-Voice.deCC-0
Thorsten-21.06-emotionalGerman2.400 utterances (8 emotions)1 male speakerhttps://www.Thorsten-Voice.deCC-0

📜 CC-BY

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
ARU Speech CorpusEnglish (UK)720 utterances / speaker12 (6 femals; 6 male)http://datacat.liverpool.ac.uk/681/1/ARU_Speech_Corpus_v1_0.zipCC-BY 3.0
Althingi Parliamentary Speech CorpusIcelandic542 hours and 25 minutes196 speakershttp://www.malfong.is/index.php?dlid=73&lang=enCC-BY 4.0
Alþingisumræður Parliamentary Speech CorpusIcelandic~21 hourshttp://www.malfong.is/index.php?dlid=8&lang=enCC-BY 3.0
Hjal CorpusIcelandic~41,000 recordings883 speakershttp://www.malfong.is/index.php?dlid=5&lang=enCC-BY 3.0
The Malromur CorpusIcelandic152 hours563 speakershttp://www.malfong.is/index.php?dlid=65&lang=enCC-BY 4.0
Telecooperation German Corpus for KinectGerman~35 hours~180 speakershttp://www.repository.voxforge1.org/downloads/de/german-speechdata-TUDa-2015.tar.gzCC-BY 2.0
African Speech Technology English-English Speech CorpusEnglish~21 hourshttps://repo.sadilar.org/handle/20.500.12185/283CC-BY 2.5 South Africa
African Speech Technology isiXhosa Speech CorpusisiXhosa~26 hourshttps://repo.sadilar.org/handle/20.500.12185/305CC-BY 2.5 South Africa
NCHLT AfrikaansAfrikaans56 hours210 speakers (98 female / 112 male)https://repo.sadilar.org/handle/20.500.12185/280CC-BY 3.0
NCHLT EnglishEnglish56 hours210 speakers (100 female / 110 male)https://repo.sadilar.org/handle/20.500.12185/274CC-BY 3.0
NCHLT isiNdebeleisiNdebele56 hours148 speakers (78 female / 70 male)https://repo.sadilar.org/handle/20.500.12185/272CC-BY 3.0
NCHLT isiXhosaisiXhosa56 hours209 speakers (106 female / 103 male)https://repo.sadilar.org/handle/20.500.12185/279CC-BY 3.0
NCHLT isiZuluisiZulu56 hours210 speakers (98 female / 112 male)https://repo.sadilar.org/handle/20.500.12185/275CC-BY 3.0
NCHLT SepediSepedi56 hours210 speakers (100 female / 110 male)https://repo.sadilar.org/handle/20.500.12185/270CC-BY 3.0
NCHLT SesothoSesotho56 hours210 speakers (113 female / 97 male)https://repo.sadilar.org/handle/20.500.12185/278CC-BY 3.0
NCHLT SetswanaSetswana56 hours210 speakers (109 female / 101 male)https://repo.sadilar.org/handle/20.500.12185/281CC-BY 3.0
NCHLT SiswatiSiswati56 hours197 speakers (96 female / 101 male)https://repo.sadilar.org/handle/20.500.12185/271CC-BY 3.0
NCHLT TshivendaTshivenda56 hours208 speakers (83 female / 125 male)https://repo.sadilar.org/handle/20.500.12185/276CC-BY 3.0
NCHLT XitsongaXitsonga56 hours198 speakers (95 female/103 male)https://repo.sadilar.org/handle/20.500.12185/277CC-BY 3.0
Lwazi II Cross-lingual Proper Name CorpusAfrikaans; English; isiZulu; Sesotho2 hours 5 mins20 speakershttps://repo.sadilar.org/handle/20.500.12185/445CC-BY 3.0
Lwazi II Proper Name Call Routing Telephone CorpusEnglish2 hours 7 minshttps://repo.sadilar.org/handle/20.500.12185/448CC-BY 3.0
Lwazi II Afrikaans Trajectory Tracking CorpusAfrikaans4 hoursone malehttps://repo.sadilar.org/handle/20.500.12185/442CC-BY 3.0
LibriSpeechEnglish~1000 hours2484 speakers (1201 female / 1283 male)http://www.openslr.org/12/CC-BY 4.0
Zeroth-KoreanKorean52.8 hours115 speakershttp://www.openslr.org/40/CC-BY 4.0
Speech CommandsEnglish17.8 hours>1,000 speakershttps://ai.googleblog.com/2017/08/launching-speech-commands-dataset.htmlCC-BY 4.0
ParlamentParlaCatalan320 hourshttps://www.openslr.org/59/CC-BY 4.0
SIWISFrench~10 hoursone femalehttp://datashare.is.ed.ac.uk/download/DS_10283_2353.zipCC-BY 4.0
VCTKEnglish44 hours109 speakershttp://datashare.is.ed.ac.uk/download/DS_10283_3443.zipCC-BY 4.0
LibriTTSEnglish586 hours2,456 speakers (1,185 female / 1,271 male)http://www.openslr.org/60/CC-BY 4.0
Augmented LibriSpeechAudio (English); Text (English, French)236 hourshttps://persyval-platform.univ-grenoble-alpes.fr/datasets/DS91CC-BY 4.0
Helsinki Prosody CorpusEnglish262.5 hours1,230 speakershttps://github.com/Helsinki-NLP/prosodyCC-BY 4.0
Tuva Speech DatabaseNorwegian24 hours40 speakershttps://www.nb.no/sprakbanken/show?serial=oai:nb.no:sbr-44&lang=CC-BY 4.0
COERLL Kʼicheʼ corpusKʼicheʼ34 minutes? speakershttps://cl.indiana.edu/~ftyers/resources/utexas-kiche-audio.tar.gzCC-BY 4.0
Timers and Such v0.1English (synthetic: US, real: various nationalities)synthetic: 172 hours, real: 0.29 hours21 synthetic, 11 realhttps://zenodo.org/record/4110812#.X9j0RmBOkYMCC-BY 4.0
Large Corpus of Czech Parliament Plenary HearingsCzech444 hourshttps://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3126CC-BY 4.0

📜 CC-BY-SA

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
IbanIban8 hourshttp://www.openslr.org/24/ https://github.com/sarahjuan/ibanCC-BY-SA 2.0
Vystadial 2013English; Czech41 hours; 15 hourshttp://www.openslr.org/6/CC-BY-SA 3.0 US
Vystadial 2016 CzechCzech77 hours; includes Vystadial 2013 Czechhttps://lindat.cz/repository/xmlui/handle/11234/1-1740CC-BY-SA 4.0
Free Spoken Digit DatasetEnglish2,000 isolated digits4 speakershttps://github.com/Jakobovski/free-spoken-digit-datasetCC-BY-SA 4.0
Google JavaneseJavanese296 hours1019 speakershttp://www.openslr.org/35/CC-BY-SA 4.0
Google NepaliNepali165 hours527 speakershttp://www.openslr.org/54/CC-BY-SA 4.0
Google BengaliBengali229 hours508 speakershttp://www.openslr.org/53/CC-BY-SA 4.0
Google SinhalaSinhala224 hours478 speakershttp://www.openslr.org/52/CC-BY-SA 4.0
Google SundaneseSundanese333 hours542 speakershttp://www.openslr.org/36/CC-BY-SA 4.0
Spoken Wikipedia Corpus (SWC-2017)English; German; Dutch182 hours; 249 hours; 79 hours395 speakers; 339 speakers; 145 speakershttps://nats.gitlab.io/swc/CC-BY-SA 4.0
Chuvash TTSChuvash4 hours1 speakerhttps://github.com/ftyers/Turkic_TTSCC-BY-SA 4.0
ForschergeistGerman2 hours2 speakers (1 female; 1 male)female speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/annettevogt-20180320-rec.tgz; male speaker: https://goofy.zamia.org/zamia-speech/corpora/forschergeist/timpritlove-20180320-rec.tgzCC-BY-SA 4.0
Malayalam Speech Corpus by SMCMalayalam1:36 hours75 speakers (3 female, 12 male, 60 unidentified)https://releases.smc.org.in/msc-reviewed-speech/CC-BY-SA 4.0
Google MalayalamMalayalam3.02 hours24 speakershttp://www.openslr.org/63/CC-BY-SA 4.0

📜 CC-BY-ND

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
IBM Recorded Debates v1English5 hours10 speakershttps://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20AnalysisCC-BY-ND
IBM Recorded Debates v2English~14 hours14 speakershttps://www.research.ibm.com/haifa/dept/vst/debating_data.shtml#Debate%20Speech%20AnalysisCC-BY-ND

📜 CC-BY-NC

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
TV3ParlaCatalan240 hourshttp://laklak.eu/share/tv3_0.3.tar.gzCC-BY-NC 4.0
Russian Open STT CorpusRussian~10,000 hours public, ~10,000 more upon requesthttps://github.com/snakers4/open_stt/#linksCC-BY-NC 4.0 with some exceptions
Russian Open TTS CorpusRussian145 hours3 maleshttps://github.com/snakers4/open_tts/#linksCC-BY-NC 4.0 with some expections
OVM – Otázky Václava MoravceCzech35 hourshttps://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-000D-EC98-3CC-BY-NC 3.0

📜 CC-BY-NC-SA

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
CHiME-HomeEnglish6.8 hourshttps://archive.org/details/chime-homeCC-BY-NC-SA 3.0
Cameroon Pidgin English CorpusCameroon Pidgin English~17 hourshttp://ota.ox.ac.uk/text/2563.zipCC-BY-NC-SA 3.0

📜 CC-BY-NC-ND

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
Tatoeba-EngEnglish~250 hours (rough estimate)6 speakershttps://voice.mozilla.org/en/datasetsCC-BY-NC 4.0 (some audio) / CC-BY-NC-ND 3.0 (most audio) / CC-BY 2.0 (all text)
TED-LIUMEnglish118 hours685 speakers (36h female / 81h male)http://www.openslr.org/7/CC-BY-NC-ND 3.0
TED-LIUM-2English207 hours1242 speakers (66h female / 141h male)http://www.openslr.org/19/CC-BY-NC-ND 3.0
TED-LIUM-3English452 hours2028 speakers (134h female / 316h male)http://www.openslr.org/51/CC-BY-NC-ND 3.0
Pansori TEDxKRKorean3 hours41 speakershttp://www.openslr.org/58/CC-BY-NC-ND 4.0
Primewords MandarinMandarin100 hours296 speakershttp://www.openslr.org/47/CC-BY-NC-ND 4.0
MuST-C v1.0Audio (English); Text (Dutch, French, German, Italian, Portuguese, Romanian, Russian, Spanish)408, 504, 492, 465, 442, 385, 432, 489 hours per language pairhttps://ict.fbk.eu/must-c-release-v1-0/CC-BY-NC-ND 4.0
Czech Parliament MeetingsCzech88 hourshttps://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0005-CF9C-4CC-BY-NC-ND 3.0
BembaSpeechBemba24 hours17 speakers (9 male / 8 female)https://github.com/csikasote/BembaSpeechCC-BY-NC-ND 4.0

📜 CDLA-Permissive

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
DiPCoEnglish~5 hours32 speakers (13 female; 19 male)https://s3.amazonaws.com/dipco/DiPCo.tgzCDLA-Permissive-1.0

📜 GNU General Public License

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
VoxForgeEnglish~120 hours~2966 speakershttp://www.repository.voxforge1.org/downloads/en/Trunk/Audio/Main/16kHz_16bit/ https://voice.mozilla.org/en/datasetsGNU-GPL 3.0
VoxForgeRussianhttp://www.repository.voxforge1.org/downloads/ru/Trunk/Audio/Main/16kHz_16bit/ http://www.repository.voxforge1.org/downloads/Russian/Trunk/Audio/Main/16kHz_16bit/GNU-GPL 3.0
VoxForgeGermanhttp://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/GNU-GPL 3.0

📜 Apache License

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
AISHELL-1Mandarin170 hours400 speakershttp://www.openslr.org/33/Apache 2.0
Tunisian_MSAModern Standard Arabic (Tunisia)11.2 hours118 speakershttp://www.openslr.org/46/Apache 2.0
African Accented FrenchFrench22 hours232 speakershttp://www.openslr.org/57/Apache 2.0
THCHS-30Mandarin Chinese33.57 hours (13,389 utterances)40 speakers (31 female; 9 male)http://www.openslr.org/18/Apache 2.0
Living Audio Dataset - DutchDutch57:49 min1 speakerhttps://github.com/Idlak/Living-Audio-DatasetApache 2.0
Living Audio Dataset - EnglishEnglish50:50 min1 speakerhttps://github.com/Idlak/Living-Audio-DatasetApache 2.0
Living Audio Dataset - IrishIrish61:56 min1 speakerhttps://github.com/Idlak/Living-Audio-DatasetApache 2.0
Living Audio Dataset - RussianRussian34:58 min1 speakerhttps://github.com/Idlak/Living-Audio-DatasetApache 2.0

📜 MIT License

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
ALFFAAmharic;Hausa (paid); Swahili; Wolofhttp://www.openslr.org/25/ https://github.com/besacier/ALFFA_PUBLICMIT

📜 BSD 3-Clause License

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
M-AILABS German CorpusGerman237 hours and 22 minuteshttp://www.caito.de/data/Training/stt_tts/de_DE.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Queen's English CorpusQueen's English45 hours and 35 minuteshttp://www.caito.de/data/Training/stt_tts/en_UK.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS US English CorpusAmerican English102 hours and 7 minuteshttp://www.caito.de/data/Training/stt_tts/en_US.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Spanish CorpusSpanish Spanish108 hours and 34 minuteshttp://www.caito.de/data/Training/stt_tts/es_ES.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Italian CorpusItalian127 hours and 40 minuteshttp://www.caito.de/data/Training/stt_tts/it_IT.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Ukrainian CorpusUkrainian87 hours and 8 minuteshttp://www.caito.de/data/Training/stt_tts/uk_UK.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Russian CorpusRussian46 hours and 47 minuteshttp://www.caito.de/data/Training/stt_tts/ru_RU.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS French-v0.9 CorpusFrench190 hours and 30 minuteshttp://www.caito.de/data/Training/stt_tts/fr_FR.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)
M-AILABS Polish CorpusPolish53 hours and 50 minuteshttp://www.caito.de/data/Training/stt_tts/pl_PL.tgzM-AILABS LICENSE (a data-specific BSD 3-Clause License)

📜 Custom License

CORPUSLANGUAGES# HOURS# SPEAKERSDOWNLOADLICENSE
Fluent Speech Commands CorpusEnglish19 hours (30,043 utterances)97 speakershttp://fluent.ai:2052/jf8398hf30f0381738rucj3828chfdnchs.tar.gzFluent Speech Commands Public License
CMU Wilderness700 LangsAlignments distributed without audio or text total:~14,000 hours; per lang: ~20 hourshttps://github.com/festvox/datasets-CMU_Wildernesshttps://live.bible.is/terms
CHiME-5English50 hours48 speakershttp://spandh.dcs.shef.ac.uk/chime_challenge/data.htmlCHiME-5 License
Fearless Steps CorpusEnglish19,000 hours (20 hours transcribed)~450 speakershttps://fearless-steps.github.io/ChallengePhase3/#19k_Corpus_AccessNASA Media Usage Guidelines
Microsoft Speech Corpus (Indian languages)Telugu; Tamil; Gujaratihttps://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985eMicrosoft Speech Corpus (Indian Languages) License
Microsoft Speech Language Translation CorpusEnglish; Chinese; Japanesehttps://msropendata.com/datasets/54813518-4ea6-4c39-9bb2-b0d1e5f0c187Microsoft Research Data License Agreement
Hey Snips CorpusEnglish11K positive "Hey Snips" (~4.4 hours) and 87K negative (~89 hours) utterances2215 speakers (positive & negative) and 4028 speakers (negative only)https://research.snips.ai/datasets/keyword-spottingSnips Data License
Snips SLU CorpusEnglish; French1660 "Smart Lights EN" (~1.3 hours), 1286 "Smart Speaker EN" (~55 minutes), 1138 "Smart Speaker FR" (~50 minutes) utterancesEnglish: 69 speakers; French: 30 speakershttps://research.snips.ai/datasets/spoken-language-understandingSnips Data License
CMU Sphinx Group - AN4English"an4_clstk"(~50 minutes) "an4test_clstk" (~6 minutes)"an4_clstk": 21 female, 53 male "an4test_clstk": 3 female, 7 malehttp://www.speech.cs.cmu.edu/databases/an4/an4_raw.bigendian.tar.gzAN4
FT SpeechDanish~1,857 hours (1,017,244 utterances)434 speakers (176 female, 258 male)https://ftspeech.dkFT Speech License
FalaBrasil-LAPS-ConstituicaoBrazilian-Portuguese9 hours1 speakerhttps://drive.google.com/uc?export=download&confirm=SrvW&id=1Nf849u-27CYRzJqedLaI-FaZfMRO7FT"Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPSMailBrazilian-Portuguese1 hour25 speakershttps://drive.google.com/uc?export=download&confirm=PecV&id=1B_Vq8MDSE4fBQefVxqCGSl-EcKAcjJLb"Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."
FalaBrasil-LaPS BenchmarkBrazilian-Portuguese1 hour1 speakerhttps://drive.google.com/uc?export=download&confirm=XFfF&id=1nZ8L9nJTt4blFC0RGT9Y7XRu02aAvDIo"Bases de áudio transcrito e bases de texto normalizadas (sem pontuação, com números escritos por extenso, etc.) disponibilizadas de forma gratuita* pelo Grupo FalaBrasil. [disponibilizadas de forma gratuita*] / Portanto, apenas as bases livres estão sendo disponibilizadas."