Awesome
CORAA ASR - v1.1
CORAA ASR is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 290.77 hours of audios and their respective transcriptions (400k+ segmented audios). The dataset is composed of audios of 5 original projects:
- ALIP (Gonçalves, 2019)
- C-ORAL Brazil (Raso and Mello, 2012)
- NURC-Recife (Oliviera Jr., 2016)
- SP-2010 (Mendes and Oushiro, 2012)
- TEDx talks (talks in Portuguese)
The audios were either validated by annotators or transcripted for the first time aiming at the ASR task.
Metadata
- file_path: the path to an audio file
- task: transcription (annotators revised original transcriptions); annotation (annotators classified the audio-transcription pair according to votes_for_* metrics); annotation_and_transcription (both tasks were performed)
- variety: European Portuguese (PT_PT) or Brazilian Portuguese (PT_BR)
- dataset: one of five datasets (ALIP, C-oral Brasil, NURC-RE, SP2010, TEDx Portuguese)
- accent: one of four accents (Minas Gerais, Recife, Sao Paulo cities, Sao Paulo capital) or the value "miscellaneous"
- speech_genre: Interviews, Dialogues, Monologues, Conversations, Interviews, Conference, Class Talks, Stage Talks or Reading
- speech_style: Spontaneous Speech or Prepared Speech or Read Speech
- up_votes: for annotation, the number of votes to valid the audio (most audios were revewed by one annotor, but some of the audios were analyzed by more than one).
- down_votes: for annotation, the number of votes do invalid the audio (always smaller than up_votes)
- votes_for_hesitation: for annotation, votes categorizing the audio as having the hesitation phenomenon
- votes_for_filled_pause: for annotation, votes categorizing the audio as having the filled pause phenomenon
- votes_for_noise_or_low_voice: for annotation, votes categorizing the audio as either having noise or low voice, without impairing the audio compression.
- votes_for_second_voice: for annotation, votes categorizing the audio as having a second voice, without impairing the audio compression
- votes_for_no_identified_problem: without impairing the audio as having no identified phenomenon (of the four described above)
- text: the transcription for the audio
Downloads :
Dataset:
No link a seguir contém áudios em RAW (sem anotação), separados dos demais áudios disponíveis para download: https://zenodo.org/record/6794924#.YsXWMEjMJkg
Experiments:
Model trained in this corpus: Wav2Vec 2.0 XLSR-53 (multilingual pretraining)
Citation
- Full Paper:
@article{candido2022coraa,
title={CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese},
author={Candido Junior, Arnaldo and Casanova, Edresson and Soares, Anderson and de Oliveira, Frederico Santos and Oliveira, Lucas and Junior, Ricardo Corso Fernandes and da Silva, Daniel Peixoto Pinto and Fayet, Fernando Gorgulho and Carlotto, Bruno Baldissera and Gris, Lucas Rafael Stefanel and others},
journal={Language Resources and Evaluation},
pages={1--33},
year={2022},
publisher={Springer}
}
- Oficial site: Tarsila Project
Partners / Sponsors / Funding
References
- Gonçalves SCL (2019) Projeto ALIP (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do Português Brasileiro. Revista Estudos Linguísticos 48(1):276–297.
- Raso T, Mello H, Mittmann MM (2012) The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, pp 106–113, URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/624_Paper.pdf
- Oliviera Jr M (2016) Nurc digital um protocolo para a digitalização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguística culta (NURC). CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Linguísticos 3(2):149–174, URL https://revistas.uam.es/chimera/article/view/6519
- Mendes RB, Oushiro L (2012) Mapping Paulistano Portuguese: the SP2010 Project. In: Proceedings of the VIIth GSCP International Conference: Speech and Corpora, Fizenze University Press, Firenze, Italy, pp 459–463.