Awesome

CORAA ASR - v1.1

CORAA ASR is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 290.77 hours of audios and their respective transcriptions (400k+ segmented audios). The dataset is composed of audios of 5 original projects:

ALIP (Gonçalves, 2019)
C-ORAL Brazil (Raso and Mello, 2012)
NURC-Recife (Oliviera Jr., 2016)
SP-2010 (Mendes and Oushiro, 2012)
TEDx talks (talks in Portuguese)

The audios were either validated by annotators or transcripted for the first time aiming at the ASR task.

Metadata

file_path: the path to an audio file
task: transcription (annotators revised original transcriptions); annotation (annotators classified the audio-transcription pair according to votes_for_* metrics); annotation_and_transcription (both tasks were performed)
variety: European Portuguese (PT_PT) or Brazilian Portuguese (PT_BR)
dataset: one of five datasets (ALIP, C-oral Brasil, NURC-RE, SP2010, TEDx Portuguese)
accent: one of four accents (Minas Gerais, Recife, Sao Paulo cities, Sao Paulo capital) or the value "miscellaneous"
speech_genre: Interviews, Dialogues, Monologues, Conversations, Interviews, Conference, Class Talks, Stage Talks or Reading
speech_style: Spontaneous Speech or Prepared Speech or Read Speech
up_votes: for annotation, the number of votes to valid the audio (most audios were revewed by one annotor, but some of the audios were analyzed by more than one).
down_votes: for annotation, the number of votes do invalid the audio (always smaller than up_votes)
votes_for_hesitation: for annotation, votes categorizing the audio as having the hesitation phenomenon
votes_for_filled_pause: for annotation, votes categorizing the audio as having the filled pause phenomenon
votes_for_noise_or_low_voice: for annotation, votes categorizing the audio as either having noise or low voice, without impairing the audio compression.
votes_for_second_voice: for annotation, votes categorizing the audio as having a second voice, without impairing the audio compression
votes_for_no_identified_problem: without impairing the audio as having no identified phenomenon (of the four described above)
text: the transcription for the audio

Downloads :

Dataset:

Gdrive	Internal	Hugging Face
Train audios	Train audios	Train audios
Train transcriptions and metadata	Train transcriptions and metadata	Train transcriptions and metadata
Dev audios	Dev audios	Dev audios
Dev transcriptions and metadata	Dev transcriptions and metadata	Dev transcriptions and metadata
Test audios	Test audios	Test audios
Test transcriptions and metadata	Test transcriptions and metadata	Test transcriptions and metadata

No link a seguir contém áudios em RAW (sem anotação), separados dos demais áudios disponíveis para download: https://zenodo.org/record/6794924#.YsXWMEjMJkg

Experiments:

Model trained in this corpus: Wav2Vec 2.0 XLSR-53 (multilingual pretraining)

Citation

Full Paper:

@article{candido2022coraa,
  title={CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese},
  author={Candido Junior, Arnaldo and Casanova, Edresson and Soares, Anderson and de Oliveira, Frederico Santos and Oliveira, Lucas and Junior, Ricardo Corso Fernandes and da Silva, Daniel Peixoto Pinto and Fayet, Fernando Gorgulho and Carlotto, Bruno Baldissera and Gris, Lucas Rafael Stefanel and others},
  journal={Language Resources and Evaluation},
  pages={1--33},
  year={2022},
  publisher={Springer}
}

Oficial site: Tarsila Project

Partners / Sponsors / Funding

References

Gonçalves SCL (2019) Projeto ALIP (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do Português Brasileiro. Revista Estudos Linguísticos 48(1):276–297.
Raso T, Mello H, Mittmann MM (2012) The C-ORAL-BRASIL I: Reference corpus for spoken Brazilian Portuguese. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), European Language Resources Association (ELRA), Istanbul, Turkey, pp 106–113, URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/624_Paper.pdf
Oliviera Jr M (2016) Nurc digital um protocolo para a digitalização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguística culta (NURC). CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Linguísticos 3(2):149–174, URL https://revistas.uam.es/chimera/article/view/6519
Mendes RB, Oushiro L (2012) Mapping Paulistano Portuguese: the SP2010 Project. In: Proceedings of the VIIth GSCP International Conference: Speech and Corpora, Fizenze University Press, Firenze, Italy, pp 459–463.