Home

Awesome

CORAA ASR - v1.1

CORAA ASR is a publicly available dataset for Automatic Speech Recognition (ASR) in the Brazilian Portuguese language containing 290.77 hours of audios and their respective transcriptions (400k+ segmented audios). The dataset is composed of audios of 5 original projects:

The audios were either validated by annotators or transcripted for the first time aiming at the ASR task.

Metadata

Downloads :

Dataset:

GdriveInternalHugging Face
Train audiosTrain audiosTrain audios
Train transcriptions and metadataTrain transcriptions and metadataTrain transcriptions and metadata
Dev audiosDev audiosDev audios
Dev transcriptions and metadataDev transcriptions and metadataDev transcriptions and metadata
Test audiosTest audiosTest audios
Test transcriptions and metadataTest transcriptions and metadataTest transcriptions and metadata

No link a seguir contém áudios em RAW (sem anotação), separados dos demais áudios disponíveis para download: https://zenodo.org/record/6794924#.YsXWMEjMJkg

Experiments:

Model trained in this corpus: Wav2Vec 2.0 XLSR-53 (multilingual pretraining)

Citation

@article{candido2022coraa,
  title={CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese},
  author={Candido Junior, Arnaldo and Casanova, Edresson and Soares, Anderson and de Oliveira, Frederico Santos and Oliveira, Lucas and Junior, Ricardo Corso Fernandes and da Silva, Daniel Peixoto Pinto and Fayet, Fernando Gorgulho and Carlotto, Bruno Baldissera and Gris, Lucas Rafael Stefanel and others},
  journal={Language Resources and Evaluation},
  pages={1--33},
  year={2022},
  publisher={Springer}
}

Partners / Sponsors / Funding

References