Home

Awesome

VoxPopuli

https://aclanthology.org/2021.acl-long.80

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.

Overview

VoxPopuli provides

The raw data is collected from 2009-2020 European Parliament event recordings. We acknowledge the European Parliament for creating and sharing these materials.

Detailed statistics

<details><summary>Unlabelled and transcribed data</summary><p>
LanguageCodeUnlabelled Hours (v1/v2)Transcribed HoursTranscribed SpeakersTranscribed TokensLM Tokens
EnglishEn4.5K/24.1K54313134.8M60.1M
GermanDe4.5K/23.2K2825312.3M50.0M
FrenchFr4.5K/22.8K2115342.1M58.6M
SpanishEs4.4K/21.4K1663051.6M57.4M
PolishPl4.5K/21.2K111282802K13.6M
ItalianIt4.6K/21.9K91306757K52.1M
RomanianRo4.5K/17.9K89164739K10.3M
HungarianHu4.4K/17.7K63143431K13.0M
CzechCs4.5K/18.7K62138461K13.5M
DutchNl4.5K/19.0K53221488K54.6M
FinnishFi4.4K/14.2K2784160K34.5M
CroatianHr2.7K/8.1K4383337K285K
SlovakSk4.4K/12.1K3596270K13.3M
SloveneSl4.4K/11.3K104576K12.6M
EstonianEt4.3K/10.6K32918K11.3M
LithuanianLt4.3K/14.4K22110K11.5M
PortuguesePt4.4K/17.5K----
BulgarianBg4.3K/17.6K----
GreekEl4.4K/17.7K----
LatvianLv4.4K/13.1K----
MalteseMt4.4K/9.1K----
SwedishSv4.5K/16.3K----
DanishDa4.3K/13.6K----
Total100K/384K1791429515M467M
</p></details> <details><summary>Speech-to-speech interpretation data</summary><p>
Source/TargetEnDeFrEsPlItRoHuCsNlFiSkSlLtDaTotal
En-4634274414324614573824274004424334343983706.0K
De187-1962042142171982052141962172082181641792.8K
Fr169187-1871721971951441701581681681561391342.3K
Es130138135-1181481289311811512411410883861.6K
Pl68665455-67554367425562575034775
It6977767972-756168647166705360961
Ro605959584961-3850434850463829688
Hu30382527293027-27203129262118378
Cs3935293036323123-232955292518434
Nl314335292738242525-3225231925401
Fi15181513131313121311-1412119182
Hr312727242728242224222426372120384
Sk2122141619161614321316-171310239
Sl664556545456-4368
Lt1111111111111-013
Total8571.2K1.1K1.2K1.2K1.3K1.2K1.1K1.2K1.1K1.3K1.3K1.2K1.0K99517.3K
</p></details> <details><summary>Accented speech transcribed data</summary><p>
AccentCodeTranscribed HoursTranscribed Speakers
Dutchen_nl3.5245
Germanen_de3.5284
Czechen_cs3.3026
Polishen_pl3.2333
Frenchen_fr2.5627
Hungarianen_hu2.3323
Finnishen_fi2.1820
Romanianen_ro1.8527
Slovaken_sk1.4617
Spanishen_es1.4218
Italianen_it1.1115
Estonianen_et1.086
Lithuanianen_lt0.657
Croatianen_hr0.429
Sloveneen_sl0.257
</p></details>

What's New

Getting Data

We provide raw audios as well as scripts to segment and align them with transcription/interpretation. The output format is Ogg Vorbis (16000Hz, 16-bit, mono-channel), which is supported by common libraries such as libsndfile and libsox (they have Python frontends by soundfile, torchaudio, etc.).

As the first step, clone this repo for the processing scripts

git clone https://github.com/facebookresearch/voxpopuli.git

and install required PyPI packages:

pip install -r requirements.txt

Unlabelled Data

First, download raw audios via

python -m voxpopuli.download_audios --root [ROOT] --subset [SUBSET]

which saves audios to ${ROOT}/raw_audios/[language]/[year]/[recording_id].ogg.

SUBSET specifies the data subset to download:

--subset# LanguagesHoursYearsSize
en, de, fr, es, pl, it, ro, hu, cs, nl, fi, hr, sk, sl, et, lt, pt, bg, el, lv, mt, sv or da12.7K-4.6K2009-202044G-75G
en_v2, de_v2, fr_v2, es_v2, pl_v2, it_v2, ro_v2, hu_v2, cs_v2, nl_v2, fi_v2, hr_v2, sk_v2, sl_v2, et_v2, lt_v2, pt_v2, bg_v2, el_v2, lv_v2, mt_v2, sv_v2 or da_v218.1K-24.1K2009-2020130G-385G
10k2310K2019-2020170G
100k23100K2009-20201.7T
400k23400K2009-20206.4T

Then, segment these audios via

python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset [SUBSET]

which outputs to ${ROOT}/unlabelled_data/[language]/[year]/[segment_id].ogg

Transcribed (ASR) Data

First, download raw audios via

python -m voxpopuli.download_audios --root [ROOT] --subset asr

which saves audios to ${ROOT}/raw_audios/original/[year]/[recording_id].ogg.

Then, segment these audios and align them with transcripts via

python -m voxpopuli.get_asr_data --root [ROOT] --lang [LANGUAGE]

which outputs

Accented transcribed data To retrieve the transcribed accented speech data, follow the above steps with --lang [LANGUAGE]_accented (e.g. --lang en_accented). Note that the accented speech data is only composed of a test set for now.

Speech-to-Speech Interpretation Data

First, follow the instructions above to set up ASR data (source audios and transcripts).

Then, download target audios via

python -m voxpopuli.download_audios --root [ROOT] --subset [TARGET_LANGUAGE]

which saves audios to ${ROOT}/raw_audios/[target_language]/[year]/[recording_id].ogg.

Finally, segment these audios and match them with source ones via

python -m voxpopuli.get_s2s_data --root [ROOT] --source-lang [SOURCE_LANGUAGE] --target-lang [TARGET_LANGUAGE]

which outputs

We also human-transcribe part of the target audios (for English, French and Spanish only) to allow more accurate alignments. To use them instead of machine transcriptions in the alignments, add --use-annotated-target to the command line.

Language Modeling (LM) Data

We combine VoxPopuli transcripts and text data from Europarl for LM training.

Download VoxPopuli and Europarl text data, process the raw text and generate the vocabulary via

python -m voxpopuli.get_lm_data --root [ROOT] --lang [LANGUAGE]

which outputs

To train an n-gram LM with KenLM, run

${KENLM_PATH}/lmplz -o ${n} --limit_vocab_file [OUT_VOCAB_FILE] < [OUT_TEXT_FILE] > ${n}gram_lm.arpa
${KENLM_PATH}/build_binary ${n}gram_lm.arpa ${n}gram_lm.bin

Pre-trained Models

wav2vec 2.0

We provide pre-trained wav2vec 2.0 models (implemented in fairseq and wav2letter/flashlight) for downstream speech tasks. Each language is covered by a monolingual Base model and multilingual Large models that combine languages in the same family or all languages. See also XLS-R for larger-scale (up to 2B) multilingual models trained on VoxPopuli (400K hours).

<details><summary><b>Download</b></summary><p>
Language(s)FamilyPT HoursBase Model (95M)Large Model (317M)
Es (V1/V2)Romance4.4K/21.4Kfairseq V1 / V2fairseq V1 / V2 Romance
Fr (V1/V2)Romance4.5K/22.8Kfairseq V1 / V2fairseq V1 / V2 Romance
It (V1/V2)Romance4.6K/21.9Kfairseq V1 / V2fairseq V1 / V2 Romance
Pt (V2)Romance17.5Kfairseqfairseq V2 Romance
Ro (V2)Romance17.9Kfairseqfairseq V2 Romance
Nl (V1/V2)West Germanic4.5K/19.0Kfairseq V1 / V2fairseq V1 / V2 West Germanic
En (V2)West Germanic24.1Kfairseqfairseq V2 West Germanic
De (V2)West Germanic23.2Kfairseqfairseq V2 West Germanic
Sv (V1/V2)North Germanic4.5K/16.3Kfairseq V1 / V2fairseq V1 / V2 North Germanic
Da (V2)North Germanic13.6Kfairseqfairseq V2 North Germanic
Bg (V2)Slavic17.6Kfairseqfairseq V2 Slavic
Cs (V2)Slavic18.7Kfairseqfairseq V2 Slavic
Hr (V2)Slavic8.1Kfairseqfairseq V2 Slavic
Pl (V2)Slavic21.2Kfairseqfairseq V2 Slavic
Sk (V2)Slavic12.1Kfairseqfairseq V2 Slavic
Sl (V2)Slavic11.3Kfairseqfairseq V2 Slavic
Et (V2)Uralic10.6Kfairseqfairseq V2 Uralic
Fi (V2)Uralic14.2Kfairseqfairseq V2 Uralic
Hu (V2)Uralic17.7Kfairseqfairseq V2 Uralic
Lv (V2)Baltic13.1Kfairseqfairseq V2 Baltic
Lt (V2)Baltic14.4Kfairseqfairseq V2 Baltic
El (V2)Greek17.7Kfairseqfairseq
Mt (V2)Semitic9.1Kfairseqfairseq
All 23 languages-10Kfairseqfairseq
All 23 languages-100Kfairseq / wav2letterfairseq
</p></details>

In our paper (Section 4.3.1), we evaluated part of these models on the Common Voice corpus in the normal setting and the few-shot phoneme recognition setting.

Wav2letter C++ implementation

A wav2letter implementation as well as a checkpoint pretrained on VoxPopuli 100k (base model) is also available in the Wav2letter respository.

The complete fine-tuned ASR baselines for this codebase shoulda come soon. The wav2letter implementation follows this paper.

ASR and LM

For the VoxPopuli ASR task, we provide Transformer baselines, fine-tuned wav2vec2 models (Base 10K) as well as n-gram LMs (trained with KenLM) and their lexicons.

<details><summary><b>Download</b></summary><p>
LanguageASR (fairseq)LM (kenLM)Lexicon
Csbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Debaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Enbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Esbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Etbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Fibaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Frbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Hrbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Hubaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Itbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Ltbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Nlbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Plbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Robaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Skbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
Slbaseline, fine-tuned wav2vec23-gram, 5-gramlexicon
</p></details>

We also provide CoVoST 2 + EuroParl-ST ASR Transformer models that are self-trained on 3000h VoxPopuli unlabelled data.

<details><summary><b>Download</b></summary><p>
LanguageCoVoST 2 Test (WER)EuroParl-ST Test (WER)Model (fairseq)
De17.321.4s2t_transformer_l
Es13.215.3s2t_transformer_l
Fr17.019.0s2t_transformer_l
</p></details>

Please refer to the S2T examples for the use of Transformer model checkpoints.

Speech-to-Text Translation (ST)

We provide CoVoST 2 + EuroParl-ST ST Transformer models that are jointly trained with 400h VoxPopuli weakly labelled data.

<details><summary><b>Download</b></summary><p>
DirectionCoVoST 2 Test (BLEU)EuroParl-ST Test (BLEU)Model (fairseq)
De-En23.424.4s2t_transformer_l
Es-En29.728.4s2t_transformer_l
Fr-En30.331.1s2t_transformer_l
</p></details>

Please refer to the S2T examples for the use of these checkpoints.

License

License
VoxPopuli DataCC0 (see also European Parliament's legal notice for the raw data)
LM Data(Please check out the Europarl website for the Europarl portion)
Pre-trained ModelsCC BY-NC 4.0
CodeCC BY-NC 4.0

Contact

Changhan Wang (changhan@fb.com), Morgane Rivière (mriviere@fb.com), Ann Lee (annl@fb.com)

Citation

@inproceedings{wang-etal-2021-voxpopuli,
    title = "{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation",
    author = "Wang, Changhan  and
      Riviere, Morgane  and
      Lee, Ann  and
      Wu, Anne  and
      Talnikar, Chaitanya  and
      Haziza, Daniel  and
      Williamson, Mary  and
      Pino, Juan  and
      Dupoux, Emmanuel",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.80",
    pages = "993--1003",
}