Home

Awesome

UniSpeech

<!--**Pre-trained models for speech related tasks**-->

The family of UniSpeech:

WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing

UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech-SAT (ICASSP 2022 Submission): Universal Speech Representation Learning with Speaker Aware Pre-Training

ILS-SSL (ICASSP 2022 Submission): Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision

Model introductions, evaluation results, and model inference instructions are located in their corresponding folders. The source code is here [https://github.com/microsoft/UniSpeech/tree/main/src].

Update

Pre-trained models

We strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.

ModelPretraining DatasetFinetuning DatasetModel
UniSpeech Large ENLabeled: 1350 hrs en-download
UniSpeech Large MultilingualLabeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it-download
Unispeech Large+Labeled: 1350 hrs en, Unlabeled: 353 hrs fr-download
UniSpeech Large+Labeld: 1350 hrs en, Unlabeled: 168 hrs es-download
UniSpeech Large+Labeled: 1350 hrs en, Unlabeld: 90 hrs it-download
UniSpeech Large MultilingualLabeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky-download
UniSpeech Large+Labeled: 1350 hrs en, Unlabeled: 353 hrs fr1 hr frdownload
UniSpeech Large+Labeld: 1350 hrs en, Unlabeled: 168 hrs es1 hr esdownload
UniSpeech Large+Labeled: 1350 hrs en, Unlabeld: 90 hrs it1 hr itdownload
UniSpeech Large MultilingualLabeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky1 hr kydownload
UniSpeech-SAT Base960 hrs LibriSpeech-download
UniSpeech-SAT Base+60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli-download
UniSpeech-SAT Large60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli-download
WavLM Base960 hrs LibriSpeech-download
WavLM Base+60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli-download
WavLM Large60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli-download

Universal Representation Evaluation on SUPERB

alt text

Downstream Task Performance

We also evaluate our models on typical speaker related benchmarks.

Speaker Verification

Finetune the model with VoxCeleb2 dev data, and evaluate it on the VoxCeleb1

ModelFix pre-trainVox1-OVox1-EVox1-H
ECAPA-TDNN-0.871.122.12
HuBERT largeYes0.8880.9121.853
Wav2Vec2.0 (XLSR)Yes0.9150.9451.895
UniSpeech-SAT largeYes0.7710.7811.669
WavLM largeYes0.590.651.328
WavLM largeNo0.5050.5791.176
+Large Margin Finetune and Score Calibration
HuBERT largeNo0.5850.6541.342
Wav2Vec2.0 (XLSR)No0.5640.6051.23
UniSpeech-SAT largeNo0.5640.5611.23
WavLM large (New)No0.330.4770.984

Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

Speech Separation

Evaluation on LibriCSS

Model0S0LOV10OV20OV30OV40
Conformer (SOTA)4.54.46.28.51112.6
UniSpeech-SAT base4.44.45.47.29.210.5
UniSpeech-SAT large4.34.25.06.38.28.8
WavLM base+4.54.45.67.59.410.9
WavLM large4.24.14.85.87.48.5

Speaker Diarization

Evaluation on CALLHOME

Modelspk_2spk_3spk_4spk_5spk_6spk_all
EEND-vector clustering7.9611.9316.3821.2123.112.49
EEND-EDA clustering (SOTA)7.1111.8814.3725.9521.9511.84
UniSpeech-SAT large5.9310.6612.916.4823.2510.92
WavLM Base6.9911.1215.2016.4821.6111.75
WavLm large6.4610.6911.8412.8920.7010.35

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@inproceedings{Wang2021UniSpeech,
  author    = {Chengyi Wang and Yu Wu and Yao Qian and Kenichi Kumatani and Shujie Liu and Furu Wei and Michael Zeng and Xuedong Huang},
  editor    = {Marina Meila and Tong Zhang},
  title     = {UniSpeech: Unified Speech Representation Learning with Labeled and
               Unlabeled Data},
  booktitle = {Proceedings of the 38th International Conference on Machine Learning,
               {ICML} 2021, 18-24 July 2021, Virtual Event},
  series    = {Proceedings of Machine Learning Research},
  volume    = {139},
  pages     = {10937--10947},
  publisher = {{PMLR}},
  year      = {2021},
  url       = {http://proceedings.mlr.press/v139/wang21y.html},
  timestamp = {Thu, 21 Oct 2021 16:06:12 +0200},
  biburl    = {https://dblp.org/rec/conf/icml/0002WQK0WZ021.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{Chen2021WavLM,
  title   = {WavLM: Large-Scale Self-Supervised  Pre-training   for Full Stack Speech Processing},
  author  = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Furu Wei},
  eprint={2110.13900},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}
@article{Chen2021UniSpeechSAT,
  title   = {UniSpeech-SAT: Universal Speech Representation Learning with  Speaker Aware Pre-Training},
  author  = {Sanyuan Chen and Yu Wu and Chengyi Wang and Zhengyang Chen and Zhuo Chen and Shujie Liu and   Jian Wu and Yao Qian and Furu Wei and Jinyu Li and  Xiangzhan Yu},
  eprint={2110.05752},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}

Contact Information

For help or issues using UniSpeech models, please submit a GitHub issue.

For other communications related to UniSpeech, please contact Yu Wu (yuwu1@microsoft.com).