Awesome
UniSpeech
<!--**Pre-trained models for speech related tasks**-->The family of UniSpeech:
WavLM (
arXiv
): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing
UniSpeech (
ICML 2021
): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR
UniSpeech-SAT (
ICASSP 2022 Submission
): Universal Speech Representation Learning with Speaker Aware Pre-Training
ILS-SSL (
ICASSP 2022 Submission
): Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision
Model introductions, evaluation results, and model inference instructions are located in their corresponding folders. The source code is here [https://github.com/microsoft/UniSpeech/tree/main/src].
Update
- [HuggingFace Integration] Dec 23, 2021: WavLM models are on HuggingFace .
- [HuggingFace Integration] Octorber 26, 2021: UniSpeech-SAT models are on HuggingFace .
- [Model Release] Octorber 13, 2021: UniSpeech-SAT models are releaseed.
- [HuggingFace Integration] Octorber 11, 2021: UniSpeech models are on HuggingFace .
- [Model Release] June, 2021: UniSpeech v1 models are released.
Pre-trained models
We strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.
Universal Representation Evaluation on SUPERB
Downstream Task Performance
We also evaluate our models on typical speaker related benchmarks.
Speaker Verification
Finetune the model with VoxCeleb2 dev data, and evaluate it on the VoxCeleb1
Model | Fix pre-train | Vox1-O | Vox1-E | Vox1-H |
---|---|---|---|---|
ECAPA-TDNN | - | 0.87 | 1.12 | 2.12 |
HuBERT large | Yes | 0.888 | 0.912 | 1.853 |
Wav2Vec2.0 (XLSR) | Yes | 0.915 | 0.945 | 1.895 |
UniSpeech-SAT large | Yes | 0.771 | 0.781 | 1.669 |
WavLM large | Yes | 0.59 | 0.65 | 1.328 |
WavLM large | No | 0.505 | 0.579 | 1.176 |
+Large Margin Finetune and Score Calibration | ||||
HuBERT large | No | 0.585 | 0.654 | 1.342 |
Wav2Vec2.0 (XLSR) | No | 0.564 | 0.605 | 1.23 |
UniSpeech-SAT large | No | 0.564 | 0.561 | 1.23 |
WavLM large (New) | No | 0.33 | 0.477 | 0.984 |
Large-scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification
Speech Separation
Evaluation on LibriCSS
Model | 0S | 0L | OV10 | OV20 | OV30 | OV40 |
---|---|---|---|---|---|---|
Conformer (SOTA) | 4.5 | 4.4 | 6.2 | 8.5 | 11 | 12.6 |
UniSpeech-SAT base | 4.4 | 4.4 | 5.4 | 7.2 | 9.2 | 10.5 |
UniSpeech-SAT large | 4.3 | 4.2 | 5.0 | 6.3 | 8.2 | 8.8 |
WavLM base+ | 4.5 | 4.4 | 5.6 | 7.5 | 9.4 | 10.9 |
WavLM large | 4.2 | 4.1 | 4.8 | 5.8 | 7.4 | 8.5 |
Speaker Diarization
Evaluation on CALLHOME
Model | spk_2 | spk_3 | spk_4 | spk_5 | spk_6 | spk_all |
---|---|---|---|---|---|---|
EEND-vector clustering | 7.96 | 11.93 | 16.38 | 21.21 | 23.1 | 12.49 |
EEND-EDA clustering (SOTA) | 7.11 | 11.88 | 14.37 | 25.95 | 21.95 | 11.84 |
UniSpeech-SAT large | 5.93 | 10.66 | 12.9 | 16.48 | 23.25 | 10.92 |
WavLM Base | 6.99 | 11.12 | 15.20 | 16.48 | 21.61 | 11.75 |
WavLm large | 6.46 | 10.69 | 11.84 | 12.89 | 20.70 | 10.35 |
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.
Microsoft Open Source Code of Conduct
Reference
If you find our work is useful in your research, please cite the following paper:
@inproceedings{Wang2021UniSpeech,
author = {Chengyi Wang and Yu Wu and Yao Qian and Kenichi Kumatani and Shujie Liu and Furu Wei and Michael Zeng and Xuedong Huang},
editor = {Marina Meila and Tong Zhang},
title = {UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data},
booktitle = {Proceedings of the 38th International Conference on Machine Learning,
{ICML} 2021, 18-24 July 2021, Virtual Event},
series = {Proceedings of Machine Learning Research},
volume = {139},
pages = {10937--10947},
publisher = {{PMLR}},
year = {2021},
url = {http://proceedings.mlr.press/v139/wang21y.html},
timestamp = {Thu, 21 Oct 2021 16:06:12 +0200},
biburl = {https://dblp.org/rec/conf/icml/0002WQK0WZ021.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{Chen2021WavLM,
title = {WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing},
author = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Furu Wei},
eprint={2110.13900},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2021}
}
@article{Chen2021UniSpeechSAT,
title = {UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training},
author = {Sanyuan Chen and Yu Wu and Chengyi Wang and Zhengyang Chen and Zhuo Chen and Shujie Liu and Jian Wu and Yao Qian and Furu Wei and Jinyu Li and Xiangzhan Yu},
eprint={2110.05752},
archivePrefix={arXiv},
primaryClass={cs.CL},
year={2021}
}
Contact Information
For help or issues using UniSpeech models, please submit a GitHub issue.
For other communications related to UniSpeech, please contact Yu Wu (yuwu1@microsoft.com
).