Awesome

<p align="center"><img width="160" src="doc/lip_white.png" alt="logo"></p> <h1 align="center">Visual Speech Recognition for Multiple Languages</h1> <div align="center">

</div>

Authors

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.

Update

2023-07-26: We have released our training recipe for real-time AV-ASR, see here.

2023-06-16: We have released our training recipe for AutoAVSR, see here.

2023-03-27: We have released our AutoAVSR models for LRS3, see here.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.

Tutorial

We provide a tutorial to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

Demo

English -> Mandarin -> Spanish	French -> Portuguese -> Italian
<img src='doc/vsr_1.gif' title='vsr1' style='max-width:320px'></img>	<img src='doc/vsr_2.gif' title='vsr2' style='max-width:320px'></img>

Youtube | Bilibili

</div>

Preparation

Clone the repository and enter it locally:

git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages

Setup the environment.

conda create -y -n autoavsr python=3.8
conda activate autoavsr

Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:

pip install -r requirements.txt
conda install -c conda-forge ffmpeg

Download and extract a pre-trained model and/or language model from model zoo to:

./benchmarks/${dataset}/models
./benchmarks/${dataset}/language_models

[For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.

Benchmark evaluation

python eval.py config_filename=[config_filename] \
               labels_filename=[labels_filename] \
               data_dir=[data_dir] \
               landmarks_dir=[landmarks_dir]

[config_filename] is the model configuration path, located in ./configs.
[labels_filename] is the labels path, located in ${lipreading_root}/benchmarks/${dataset}/labels.
[data_dir] and [landmarks_dir] are the directories for original dataset and corresponding landmarks.
gpu_idx=-1 can be added to switch from cuda:0 to cpu.

Speech prediction

python infer.py config_filename=[config_filename] data_filename=[data_filename]

data_filename is the path to the audio/video file.
detector=mediapipe can be added to switch from RetinaFace to MediaPipe tracker.

Mouth ROIs cropping

python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]

dst_filename is the path where the cropped mouth will be saved.

Model zoo

Overview

We support a number of datasets for speech recognition:

AutoAVSR models

<details open> <summary>Lip Reading Sentences 3 (LRS3)</summary> <p> </p>

Components	WER	url	size (MB)
Visual-only
-	19.1	GoogleDrive or BaiduDrive(key: dqsy)	891
Audio-only
-	1.0	GoogleDrive or BaiduDrive(key: dvf2)	860
Audio-visual
-	0.9	GoogleDrive or BaiduDrive(key: sai5)	1540
Language models
-	-	GoogleDrive or BaiduDrive(key: t9ep)	191
Landmarks
-	-	GoogleDrive or BaiduDrive(key: mi3c)	18577

</details>

VSR for multiple languages models

<details open> <summary>Lip Reading Sentences 2 (LRS2)</summary> <p> </p>

Components	WER	url	size (MB)
Visual-only
-	26.1	GoogleDrive or BaiduDrive(key: 48l1)	186
Language models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 53rc)	9358

</details> <details open> <summary>Lip Reading Sentences 3 (LRS3)</summary> <p> </p>

Components	WER	url	size (MB)
Visual-only
-	32.3	GoogleDrive or BaiduDrive(key: 1b1s)	186
Language models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: mi3c)	18577

</details> <details open> <summary>Chinese Mandarin Lip Reading (CMLR)</summary> <p> </p>

Components	CER	url	size (MB)
Visual-only
-	8.0	GoogleDrive or BaiduDrive(key: 7eq1)	195
Language models
-	-	GoogleDrive or BaiduDrive(key: k8iv)	187
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 1ret)	3721

</details> <details open> <summary>CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)</summary> <p> </p>

Components	WER	url	size (MB)
Visual-only
Spanish	44.5	GoogleDrive or BaiduDrive(key: m35h)	186
Portuguese	51.4	GoogleDrive or BaiduDrive(key: wk2h)	186
French	58.6	GoogleDrive or BaiduDrive(key: t1hf)	186
Language models
Spanish	-	GoogleDrive or BaiduDrive(key: 0mii)	180
Portuguese	-	GoogleDrive or BaiduDrive(key: l6ag)	179
French	-	GoogleDrive or BaiduDrive(key: 6tan)	179
Landmarks
-	-	GoogleDrive or BaiduDrive(key: vsic)	3040

</details> <details open> <summary>GRID</summary> <p> </p>

Components	WER	url	size (MB)
Visual-only
Overlapped	1.2	GoogleDrive or BaiduDrive(key: d8d2)	186
Unseen	4.8	GoogleDrive or BaiduDrive(key: ttsh)	186
Landmarks
-	-	GoogleDrive or BaiduDrive(key: 16l9)	1141

You can include data_ext=.mpg in your command line to match the video file extension in the GRID dataset.

</details> <details open> <summary>Lombard GRID</summary> <p> </p>

Components	WER	url	size (MB)
Visual-only
Unseen (Front Plain)	4.9	GoogleDrive or BaiduDrive(key: 38ds)	186
Unseen (Side Plain)	8.0	GoogleDrive or BaiduDrive(key: k6m0)	186
Landmarks
-	-	GoogleDrive or BaiduDrive(key: cusv)	309

You can include data_ext=.mov in your command line to match the video file extension in the Lombard GRID dataset.

</details> <details open> <summary>TCD-TIMIT</summary> <p> </p>

Components	WER	url	size (MB)
Visual-only
Overlapped	16.9	GoogleDrive or BaiduDrive(key: jh65)	186
Unseen	21.8	GoogleDrive or BaiduDrive(key: n2gr)	186
Language models
-	-	GoogleDrive or BaiduDrive(key: 59u2)	180
Landmarks
-	-	GoogleDrive or BaiduDrive(key: bnm8)	930

</details>

Citation

If you use the AutoAVSR models training code, please consider citing the following paper:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels}, 
  year={2023},
}

If you use the VSR models for multiple languages please consider citing the following paper:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)