Home

Awesome

<p align="center"><img width="160" src="doc/lip_white.png" alt="logo"></p> <h1 align="center">Visual Speech Recognition for Multiple Languages</h1> <div align="center">

📘Introduction | 🛠️Preparation | 📊Benchmark | 🔮Inference | 🐯Model zoo | 📝License

</div>

Authors

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, Maja Pantic.

Update

2023-07-26: We have released our training recipe for real-time AV-ASR, see here.

2023-06-16: We have released our training recipe for AutoAVSR, see here.

2023-03-27: We have released our AutoAVSR models for LRS3, see here.

Introduction

This is the repository of Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR) on LRS3.

Tutorial

We provide a tutorial Open In Colab to show how to use our Auto-AVSR models to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.

Demo

English -> Mandarin -> SpanishFrench -> Portuguese -> Italian
<img src='doc/vsr_1.gif' title='vsr1' style='max-width:320px'></img><img src='doc/vsr_2.gif' title='vsr2' style='max-width:320px'></img>
<div align="center">

Youtube | Bilibili

</div>

Preparation

  1. Clone the repository and enter it locally:
git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages
cd Visual_Speech_Recognition_for_Multiple_Languages
  1. Setup the environment.
conda create -y -n autoavsr python=3.8
conda activate autoavsr
  1. Install pytorch, torchvision, and torchaudio by following instructions here, and install all packages:
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
  1. Download and extract a pre-trained model and/or language model from model zoo to:
  1. [For VSR and AV-ASR] Install RetinaFace or MediaPipe tracker.

Benchmark evaluation

python eval.py config_filename=[config_filename] \
               labels_filename=[labels_filename] \
               data_dir=[data_dir] \
               landmarks_dir=[landmarks_dir]

Speech prediction

python infer.py config_filename=[config_filename] data_filename=[data_filename]

Mouth ROIs cropping

python crop_mouth.py data_filename=[data_filename] dst_filename=[dst_filename]

Model zoo

Overview

We support a number of datasets for speech recognition:

AutoAVSR models

<details open> <summary>Lip Reading Sentences 3 (LRS3)</summary> <p> </p>
ComponentsWERurlsize (MB)
Visual-only
-19.1GoogleDrive or BaiduDrive(key: dqsy)891
Audio-only
-1.0GoogleDrive or BaiduDrive(key: dvf2)860
Audio-visual
-0.9GoogleDrive or BaiduDrive(key: sai5)1540
Language models
--GoogleDrive or BaiduDrive(key: t9ep)191
Landmarks
--GoogleDrive or BaiduDrive(key: mi3c)18577
</details>

VSR for multiple languages models

<details open> <summary>Lip Reading Sentences 2 (LRS2)</summary> <p> </p>
ComponentsWERurlsize (MB)
Visual-only
-26.1GoogleDrive or BaiduDrive(key: 48l1)186
Language models
--GoogleDrive or BaiduDrive(key: 59u2)180
Landmarks
--GoogleDrive or BaiduDrive(key: 53rc)9358
</details> <details open> <summary>Lip Reading Sentences 3 (LRS3)</summary> <p> </p>
ComponentsWERurlsize (MB)
Visual-only
-32.3GoogleDrive or BaiduDrive(key: 1b1s)186
Language models
--GoogleDrive or BaiduDrive(key: 59u2)180
Landmarks
--GoogleDrive or BaiduDrive(key: mi3c)18577
</details> <details open> <summary>Chinese Mandarin Lip Reading (CMLR)</summary> <p> </p>
ComponentsCERurlsize (MB)
Visual-only
-8.0GoogleDrive or BaiduDrive(key: 7eq1)195
Language models
--GoogleDrive or BaiduDrive(key: k8iv)187
Landmarks
--GoogleDrive or BaiduDrive(key: 1ret)3721
</details> <details open> <summary>CMU Multimodal Opinion Sentiment, Emotions and Attributes (CMU-MOSEAS)</summary> <p> </p>
ComponentsWERurlsize (MB)
Visual-only
Spanish44.5GoogleDrive or BaiduDrive(key: m35h)186
Portuguese51.4GoogleDrive or BaiduDrive(key: wk2h)186
French58.6GoogleDrive or BaiduDrive(key: t1hf)186
Language models
Spanish-GoogleDrive or BaiduDrive(key: 0mii)180
Portuguese-GoogleDrive or BaiduDrive(key: l6ag)179
French-GoogleDrive or BaiduDrive(key: 6tan)179
Landmarks
--GoogleDrive or BaiduDrive(key: vsic)3040
</details> <details open> <summary>GRID</summary> <p> </p>
ComponentsWERurlsize (MB)
Visual-only
Overlapped1.2GoogleDrive or BaiduDrive(key: d8d2)186
Unseen4.8GoogleDrive or BaiduDrive(key: ttsh)186
Landmarks
--GoogleDrive or BaiduDrive(key: 16l9)1141

You can include data_ext=.mpg in your command line to match the video file extension in the GRID dataset.

</details> <details open> <summary>Lombard GRID</summary> <p> </p>
ComponentsWERurlsize (MB)
Visual-only
Unseen (Front Plain)4.9GoogleDrive or BaiduDrive(key: 38ds)186
Unseen (Side Plain)8.0GoogleDrive or BaiduDrive(key: k6m0)186
Landmarks
--GoogleDrive or BaiduDrive(key: cusv)309

You can include data_ext=.mov in your command line to match the video file extension in the Lombard GRID dataset.

</details> <details open> <summary>TCD-TIMIT</summary> <p> </p>
ComponentsWERurlsize (MB)
Visual-only
Overlapped16.9GoogleDrive or BaiduDrive(key: jh65)186
Unseen21.8GoogleDrive or BaiduDrive(key: n2gr)186
Language models
--GoogleDrive or BaiduDrive(key: 59u2)180
Landmarks
--GoogleDrive or BaiduDrive(key: bnm8)930
</details>

Citation

If you use the AutoAVSR models training code, please consider citing the following paper:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels}, 
  year={2023},
}

If you use the VSR models for multiple languages please consider citing the following paper:

@article{ma2022visual,
  title={{Visual Speech Recognition for Multiple Languages in the Wild}},
  author={Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  journal={{Nature Machine Intelligence}},
  volume={4},
  pages={930--939},
  year={2022}
  url={https://doi.org/10.1038/s42256-022-00550-z},
  doi={10.1038/s42256-022-00550-z}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)