Home

Awesome

Auto-AVSR: Lip-Reading Sentences Project

PWC

Update

2023-07-26: We released the implementation of Real-Time AV-ASR.

Introduction

This repository is an open-sourced framework for speech recognition, with a primary focus on visual speech (lip-reading). It is designed for end-to-end training, aiming to deliver state-of-the-art models and enable reproducibility on audio-visual speech benchmarks.

<div align="center"><img src="doc/pipeline.png" width="640"/></div>

By using this repository, you can achieve a word error rate (WER) of 20.3% for visual speech recognition (VSR) and 1.0% for audio speech recognition (ASR) on LRS3.

Setup

  1. Set up environment:
conda create -y -n auto_avsr python=3.8
conda activate auto_avsr
  1. Clone repository:
git clone https://github.com/mpc001/auto_avsr
cd auto_avsr
  1. Install fairseq within the repository:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
cd ..
  1. Install PyTorch (tested pytorch version: v2.0.1) and other packages:
pip install torch torchvision torchaudio
pip install pytorch-lightning==1.5.10
pip install sentencepiece
pip install av
pip install hydra-core --upgrade
  1. Install ffmpeg:
conda install "ffmpeg<5" -c conda-forge
  1. Prepare the dataset. See the instructions in the preparation folder.

Training

python train.py exp_dir=[exp_dir] \
               exp_name=[exp_name] \
               data.modality=[modality] \
               data.dataset.root_dir=[root_dir] \
               data.dataset.train_file=[train_file] \
               trainer.num_nodes=[num_nodes] \
<details open> <summary><strong>Required arguments</strong></summary> </details> <details> <summary><strong>Optional arguments</strong></summary> </details> <details open> <summary><strong>Note</strong></summary> </details>

Testing

python eval.py data.modality=[modality] \
               data.dataset.root_dir=[root_dir] \
               data.dataset.test_file=[test_file] \
               pretrained_model_path=[pretrained_model_path] \
<details open> <summary><strong>Required arguments</strong></summary> </details> <details> <summary><strong>Optional arguments</strong></summary> </details>

Demo

Want to see how our asr/vsr model performs on your audio/video? Just run this command:

python demo.py  data.modality=[modality] \
                pretrained_model_path=[pretrained_model_path] \
                file_path=[file_path]
<details open> <summary><strong>Required arguments</strong></summary> </details>

Model zoo

We provide audio-only, visual-only and audio-visual models for lrs3.

<details open> <summary>LRS3</summary>
ModelTraining data (h)WER [%]MD5
vsr_trlrs3_23h_base.pth2396.650c88
vsr_trlrs3_base.pth43836.7ea3ec
vsr_trlrs3vox2_base.pth175925.00a126
vsr_trlrwlrs2lrs3vox2avsp_base.pth344820.3a896f
asr_trlrs3_23h_base.pth2372.587d45
asr_trlrs3_base.pth4382.044fa87
asr_trlrs3vox2_base.pth17591.077beab
asr_trlrwlrs2lrs3vox2avsp_base.pth34480.99dc759
avsr_trlrwlrs2lrs3vox2avsp_base.pth34480.936b3c5
</details>

Citation

If you find this repository helpful, please consider citing our work:

@inproceedings{ma2023auto,
  author={Ma, Pingchuan and Haliassos, Alexandros and Fernandez-Lopez, Adriana and Chen, Honglie and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels},
  year={2023},
  pages={1-5},
  doi={10.1109/ICASSP49357.2023.10096889}
}

Acknowledgement

This repository is built using the espnet, fairseq, raven and avhubert repositories.

License

Code is Apache 2.0 licensed. The pre-trained models provided in this repository may have their own licenses or terms and conditions derived from the dataset used for training.

Contact

Contributions are welcome; feel free to create a PR or email me:

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)