Home

Awesome

Lipreading using Temporal Convolutional Networks

PWC

Authors

Pingchuan Ma, Brais Martinez, Yujiang Wang, Stavros Petridis, Jie Shen, Maja Pantic.

Update

2022-09-09: We have released our DC-TCN models, see here.

2021-06-09: We have released our official training code, see here.

2020-12-08: We have released our audio-only models. see here.

Content

Deep Lipreading

Model Zoo

Citation

License

Contact

Deep Lipreading

Introduction

This is the respository of Training Strategies For Improved Lip-reading, Towards Practical Lipreading with Distilled and Efficient Models and Lipreading using Temporal Convolutional Networks. In this repository, we provide training code, pre-trained models, network settings for end-to-end visual speech recognition (lipreading). We trained our model on LRW. The network architecture is based on 3D convolution, ResNet-18 plus MS-TCN.

<div align="center"><img src="doc/pipeline.png" width="640"/></div>

By using this repository, you can achieve a performance of 89.6% on the LRW dataset. This repository also provides a script for feature extraction.

Preprocessing

As described in our paper, each video sequence from the LRW dataset is processed by 1) doing face detection and face alignment, 2) aligning each frame to a reference mean face shape 3) cropping a fixed 96 × 96 pixels wide ROI from the aligned face image so that the mouth region is always roughly centered on the image crop 4) transform the cropped image to gray level.

You can run the pre-processing script provided in the preprocessing folder to extract the mouth ROIs.

<table style="display: inline-table;"> <tr><td><img src="doc/demo/original.gif", width="144"></td><td><img src="doc/demo/detected.gif" width="144"></td><td><img src="doc/demo/transformed.gif" width="144"></td><td><img src="doc/demo/cropped.gif" width="144"></td></tr> <tr><td>0. Original</td> <td>1. Detection</td> <td>2. Transformation</td> <td>3. Mouth ROIs</td> </tr> </table>

How to install environment

  1. Clone the repository into a directory. We refer to that directory as TCN_LIPREADING_ROOT.
git clone --recursive https://github.com/mpc001/Lipreading_using_Temporal_Convolutional_Networks.git
  1. Install all required packages.
pip install -r requirements.txt

How to prepare dataset

  1. Download a pre-trained model from Model Zoo and put the model into the $TCN_LIPREADING_ROOT/models/ folder.

  2. For audio-only experiments, please pre-process audio waveforms using the script extract_audio_from_video.py in the preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/audio_data/.

  3. For VSR benchmarks reported in Table 1, please download our pre-computed landmarks from GoogleDrive or BaiduDrive (key: m00k) and unzip them to $TCN_LIPREADING_ROOT/landmarks/ folder. please pre-process mouth ROIs using the script crop_mouth_from_video.py in the preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/visual_data/.

  4. For VSR benchmarks reported in Table 2, please download our pre-computed landmarks from GoogleDrive or BaiduDrive (key: kumy) and unzip them to $TCN_LIPREADING_ROOT/landmarks/ folder. please pre-process mouth ROIs using the script crop_mouth_from_video.py in the legacy_preprocessing folder and save them to $TCN_LIPREADING_ROOT/datasets/visual_data/.

How to train

  1. Train a visual-only model.
CUDA_VISIBLE_DEVICES=0 python main.py --modality video \
                                      --config-path <MODEL-JSON-PATH> \
                                      --annonation-direc <ANNONATION-DIRECTORY> \
                                      --data-dir <MOUTH-ROIS-DIRECTORY>
  1. Train an audio-only model.
CUDA_VISIBLE_DEVICES=0 python main.py --modality audio \
                                      --config-path <MODEL-JSON-PATH> \
                                      --annonation-direc <ANNONATION-DIRECTORY> \
                                      --data-dir <AUDIO-WAVEFORMS-DIRECTORY>

We call the original LRW directory that includes timestamps (.txt) as <ANNONATION-DIRECTORY>.

  1. Resume from last checkpoint.

You can pass the checkpoint path (.pth or .pth.tar) <CHECKPOINT-PATH> to the variable argument --model-path, and specify the --init-epoch to 1 to resume training.

How to test

You need to specify <ANNONATION-DIRECTORY> if you use a model with utilising word boundaries indicators.

  1. Evaluate the visual-only performance (lipreading).
CUDA_VISIBLE_DEVICES=0 python main.py --modality video \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --data-dir <MOUTH-ROIS-DIRECTORY> \
                                      --test
  1. Evaluate the audio-only performance.
CUDA_VISIBLE_DEVICES=0 python main.py --modality audio \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --data-dir <AUDIO-WAVEFORMS-DIRECTORY>
                                      --test

How to extract embeddings

We assume you have cropped the mouth patches and put them into <MOUTH-PATCH-PATH>. The mouth embeddings will be saved in the .npz format.

CUDA_VISIBLE_DEVICES=0 python main.py --modality video \
                                      --extract-feats \
                                      --config-path <MODEL-JSON-PATH> \
                                      --model-path <MODEL-PATH> \
                                      --mouth-patch-path <MOUTH-PATCH-PATH> \
                                      --mouth-embedding-out-path <OUTPUT-PATH>

Model Zoo

<details open> <summary>Table 1. Results of the audio-only and visual-only models on LRW. Mouth patches and audio waveforms are extracted in the <a href="https://github.com/mpc001/Lip-reading-with-Densely-Connected-Temporal-Convolutional-Networks/tree/main/preprocessing">preprocessing</a> folder.</summary> <p> </p>
ArchitectureAcc.urlsize (MB)
Audio-only
resnet18_dctcn_audio_boundary99.2GoogleDrive or BaiduDrive (key: w3jh)173
resnet18_dctcn_audio99.1GoogleDrive or BaiduDrive (key: hw8e)173
resnet18_mstcn_audio98.9GoogleDrive or BaiduDrive (key: bnhd)111
Visual-only
resnet18_dctcn_video_boundary92.1GoogleDrive or BaiduDrive (key: jb7l)201
resnet18_dctcn_video89.6GoogleDrive or BaiduDrive (key: f3hd)201
resnet18_mstcn_video88.9GoogleDrive or BaiduDrive (key: 0l63)139
</details> <details open> <summary>Table 2. Results of the visual-only models on LRW. Mouth patches are extracted in the <a href="https://github.com/mpc001/Lip-reading-with-Densely-Connected-Temporal-Convolutional-Networks/tree/main/legacy_preprocessing">legacy_preprocessing</a> folder.</summary> <p> </p>
ArchitectureAcc.urlsize (MB)
Visual-only
snv1x_dsmstcn3x85.3GoogleDrive or BaiduDrive (key: 86s4)36
snv1x_tcn2x84.6GoogleDrive or BaiduDrive (key: f79d)35
snv1x_tcn1x82.7GoogleDrive or BaiduDrive (key: 3caa)15
snv05x_tcn2x82.5GoogleDrive or BaiduDrive (key: ej9e)32
snv05x_tcn1x79.9GoogleDrive or BaiduDrive (key: devg)11
</details>

Citation

If you find this code useful in your research, please consider to cite the following papers:

@INPROCEEDINGS{ma2022training,
    author={Ma, Pingchuan and Wang, Yujiang and Petridis, Stavros and Shen, Jie and Pantic, Maja},
    booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    title={Training Strategies for Improved Lip-Reading},
    year={2022},
    pages={8472-8476},
    doi={10.1109/ICASSP43922.2022.9746706}
}

@INPROCEEDINGS{ma2021lip,
  title={Lip-reading with densely connected temporal convolutional networks},
  author={Ma, Pingchuan and Wang, Yujiang and Shen, Jie and Petridis, Stavros and Pantic, Maja},
  booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  pages={2857-2866},
  year={2021},
  doi={10.1109/WACV48630.2021.00290}
}

@INPROCEEDINGS{ma2020towards,
  author={Ma, Pingchuan and Martinez, Brais and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Towards Practical Lipreading with Distilled and Efficient Models},
  year={2021},
  pages={7608-7612},
  doi={10.1109/ICASSP39728.2021.9415063}
}

@INPROCEEDINGS{martinez2020lipreading,
  author={Martinez, Brais and Ma, Pingchuan and Petridis, Stavros and Pantic, Maja},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  title={Lipreading Using Temporal Convolutional Networks},
  year={2020},
  pages={6319-6323},
  doi={10.1109/ICASSP40776.2020.9053841}
}

License

It is noted that the code can only be used for comparative or benchmarking purposes. Users can only use code supplied under a License for non-commercial purposes.

Contact

[Pingchuan Ma](pingchuan.ma16[at]imperial.ac.uk)