Home

Awesome

(B)RAVEn: A PyTorch Lightning Implementation

Introduction

We provide code for the reproduction of the main results in Jointly Learning Visual and Auditory Speech Representations from Raw Data and BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition . Our implementation is based on PyTorch Lightning.

Preparation

Installation

conda env create -f environment.yml. Change the environment prefix to match the location of miniconda3, if necessary.

Data

  1. The datasets used in the paper can be downloaded from the following links:
  2. Compute 68 landmarks per frame using e.g., RetinaFace and 2-D FAN, or download them e.g., from this repo. Each landmark file should have the same name as its corresponding video (except that it ends in .npy).
  3. Use the following command to crop the mouths:
    python preprocessing/extract_mouths.py --src_dir ${SOURCE_DIR} --tgt_dir ${TARGET_DIR} --landmarks_dir ${LANDMARKS_DIR}
    

RAVEn pre-trained models

Below are the checkpoints of the Base and Large models pre-trained with RAVEn on LRS3+Vox2-en.

ModelModalityCheckpoint
BaseVideoDownload
BaseAudioDownload
LargeVideoDownload
LargeAudioDownload

BRAVEn pre-trained models

Below are the checkpoints of the Base, Base+, and Large models pre-trained with BRAVEn.

ModelModalityCheckpoint
Base (LRS3)VideoDownload
Base (LRS3)AudioDownload
Base+ (LRS3+Vox2)VideoDownload
Base+ (LRS3+Vox2)AudioDownload
Large (LRS3+Vox2+AVS)VideoDownload
Large (LRS3+Vox2+AVS)AudioDownload

Testing

VSR

RAVEn low-resource

ModelPre-training datasetWER (%)CheckpointBash script
BaseLRS347.0Downloadscripts/vsr/lrs3_trainval/base_lrs3.sh
BaseLRS3+Vox2-en40.2Downloadscripts/vsr/lrs3_trainval/base_lrs3vox2.sh
LargeLRS3+Vox2-en32.5Downloadscripts/vsr/lrs3_trainval/large_lrs3vox2.sh
Large w/ STLRS3+Vox2-en24.8Downloadscripts/vsr/lrs3_trainval/large_lrs3vox2_self.sh
Large w/ ST + LMLRS3+Vox2-en23.8same as last rowscripts/vsr/lrs3_trainval/large_lrs3vox2_self_lm.sh

BRAVEn low-resource

ModelPre-training datasetWER (%)CheckpointBash script
BaseLRS343.4Downloadscripts/vsr/lrs3_trainval/base_lrs3_braven.sh
Base PlusLRS3+Vox2-en35.1Downloadscripts/vsr/lrs3_trainval/baseplus_lrs3vox2_braven.sh
LargeLRS3+Vox2-en30.8Downloadscripts/vsr/lrs3_trainval/large_lrs3vox2_braven.sh
LargeLRS3+Vox2-en+AVS24.8Downloadscripts/vsr/lrs3_trainval/large_lrs3vox2avs_braven.sh
Large w/ STLRS3+Vox2-en+AVS21.3Downloadscripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LMLRS3+Vox2-en+AVS20.0same as last rowscripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh

RAVEn high-resource

ModelPre-training datasetWER (%)CheckpointBash script
BaseLRS339.1Downloadscripts/vsr/lrs3/base_lrs3.sh
BaseLRS3+Vox2-en33.1Downloadscripts/vsr/lrs3/base_lrs3vox2.sh
LargeLRS3+Vox2-en27.8Downloadscripts/vsr/lrs3/large_lrs3vox2.sh
Large w/ STLRS3+Vox2-en24.4Downloadscripts/vsr/lrs3/large_lrs3vox2_self.sh
Large w/ ST + LMLRS3+Vox2-en23.1same as last rowscripts/vsr/lrs3/large_lrs3vox2_self_lm.sh

BRAVEn high-resource

ModelPre-training datasetWER (%)CheckpointBash script
BaseLRS336.0Downloadscripts/vsr/lrs3/base_lrs3_braven.sh
Base PlusLRS3+Vox2-en28.8Downloadscripts/vsr/lrs3/baseplus_lrs3vox2_braven.sh
LargeLRS3+Vox2-en26.6Downloadscripts/vsr/lrs3/large_lrs3vox2_braven.sh
LargeLRS3+Vox2-en+AVS23.6Downloadscripts/vsr/lrs3/large_lrs3vox2avs_braven.sh
Large w/ STLRS3+Vox2-en+AVS20.9Downloadscripts/vsr/lrs3/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LMLRS3+Vox2-en+AVS20.1same as last rowscripts/vsr/lrs3/large_lrs3vox2avs_self_lm_braven.sh

ASR

RAVEn low-resource

ModelPre-training datasetWER (%)CheckpointBash script
BaseLRS34.7Downloadscripts/asr/lrs3_trainval/base_lrs3.sh
BaseLRS3+Vox2-en3.8Downloadscripts/asr/lrs3_trainval/base_lrs3vox2.sh
LargeLRS3+Vox2-en2.7Downloadscripts/asr/lrs3_trainval/large_lrs3vox2.sh
Large w/ STLRS3+Vox2-en2.3Downloadscripts/asr/lrs3_trainval/large_lrs3vox2_self.sh
Large w/ ST + LMLRS3+Vox2-en1.9same as last rowscripts/asr/lrs3_trainval/large_lrs3vox2_self_lm.sh

BRAVEn low-resource

ModelPre-training datasetWER (%)CheckpointBash script
BaseLRS34.0Downloadscripts/asr/lrs3_trainval/base_lrs3_braven.sh
Base PlusLRS3+Vox2-en3.0Downloadscripts/asr/lrs3_trainval/baseplus_lrs3vox2_braven.sh
LargeLRS3+Vox2-en2.3Downloadscripts/asr/lrs3_trainval/large_lrs3vox2_braven.sh
LargeLRS3+Vox2-en+AVS2.1Downloadscripts/asr/lrs3_trainval/large_lrs3vox2avs_braven.sh
Large w/ STLRS3+Vox2-en+AVS1.9Downloadscripts/asr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LMLRS3+Vox2-en+AVS1.7same as last rowscripts/asr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh

RAVEn high-resource

ModelPre-training datasetWER (%)CheckpointBash script
BaseLRS32.2Downloadscripts/asr/lrs3/base_lrs3.sh
BaseLRS3+Vox2-en1.9Downloadscripts/asr/lrs3/base_lrs3vox2.sh
LargeLRS3+Vox2-en1.4Downloadscripts/asr/lrs3/large_lrs3vox2.sh
Large w/ STLRS3+Vox2-en1.4Downloadscripts/asr/lrs3/large_lrs3vox2_self.sh
Large w/ ST + LMLRS3+Vox2-en1.4same as last rowscripts/asr/lrs3/large_lrs3vox2_self_lm.sh

BRAVEn high-resource

ModelPre-training datasetWER (%)CheckpointBash script
BaseLRS31.9Downloadscripts/asr/lrs3/base_lrs3_braven.sh
Base PlusLRS3+Vox2-en1.4Downloadscripts/asr/lrs3/baseplus_lrs3vox2_braven.sh
LargeLRS3+Vox2-en1.2Downloadscripts/asr/lrs3/large_lrs3vox2_braven.sh
LargeLRS3+Vox2-en+AVS1.2Downloadscripts/asr/lrs3/large_lrs3vox2avs_braven.sh
Large w/ STLRS3+Vox2-en+AVS1.2Downloadscripts/asr/lrs3/large_lrs3vox2avs_self_braven.sh
Large w/ ST + LMLRS3+Vox2-en+AVS1.1same as last rowscripts/asr/lrs3/large_lrs3vox2avs_self_lm_braven.sh

Code for pre-training and fine-tuning coming soon...

Citation

If you find this repo useful for your research, please consider citing the following:

@article{haliassos2022jointly,
  title={Jointly Learning Visual and Auditory Speech Representations from Raw Data},
  author={Haliassos, Alexandros and Ma, Pingchuan and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
  journal={arXiv preprint arXiv:2212.06246},
  year={2022}
}
@inproceedings{haliassos2024braven,
  title={BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition},
  author={Haliassos, Alexandros and Zinonos, Andreas and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={11431--11435},
  year={2024},
  organization={IEEE}
}