Awesome

AV-RelScore

This code is part of the paper: Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring accepted at CVPR 2023.

Overview

This repository provides the audio-visual corruption modeling code for testing audio-visual speech recognition of LRS2 and LRS3 datasets. The video demo is available in here.

Prerequisite

Python >= 3.6
Clone this repository.
Install python requirements.

pip install -r requirements.txt

Download the LRS2-BBC and LRS3-TED datasets.
Download the landmarks of LRS2 and LRS3 from this repository.
Download coco_object.7z from here, extract, and put object_image_sr and object_mask_x4 in occlusion_patch folder.

Audio-Visual corruption modeling

We utilize babble noise from NOISEX-92 for the audio corruption modeling.
The occlusion patches for the visual corruption modeling are provided from this paper.
Please create the separate audio (.wav) files from the LRS2 and LRS3 video dataset.

Audio corruption modeling

LRS2

python LRS2_audio_gen.py --split_file <SPLIT-FILENAME-PATH> \
                         --LRS2_main_dir <DATA-DIRECTORY-PATH> \
                         --LRS2_save_loc <OUTPUT-DIRECTORY-PATH> \
                         --babble_noise <BABBLE-NOISE-LOCATION> \

LRS3

python LRS3_audio_gen.py --split_file <SPLIT-FILENAME-PATH> \
                         --LRS3_test_dir <DATA-DIRECTORY-PATH> \
                         --LRS3_save_loc <OUTPUT-DIRECTORY-PATH> \
                         --babble_noise <BABBLE-NOISE-LOCATION> \

Visual corruption modeling

LRS2

python LRS2_gen.py --split_file <SPLIT-FILENAME-PATH> \
                   --LRS2_main_dir <DATA-DIRECTORY-PATH> \
                   --LRS2_landmark_dir <LANDMARK-DIRECTORY-PATH> \
                   --LRS2_save_loc <OUTPUT-DIRECTORY-PATH> \
                   --occlusion <OCCLUSION-LOCATION> \
                   --occlusion_mask <OCCLUSION-MASK-LOCATION> \

LRS3

python LRS3_gen.py --split_file <SPLIT-FILENAME-PATH> \
                   --LRS3_test_dir <DATA-DIRECTORY-PATH> \
                   --LRS3_landmark_dir <LANDMARK-DIRECTORY-PATH> \
                   --LRS3_save_loc <OUTPUT-DIRECTORY-PATH> \
                   --occlusion <OCCLUSION-LOCATION> \
                   --occlusion_mask <OCCLUSION-MASK-LOCATION> \

Test datasets

Note that the extracted corrupted data may be different from the actual corrupted test datasets that we have used for the experiment. Since we use random function when modeling the audio-visual corruption, so it may not work the same on all devices.

Please request us (joanna2587@kaist.ac.kr) the actual test datasets for the fair comparisons.

Acknowledgement

We refer to Visual Speech Recognition for Multiple Languages for landmarks of the datasets and Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets for visual occlusion patches. We thank the authors for the amazing works.

Citation

If you find our AV-RelSocre useful in your research, please cite our paper.

@article{hong2023watch,
  title={Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring},
  author={Hong, Joanna and Kim, Minsu and Choi, Jeongsoo and Ro, Yong Man},
  journal={arXiv preprint arXiv:2303.08536},
  year={2023}
}