Awesome
AV-RelScore
This code is part of the paper: Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring accepted at CVPR 2023.
Overview
This repository provides the audio-visual corruption modeling code for testing audio-visual speech recognition of LRS2 and LRS3 datasets. The video demo is available in here.
Prerequisite
- Python >= 3.6
- Clone this repository.
- Install python requirements.
pip install -r requirements.txt
- Download the LRS2-BBC and LRS3-TED datasets.
- Download the landmarks of LRS2 and LRS3 from this repository.
- Download
coco_object.7z
from here, extract, and putobject_image_sr
andobject_mask_x4
inocclusion_patch
folder.
Audio-Visual corruption modeling
- We utilize babble noise from NOISEX-92 for the audio corruption modeling.
- The occlusion patches for the visual corruption modeling are provided from this paper.
- Please create the separate audio (.wav) files from the LRS2 and LRS3 video dataset.
Audio corruption modeling
- LRS2
python LRS2_audio_gen.py --split_file <SPLIT-FILENAME-PATH> \
--LRS2_main_dir <DATA-DIRECTORY-PATH> \
--LRS2_save_loc <OUTPUT-DIRECTORY-PATH> \
--babble_noise <BABBLE-NOISE-LOCATION> \
- LRS3
python LRS3_audio_gen.py --split_file <SPLIT-FILENAME-PATH> \
--LRS3_test_dir <DATA-DIRECTORY-PATH> \
--LRS3_save_loc <OUTPUT-DIRECTORY-PATH> \
--babble_noise <BABBLE-NOISE-LOCATION> \
Visual corruption modeling
- LRS2
python LRS2_gen.py --split_file <SPLIT-FILENAME-PATH> \
--LRS2_main_dir <DATA-DIRECTORY-PATH> \
--LRS2_landmark_dir <LANDMARK-DIRECTORY-PATH> \
--LRS2_save_loc <OUTPUT-DIRECTORY-PATH> \
--occlusion <OCCLUSION-LOCATION> \
--occlusion_mask <OCCLUSION-MASK-LOCATION> \
- LRS3
python LRS3_gen.py --split_file <SPLIT-FILENAME-PATH> \
--LRS3_test_dir <DATA-DIRECTORY-PATH> \
--LRS3_landmark_dir <LANDMARK-DIRECTORY-PATH> \
--LRS3_save_loc <OUTPUT-DIRECTORY-PATH> \
--occlusion <OCCLUSION-LOCATION> \
--occlusion_mask <OCCLUSION-MASK-LOCATION> \
Test datasets
Note that the extracted corrupted data may be different from the actual corrupted test datasets that we have used for the experiment. Since we use random function when modeling the audio-visual corruption, so it may not work the same on all devices.
Please request us (joanna2587@kaist.ac.kr) the actual test datasets for the fair comparisons.
Acknowledgement
We refer to Visual Speech Recognition for Multiple Languages for landmarks of the datasets and Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets for visual occlusion patches. We thank the authors for the amazing works.
Citation
If you find our AV-RelSocre useful in your research, please cite our paper.
@article{hong2023watch,
title={Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring},
author={Hong, Joanna and Kim, Minsu and Choi, Jeongsoo and Ro, Yong Man},
journal={arXiv preprint arXiv:2303.08536},
year={2023}
}