Awesome

Awesome Audio-Visual:

A curated list of papers and datsets for various audio-visual tasks, inspired by awesome-computer-vision.

Audio-Visual Localization
Audio-Visual Separation
Audio-Visual Representation/Classification/Retrieval
Audio-Visual Action Recognition
Audio-Visual Spatial/Depth
Audio-Visual RIR
Audio-Visual Deepfake/Robustness
Lightweight Audio-Visual Model
Audio-Visual Navigation/RL
Audio-Visual Faces/Speech
Audio-Visual Question Answering
Audio-Visual Stylization/Generation
Cross-modal Generation (Audio-Video / Video-Audio)
Multi-modal Architectures
Uncategorized Papers

Self-Supervised Visual Acoustic Matching - Somayazulu, A., Chen, C., & Grauman, K. (NeurIPS 2023) [project page]
Novel-View Acoustic Synthesis - Chen, C., Richard, A., Shapovalov, R., Ithapu, V. K., Neverova, N., Grauman, K., & Vedaldi, A. (CVPR 2023) [code]
Few-shot audio-visual learning of environment acoustics - Majumder, S., Chen, C., Al-Halah, Z., & Grauman, K. (NeurIPS 2022)
Learning Neural Acoustic Fields - Luo, A., Du, Y., Tarr, M., Tenenbaum, J., Torralba, A., & Gan, C. (NeurIPS 2022) [code]
Learning Audio-Visual Dereverberation - Chen, C., Sun, W., Harwath, D., & Grauman, K. (ICASSP 2023) [code]
Visual acoustic matching - Chen, C., Gao, R., Calamia, P., & Grauman, K. (CVPR 2022) [code]
Image2reverb: Cross-modal reverb impulse response synthesis - Singh, N., Mentch, J., Ng, J., Beveridge, M., & Drori, I. (ICCV 2021). [code]

Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion - Ye, Q., Shen, X., Gao, Y., Wang, Z., Bi, Q., Li, P., & Yang, G. (ICCV 2021)
Joint Visual and Audio Learning for Video Highlight Detection - Badamdorj, T., Rochan, M., Wang, Y., & Cheng, L. (ICCV 2021)

Sound Adversarial Audio-Visual Navigation - Yu, Y., Huang, W., Sun, F., Chen, C., Wang, Y., & Liu, X. (ICLR 2022) [project page] [code]
AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments - Paul, S., Roy-Chowdhury, A., & Cherian, A. (NeurIPS 2022)
Semantic Audio-Visual Navigation - Chen, C., Al-Halah, Z., & Grauman, K. (CVPR 2021) [project page] [code]
Learning to set waypoints for audio-visual navigation - Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (ICLR 2021) [project page] [code]
See, hear, explore: Curiosity via audio-visual association - Dean, V., Tulsiani, S., & Gupta, A. (arXiv 2020) [project page] [code]
Audio-Visual Embodied Navigation - Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson P., Grauman, K. (ECCV 2020) [project page]
Look, listen, and act: Towards audio-visual embodied navigation - Gan, C., Zhang, Y., Wu, J., Gong, B., & Tenenbaum, J. B. (ICRA 2020) [project page/dataset]

INRAS: Implicit Neural Representations of Audio Scenes - Su, K.*, Chen, M.*, Shilzerman, E. (NeurIPS 2022)
Learning Neural Acoustic Fields - Luo, A., Du, Y., Tarr, M., Tenenbaum, J., Torralba, A., & Gan, C. (NeurIPS 2022) [code] [project page]

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning - Yu, S., Wu, P., Liang, P. P., Salakhutdinov, R., & Morency, L. P. (ECCV 2022) [code]
Learning To Answer Questions in Dynamic Audio-Visual Scenarios - Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J. R., & Hu, D. (CVPR 2022) [project page] [code]

What Makes Training Multi-Modal Networks Hard? - Wang, W., Tran, D., & Feiszli, M. (arXiv 2019)
MFAS: Multimodal Fusion Architecture Search - Pérez-Rúa, J. M., Vielzeuf, V., Pateux, S., Baccouche, M., & Jurie, F. (CVPR 2019)

Datasets

AudioSet - Audio-Visual Classification
MUSIC - Audio-Visual Source Separation
AudioSetZSL - Audio-Visual Zero-shot Learning
Visually Engaged and Grounded AudioSet (VEGAS) - Sound generation from video
SoundNet-Flickr - Image-Audio pair for cross-modal learning
Audio-Visual Event (AVE) - Audio-Visual Event Localization
AudioSet Single Source - Subset of AudioSet videos containing only a single souding object
Kinetics-Sounds - Subset of Kinetics dataset
EPIC-Kitchens - Egocentric Audio-Visual Action Recogniton
Audio-Visually Indicated Actions Dataset - Multimodal dataset (RGB, acoustic data as raw audio) acquired using the acoustic-optical camera
IMSDb dataset - Movie scripts downloaded from The Internet Script Movie Database
YOUTUBE-ASMR-300K dataset - ASMR videos collected from YouTube that contains stereo audio
FAIR-Play - 1,871 video clips and their corresponding binaural audio clips recorded in a music room
VGG-Sound - audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube
XD-Violence - weakly annotated dataset for audio-visual violence detection
AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) - Geotagged aerial images and sounds, classified into 13 scene classes
auDIoviSual Crowd cOunting dataset (DISCO) - 1,935 Images and audios from various typical scenes, a total of 170, 270 instances annotated with the head locations.
MUSIC-Synthetic dataset- Category-balanced multi-source videos by artificially synthesizing solo videos from the MUSIC dataset, to facilitate the learning and evaluation of multiple-soundings-sources localization in the cocktail-party scenario.
ACAV100M - 140 million full-length videos (total duration 1,030 years) and produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence.
AIST++ - A large-scale 3D human dance motion dataset, which contains a wide variety of 3D motion paired with music It is built upon the AIST Dance Database, which is an uncalibrated multi-view collection of dance videos.
VideoCC - A dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset.
ssw60 - A dataset for research on adiovisual fine-grained categorization. The dataset covers 60 species of birds that all occur in a specific geographic location: Sapsucker Woods, Ithaca, NY. It is comprised of images from existing datasets, and brand new, expert curated audio and video data.
PACS - A dataset designed to help create and evaluate a new generation of AI algorithms able to reason about physical commonsense using both audio and visual modalities.
AVSBench - A dataset for audio-visual pixel-wise segmentation task.
UnAV-100 - The dataset consists of more than 10K untrimmed videos with over 30K audio-visual events covering 100 different event categories. There are often multiple audio-visual events that might be very short or long, and occur concurrently in each video as in real-life audio-visual scenes.

VoxCeleb - Audio-Visual Speaker Identification, contains two versions
EmoVoxCeleb
Speech2Gesture - Gesture prediction from speech
AVSpeech
LRW Dataset
LRS2, LRS3, LRS3 Language - Lip Reading Datasets

Licenses

To the extent possible under law, Kranti Kumar Parida has waived all copyright and related or neighboring rights to this work.

Contributing

Please feel free to send me pull requests or email (kranti@cse.iitk.ac.in) to add links, correct wrong ones or if you find any broken links.

Awesome

Awesome Audio-Visual:

Contents

Audio-Visual Localization

Audio-Visual Separation

Audio-Visual Representation/Classification/Retrieval

Audio-Visual Action Recognition

Audio-Visual Spatial/Depth

Audio-Visual RIR

Audio-Visual Highlight Detection

Audio-Visual Deepfake/Robustness

Lightweight Audio-Visual Model

Audio-Visual Navigation/RL

Audio-Visual Faces/Speech

Audio-Visual Learning of Scene Acoustics

Audio-Visual Question Answering

Cross-modal Generation (Audio-Video / Video-Audio)

Audio-Visual Stylization/Generation

Multi-modal Architectures

Uncategorized Papers

Datasets

General Audio-Visual Tasks

Face-Voice Dataset

Licenses

Contributing