


This repository contains the code and dataset in our NeurIPS'20 paper.

Learning Representations from Audio-Visual Spatial Alignment. Pedro Morgado*, Yi Li*, Nuno Vasconcelos. Advances in Neural Information Processing Systems (NeurIPS), 2020.


Requirements listed in environment.yml.

Data preparation

YT-360 dataset

YouTube id's of videos in the YT-360 dataset are provided in datasets/assets/yt360/[train|test].txt, and segment timestamps in datasets/assets/yt360/segments.txt. Please, use your favorite YouTube dataset downloader to download the videos (e.g.~link), and split them into 10s clips. The dataset should be stored in data/yt360/video and data/yt360/audio with filenames {YOUTUBE_ID}-{SEGMENT_START_TIME}.{EXTENSION}.

The pre-extracted segmentation maps can be downloaded from here and extracted to data/yt360/segmentation/.

If you experience issues downloading or processing the dataset, please email the authors at {pmaravil, yil898}@eng.ucsd.edu for assistance.

Pre-trained model

The AVSA model that yield the top performance (trained from configs/main/avsa/Cur-Loc4-TransfD2.yaml) is available here.

Self-supervised training

python main-video-ssl.py [--quiet] cfg

Training config cfg for the following models are provided:


Four downstream tasks are supported: Binary audio-visual correspondence (AVC-Bin), binary audio-visual spatial alignment (AVSA-Bin), video action recognition (on UCF/HMDB), and audio-visual semantic segmentation.

Action recognition

python eval-action-recg.py [--quiet] cfg model_cfg

Evaluation config cfg for UCF and HMDB dataset are provided:

model_cfg is training config for the model to evaluate, e.g. configs/main/avsa/Cur-Loc4-TransfD2.yaml for AVSA pre-training.

Semantic segmentation

python eval-audiovisual-segm.py [--quiet] cfg model_cfg

Evaluation config cfg for three settings are provided:

Binary audio-visual correspondence

python eval-avc.py [--quiet] cfg model_cfg

Evaluation config cfg for two settings are provided:

Binary audio-visual spatial alignment

python eval-avsa.py [--quiet] cfg model_cfg

Evaluation config cfg for two settings are provided:


Please cite our work if you find it helpful for your research:

  title={Learning Representations from Audio-Visual Spatial Alignment},
  author={Morgado, Pedro and Li, Yi and Nvasconcelos, Nuno},
  journal={Advances in Neural Information Processing Systems},


This work was partially funded by NSF award IIS-1924937 and NVIDIA GPU donations. We also acknowledge and thank the use of the Nautilus platform for some of the experiments in paper.