

A Closer Look at Weakly-Supervised Audio-Visual Source Localization

Official codebase for SLAVC.

SLAVC is a new approach for weakly-supervised visual sound source localization to identify negatives and solve significant overfitting problems.

A Closer Look at Weakly-Supervised Audio-Visual Source Localization <br>Shentong Mo, Pedro Morgado<br> NeurIPS 2022.

<div align="center"> <img width="100%" alt="SLAVC Illustration" src="images/framework.png"> </div>


To setup the environment, please simply run

pip install -r requirements.txt



Data can be downloaded from Learning to localize sound sources

VGG-Sound Source

Data can be downloaded from Localizing Visual Sounds the Hard Way

Extended Flickr-SoundNet

Data can be downloaded from Extended-Flickr-SoundNet

Extended VGG-Sound Source

Data can be downloaded from Extended-VGG-Sound Source

Model Zoo

We release MoVSL model pre-trained on VGG-Sound 144k data and scripts on reproducing results on Extended Flickr-SoundNet and Extended VGG-Sound Source benchmarks.

MethodTrain SetTest SetAPmax-F1PrecisionurlTrainTest
SLAVCVGG-Sound 144kExtended Flickr-SoundNet51.6359.1083.60modelscriptscript
SLAVCVGG-Sound 144kExtended VGG-SS32.9540.0037.79modelscriptscript


For training an SLAVC model, please run

python train.py --multiprocessing_distributed \
    --train_data_path /path/to/VGGSound-all/ \
    --test_data_path /path/to/Flickr-SoundNet/ \
    --test_gt_path /path/to/Flickr-SoundNet/Annotations/ \
    --experiment_name vggss144k_slavc \
    --model 'slavc' \
    --trainset 'vggss_144k' \
    --testset 'flickr' \
    --epochs 20 \
    --batch_size 128 \
    --init_lr 0.0001 \
    --use_momentum --use_mom_eval \
    --m_img 0.999 --m_aud 0.999 \
    --dropout_img 0.9 --dropout_aud 0


For testing and visualization, simply run

python test.py --test_data_path /path/to/Extended-VGGSound-test/ \
    --model_dir checkpoints \
    --experiment_name vggss144k_slavc \
    --testset 'vggss_plus_silent' \
    --alpha 0.9 \


If you find this repository useful, please cite our paper:

  title={A Closer Look at Weakly-Supervised Audio-Visual Source Localization},
  author={Mo, Shentong and Morgado, Pedro},
  booktitle={Advances in Neural Information Processing Systems},