Home

Awesome

Sound Localization by Self-Supervised Time Delay Estimation

<h4> Ziyang Chen, David F. Fouhey, Andrew Owens </br> <span style="font-size: 14pt; color: #555555"> University of Michigan </span> </br> </h4> <hr>

This repository contains the official codebase for Sound Localization by Self-Supervised Time Delay Estimation. [Project Page]

<div align="center"> <img width="100%" alt="StereoCRW Illustration" src="images/method.png"> </div>

Environment

To setup the environment, please simply run

conda env create -f environment.yml
conda activate Stereo

Datasets

Free Music Archive (FMA)

We perform self-supervised learning on this training dataset, data can be downloaded from FMA offical github repo.

FAIR-Play

Data can be downloaded from FAIR-Play offical github repo.

TDE-Simulation <span id="TDE-Simulation"><span>

We create a simulated test set using Pyroomacoustics. It contains approximately 6K stereo audio samples from three simulated environments with rooms of different sizes and microphone positions. We use TIMIT as sound database. Our data can be downloaded from Here. You can simply download our dataset by running

cd Dataset/TDE-Simulation
chmod +x download_tde.sh
./download_tde.sh

We also provide the code for generating the stereo sound in Dataset/TDE-Simulation/data-generation-advance.py, you can create your own evaluation set. We have provided the evaluation information in Dataset/TDE-Simulation/data-split.

In-the-wild data

We collected 1K samples from 30 internet binaural videos, and use human judgements to label sound directions. These videos contain a variety of sounds, including engine noise and human speech, which are often far from the viewer. The processed data could be downloaded from Here. You can simply download our dataset by running

cd Dataset/Youtube-Binaural
chmod +x download_inthewild.sh
./download_inthewild.sh

We also provide Youtube ID and timestamp in Dataset/Youtube-Binaural/data-info/in-the-wild.csv, you can download and process them with Dataset/Youtube-Binaural/multi-download-process.sh. Labels are provided in Dataset/Youtube-Binaural/data-split/in-the-wild/test_with_label.csv.

Visually-guided Time Delay Simulation Dataset

We use audio clips from VoxCeleb2 with the simulation parameters from TDE-Simulation. We select 500 speakers from the database and pair them with their corresponding face images. The processed data could be downloaded from Here. You can simply download our dataset by running

cd Dataset/VoxCeleb2
chmod +x download_voxceleb2_simulation.sh
./download_voxceleb2_simulation.sh

We have provided the evaluation information in Dataset/VoxCeleb2/data-split/voxceleb-tde/Easy/test.csv.

Model Zoo

We release several models pre-trained with our proposed methods. We hope it could benefit our research communities.

Methodsize, stride, numTrain SetTest SetMAE (ms)RMSE (ms)url
MonoCLR1024, 4, 49Free-MusicTDE-Simulation0.1870.335url
ZeroNCE1024, 4, 49Free-MusicTDE-Simulation0.1740.319url
StereoCRW1024, 4, 49Free-MusicTDE-Simulation0.1330.259url
AV-MonoCLR15360, 4, 49VoxCeleb2Voxceleb2-Simulation-0.304url

Note that our models above are trained with 0.064s while you can directly inference with different audio lengths without retraining. We also provide some pre-trained models trained with longer audio inputs (0.48s) for accelerating training process only. To download all the checkpoints, simply run

./scripts/download_models.sh

Train & Evaluation

We provide training and evaluation scripts under scripts, please check each bash file before running.

Training

Evaluation

Visualization Demo

We provide codes for visualizing the ITD prediction of videos over time in vis_scripts/vis_video_itd.py. You can follow the steps below to generate visualization results of your own videos:

Citation

If you find this code useful, please consider citing:

@inproceedings{
    chen2022sound,
    title={Sound Localization by Self-Supervised Time Delay Estimation},
    author={Chen, Ziyang and Fouhey, David F. and Owens, Andrew},
    journal={arXiv},
    year={2022}
}

Acknowledgment

This work was funded in part by DARPA Semafor and Cisco Systems. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.