Awesome
Sound Localization by Self-Supervised Time Delay Estimation
<h4> Ziyang Chen, David F. Fouhey, Andrew Owens </br> <span style="font-size: 14pt; color: #555555"> University of Michigan </span> </br> </h4> <hr>This repository contains the official codebase for Sound Localization by Self-Supervised Time Delay Estimation. [Project Page]
<div align="center"> <img width="100%" alt="StereoCRW Illustration" src="images/method.png"> </div>Environment
To setup the environment, please simply run
conda env create -f environment.yml
conda activate Stereo
Datasets
Free Music Archive (FMA)
We perform self-supervised learning on this training dataset, data can be downloaded from FMA offical github repo.
FAIR-Play
Data can be downloaded from FAIR-Play offical github repo.
TDE-Simulation <span id="TDE-Simulation"><span>
We create a simulated test set using Pyroomacoustics. It contains approximately 6K stereo audio samples from three simulated environments with rooms of different sizes and microphone positions. We use TIMIT as sound database. Our data can be downloaded from Here. You can simply download our dataset by running
cd Dataset/TDE-Simulation
chmod +x download_tde.sh
./download_tde.sh
We also provide the code for generating the stereo sound in Dataset/TDE-Simulation/data-generation-advance.py
, you can create your own evaluation set. We have provided the evaluation information in Dataset/TDE-Simulation/data-split
.
In-the-wild data
We collected 1K samples from 30 internet binaural videos, and use human judgements to label sound directions. These videos contain a variety of sounds, including engine noise and human speech, which are often far from the viewer. The processed data could be downloaded from Here. You can simply download our dataset by running
cd Dataset/Youtube-Binaural
chmod +x download_inthewild.sh
./download_inthewild.sh
We also provide Youtube ID and timestamp in Dataset/Youtube-Binaural/data-info/in-the-wild.csv
, you can download and process them with Dataset/Youtube-Binaural/multi-download-process.sh
. Labels are provided in Dataset/Youtube-Binaural/data-split/in-the-wild/test_with_label.csv
.
Visually-guided Time Delay Simulation Dataset
We use audio clips from VoxCeleb2 with the simulation parameters from TDE-Simulation. We select 500 speakers from the database and pair them with their corresponding face images. The processed data could be downloaded from Here. You can simply download our dataset by running
cd Dataset/VoxCeleb2
chmod +x download_voxceleb2_simulation.sh
./download_voxceleb2_simulation.sh
We have provided the evaluation information in Dataset/VoxCeleb2/data-split/voxceleb-tde/Easy/test.csv
.
Model Zoo
We release several models pre-trained with our proposed methods. We hope it could benefit our research communities.
Method | size, stride, num | Train Set | Test Set | MAE (ms) | RMSE (ms) | url |
---|---|---|---|---|---|---|
MonoCLR | 1024, 4, 49 | Free-Music | TDE-Simulation | 0.187 | 0.335 | url |
ZeroNCE | 1024, 4, 49 | Free-Music | TDE-Simulation | 0.174 | 0.319 | url |
StereoCRW | 1024, 4, 49 | Free-Music | TDE-Simulation | 0.133 | 0.259 | url |
AV-MonoCLR | 15360, 4, 49 | VoxCeleb2 | Voxceleb2-Simulation | - | 0.304 | url |
Note that our models above are trained with 0.064s while you can directly inference with different audio lengths without retraining. We also provide some pre-trained models trained with longer audio inputs (0.48s) for accelerating training process only. To download all the checkpoints, simply run
./scripts/download_models.sh
Train & Evaluation
We provide training and evaluation scripts under scripts
, please check each bash file before running.
Training
- To train our StereoCRW method on FMA, simply run:
./scripts/training/train-StereoCRW-FMA.sh
under parent path. - To train our MonoCLR method on FMA, simply run:
./scripts/training/train-MonoCLR-FMA.sh
under parent path. - To train our ZeroNCE method on FMA, simply run:
./scripts/training/train-ZeroNCE-FMA.sh
under parent path. - To train our AV-MonoCLR method on VoxCeleb2, simply run:
./scripts/training/train-AVMonoCLR-VoxCeleb2.sh
under parent path.
Evaluation
- To evaluate our model method on
TDE-Simualtion dataset
, simply run:./scripts/evaluation/evaluation_tde.sh
under parent path. You can change the checkpoint in the bash file. - To evaluate our model method on
TDE-Simualtion dataset
with mixture condition, simply run:./scripts/evaluation/evaluation_mixture_tde.sh
under parent path. You can change the checkpoint in the bash file. - To evaluate our model method on
In-the-wild dataset
, simply run:./scripts/evaluation/evaluation_inthewild.sh
under parent path. You can change the checkpoint in the bash file. - To evaluate our visual-guided ITD estimation model method on
Visually-guided Time Delay Simulation Dataset
, simply run:./scripts/evaluation/evaluation_vgITD.sh
under parent path. You can change the checkpoint in the bash file.
Visualization Demo
We provide codes for visualizing the ITD prediction of videos over time in vis_scripts/vis_video_itd.py
. You can follow the steps below to generate visualization results of your own videos:
- Create a folder for your test vides to by
mkdir Dataset/DemoVideo/RawVideos/YourVideo
, and save your videos to this path. - For preprocessing the video, simply run:
cd Dataset/DemoVideo chmod +x process.sh ./process.sh 'YourVideo'
- To inference with video results, go back to the parent folder path and simply run
and the video results will be appeared under./scripts/visualization_video.sh 'YourVideo' YOUR_SAVE_PATH
results/YOUR_SAVE_PATH
.
Citation
If you find this code useful, please consider citing:
@inproceedings{
chen2022sound,
title={Sound Localization by Self-Supervised Time Delay Estimation},
author={Chen, Ziyang and Fouhey, David F. and Owens, Andrew},
journal={arXiv},
year={2022}
}
Acknowledgment
This work was funded in part by DARPA Semafor and Cisco Systems. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.