Home

Awesome

l3embedding

Code for running the expriments presented in:

Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings<br/> Jason Cramer, Ho-Hsiang Wu, Justin Salamon and Juan Pablo Bello<br/> Under review, 2018.

For the pre-trained embedding models (openL3), please go to: github.com/marl/openl3

This repository contains an implementation of the model proposed in Look, Listen and Learn (Arandjelović, R., Zisserman, A. 2017). This model uses videos to learn vision and audio features in an unsupervised fashion by training the model for the proposed Audio-Visual Correspondence (AVC) task. This task tries to determine whether a piece of audio and an image frame come from the same video and occur simulatneously.

Dependencies

The code for the model and training implementation can be found in l3embedding/. Note that the metadata format expected is the same used in AudioSet (Gemmeke, J., Ellis, D., et al. 2017), as training this model on AudioSet was one of the goals for this implementation.

You can train an AVC/embedding model using train.py. Run python train.py -h to read the help message regarding how to use the script.

There is also a module classifier/ which contains code to train a classifier using that uses extracts embeddings on new audio using the embedding model. Currently this only supports using the UrbanSound8K dataset (Salamon, J., Jacoby, C., Bello, J. 2014)

You can train an urban sound classification model using train_classifier.py. Run python train_classifier.py -h to read the help message regarding how to use the script.

Download VGGish models:

If you use a SLURM environment, sbatch scripts are available in jobs/.