Home

Awesome

VGGish

This repository is forked from Google Research and is used to extract 128-D audio features for the given video files.

  1. Install Python packages and download two data files
  2. Follow the structure below to place your files (remember to modify base_data_path in Constants.py)
└── $base_data_path
    ├── $dataset
    │   ├── all_videos
    │   │   ├── video0.mp4 (or .avi)
    │   │   └── ...
    │   ├── info_corpus.pkl
    │   └── refs.pkl
    └── $another_dataset
        ├── all_videos
        │   ├── video0.mp4 (or .avi)
        │   └── ...
        ├── info_corpus.pkl
        └── refs.pkl
  1. Extract wav files from video files
  python video2wav.py --dataset MSRVTT
  1. Extract features
  python extract_features.py --dataset MSRVTT --n_frames 60 --video_postfix .mp4

Original README

The initial AudioSet release included 128-dimensional embeddings of each AudioSet segment produced from a VGG-like audio classification model that was trained on a large YouTube dataset (a preliminary version of what later became YouTube-8M).

We provide a TensorFlow definition of this model, which we call VGGish, as well as supporting code to extract input features for the model from audio waveforms and to post-process the model embedding output into the same format as the released embedding features.

Installation

VGGish depends on the following Python packages:

These are all easily installable via, e.g., pip install numpy (as in the sample installation session below). Any reasonably recent version of these packages shold work.

VGGish also requires downloading two data files:

After downloading these files into the same directory as this README, the installation can be tested by running python vggish_smoke_test.py which runs a known signal through the model and checks the output.

Here's a sample installation and test session:

# You can optionally install and test VGGish within a Python virtualenv, which
# is useful for isolating changes from the rest of your system. For example, you
# may have an existing version of some packages that you do not want to upgrade,
# or you want to try Python 3 instead of Python 2. If you decide to use a
# virtualenv, you can create one by running
#   $ virtualenv vggish   # For Python 2
# or
#   $ python3 -m venv vggish # For Python 3
# and then enter the virtual environment by running
#   $ source vggish/bin/activate  # Assuming you use bash
# Leave the virtual environment at the end of the session by running
#   $ deactivate
# Within the virtual environment, do not use 'sudo'.

# Upgrade pip first. Also make sure wheel is installed.
$ sudo python -m pip install --upgrade pip wheel

# Install all dependences.
$ sudo pip install numpy resampy tensorflow tf_slim six soundfile

# Clone TensorFlow models repo into a 'models' directory.
$ git clone https://github.com/tensorflow/models.git
$ cd models/research/audioset/vggish
# Download data files into same directory as code.
$ curl -O https://storage.googleapis.com/audioset/vggish_model.ckpt
$ curl -O https://storage.googleapis.com/audioset/vggish_pca_params.npz

# Installation ready, let's test it.
$ python vggish_smoke_test.py
# If we see "Looks Good To Me", then we're all set.

Usage

VGGish can be used in two ways:

About the Model

The VGGish code layout is as follows:

Architecture

See vggish_slim.py and vggish_params.py.

VGGish is a variant of the VGG model, in particular Configuration A with 11 weight layers. Specifically, here are the changes we made:

The model definition provided here defines layers up to and including the 128-wide embedding layer. Note that the embedding layer does not include a final non-linear activation, so the embedding value is pre-activation. When training a model stacked on top of VGGish, you should send the embedding through a non-linearity of your choice before adding more layers.

Input: Audio Features

See vggish_input.py and mel_features.py.

VGGish was trained with audio features computed as follows:

We provide our own NumPy implementation that produces features that are very similar to those produced by our internal production code.

Output: Embeddings

See vggish_postprocess.py.

The released AudioSet embeddings were postprocessed before release by applying a PCA transformation (which performs both PCA and whitening) as well as quantization to 8 bits per embedding element. This was done to be compatible with the YouTube-8M project which has released visual and audio embeddings for millions of YouTube videos in the same PCA/whitened/quantized format.

We provide a Python implementation of the postprocessing which can be applied to batches of embeddings produced by VGGish. vggish_inference_demo.py shows how the postprocessor can be run after inference.

If you don't need to use the released embeddings or YouTube-8M, then you could skip postprocessing and use raw embeddings.

A Colab showing how to download the model and calculate the embeddings on your own sound data is available here: AudioSet Embedding Colab.