Home

Awesome

Benchmarking for Audio-Text and Audio-Visual Generation

Overview

This repository supports the evaluations of:

Installation

This repository has been tested on Ubuntu and requires Python 3.9+ and PyTorch 2.5.1 or later. Follow the steps below to set up the environment:

1. Recommend Setup

We recommend using a miniforge environment.

2. Install PyTorch

Before proceeding, install PyTorch with the appropriate CUDA version from the official PyTorch website.

Please install pytorch with the appropriate cuda versions via https://pytorch.org/ before proceeding.

3. Clone and Install the Repository

git clone https://github.com/hkchengrex/av-benchmark.git
cd av-benchmark
pip install -e .

4. Download Pretrained Models

Download music_speech_audioset_epoch_15_esc_89.98.pt and Synchformer and put them in weights.

(Execute when you are in the root directory of the repository)

mkdir weights
wget https://huggingface.co/lukewys/laion_clap/blob/main/music_speech_audioset_epoch_15_esc_89.98.pt -O weights/music_speech_audioset_epoch_15_esc_89.98.pt
wget https://github.com/hkchengrex/MMAudio/releases/download/v0.1/synchformer_state_dict.pth -O weights/synchformer_state_dict.pth

5. Optional: For Video Evaluation

If you plan to evaluate on videos, you will also need ffmpeg. Note that torchaudio imposes a maximum version limit (ffmpeg<7). You can install it as follows:

conda install -c conda-forge 'ffmpeg<7

Usage

Overview

Evaluation is a two-stage process:

  1. Extraction: extract video/text/audio features for ground-truth and audio features for the predicted audios. The extracted features are saved in gt_cache and pred_cache respectively.
  2. Evaluation: compute the desired metrics using the extracted features.

By default, if gt_cache or pred_cache are not specified, we will use gt_audio/cache and pred_audio/cache. gt_audio and pred_audio should point to a directory containing audio files in either .wav or .flac formats.

Extraction

1. Video feature extraction (optional).

For video-to-audio applications, visual features are extracted from input videos. This is also applicable for generated videos in audio-to-video or audio-visual joint generation tasks.

Input requirements:

Run the following to extract visual features using Synchformer and ImageBind:

python extract_video.py --gt_cache <output cache directory> --video_path <directory containing videos> --gt_batch_size <batch size> --audio_length=<length of video in seconds>

Some of the precomputed caches for VGGSound/AudioCaps can be found here: https://huggingface.co/datasets/hkchengrex/MMAudio-precomputed-results

2. Text feature extraction (optional).

For text-to-audio applications, text features are extracted from input text data.

Input requirements:

Run the following to extract text features using LAION-CLAP and MS-CLAP:

python extract_text.py --text_csv <path to the csv> --output_cache_path <output cache directory>

3. Audio feature extraction.

Audio features are automatically extracted during the evaluation stage.

Manual extraction: You can force feature extraction by specifying:

This is useful if the extraction is interrupted or the cache is corrupted.

Evaluation

python evaluate.py  --gt_audio <gt audio path> --gt_cache <gt cache path> --pred_audio <pred audio path> --pred_cache <pred cache path> --audio_length=<length of audio wanted in seconds> 

You can specify --skip_clap or --skip_video_related to speed up evaluation if you don't need those metrics.

Supporting Libraries

To address issues with deprecated code in some underlying libraries, we have forked and modified several of them. These forks are included as dependencies to ensure compatibility down the road.

Citation

This repository is part of the accompanying code for MMAudio. To cite this repository, please use the following BibTeX entry:

@inproceedings{cheng2024taming,
  title={Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis},
  author={Cheng, Ho Kei and Ishii, Masato and Hayakawa, Akio and Shibuya, Takashi and Schwing, Alexander and Mitsufuji, Yuki},
  booktitle={arXiv},
  year={2024}
}

References

Many thanks to