Awesome

D2M-GAN: Music Generation for Dance Videos.

This is the official Pytorch implementation for D2M-GAN, a VQ-GAN based complex music generative model for dance videos. (ECCV 2022)

Ye Zhu, Kyle Olszewski, Yu Wu, Panos Achlioptas, Menglei Chai, Yan Yan, Sergey Tulyakov

Paper | Samples

Updates:

If you are interested in this Dance-to-Music generation task or the conditional cross-modality synthesis in general, please also check our follow-up work using the Contrastive Diffusion here, thanks!

1. Project Overview

In our paper Quantized GAN for Complex Music Generation from Dance Videos, we present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates complex musical samples conditioned on dance videos.

2. Environment Setup

The envirunment can be set up following the instructions below. Since our proposed D2M-GAN utilizes the VQ-VAE decoder of JukeBox for synthesizing the music, part of the code for loading the pre-trained VQ-VAE are adapted from the JukeBox repo.

conda create --name d2m python=3.8
source activate d2m
conda install mpi4py==3.0.3
# install the pytorch and cudatoolkit based on your machine env.
conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch
git clone https://github.com/L-YeZhu/D2M-GAN.git
cd D2M-GAN
pip install -r requirements.txt
pip install -e .

3. Data

We conduct experiments on the following two datasets.

3.1 AIST++ Dataset

The AIST++ dataset is a subset of AIST dataset, which can be downloaded from here. We use the cross-modality data split for training and testing. We also provide our preprocessed motion, vision, and music audio data.

3.2 TikTok Dance-Music Dataset

We build a TikTok Dance-Music dataset, which contains the "in-the-wild" dance videos. The current dataset has 445 dance videos with an average length of 12.5 seconds. We provide the full dataset annotation including the video link, music ID, video ID, the duration and the number of dancers here, as well as our splits for training and testing. To facilitate the usage, the preprocessed motion, vision, and music audio are also available.

4. Training

4.1 Training on the AIST++

To train the D2M-GAN on the AIST++ dataset.

python d2m_aist.py --model=5b --save_sample_path=./samples --model_level=high

4.2 Training on the TikTok Dance-Music

To train the D2M-GAN on the TikTok Dance-Music dataset.

python d2m_tiktok.py --model=5b --save_sample_path=./samples --model_level=high

5. Generation

After finish training the D2M-GAN, run the following command to generate the music samples.

For the AIST++ dataset. We also release the pre-trained models for the motion encoder and music VQ generator to run the high level model. Low level checkpoints can also be downloaded here: motion encoder and music VQ generator.

python generate_aist.py --model=5b --load_path=./logs_aist_high --result_path=./audio_result_aist --model_level=high

For the TikTok dance-music dataset.

python generate_tiktok.py --model=5b --load_path=./logs_tiktok_high --result_path=./audio_result_tiktok --model_level=high

We also provide a simple tool for calculating the beats cover and hit scores here.

6. Citation

Please consider citing our paper if you find it useful.

@inproceedings{yezhu2022quantizedgan,
  title={Quantized GAN for Complex Music Generation from Dance Videos},
  author={Zhu, Ye and Olszewski, Kyle and Wu, Yu and Achlioptas, Panos and Chai, Menglei and Yan, Yan and Tulyakov, Sergey},
  booktitle={The European Conference on Computer Vision (ECCV)},
  year={2022}
}