Awesome

<div align="center"> <h1>Audio-Synchronized Visual Animation</h1>

<p style="font-size: 18px;"> <strong> <a href="https://lzhangbj.github.io/">Lin Zhang<sup>1</sup></a>, <a href="https://scholar.google.com/citations?user=6aYncPAAAAAJ">Shentong Mo<sup>2</sup></a>, <a href="https://yijingz02.github.io/">Yijing Zhang<sup>1</sup></a>, <a href="https://pedro-morgado.github.io/">Pedro Morgado<sup>1</sup></a> </strong> </p> <p style="font-size: 18px;"> <strong> University of Wisconsin Madison<sup>1</sup><br> Carnegie Mellon University<sup>2</sup> </strong> </p>

<strong style="font-size: 25px;">ECCV 2024</strong><br> <strong style="font-size: 25px;">Oral Presentation</strong>

</div>

Checklist

Release pretrained checkpoints
Release inference code on audio-conditioned image animation and sync metrics
Release ASVA training and evaluation code
Release AVSync classifier training and evaluation code
Release Huggingface Demo

1. Create environment

We use video_reader backend of torchvision to load audio and videos, which requires building torchvision locally

conda create -n asva python==3.10 -y
conda activate asva

pip install torch==2.1.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

# Build torchvision from source
mkdir -p submodules
cd submodules
git clone https://github.com/pytorch/vision.git
cd vision
git checkout tags/v0.16.0
conda install -c conda-forge 'ffmpeg<4.3' -y
python setup.py install
cd ../..

pip install -r requirements.txt

export PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/submodules/ImageBind

2. Download pretrained models

Download required features/models

ImageBind: Pretrained frozen audio encoder
I3D: Evaluating FVD
Stable Diffusion V1.5: Load pretrained image generation model
AVID-CMA: Initialize AVSync Classifier's encoders
Precomputed null text encodings: Ease of computatoin

Please download and structure them as following:

- submodules/
    - ImageBind/
- pretrained/
    - i3d_torchscript.pt
    - stable-diffusion-v1-5/
    - openai-clip-l_null_text_encoding.pt
    - AVID-CMA_Audioset_InstX-N1024-PosW-N64-Top32_checkpoint.pth.tar

Download pretrained AVSyncD and AVSync Classifier checkpoints

<table> <tr> <th>Model</th> <th>Dataset</th> <th>Checkpoint</th> <th>Config</th> <th>Audio CFG</th> <th>FVD</th> <th>AlignSync</th> </tr> <tr> <td rowspan="9">AVSyncD</td> <td rowspan="3">AVSync15</td> <td rowspan="3"><a href="https://drive.google.com/file/d/17ZYopMVM1ZuJ1CBZPzhwAyOa4rR9Eo-_/view?usp=sharing">GoogleDrive</a></td> <td rowspan="3"><a href="configs/audio-cond_animation/avsync15_audio-cond_cfg.yaml">Link</a></td> <td>1.0</td> <td>323.06</td> <td>22.21</td> </tr> <tr> <td>4.0</td> <td>300.82</td> <td>22.64</td> </tr> <tr> <td>8.0</td> <td>375.02</td> <td>22.70</td> </tr> <tr> <td rowspan="3">Landscapes</td> <td rowspan="3"><a href="https://drive.google.com/file/d/1Wa0wK9D_qlkT8U2O8zCz6UoQql-A3zjD/view?usp=sharing">GoogleDrive</a></td> <td rowspan="3"><a href="configs/audio-cond_animation/landscapes_audio-cond_cfg.yaml">Link</a></td> <td>1.0</td> <td>491.37</td> <td>24.94</td> </tr> <tr> <td>4.0</td> <td>449.59</td> <td>25.02</td> </tr> <tr> <td>8.0</td> <td>547.97</td> <td>25.16</td> </tr> <tr> <td rowspan="3">TheGreatestHits</td> <td rowspan="3"><a href="https://drive.google.com/file/d/1u8Ksc9TrDhcr6tV_7xH9RsbdkklH-2y9/view?usp=sharing">GoogleDrive</a></td> <td rowspan="3"><a href="configs/audio-cond_animation/thegreatesthits_audio-cond_cfg.yaml">Link</a></td> <td>1.0</td> <td>305.41</td> <td>22.56</td> </tr> <tr> <td>4.0</td> <td>255.49</td> <td>22.89</td> </tr> <tr> <td>8.0</td> <td>279.12</td> <td>23.14</td> </tr> </table>

Model	Dataset	Checkpoint	Config	A2V Sync Acc	V2A Sync Acc
AVSync Classifier	VGGSS	GoogleDrive	Link	40.76	40.86

Please download checkpoints you need and structure them as following:

- checkpoints/
    - audio-cond_animation/
        - avsync15_audio-cond_cfg/
        - landscapes_audio-cond_cfg/
        - thegreatesthits_audio-cond_cfg/
    - avsync/
        - vggss_sync_contrast/

3. Demo

Generate animation on audio / image / video

The program first tries to load audio from audio and image from image. If they are not specified, the program then loads audio or image from video.

python -W ignore scripts/animation_demo.py --dataset AVSync15 --category "lions roaring" --audio_guidance 4.0 \
    --audio ./assets/lions_roaring.wav --image ./assets/lion_and_gun.png --save_path ./assets/generation_lion_roaring.mp4

python -W ignore scripts/animation_demo.py --dataset AVSync15 --category "machine gun shooting" --audio_guidance 4.0 \
    --audio ./assets/machine_gun_shooting.wav --image ./assets/lion_and_gun.png --save_path ./assets/generation_lion_shooting_gun.mp4

Compute sync metrics for audio-video pairs

We have 3 metrics:

AVSync score

Raw output value of the avsync classifier for an input (audio, video) pair. It is in range (-\inf, \inf).

python -W ignore scripts/avsync_metric.py --metric avsync_score --audio {audio path} --video {video path}

RelSync

Measures synchronization of an (audio, video) pair by using a reference.

To measure synchronization audio generation, the reference is a groundtruth audio

python -W ignore scripts/avsync_metric.py --metric relsync --audio {generated audio path} --video {video path} --ref_audio {groundtruth audio path}

To measure synchronization video generation, the reference is a groundtruth video.

python -W ignore scripts/avsync_metric.py --metric relsync --audio {audio path} --video {generated video path} --ref_video {groundtruth video path}

AlignSync

Measures synchronization of an (audio, video) pair by using a reference video. It is only used to measure sync for video generation.

python -W ignore scripts/avsync_metric.py --metric alignsync --audio {audio path} --video {generated video path} --ref_video {groundtruth video path}

4. Download datasets

Each dataset has 3 files/folders:

videos/: the directory to store all .mp4 video files
train.txt: training file names
test.txt: testing file names

Optionally, we precomputed two files for ease of computation:

class_mapping.json: mapping category string in file name to text string used for conditioning
class_clip_text_encodings_stable-diffusion-v1-5.pt: mapping text string used for conditioning to clip text encodings

Download these files from GoogleDrive, and place them under datasets/ folder.

To download videos:

AVSync15: download videos from link above. (Last update: July 26 2024)
Landscapes: download videos from MMDiffusion.
TheGreatestHits: download videos from Visually Indicated Sounds.
VGGSS: for AVSync classifier training/evaluation, download videos from VGGSound. Only videos listed in train.txt and test.txt are needed.

Overall, the datasets folder has the following structure

- datasets/
    - AVSync15/
        - videos/
            - baby_babbling_crying/
            - cap_gun_shooting/
            - ...
        - train.txt
        - test.txt
        - class_mapping.json
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - Landscapes/
        - videos/
            - train/
                - explosion
                - ...
            - test/
                - explosion
                - ...
            - ...
        - train.txt
        - test.txt
        - class_mapping.json
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - TheGreatestHits/
        - videos/
            - xxxx_denoised_thumb.mp4
            - ...
        - train.txt
        - test.txt
        - class_clip_text_encodings_stable-diffusion-v1-5.pt
    - VGGSS/
        - videos/
            - air_conditioning_noise/
            - air_horn/
            - ...
        - train.txt
        - test.txt

5. Train and evaluate AVSyncD

Train

Training is done on 8 RTX-A4500 GPUs (20G) on AVSync15/Landscapes or 4 A100 GPUs on TheGreatestHits, with a total batch size of 64, accelerate for distributed training, and wandb for logging. Checkpoints will be flushed every checkpointing_steps iterations. Besides, checkpoints at the checkpointing_milestones-th iteration and the last iteration will both be saved. Please adjust these two parameters in .yaml config file to avoid important weights being flushed when you customize training recipes.

PYTHONWARNINGS="ignore" accelerate launch scripts/animation_train.py --config_file configs/audio-cond_animation/{datasetname}_audio-cond_cfg.yaml

Results are saved to exps/audio-cond_animation/{dataset}_audio-cond_cfg, with the same structure as pretrained checkpoints.

Evaluation

Evaluation is two-step:

Generate 3 clips per video for test set using scripts/animation_gen.py
Evaluate between groundtruth clips and generated clips using scripts/animation_eval.py

Please refer to scripts/animation_test_{dataset}.sh for the steps. For example, to evaluate AVSyncD pretrained on AVSync15 with audio guidance scale = 4.0:

bash scripts/animation_test_avsync15.sh checkpoints/audio-cond_animation/avsync15_audio-cond_cfg 37000 4.0

6. Train and evaluate AVSync Classifier

Train

AVSync Classifier is trained on VGGSS training split for 4 days, 8 RTX-A4500 GPUs, and batchsize 32.

PYTHONWARNINGS="ignore" accelerate launch scripts/avsync_train.py --config_file configs/avsync/vggss_sync_contrast.yaml

Evaluation

We follow VGGSoundSync to sample 31 clips from each video, with 0.04-s gap between neighboring clips. Given the audio/video clip at the center, we predict its synchronized video/audio clip's index. A tolerate range of 5 is applied, since human is tolerant to 0.2s asynchrony.

For example, to evaluate our pretrained AVSync Classifier on 8 GPUs, run:

PYTHONWARNINGS="ignore" accelerate launch --num_processes=8 scripts/avsync_eval.py --checkpoint checkpoints/avsync/vggss_sync_contrast/ckpts/checkpoint-40000/modules --mixed_precision fp16

Citation

Please consider citing our paper if you find this repo useful:

@inproceedings{linz2024asva,
    title={Audio-Synchronized Visual Animation},
    author={Lin Zhang and Shentong Mo and Yijing Zhang and Pedro Morgado},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2024}
}