Awesome

Exploring Heterogeneous Clues for Weakly Supervised Audio-Visual Video Parsing

Code for CVPR 2021 paper Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing

The Audio-Visual Video Parsing task

We aim at identifying the audible and visible events and their temporal location in videos. Note that the visual and audio events might be asynchronous.

Prepare data

Please refer to https://github.com/YapengTian/AVVP-ECCV20 for downloading the LLP Dataset and the preprocessed audio and visual features. Put the downloaded r2plus1d_18, res152, vggish features into the feats folder.

Training pipeline

The training includes three stages.

Train a base model

We first train a base model using MIL and our proposed contrastive learning.

cd step1_train_base_model
python main_avvp.py --mode train --audio_dir ../feats/vggish/ --video_dir ../feats/res152/ --st_dir ../feats/r2plus1d_18

Generate modality-aware labels

We then freeze the trained model and evaluate each video by swapping its audio and visual tracks with other unrelated videos.

cd step2_find_exchange
python main_avvp.py --mode estimate_labels --audio_dir ../feats/vggish/ --video_dir ../feats/res152/ --st_dir ../feats/r2plus1d_18 --model_save_dir ../step1_train_base_model/models/

Re-train using modality-aware labels

We then re-train the model from scratch using modality-aware labels.

cd step3_retrain
python main_avvp.py --mode retrain --audio_dir ../feats/vggish/ --video_dir ../feats/res152/ --st_dir ../feats/r2plus1d_18

Citation

Please cite the following paper in your publications if it helps your research:

@inproceedings{wu2021explore,
    title = {Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing},
    author = {Wu, Yu and Yang, Yi},
    booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2021}
    
}