Awesome
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks, NeurIPS 2023
This is the Pytorch implementation of our paper:
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
[Paper] [arXiv] [Video] [Poster] [Slides]
Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao
In NeurIPS 2023
📝Requirements and Installation
-
Getting Started
git clone https://github.com/haoyi-duan/DG-SCT
cd DG-SCT
pip install -r requirements.txt
-
Download HTS-AT Backbone
Download
checkpoints.zip
from Google Drive or Baidu Disk (pwd: 2023), and extract it into the directory./DG-SCT
/.
AVE
-
Download Data
Download
frames.zip
from Google Drive or Baidu Disk (pwd: 2023),wave.zip
from Google Drive or Baidu Disk (pwd: 2023), and extract them into the directory./data/AVE
/. -
Usage
Go to AVE task directory.
cd DG-SCT/AVE
-
Training
bash train.sh
-
Testing
./models/best_82.18.pt
: Google Drive or Baidu Disk (pwd: 2023)bash test.sh
-
-
Results
AVS
-
Download Data
-
Download Dataset
The updated AVSBench dataset is available here (
AVSBench-object
). You may request the dataset by filling the Google Form.The downloaded data should be placed to the directory
./data/
. -
Download Wave
Download wave for task S4 (Google Drive or Baidu Disk (pwd: 2023)) and task MS3 (Google Drive or Baidu Disk (pwd: 2023)), and extract them into the directory
./data/AVSBench_data/Single-source/s4_data/
and./data/AVSBench_data/Multi-sources/ms3_data/
, respectively.
-
-
Download pretrained backbones
The pretrained ResNet50/PVT-v2-b5 (vision) and VGGish (audio) backbones can be downloaded from here and placed to the directory
./DG-SCT/AVS/pretrained_backbones/
. -
Usage
Go to AVS task directory.
# for S4 task: cd DG-SCT/AVS/avs_scripts/avs_s4 # for MS3 task: cd DG-SCT/AVS/avs_scripts/avs_ms3
-
Training
bash train.sh
-
Testing
checkpoint for S4 task:
./DG-SCT/AVS/avs_scripts/avs_s4/train_logs
Google Drive or Baidu Disk (pwd:2023)checkpoint for MS3 task:
./DG-SCT/AVS/avs_scripts/avs_ms3/train_logs
Google Drive or Baidu Disk (pwd:2023)bash test.sh
-
-
Results
AVVP
-
Download Data
Download extracted feats, frame and wave of LLP dataset from Baidu Disk (pwd: 2023), and extract it into the directory
./data/AVVP/
. -
Usage
Go to AVVP task directory:
cd DG-SCT/AVVP
-
Training
bash train.sh
-
Testing
./models/MGN_Net.pt
: Google Drive or Baidu Disk (pwd:2023)bash test.sh
-
-
Results
AVQA
-
Download Data
Download
frames.zip
from Google Drive or Baidu Disk (pwd: 2023),audio_wave.zip
from Google Drive or Baidu Disk (pwd: 2023), and extract them into the directory./data/AVQA/
. -
Usage
Go to AVQA task directory.
cd DG-SCT/AVQA
-
Audio-Visual Grounding Generation
python grounding_gen/main_grd_gen.py
You can download the
./grounding_gen/models_grounding_gen/lavish_grounding_gen_best.pt
from Google Drive or Baidu Disk (pwd: 2023) to skip the Audio-Visual Grounding Generation process. -
Training
bash train.sh
-
Testing
./net_grd_avst/avst_models/avst.pt
: Google Drive or Baidu Disk (pwd: 2023)bash test.sh
-
-
Results
Few-shot/Zero-shot
We use audio-text backbones in CLAP: 630k-audioset-fusion-best.pt, and 630k-fusion-best.pt. Please download and place them into the directory ./pretrain/models/
.
-
Few-shot
-
Go to Few-shot Directory
cd few-shot
-
AVE
-
1 shot
python main_AVE.py --dataset_name AVE --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
-
2 shots
python main_AVE.py --dataset_name AVE --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
-
4 shots
python main_AVE.py --dataset_name AVE --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
-
8 shots
python main_AVE.py --dataset_name AVE --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
-
16 shots
python main_AVE.py --dataset_name AVE --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.01 --weak 0 --classification 0
-
-
AVE Classification
-
1 shot
python main_AVE_class.py --dataset_name AVE --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
2 shots
python main_AVE_class.py --dataset_name AVE --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
4 shots
python main_AVE_class.py --dataset_name AVE --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
8 shots
python main_AVE_class.py --dataset_name AVE --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
16 shots
python main_AVE_class.py --dataset_name AVE --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
-
LLP Classification
-
1 shot
python main_LLP_class.py --dataset_name LLP --shot 1 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
2 shots
python main_LLP_class.py --dataset_name LLP --shot 2 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
4 shots
python main_LLP_class.py --dataset_name LLP --shot 4 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
8 shots
python main_LLP_class.py --dataset_name LLP --shot 8 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
16 shots
python main_LLP_class.py --dataset_name LLP --shot 16 --alpha 0.2 --beta 0.05 --gamma 0.05 --weak 1 --classification 1
-
-
-
Zero-shot
-
Download Data
Download VGG-Sound(40K) from Baidu Disk (pwd: 2023), and extract it into the directory
./data/
. -
Usage
-
Pretrain on VGG-Sound(40K)
cd pretrain bash train.sh
The pretrained model will be placed at
pretrain/models/
. -
Zero-shot
MODEL_NAME="name of the pretrained model in pretrain/models/." # AVE python zero_shot.py --test_dataset_name AVE --backbone $MODEL_NAME --is_event_score 1 # AVE classification python zero_shot.py --test_dataset_name AVE --backbone $MODEL_NAME --is_event_score 0 # LLP classification python zero_shot.py --test_dataset_name LLP --backbone $MODEL_NAME --is_event_score 0
-
-
-
Results
🎓Cite
If you find this work useful, please consider citing it.
@inproceedings{duan2023cross,
title={Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks},
author={Duan, Haoyi and Xia, Yan and Zhou, Mingze and Tang, Li and Zhu, Jieming and Zhao, Zhou},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023}
}
👍Acknowledgments
Our code is based on CMBS, AVSBench, MGN, MUSIC-AVQA, and LAVisH.
✏Model Checkpoints
Tasks | Checkpoints |
---|---|
AVE | Google Drive or Baidu Disk (pwd: 2023) |
AVS_S4 | Google Drive or Baidu Disk (pwd:2023) |
AVS_MS3 | Google Drive or Baidu Disk (pwd:2023) |
AVVP | Google Drive or Baidu Disk (pwd:2023) |
AVQA | Google Drive or Baidu Disk (pwd: 2023) |