Home

Awesome

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks, NeurIPS 2023

model

This is the Pytorch implementation of our paper:

Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks

[Paper] [arXiv] [Video] [Poster] [Slides]

Haoyi Duan, Yan Xia, Mingze Zhou, Li Tang, Jieming Zhu, Zhou Zhao

In NeurIPS 2023


📝Requirements and Installation

git clone https://github.com/haoyi-duan/DG-SCT
cd DG-SCT
pip install -r requirements.txt

AVE

AVS

AVVP

AVQA

Few-shot/Zero-shot

We use audio-text backbones in CLAP: 630k-audioset-fusion-best.pt, and 630k-fusion-best.pt. Please download and place them into the directory ./pretrain/models/.

few-zero

🎓Cite

If you find this work useful, please consider citing it.

@inproceedings{duan2023cross,
  title={Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks},
  author={Duan, Haoyi and Xia, Yan and Zhou, Mingze and Tang, Li and Zhu, Jieming and Zhao, Zhou},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023}
}

👍Acknowledgments

Our code is based on CMBS, AVSBench, MGN, MUSIC-AVQA, and LAVisH.

✏Model Checkpoints

TasksCheckpoints
AVEGoogle Drive or Baidu Disk (pwd: 2023)
AVS_S4Google Drive or Baidu Disk (pwd:2023)
AVS_MS3Google Drive or Baidu Disk (pwd:2023)
AVVPGoogle Drive or Baidu Disk (pwd:2023)
AVQAGoogle Drive or Baidu Disk (pwd: 2023)