Awesome

Siamese Vision Transformers are Scalable Audio-visual Learners

This is the PyTorch implementation of our paper:

Siamese Vision Transformers are Scalable Audio-visual Learners

Yan-Bo Lin and Gedas Bertasius

Our Method

📝 Preparation

pip3 install -r requirement
Download AudioSet and VGGSound
Download jx_vit_base_patch16_224_in21k-e5005f0a.pth and save at ./src/adapt_weights (Not necessary. But, it somehow affect results a bit.)
Donwload sqllite3 files and save wherever you want. (Instead of reading csv annotation, this can address out of CPU memory issue)
edit ./scr/dataloader.py and ./scr/dataloader_ft.py to make sure your video path and sql path is correct.

🏃 Pretraining

run ./egs/audioset/run_pretrain_base.sh

🏃 Finetuneing

AudioSet 2M:run ./egs/audioset/run_base_ft_2m.sh
AudioSet 20K:run ./egs/audioset/run_base_ft.sh
VGGSound: run ./egs/vggsound/run_base_ft.sh

🎓 Cite

If you use this code in your research, please cite:

@article{lin2024siamese,
  title={Siamese Vision Transformers are Scalable Audio-visual Learners},
  author={Lin, Yan-Bo and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2403.19638},
  year={2024}
}

👍 Acknowledgments

Our code is based on CAV-MAE.

✏ Model Checkpoints

More Checkpoints and training scripts will be available.

Base	Base+	Large	Huge
PT AS-2M	PT AS-2M+VGG+ACAV2.4M

Awesome

Siamese Vision Transformers are Scalable Audio-visual Learners

Siamese Vision Transformers are Scalable Audio-visual Learners <br>

📝 Preparation

🏃 Pretraining

🏃 Finetuneing

🎓 Cite

👍 Acknowledgments

✏ Model Checkpoints