Awesome
Siamese Vision Transformers are Scalable Audio-visual Learners
<img src="https://raw.githubusercontent.com/facebookresearch/unbiased-teacher/main/teaser/pytorch-logo-dark.png" width="10%">
This is the PyTorch implementation of our paper: <br>
Siamese Vision Transformers are Scalable Audio-visual Learners <br>
Yan-Bo Lin and Gedas Bertasius<br>
<p align="center"> <img src="https://i.imgur.com/z6A5kGd.png" width="70%"> </p> <br><br>Our Method<br>
<p align="center"> <img src="https://i.imgur.com/1gOUGh3.png" width="80%"> </p>π Preparation
pip3 install -r requirement
- Download AudioSet and VGGSound
- Download jx_vit_base_patch16_224_in21k-e5005f0a.pth and save at
./src/adapt_weights
(Not necessary. But, it somehow affect results a bit.) - Donwload sqllite3 files and save wherever you want. (Instead of reading csv annotation, this can address out of CPU memory issue)
- edit
./scr/dataloader.py
and./scr/dataloader_ft.py
to make sure your video path and sql path is correct.
π Pretraining
run ./egs/audioset/run_pretrain_base.sh
π Finetuneing
- AudioSet 2M:
run ./egs/audioset/run_base_ft_2m.sh
- AudioSet 20K:
run ./egs/audioset/run_base_ft.sh
- VGGSound:
run ./egs/vggsound/run_base_ft.sh
π Cite
If you use this code in your research, please cite:
@article{lin2024siamese,
title={Siamese Vision Transformers are Scalable Audio-visual Learners},
author={Lin, Yan-Bo and Bertasius, Gedas},
journal={arXiv preprint arXiv:2403.19638},
year={2024}
}
π Acknowledgments
Our code is based on CAV-MAE.
β Model Checkpoints
More Checkpoints and training scripts will be available.
Base | Base+ | Large | Huge |
---|---|---|---|
PT AS-2M | PT AS-2M+VGG+ACAV2.4M |