Home

Awesome

Siamese Vision Transformers are Scalable Audio-visual Learners

πŸ“—Paper

License: MIT <img src="https://raw.githubusercontent.com/facebookresearch/unbiased-teacher/main/teaser/pytorch-logo-dark.png" width="10%">

This is the PyTorch implementation of our paper: <br>

Siamese Vision Transformers are Scalable Audio-visual Learners <br>

Yan-Bo Lin and Gedas Bertasius<br>

<p align="center"> <img src="https://i.imgur.com/z6A5kGd.png" width="70%"> </p> <br>

<br>Our Method<br>

<p align="center"> <img src="https://i.imgur.com/1gOUGh3.png" width="80%"> </p>

πŸ“ Preparation

πŸƒ Pretraining

πŸƒ Finetuneing

πŸŽ“ Cite

If you use this code in your research, please cite:

@article{lin2024siamese,
  title={Siamese Vision Transformers are Scalable Audio-visual Learners},
  author={Lin, Yan-Bo and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2403.19638},
  year={2024}
}

πŸ‘ Acknowledgments

Our code is based on CAV-MAE.

✏ Model Checkpoints

More Checkpoints and training scripts will be available.

BaseBase+LargeHuge
PT AS-2MPT AS-2M+VGG+ACAV2.4M