Awesome
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
<div align=center><img src=img/radar_compare_alldata_vast.png/ width="75%" height="75%"></div>This is the official repository of VAST which will provide code, model checkpoint and dataset. They will be released after paper is accepted.
<div align=center><img src=img/VAST-model.jpg/></div>Citation
If you find this code useful for your research, please consider citing:
@article{chen2023vast,
title={VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset},
author={Chen, Sihan and Li, Handong and Wang, Qunbo and Zhao, Zijia and Sun, Mingzhen and Zhu, Xinxin and Liu, Jing},
journal={arXiv preprint arXiv:2305.18500},
year={2023}
}