Awesome
TVLT
TVLT: Textless Vision-Language Transformer [NeurIPS 2022 bib]
Zineng Tang*, Jaemin Cho*, Yixin Nie*, Mohit Bansal
Learning compact visual-linguistic Transformer representation from low-level continuous visual ๐ and audio๐ perception signal without assuming the prior existence of written texts or tokens
Introduction
<!-- <p align="center"> <big><b>TVLT: Textless Vision-Language Transformer (NeurIPS 2022)</b></big> </p> <p align="center"> <big><b>Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal</b></big> </p> -->Transformers for Vision-Language (VL) representation learning heavily rely on text-based inputs. (Some works use audio channel only as auxiliary channel)
TVLT takes audio and visual inputs for VL representation learning with minimal modality-specific design and without text-specific modules such as tokenization and automatic speech recognition (ASR).
TVLT is pre-trained with vision-audio mathcing and mask autoencoding (mask and then reconstruct the continuous input of video frames and audio spectrogram), following the previous idea of training scalable vision learners with mask autoencoding on images (the Vision-BERT).
<p align="center"> <img align="middle" width="800" src="assets/architecture.png"/> </p> <details> <summary>More</summary>TVLT attains performance comparable to its text-based counterpart on various multimodal tasks, such as visual question answering and multimodal sentiment analysis, with 28x faster inference speed and only 1/3 of the parameters.
<p align="center"> <img align="middle" width="800" src="assets/teaser.png"/> </p> </details>Install
Setup python
environment
conda create -n TVLT python=3.8 # You can also use other environment.
Install pytorch
, torchvision
, and torchaudio
The following version have been tested.
torch 1.10.0 1.12.1
torchvision 0.11.1 0.12.1
torchaudio 0.10.0 0.13.1
You can try other version of pytorch
but make sure that it will be compatible with your cuda
and cudnn
.
Install other dependencies
pip install -r requirements.txt
<!--
## Model Weights
[Huggingface Hub](https://huggingface.co/TVLT/models). -->
Demos
Getting familiar with TVLT by trying the following demos.
- Masked Autoecoding on Video Frames and Audio Spectrogram
- Sentiment Analysis on Video and Audio
- Emotional Analysis on Video and Audio
Training
Pretraining (Data + scripts) -> TVLT Pretraining
Download MAE checkpoint here
# Example
bash scripts/pretrain_mae_vam.sh
Finetuning on Downstream (Data + scripts) -> TVLT Finetuning
# Example
bash scripts/finetune_mosei.sh
Released Models
The model weights are hosted in Huggingface Hub.
If you have tried the demos, some models should have already been downloaded.
The details of each released TVLT models are described in the table below.
Training | Input Format | Component | Link |
---|---|---|---|
Pre-trained on Howto100m + Yttemporal videos | Video ๐+ Audio๐ | Encoder + Decoder | [link] |
Pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI sentiment analysis | Video ๐+ Audio๐ | Encoder + Classification Head | [link] |
Pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI emotional analysis | Video ๐+ Audio๐ | Encoder + Classification Head | [link] |
{re-trained on Howto100m + Yttemporal videos+ASR, then finetuned on CMU-MOSEI emotional analysis | Video ๐+ Textโ๏ธ | Encoder + Classification Head | [link] |
To be contined... (Stay tuned, more pre-trained variants coming soon)
<!-- * A TVLT model pre-trained on Howto100m + Yttemporal videos, then finetuned on CMU-MOSEI sentiment analysis: --> <!-- * A TVLT model on CMU-MOSEI emotional analysis * Finetuned (Text-based) on CMU-MOSEI emotional analysis [[link]](https://huggingface.co/TVLT/models/resolve/main/TVLT-MOSEI-EA-text.ckpt) --> <!-- and specify with command "load_local_path". ``` load_local_path="path/to/the/checkpoint" ``` Or use comman "load_hub_path", which will automatically download model for training scripts. ``` load_hub_path="TVLT.ckpt" ``` -->Folder Structure
See Folder Structure
Updates
- Initial Code Release
- Notebook Demos
- Colab
- Release TTS question audios for VQA (We convert all the textual questions of VQAv2 to audio using Google TTS API.)
...
Recommanded Usage
In our experiment, we pre-train TVLT on HowTo100M and YTtemporal videos. However, we recommend to unlock the power of TVLT by pre-training TVLT on large-scale videos for more generic Vision-Language representation.
The resultant models can be either use to directly process video (with the audio channel) inputs such as audio-image/video retrieval, audio-VQA, TTS-based VQA or to extract visual-acoustic features for other tasks such as speech translation, multimodal content understanding, etc.
Citation
@inproceedings{tang2022tvlt,
title = {TVLT: Textless Vision-Language Transformer},
author = {Zineng Tang and Jaemin Cho and Yixin Nie and Mohit Bansal},
booktitle = {NeurIPS},
year = {2022}
}
Acknowledgement
The idea of this paper is heavily inspired by Masked Autoencoders Are Scalable Vision Learners.
Our codebase is based on ViLT.
We thank the authors for their open-source contributions.
Contact
Zineng Tang (zn.tang.terran@gmail.com)